Wednesday, November 11, 2009

Chapter 5. Hashes










Chapter 5. Hashes


Hashes and arrays are the two basic "aggregate" data types supported by most modern programming lagnguages. The basic interface of a hash is similar to that of an array. The difference is that while an array stores items according to a numeric index, the index of a hash can be any object at all.


Arrays and strings have been built into programming languages for decades, but built-in hashes are a relatively recent development. Now that they're around, it's hard to live without them: they're at least as useful as arrays.


You can create a Hash by calling
Hash.new
or by using one of the special sytaxes Hash[] or {}. With the Hash[] syntax, you pass in the initial elements as comma-separated object references. With the {} syntax, you pass in the initial contents as comma-separated key-value pairs.



empty = Hash.new # => {}
empty ={} # => {}
numbers = { 'two' => 2, 'eight' => 8} # => {"two"=>2, "eight"=>8}
numbers = Hash['two', 2, 'eight', 8] # => {"two"=>2, "eight"=>8}



Once the hash is created, you can do hash lookups and element assignments using the same syntax you would use to view and modify array elements:



numbers["two"] # => 2
numbers["ten"] = 10 # => 10
numbers # => {"two"=>2, "eight"=>8, "ten"=>10}



You can get an array containing the keys or values of a hash with Hash#keys or Hash#values. You can get the entire hash as an array with Hash#to_a:



numbers.keys # => ["two", "eight", "ten"]
numbers.values # => [2, 8, 10]
numbers.to_a # => [["two", 2], ["eight", 8], ["ten", 10]]



Like an array, a hash contains references to objects, not copies of them. Modifications to the original objects will affect all references to them:



motto = "Don't tread on me"
flag = { :motto => motto,
:picture => "rattlesnake.png"}
motto.upcase!
flag[:motto] # => "DON'T TREAD ON ME"



The defining feature of an array is its ordering. Each element of an array is assigned a Fixnum object as its key. The keys start from zero and there can never be gaps. In contrast, a hash has no natural ordering, since its keys can be any objects at all. This feature make hashes useful for storing lightly structured data or key-value pairs.


Consider some simple data for a person in an address book. For a side-by-side comparison I'll represent identical data as an array, then as a hash:



a = ["Maury", "Momento", "123 Elm St.", "West Covina", "CA"]
h = { :first_name => "Maury",
:last_name => "Momento",
:address => "123 Elm St."
:city => "West Covina",
:state => "CA" }



The array version is more concise, and if you know the numeric index, you can retrieve any element from it in constant time. The problem is knowing the index, and knowing what it means. Other than inspecting the records, there's no way to know whether the element at index 1 is a last name or a first name. Worse, if the array format changes to add an apartment number between the street address and city, all code that uses a[3] or a[4] will need to have its index changed.


The hash version doesn't have these problems. The last name will always be at :last_name, and it's easy (for a human, anyway) to know what :last_name means. Most of the time, hash lookups take no longer than array lookups.


The main advantage of a hash is that it's often easier to find what you're looking for. Checking whether an array contains a certain value might require scanning the entire array. To see whether a hash contains a value for a certain key, you only need to look up that key. The set library (as seen in the previous chapter) exploits this behavior to implement a class that looks like an array, but has the performance characteristics of a hash.


The downside of using a hash is that since it has no natural ordering, it can't be sorted except by turning it into an array first. There's also no guarantee of order when you iterate over a hash. Here's a contrasting case, in which an array is obviously the right choice:



a = [1, 4, 9, 16]
h = { :one_squared => 1, two_squared => 4, three_squared => 9,
:four_squared => 16 }



In this case, there's a numeric order to the entries, and giving them additional labels distracts more than it helps.


A hash in Ruby is actually implemented as an array. When you look up a key in a hash (either to see what's associated with that key, or to associate a value with the key), Ruby calculates the

hash code
of the key by calling its
hash
method. The result is used as a numeric index in the array. Recipe 5.5 will help you with the most common problem related to
hash codes.


The performance of a hash depends a lot on the fact that it's very rare for two objects to have the same hash code. If all objects in a hash had the same hash code, a hash would be much slower than an array. Code like this would be a very bad idea:



class BadIdea
def hash
100
end
end



Except for strings and other built-in objects, most objects have a hash code equivalent to their internal object ID. As seen above, you can override Object#hash to change this, but the only time you should need to do this is if your class also overrides Object#==. If two objects are considered equal, they should also have the same hash code; otherwise, they will behave strangely when you put them into hashes. Code like the fragment below is a very good idea:



class StringHolder
attr_reader :string
def initialize(s)
@string = s
end

def ==(other)
@string == other.string
end

def hash
@string.hash
end
end
a = StringHolder.new("The same string.")
b = StringHolder.new("The same string.")
a == b # => true
a.hash # => -1007666862
b.hash # => -1007666862













No comments: