4.2 Implementations
Now
that you have the various types of collections in mind, it is time to
turn your attention to the implementation of these collections. A
List, for example, can be implemented a number of
ways, each with advantages and disadvantages. However, before you can
understand the various implementations of Lists
and other collections, you need to understand how collections and
maps determine equality and order.
4.2.1 Determining Equality and Order
Until now, we have been
discussing equality and order quite a bit, but without explaining how
these conditions are actually discovered by the collections. For
example, how does a Map decide if two keys are the
same, and how does a SortedSet determine the order
used to sort the objects? To answer these questions, you have to
tackle the basic principles of object equality: identity and
comparability.
4.2.1.1 Equality versus identity
For many collections to do their job, they need to know which objects
are equal. For example, a Set
can't exclude duplicate entries if it
doesn't know how to check to see whether supplied
objects are equal. These collections take advantage of the fact that
all objects in Java descend from the common class
java.lang.Object.
Since all objects descend from Object, the
Set implementations can call the method
equals( ) on all objects. This allows the set to
compare objects. However, there is one catch: for this to be
effective, the object must define (or inherit) a valid implementation
of equals( ).
The default implementation of equals( ) compares
two objects to see whether they are the same object in the virtual
machine by identity, not by
equality. Suppose you create two
Address objects that define the addresses of
employees. Both address objects can contain the same street name,
number, country, postal code, and other data. However, these objects
are two different instances (and therefore reside at two different
memory addresses). In this case, the Address
objects are equivalent, but they are not the same object梚.e.,
they are not identical. If you use the default implementation of the
equals( ) method, you would always get a result of
false despite the fact that the objects are equal.
To solve this problem, you need to override the definition of
equals( ) to compare the actual data in the class.
While creating your new equals method, you need to
keep the following requirements in mind:
The method must consistently return the same value for any single
comparison of objects x and y
no matter how many times the comparison is run, assuming that no data
changes in x and y. The method should be symmetric so that x.equals(y)
returns the same result as y.equals(x). The method should be transitive so that if
x.equals(y) and y.equals(z)
returns true, then z.equals(x)
also returns true. Whenever the comparison is against a null value,
the method should always return false梚.e.,
(x.equals(null) == false) should be
true for all types of x. The method should be reflexive so that x.equals(x)
is always true.
|
Don't forget to override the implementation of
hashCode( ) when you override equals(
). There is a contract between these two methods that says
that any objects evaluating as equivalent must also return the same
hash code. For more information on this contract, see the Javadoc for
java.lang.Object.equals( ).
|
|
The overridden equals( ) method ensures that
collections that are not supposed to contain duplicates do indeed
prevent them. It also illustrates an important concept in
programming. Two objects are said to be equal in
identity if they are the same object instance,
whereas two objects are said to have equality if
they contain the same data.
4.2.1.2 Comparing objects
The problem of sorting in collections is solved by defining a way to
compare two objects. This can be accomplished using two different
techniques. The first technique is to have the objects implement
java.lang.Comparable. The
Comparable interface contains a single method
defined by the following signature:
public interface Comparable {
public int compareTo(Object o);
}
The compareTo( ) method returns an integer
indicating the comparison of the this object to
the target object. If compareTo( ) returns a
negative integer, then the this object is said to
be less than the object given in the parameter o.
If the returned integer is positive, then the this
object is greater than the given object. If the integer is
0, the object is comparatively the same as the
this object. Generally, implementations will
ignore anything other than whether the return value is positive,
negative, or 0. Therefore, it is useless to worry
about defining how much less than the given object the return value
is. For example, a returned value of -8 would be
treated the same as a returned value of -1.
One important factor to remember about this comparison is that it
does not compare equality. Two objects that are
not equivalent may result in a compareTo( ) result
of 0. This occurs often when objects try to sort
on only one or two fields in the class. For example, if you consider
a Person object, you would want to sort on first
name and last name but not necessarily on the other fields, such as
age and tax identification number. In this case, compareTo(
) would return a value of 0 for two
people named Paul Henry despite the fact that one is a senior
executive and the other is an infant.
The compareTo( ) method has a number of
requirements similar to the equals( ) method:
Like the equals( ) method, the compareTo(
) method must return the same value for any single
comparison of objects x and y
no matter how many times the comparison is run, assuming that no data
changes in x and y. The method should be transitive梚.e., if object
x is comparatively greater than object
y, and object y is
comparatively greater than object z, object
x should be greater than object
z. This is generally written using the following
mathematical notation: If x > y and y > z then x > z The method must ensure that the comparison is consistent. For
example, if x.compareTo(y) yields
-1, then y.compareTo(x) should
yield 1. The method should be reflexive so that
x.compareTo(x) always returns
0. It is recommended that objects that are
comparatively 0, but are not necessarily
equals( ), indicate this in the documentation to
the class.
It is always a good idea to implement the
Comparable interface for objects that may be
stored inside ordered collections. This implementation defines what
is known as the natural
ordering of instances. However, there are some
situations in which a developer may want to sort the objects in a
collection based on a value other than the natural ordering. The
java.util.Comparator interface is used for this
purpose.
Suppose you have a set of data objects that contain a variety of
information, and you want to be able to sort that information to
achieve something other than the natural ordering. For example, if
you have a set of Person objects, you may want to
sort them by age or occupation instead of name. To accomplish this,
create a new implementation of the Comparator
interface that will sort the objects. Example 4-1
shows a Comparator that sorts
Person objects based on their tax identification
number.
Example 4-1. An age Comparator
package oreilly.hcj.collections;
import java.util.Comparator;
import oreilly.hcj.bankdata.Person;
public class TaxIdComparator implements Comparator {
public int compare(final Object x, final Object y) {
return ((Person)x).getTaxID( )
.compareTo(((Person)y).getTaxID( ));
}
}
In this example, you take advantage of the compareTo(
) method built into java.lang.String to
compare the two people based on the values of their
taxID property. Now that you have a new
Comparator implementation, you can sort the people
by passing this comparator to sorting collections:
public class SomeClass {
protected void demoComparator( ) {
SortedSet people = new TreeSet(new TaxIdComparator( ));
// . . . add people to the set.
}
}
Once you have created the sorted set and assigned the comparator, you
can be sure that the people inserted into the set will always be
sorted in order of taxID. You can also implement
more complicated comparators that are configurable.
4.2.2 Big O Notation
Implementing
collections in programming languages
has been a topic of computer science since the first days of
structured languages. This topic was born out of the desire to
efficiently store and analyze information in computer systems. It is
a true science with theories, mathematics, and concepts. One of the
most important pieces of mathematics necessary to understanding
collections is the concept of Big O Notation.
Big O Notation is a mathematical term used to express the efficiency
of an algorithm depending on the number of elements the algorithm
must work on. The notation is called Big O Notation because of the
conspicuous usage of the letter O in the formulas. The O stands for
orders of magnitude. The general
idea is that the lower the orders of magnitude, the faster the
algorithm is.
Consider a list of people stored in a random order. Big O Notation
will tell you how long, on average, it will take you to find any one
person in the list by name. If you think about the worst-case
scenario, you will realize that you may look through all of the
elements in a list, only to find that your target person is the last
element on the list. As the list grows in size, your worst-case
scenario gets . . . worse. If you have 100 people in your list, your
worst-case scenario would involve performing 99 checks to find a
person. If your list is 100,000 people, then you would have to do
99,999 checks before you find the person. This is described in Big O
Notation with the formula O(n). This shorthand is
read as "the orders of magnitude grows in proportion
to n."
By contrast, if you store people that are sorted by name in an
ordered list, you can use a binary search algorithm to find the
target person, called x, by name. This algorithm
compares the person at the halfway point in the list,
y, to the target person. If the person
x evaluates as equal to person
y, then the person is found, and the algorithm
stops. However, if the person x evaluates as
greater than person y, then the algorithm uses the
upper half of the list for the search. If person x
evaluates as less than person y, the algorithm
uses the lower half of the list. The algorithm then splits up the
targeted half and repeats the splitting process until it finds the
person or narrows the list to one person that isn't
the target person.
The binary search algorithm has an order
of magnitude of O(log2(n)). This means that the
worst-case scenario to find a person in a list of size
n is the result of
log2(n).
In the case of your 100,000-person list, the result of this equation
is roughly 16.609 operations. In other words, the worst-case scenario
is that you would need 17 compares to find the target person.
|
Most calculators have only a
log10(n)
function. To find
log2(n)
on these calculators, you can employ the simple formula
log2(n)
-
log10(n)/log10(2).
|
|
The most common orders of magnitude that you will encounter are the
following. They are listed in order of most efficient to least
efficient.
- O(1)
Constant time. No matter how big the list, the time to find the
target is always the same.
- O(log2(n))
This notation is often abbreviated O(log(n)). For
100,000 people, the value would be 17 compares.
- O(n)
For 100,000 people, you would need 99,999 compares.
- O(n x log(n))
For 100,000 people, you would need 1,700,000 compares!
- O(n2)
My computer couldn't compute this result for 100,000
entries; however, for 100 entries, it is 10,000. This number
increases very rapidly.
- O(n!)
The factorial is the series of integers from 1 to
n multiplied together. For example, if
n = 10, the factorial is
10 x 9 x
8 x 7 x
6 x 5 x
4 x 3 x
2 x 1
= 3628800. If you write an
algorithm with this order of magnitude, you will definitely have
problems.
Big O Notation is not limited to calculating search time in a
collection. In fact, it can be applied to any computer algorithm as a
measure of efficiency. Generally, algorithms higher than
O(n) are not feasible as solutions to practical
problems.
4.2.3 Lists
Now
that you have an understanding of the basic concepts, you are ready
to study various types of collections. Once you have decided that you
need a List, the next step is to decide exactly
which List you need to use. To make a good
decision, you need to have a good understanding of the three standard
implementations of List.
|
Although anyone can implement the collection interfaces in their
code, we will stick to the classes defined in the
java.util package. Personally, I like to augment
this with the collections classes from Jakarta Commons, available at
http://jakarta.apache.org/.
|
|
4.2.3.1 java.util.Vector
The Vector class has been with the JDK for quite a
long time. If you look at the @since tag in the
Javadoc for this class, you will see that it was introduced in JDK
1.0.
A Vector is essentially a
dynamic resizing array. You create a
vector by specifying its initial size and the size of the chunk you
want to use to resize the Vector should it exceed
its storage capacity. The following code creates a
Vector that starts with 100 elements and grows by
100 elements (when needed):
Vector v = new Vector(100, 100);
|
There are other constructors to vectors that use default values for
sizes and increments. Additionally, there is an extra constructor for
Vector that copies a collection into a
Vector and uses the default value for
incrementation.
|
|
Since Vector is an implementation of the
List interface, instances of
Vector also allow the user to access each element
in a similar manner to that of an array. Therefore, you can think of
Vectors as something of a resizable array.
In fact, inside the Vector class, the data is
stored as an array of objects. When you initialize the
Vector, this array is allocated according to the
initial size that you specify in the constructor. As the
capacity of the Vector is
exceeded, a new array is allocated that is equal to the size of the
old array plus the capacityIncrement you specified
in the constructor. If you didn't specify either of
these values, the class uses its default sizes.
When the Vector class has to resize, it copies the
old array into the new array using the System.arraycopy(
) method. Although this method uses fast
native code to copy the array, copying is still quite slow and will
get slower as the size of the Vector grows. This
is why it is important to assign values to
capacity and capacityIncrement
that minimize the number of resize operations without wasting too
much memory.
Using an array to store the data inside a Vector
has one other major disadvantage. When inserting an element into a
Vector at a specific position, the
Vector must use System.arraycopy(
) to move the data to accommodate the insertion, as Figure 4-1 demonstrates.
Figure 4-1 shows the result of inserting a
String element at index 2 into the
Vector. Since there was already an element at
index 2, the Vector had to use
System.arraycopy( ) to push the elements further
down into the array. If you remove elements at an index, the same
process has to be performed to compress the array.
What this all adds up to is that Vector is not
very good at inserts and removals of items. However, the array
storage paradigm makes Vector very good at
accessing elements quickly. Therefore, Vector is a
good choice for a list whose prime function is to access the data in
the list rapidly, and a poor choice if the data in the list will be
modified frequently.
However, the Vector class has one major problem: since
Vector was designed to be thread-safe, it is
replete with synchronized blocks in the code. This
allows multiple threads to use a Vector safely,
but significantly slows down its performance. The semantics of
synchronized code are outside the scope of this book; however, I can
tell you that the process of synchronization is a costly one.
|
If you are interested in learning more about synchronization and
thread programming, check out Java Threads by
Scott Oaks and Henry Wong (O'Reilly).
|
|
In situations in which data structures will not be accessed by
multiple threads or in which the synchronization can be provided
outside of the list, using Vector is a sad waste
of performance. However, if you will be tossing your list around
various threads, you should stick with Vector.
4.2.3.2 java.util.ArrayList
The ArrayList class is JDK 1.2's
answer to the synchronization problem in the
Vector class. This class does essentially the same
job as the Vector class but without
synchronization. The lack of thread safety within the
ArrayList class makes it much faster than
Vector. However, it still suffers from the same
performance problems as Vector when resizing,
adding, or removing elements.
If you need a class that has all of the features of
Vector but not the internal synchronization, you
should choose a list class.
If you look at the documentation for the ArrayList
class, you will notice that in addition to implementing the
List interface, it implements
java.io.Serializable and
java.util.RandomAccess. However, these two
interfaces don't declare any methods or attributes.
These two interfaces are examples of marker
interfaces. Marker interfaces are used to mark
classes conceptually. For example, the
RandomAccess interface marks
List implementations that can be accessed in a
random, rapid, and predictable order. Although this marker interface
is useful in some situations, I haven't seen it used
professionally.
On the other hand, Serializable is a marker
interface that you see all over the place. This interface marks a
class as having the ability to be flattened and serialized to a
stream and then restored from that stream. This allows Java to store
objects in files and send these objects across a network connection.
In this case, the marker interface lets the virtual machine know that
this class can be introspected and then written out. Classes that are
not serializable do not have this ability.
By using marker interfaces, a developer can tag certain code without
complicating its inheritance structure.
|
4.2.3.3 java.util.LinkedList
The LinkedList class also implements the
List interface and allows you to store objects in
a defined order. However, it implements this ability in a very
different way than Vector or
ArrayList.
Whereas Vector and ArrayList
hold their data in an array of objects, LinkedList
uses a special node structure to connect the objects. Each object is
stored in a node that contains three attributes: one to contain the
data in the node, one to be used as a reference to the next node, and
one to be used as a reference to the previous node. The start of the
chain of references is usually referred to as the
head of the list, and is the only node actually
stored in the class variable. The structure of a
LinkedList is shown in Figure 4-2.
This node-based structure makes it easy to insert or remove elements
in the LinkedList. All that the list needs to do
is create a node and rewire the references to the previous and next
objects. There is (thankfully) no need to copy references or perform
any other costly actions.
On the other hand, the structure of a LinkedList
makes finding any particular element in the list a costly exercise.
Since the LinkedList class has to step through the
nodes one by one from the head of the list until the element is
found, the search will be slow.
|
Some of the data structure purists out there may have noticed that
LinkedList is, in fact, not a normal linked list
at all. In fact, a normal linked list has only a reference to the
next object and not a reference to the previous object. The
LinkedList class in the
java.util package is actually a doubly
linked list.
|
|
For these reasons, LinkedList is a great class for
manipulating data and shuffling it around. It is ideal for tasks such
as sorting and performing frequent insertions and removals. However,
it is really bad at accessing data quickly. You should use
LinkedList if the primary function of your list is
to manipulate data in the list.
4.2.4 Maps and SortedMaps
Maps
are useful structures for storing things such as key/value pairs. The
Map interface and its descendant
SortedMap have many implementations. Once you have
decided that you need to use a map as your collection, you will then
need to decide exactly which map to use.
4.2.4.1 java.util.HashTable
The Hashtable class is one of the two grandfathers of
Java collections, the other being Vector. The
Hashtable class stores a series of keyed values in
a structure that uses the hash code of the object to store things
quickly. To understand how a Hashtable works, it
is important that you understand the hash system.
When a Hashtable is constructed, it creates a
table to hold the data of the collection. This table is an array of
arrays. Each of the subarrays represents a bucket in which the class
can store a set of values. All of the buckets are the same size.
As you insert new items into the table, the hash code of the object
is used to determine which bucket the object should go into. Once the
correct bucket is found, the object is inserted into that bucket. The
structure of the class is shown in Figure 4-3.
Each bucket holds data within a particular range of hash codes. As
you add items to the table, the hash code of the key is computed and
the correct bucket is chosen.
One of the advantages of this structure is that it can hold very
large data sets without impeding performance. As the map grows, the
number of buckets grows, and the range of hash codes allowed in each
bucket gets smaller. The result is that the map can find any object,
taking, on average, the same amount of time as it does to find any
other object. In Big O Notation, this is referred to as
O(1). This is the best-case scenario for algorithm
efficiency.
As the map grows, it fills up until its load reaches a certain
percentage. This loadFactor can be given to the
class at construction time, or the user can choose to use the default
value of 0.75f. When using the default, as soon as
the map becomes 75% full, the class will increase the size of the
map, usually by creating more buckets and narrowing the range. Once
the new map is ready, the class will rehash the
map by computing the hash codes of each object in the map and filing
them in their proper buckets. This is a fairly slow process, so you
don't want to do it often. On the other hand, since
hash-based data structures take up a lot of memory, you
wouldn't want to allocate maps with huge initial
capacities either. Finding the optimal performance for a hash-based
data structure is often dependent on how the class will be used in
the program. If the data size in the map is fairly constant, you can
use higher load factors. However, if the data is constantly changing
size and memory is tight, you may want to use lower load factors. The
default size of 0.75 provides this good balance in
most situations.
The Hashtable class is indeed useful. However, it
is internally synchronized, which makes it thread-safe but much
slower than other alternatives.
4.2.4.2 java.util.HashMap
Essentially, a HashMap is an implementation of the
Hashtable class without internal synchronization.
It sacrifices thread safety for higher performance. Other than this
difference, they are conceptually the same. If you
don't need to pass your map around in a threaded
environment, or if thread safety can be provided externally to the
map, then you should use a HashMap.
4.2.4.3 java.util.LinkedHashMap
A LinkedHashMap is an implementation of the
SortedMap interface that allows you to guarantee
that the iteration of the keys is in a predictable order. It uses the
same technique as the HashMap and
HashTable classes to store the data. However,
LinkedHashMap also maintains a running
LinkedList that contains all of the keys in the
map in a sorted order.
Although keeping the keys in a sorted order is advantageous when it
comes to iterating through the map, maintaining the extra linked list
causes LinkedHashMap to be slower than
HashMap. Use it with caution.
4.2.4.4 java.util.IdentityHashMap
An IdentityHashMap is the same as a
HashMap, except that objects are compared based on
identity, not equality.
In this map, key objects evaluate as equal only if they are the same
instance in the virtual machine. This differs from other collections
that evaluate for equality based on the result of calling the
equals( ) method.
Since this map doesn't compare based on equality,
you can insert multiple keys into the map even when they evaluate as
equals according to the equals( ) method; they
only need to be different instances. For example, the following two
instances could be placed in the same
IdentityhashMap:
String x = new String("tom");
String y = new String("tom"); // <== Different instance!
IdentityHashMap ihm = new IdentityHashMap( );
ihm.put(x, new Integer(15));
ihm.put(y, new Integer(8));
If you try this with a normal HashMap, the value
for the key tom would be replaced on the second
line. With an IdentityHashMap, both values are
inserted. However, just because you can do this
doesn't mean you should; I don't
recommend inserting duplicate keys into an
IdentityHashMap because it can result in some very
confusing code.
All other operations on an IdentityHashMap work
the same way as they do on a HashMap. The
following code takes up where the last snippet left off:
String z = new String("tom");
Integer obj = ihm.get(z);
In this code, the variable element will contain the value
null. This is because z is not
the same instance as x or y
even though they contain the same value. When working with
IdentityHashMaps, you should keep these gotchas in
mind.
4.2.4.5 java.util.WeakHashMap
A WeakHashMap is just like a
HashMap, except that the keys used in the map are
stored in weak references. We will cover weak references in Chapter 10.
4.2.4.6 java.util.TreeMap
A TreeMap is the other implementation of the
SortedSet interface. However, unlike
LinkedHashMap, the TreeMap
class doesn't store its elements in a hash-based
data structure. Instead, it stores its elements, unsurprisingly, in a
tree.
Specifically, the TreeMap class uses a structure
called a binary tree. Figure 4-4 shows how this
structure looks in memory.
Inside TreeMap is a variable called
root, which stores the root of the tree. Each node
in the tree stores a key and a value, and each element stores a left
and a right element. When a new element is inserted, the key is
evaluated against the root. If the key is the same, then the
root value in the root entry will be replaced with
the new value. However, if the key evaluates as less than the
root (compareTo( ) returns
-1), the TreeMap will use the
left member of the root to navigate to the next
element. If the key evaluates as greater than the
root (compareTo( ) returns
1), the TreeMap will use the
right member to navigate to the next element. This
process of evaluation continues until the key is found or the
TreeMap hits a leaf node, which is a node without
a left or right member. If the TreeMap hits a leaf
node, it will add a new entry containing the key and value and tack
it on to the leaf node.
Keep in mind that determining whether a node is a leaf is dependent
on which side, left or right,
to which the TreeMap wants to surf. A particular
node could be a leaf node with respect to left but
not to right, as right may have
a reference to another entry.
Every now and then, the tree will become lopsided梩hat is, it
has more elements on one side of root than on the
other. When this occurs, the tree is rebalanced by using the middle
key in the sorted keys as the new root entry. The
other entries are then copied into the new tree using the algorithm
for adding new entries. This rebalancing process is slow.
The process of adding or removing a key from a
TreeMap has a O(log2(n))
efficiency. It isn't as good as a
LinkedHashMap with its O(1)
efficiency, but a LinkedHashMap has to store two
data structures for the elements in the map梠ne to accomplish
the actual storage and another to keep a list of the keys in a sorted
order. For this reason, TreeMaps are better at
storing large sets of data. Since log2(n) grows
very slowly as n increases, this size trade-off is
often worth the cost.
If you are morbidly curious, where n is
100,000,000, log2(n) is only 26.5754247590989.
This means that if you have a map that contains a hundred million
entries, a TreeMap will be able to find any
arbitrary key using a maximum of 27 compares. That is usually good
enough in most situations. On the other hand, storing references to
all 100,000,000 entries twice would take up too
much memory.
4.2.5 Sets and SortedSets
If
you have opted to use a set as your collection, you will first need
to decide whether the data needs to maintain a sorting order. This
question is more complex than it appears.
For example, a set of contacts in an address book
doesn't necessarily need to maintain sorting order
even though the items may be displayed in a sorted order in the
interface. In fact, the sorting order may be invalid if you built the
list to sort based on last name, and the user wants it sorted based
on email address. However, a set that holds the instructions to make
a certain part in a factory must be kept in a specific sorted order
to ensure that they are followed correctly.
The question you
should ask yourself is, "Do my objects need to be in
a sorted order for presentation only, or is there some other
underlying logical reason for the sorting?" If they
only need to be presented in sorted order, then
you can easily copy them to a list or SortedSet in
the GUI and sort them at presentation time. However, if there is
another underlying logical reason to sort them in one predefined
order, then you should opt for a SortedSet. Try to
avoid sorting your sets. Sorting sets introduces significant overhead
in the insertion and deletion of items in a set and can impede
performance.
4.2.5.1 java.util.HashSet
A HashSet is an implementation of the
Set interface that stores its contents using a
hash system. This system works the same way as a
HashMap. In fact, one interesting piece of Java
trivia is that the HashSet class is implemented
using only the key set of the HashMap class. The
HashSet class merely inserts a dummy value for
each key in the map and hides the fact that it is using a map from
the user.
4.2.5.2 java.util.LinkedSet
A LinkedSet is an implementation of the
Set interface that guarantees a predictable
iteration order. It does this by maintaining a
LinkedList of the objects in the set, which can
significantly decrease the amount of memory.
Note that just because the order is predictable does not mean that
the order is sorted. In fact, the only real benefit of a
LinkedHashSet over a regular
HashSet is that the iteration of the set remains
constant over time and isn't as random as
HashSet. However, I have never seen a programming
situation in which this is enough of a concern to balance the
significantly larger memory consumption.
4.2.5.3 java.util.TreeSet
A TreeSet is implemented the same way
HashSet is. However, instead of using a
HashMap as a backing store, it uses
TreeMap. Therefore, the TreeSet
class inherits the same benefits and limitations as the
TreeMap class.
|
No comments:
Post a Comment