Monday, January 4, 2010

4.5 Maps



[ Team LiB ]





4.5 Maps


When we use an array index variable to access array elements, we perform a very efficient operation, typically implemented with one or two machine instructions. This feature makes arrays ideal for organizing simple maps or lookup tables keyed by sequential integers starting from zero. In cases where the array elements are known in advance, they can be used to directly initialize the array contents.[30]

[30] netbsdsrc/lib/libc/time/localtime.c:453�455



static const int year_lengths[2] = {
DAYSPERNYEAR, DAYSPERLYEAR
};

The array defined in this example contains just two elements denoting the number of days in normal and leap years. What is important is not the array's size but the uniform random access made possible through its index.[31]

[31] netbsdsrc/lib/libc/time/localtime.c:832�1264



janfirst += year_lengths[isleap(year)] * SECSPERDAY;
[...]
while (days < 0 || days >= (long) year_lengths[yleap = isleap(y)]) {
[...]
yourtm.tm_mday += year_lengths[isleap(i)];

The same principle is also often employed for tables of more than one dimension.[32]

[32] netbsdsrc/lib/libc/time/localtime.c:448�451



static const int mon_lengths[2][MONSPERYEAR] = {
{ 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 },
{ 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 }
};

The above example initializes a two-dimensional array to store the number of days for each month in normal and leap years. You may find the information stored in the Mon_lengths array trivial and a waste of space. Far from it�as in this case, arrays are often used to efficiently encode control structures, thus simplifying a program's logic.


Figure 4.6 Table-driven program operation.


static struct key {
char *name; <-- a
void (*f)(struct info *); <-- b
[...]
} keys[] = {
<-- c
{ "all", f_all, 0 },
{ "cbreak", f_cbreak, F_OFFOK },
[...]
{ "tty", f_tty, 0 },
};
[...]
int
ksearch(char ***argvp, struct info *ip)
{
char *name;
struct key *kp, tmp;

name = **argvp;
[...]
tmp.name = name;
if (!(kp = (struct key *)bsearch(&tmp, keys, <-- d
sizeof(keys)/sizeof(struct key), sizeof(struct key), c_key)))
return (0);
[...]
kp->f(ip); <-- e
return (1);
}

<-- f
void
f_all(struct info *ip)
{
print(&ip->t, &ip->win, ip->ldisc, BSD);
}


(a)
Command name


(b)
Function to process it


(c)
Command and function table


(d)
Look for command in table


(e)
Execute the respective function


(f)
Processing function for the "all" command


Consider how the following code excerpts would have to be coded using explicit if statements.[33]

[33] netbsdsrc/lib/libc/time/localtime.c:682�1373



value += mon_lengths[leapyear][i] * SECSPERDAY;
[...]
if (d + DAYSPERWEEK >= mon_lengths[leapyear][rulep->r_mon - 1])
[...]
i = mon_lengths[isleap(yourtm.tm_year)][yourtm.tm_mon];

The logical conclusion of using an array to encode a program's control elements is the complete table-driven operation of a program. In the most common case arrays are used to associate data with code by storing in each position a data element and a pointer to the element's processing function. The example in Figure 4.6[34] uses an array initialized with command names and pointers to the function used to execute each respective command. The ksearch function performs a binary search for the command name over the array contents (sorted in ascending order by the command name). If an entry for the command is found, ksearch will call the respective function. Generalizing from the above examples, we note that arrays can control a program's operation by storing data or code used by abstract or virtual machines implemented within that program. Typical examples of this approach include the implementation of parsers and lexical analyzers using pushdown and stack automata and the realization of interpreters based on the execution of virtual machine instructions.

[34] netbsdsrc/bin/stty/key.c:74�148


A minor but frequently encountered idiom used in Figure 4.6 concerns the calculation of the number of elements of the key array. When the key array is defined, its size is not given; the compiler determines it based on the number of elements used to initialize it. Within the program, in places where the number of array elements is needed, it can be obtained as the result of the expression sizeof(keys)/sizeof(struct key). You can always read the expression sizeof(x)/sizeof(x[0]) as the number of elements of the array x. The number of elements is calculated as a compile-time constant by dividing the size needed for the whole array by the size needed for each of its elements. Since the result of this operation is a compile-time constant you will often see it used for specifying the size of or initializing other arrays.[35]

[35] netbsdsrc/sys/netinet/in proto.c:167�177



struct protosw impsw[] = {
{ SOCK_RAW, &impdomain, 0, PR_ATOMIC|PR_ADDR,
[...]
},
};

struct domain impdomain =
{ AF�IMPLINK, "imp", 0, 0, 0,
impsw, &impsw[sizeof (impsw)/sizeof(impsw[0])] };

4.5.1 Hash Tables


Lookup tables are not always initialized from static data and accessed using a zero-based key index. Very often we use the efficient random-access mode that arrays give us to organize a map that changes during a program's operation and to implement element lookup schemes with keys different from a simple array index. You will find such schemes referred to as hash tables. The main idea behind such a scheme is to construct an array index out of the lookup key we want to use. When the lookup key is another (potentially very large) integer, a simple solution involves using the remainder of the key divided by the number of array elements as the array index.[36]

[36] netbsdsrc/bin/pax/tables.c:163�167



/*
* hash inode number and look for this file
*/
indx = ((unsigned)arcn->sb.st_ino) % L_TAB�SZ;
if ((pt = ltab[indx]) != NULL) {

In the above example a file's unique identifying number st_ino (called the inode number) is used as an array index for the ltab array, with L_TAB_SZ elements. The result of the modulo operation is guaranteed to match exactly the range of allowable array indices and is therefore then directly used to access array elements. Often the cost of the modulo operation is avoided by having the array contain a number of elements equal to a power of 2 and using a bitwise-AND operation, as we described in Section 2.10, instead.


When the index variable we want to use on the map is not an integer, you will find it converted into an integer by using a hash function. The hash function combines the key data elements so as to convert them into an integer. As an example, the hash function in Figure 4.7:1[37] converts a string into an integer by using the exclusive-OR of each string's character element with the accumulated hash value shifted left by three bit positions; negative results are converted to positive at the end of the conversion. Before looking up the array, the computed value is again confined to the legal array index values by using a bitwise-AND operation (Figure 4.7:2). Keep in mind that many different map index values may hash into the same array position. Therefore, after locating the array position, we need to check that the element in that position is indeed the element we were looking for (Figure 4.7:3). A final complication arises from the fact that more than one element may need to be stored in the same position. A strategy often used in this case is calculating new candidate array positions by using another function (Figure 4.7:4). These alternative positions are then used for placing elements when their primary position is occupied and for searching elements when the primary position does not contain the element we were looking for (Figure 4.7:5).

[37] XFree86-3.3/xc/lib/font/util/atom.c:55�173


Figure 4.7 Hash function and hash table access.



static
Hash(char *string, int len)
{
int h;

h = 0;
while (len--)
h = (h << 3) ^ *string++;
if (h < 0)
return -h;
return h;
}

[...]
Atom
MakeAtom(char *string, unsigned len, int makeit)
{
int hash, h, r;

hash = Hash (string, len); <-- a
if (hashTable) {
h = hash & hashMask;
if (hashTable[h]) {

if (hashTable[h]->hash == hash &&
hashTable[h]->len == len &&
NameEqual (hashTable[h]->name, string, len))
return hashTable[h]->atom;

r = (hash % rehash) | 1;


for (;;) {
h += r;
if (h >= hashSize)
h -= hashSize;
if (!hashTable[h])
break;
if (hashTable[h]->hash == hash &&
hashTable[h]->len == len &&
NameEqual (hashTable[h]->name, string, len))
return hashTable[h]->atom;
}
}

Hash function


(a)
Compute hash value

Limit hash value to the array size

Verify that the correct element was found

Value for calculating alternative positions

Search alternative positions


Exercise 4.14
Locate in the book's CD-ROM an instance where an array is used to encode control information. Rewrite the code using explicit control statements. Measure the difference in lines of code and comment on the relative maintainability, extensibility, and efficiency of the two approaches.


Exercise 4.15
Locate code where explicit control structures can be substituted by an array lookup mechanism. Explain how the code would be implemented.


Exercise 4.16
Locate in the book's CD-ROM at least three different implementations of a string hash function. Select a representative set of input strings and compare the implementations based on the number of elements that collide in the same array position and the execution time of each function.


Exercise 4.17
Do you envisage any difficulties in implementing a hash-based map as an abstract data type? Suggest ways to overcome them.





    [ Team LiB ]



    No comments: