A simple way to use hash codes


1- Introduction


Here's an example to build an index of variable length strings (or data blocks) that enables you to find quickly if a string is present or not.

First a few notes. I use a Hash function that returns a value on 32 bits. But the index is on 8 bits. The reason is that storing a table with 32-bits access is too big (16Go of memory). In the index i of the array, there is a list of objects sharing the value hash % i where hash is the hash-code of the object. For example, if the index size is 256, in the index 3 there will be the objects with the CRCs 3, 259, 515, 771, 1027 etc etc... This allows fast access. You are free to use another value for index size (even prime numbers if you like it). One thing is sure: the bigger the index table, the lesser it will take to find an object in it. You will have far better access times with a 65536 entries-long index than you would get with a 256 one, but it will take more memory (exactly 256 times more). So you should use a dimension not too big. A good value is the number of elements you will expect to enter in the index, divided by, let's say 2. Depending on object structure, you will get between 2 and 3 collisions.

For really huge indexes, there is a special structure called B-Tree which has special properties, and enables to find an index within a given number of operations (fixed). It's a great thing, but it's a shit to program. If I can manage to program this simply, I will append it to the library.


2- Declaration


First, we'll declare the index table:

#define INDEX_SIZE  256

typedef struct _cell {
    char *string;
    long size;
    u_long hash;
    struct _cell *next;
} Cell;

Cell Index[INDEX_SIZE];

As you can see, nothing special: we create an array of lists. Each list's cell contains a pointer to the string, the string's length and it's hash value.


3- Initialization

Next, we shall initialize the index:

void InitializeArray(void)
{
    int i;

    for (i=0; i<INDEX_SIZE; i++) {
	Index[i].string = NULL;
	Index[i].size = 0;
	Index[i].hash = 0;
	Index[i].next = NULL;
    }
}

Nothing special here too.


4- Find

Let's see now the find function. It take two parameter: a pointer to char, and the length of the zone. It tries to find in the index if the pointed string is present or not, by computing the hash value, checking in the index if there are previously stored strings for this hash value. If it is so, it checks the length of the zones, and finally compares the string found with the argument string (byte to byte). If both match, a pointer to the corresponding cell is returned. Otherwise, a NULL pointer is returned.

Cell *Find(char *string, long size)
{
    Cell *current;
    u_long hash;

    hash = RawHash((u_char*)string, size);
    current = Index + (hash % INDEX_SIZE);
    if (current->string == NULL)
	return NULL;
    while (current != NULL) {
	if (current->hash == hash && current->size == size &&
	    memcmp(current->string, string, size) == 0)
	    return current;
	current = current->next;
    }

    return NULL;
}

As you can see, many string compares are avoided. Most of the time, the code will choke on the first integer compare (the hash-value compare). You should note that the bigger the size of the index is, the lesser collisions you'll get. Thus, with a big index, finding a string is often a few integer compares, and one string compare.


5- Insert


Now let's see the insertion function. In fact it is very simple: we compute the hash value, find the index, and insert it. The trick is that we don't want the same string to be repeated in the index, so we check that the argument string is not already in it.

void Insert(char *string, long size)
{
    Cell *current, *previous=NULL;
    u_long hash;

    hash = RawHash(string, size);
    current = Index + (hash % INDEX_SIZE);
    if (current->string != NULL)
	while (current) {
	    if (current->hash == hash && current->size == size &&
		memcmp(current->string, string, size) == 0)
		return;
	    previous = current;
            current = current->next;
	}

    if (previous) {
	current = (Cell*)malloc(sizeof(Cell));
	previous->next = current;
    }
    current->string = string;
    current->size = size;
    current->hash = hash;
    current->next = NULL;
}

There are no particular tricks in this code. If you want speed, you can use the following function, even simpler:

void SpeedInsert(char *string, long size)
{
    Cell *current, *new;
    u_long hash;

    hash = RawHash(string, size);
    current = Index + (hash % INDEX_SIZE);
    if (current->string != NULL) {
	new = (Cell*)malloc(sizeof(Cell));
	new->string = string;
	new->size = size;
	new->hash = hash;
	new->next = current->next;
	current->next = new;
    } else {
	current->string = string;
	current->size = size;
	current->hash = hash;
    }
}

This one doesn't check if the string is already in the index or not. This means you can insert duplicated strings, but it won't affect the research (and will even speed up the duplicate string find), but will affect the memory cost of your index.


6- Conclusion

As you can see, this simple index can really speed-up your string indexing, and improve efficiency of your programs. As in most optimizations, speed improvment is made with a memory cost. So you should use this kind of index for long strings/memory blocks index. I do not pretend to give a very good solution (there are much better algorithms to improve string research), but I give a simple solution to a problem I've encountered many times, without finding a solution.

I have used a similar algorithm to speed string research in the LZW algorithm, and it gives pretty good results. I have made no test on the collision rate I obtain, but it doesn't seem very high.

You can improve this example, by adding a RemoveIndex function for example. You can also avoid most of the malloc calls, especially if you know exactly how many index you have to enter (by using your own allocation function). You can also make a simple CleanIndexTable function that will free all the Cell chunks allocated (free memory, that is). There are many many things to do here.

You should also note that this kind of program is not restricted to character strings only. You can put whatever you want in the Cell structure. For a symbol table, you would put the symbol name, together with its type, memory size, and other things.