SourceForge.net Logo
sourceforge.net > uthash project > uthash home > uthash user guide

A hash in C

This document is geared towards C programmers of UNIX systems. Since you're reading this, chances are that you know a hash is used for looking up items using a key. In scripting languages like Perl, hashes are used all the time. In C, hashes don't exist in the language itself. This software provides a hash for C structures.

What can it do?

This software supports four basic operations on hashes.

  1. add

  2. find

  3. delete

  4. iterate/sort

Is it fast?

Add, find and delete are normally constant-time operations. This is influenced by your key domain and the hash function.

This hash aims to be minimalistic and efficient. It's around 500 lines of C. It inlines automatically because it's implemented as macros. It's fast as long as the hash function is properly chosen to suit your keys. This is easy; see Appendix A: Choosing the hash function.

Is it a library?

No, it's just a single header file: uthash.h. All you need to do is copy the header file into your project, and:

#include "uthash.h"

You may then utilize the hash macros as explained below.

Supported platforms

This software has been tested on Linux, Solaris, OpenBSD, Mac OSX and Cygwin. It probably runs on other Unix or Unix-like platforms too.

BSD licensed

This software is made available under the BSD license. It is free and open source.

Obtaining uthash

Please follow the link to download on the uthash website.

Getting help

You can email the author at thanson@users.sourceforge.net if you have questions not answered by this document.

Keys and values

A hash is comprised of structures. Each structure represents a key-value association. One or more fields act as the key; the structure itself is the value.

Example: Defining a structure that can be hashed
#include "uthash.h"

struct my_struct {
    int id;                    /* key */
    char name[10];
    UT_hash_handle hh;         /* makes this structure hashable */
};

There are no restrictions on the data type or name of the key field(s).

Note
Just remember that, as with any hash, the keys have to be unique.

The UT_hash_handle field

The UT_hash_handle field must be present in your structure in order to hash it. It is used for the internal bookkeeping that makes the hash work. It does not require initialization. It can be named anything, but you can simplify matters by naming it hh. This allows you to use the easier "convenience" macros to add, find and delete items.

Adding, finding, deleting and iterating

This section introduces the HASH_ADD, HASH_FIND, HASH_DELETE, and HASH_SORT macros. This section demonstrates the macros by example. For a more succinct listing, see Appendix F: Macro Reference.

Adding an item

Your hash must be declared as a NULL-initialized pointer to your structure. Then allocate and initialize your structure and call HASH_ADD. (Here we use the convenience macro HASH_ADD_INT, which is for keys of type int.)

Important
The hash must be initialized to NULL!
Example: Adding an item to a hash
struct my_struct *users = NULL;    /* important! initialize to NULL */

int add_user(int user_id, char *name) {
    struct my_struct *s;

    s = malloc(sizeof(struct my_struct));
    s->id = user_id;
    strcpy(s->name, name);
    HASH_ADD_INT( users, id, s );  /* id: name of key field */
}

The second parameter to HASH_ADD_INT is the name of the key field. Here, this is id.

Once an item has been added to the hash, do not change the value of its key. Instead, delete the item from the hash, change the key, and then re-add it.

Important
Key uniqueness

Your application must enforce key uniqueness. In other words, do not add the same key to the hash more than once. If you do, hash behavior will be undefined.

You can test whether a key already exists in the hash using HASH_FIND.

Finding an item

To look up an item in a hash, you must have its key. Then call HASH_FIND. (Here we use the convenience macro HASH_FIND_INT for keys of type int).

Example: Looking up an item in a hash by its key

struct my_struct *find_user(int user_id) {
    struct my_struct *s;

    HASH_FIND_INT( users, &user_id, s );  /* s: output pointer */
    return s;
}
Note
The middle argument is a pointer to the key. You can't pass a literal key value to HASH_FIND. Instead assign the literal value to a variable, and pass a pointer to the variable.

Here, s is the output variable of HASH_FIND_INT. It's a pointer to the sought item, or NULL if the key wasn't found in the hash.

Deleting an item

To delete an item from a hash, you must have a pointer to the item.

Example: Deleting an item from a hash

void delete_user(struct my_struct *user) {
    HASH_DEL( users, user);  /* user: pointer to deletee */
}
Note
Deleting an item from a hash just removes it— it does not free it.

Iterating and sorting

You can loop over the items in the hash by starting from the beginning and following the hh.next pointer.

Example: Iterating over all the items in a hash

void print_users() {
    struct my_struct *s;

    for(s=users; s != NULL; s=s->hh.next) {
        printf("user id %d: name %s\n", s->id, s->name);
    }
}

There is also an hh.prev pointer you could use to iterate backwards through the hash, starting from any known item.

Sorted iteration

The items in the hash are, by default, traversed in the order they were added ("insertion order") when you follow the hh.next pointer. But you can sort the items into a new order using HASH_SORT. E.g.,

HASH_SORT( users, name_sort );

The second argument is a pointer to a comparison function. It must accept two arguments which are pointers to two items to compare. Its return value should be less than zero, zero, or greater than zero, if the first item sorts before, equal to, or after the second item, respectively. (Just like strcmp).

Example: Sorting the items in the hash

int name_sort(struct my_struct *a, struct my_struct *b) {
    return strcmp(a->name,b->name);
}

int id_sort(struct my_struct *a, struct my_struct *b) {
    return (a->id - b->id);
}

void sort_by_name() {
    HASH_SORT(users, name_sort);
}

void sort_by_id() {
    HASH_SORT(users, id_sort);
}

When the items in the hash are sorted, the first item may also change position; so in the example above, users may point to a different item after calling HASH_SORT.

Putting it all together: A complete Example

We'll repeat all the code and embellish it with a main() function to form a working example.

If this code was placed in a file called example.c in the same directory as uthash.h, it could be compiled and run like this:

cc -o example example.c
./example

Follow the prompts to try the program, and type Ctrl-C when done.

Example: A complete program
#include <stdio.h>   /* gets */
#include <stdlib.h>  /* atoi, malloc */
#include <string.h>  /* strcpy */
#include "uthash.h"

struct my_struct {
    int id;                    /* key */
    char name[10];
    UT_hash_handle hh;         /* makes this structure hashable */
};

struct my_struct *users = NULL;

int add_user(int user_id, char *name) {
    struct my_struct *s;

    s = malloc(sizeof(struct my_struct));
    s->id = user_id;
    strcpy(s->name, name);
    HASH_ADD_INT( users, id, s );  /* id: name of key field */
}

struct my_struct *find_user(int user_id) {
    struct my_struct *s;

    HASH_FIND_INT( users, &user_id, s );  /* s: output pointer */
    return s;
}

void delete_user(struct my_struct *user) {
    HASH_DEL( users, user);  /* user: pointer to deletee */
}

void print_users() {
    struct my_struct *s;

    for(s=users; s != NULL; s=s->hh.next) {
        printf("user id %d: name %s\n", s->id, s->name);
    }
}

int name_sort(struct my_struct *a, struct my_struct *b) {
    return strcmp(a->name,b->name);
}

int id_sort(struct my_struct *a, struct my_struct *b) {
    return (a->id - b->id);
}

void sort_by_name() {
    HASH_SORT(users, name_sort);
}

void sort_by_id() {
    HASH_SORT(users, id_sort);
}

int main(int argc, char *argv[]) {
    char in[10];
    int id=1;
    struct my_struct *s;

    while (1) {
        printf("1. add user\n");
        printf("2. find user\n");
        printf("3. delete user\n");
        printf("4. sort items by name\n");
        printf("5. sort items by id\n");
        printf("6. print users\n");
        gets(in);
        switch(atoi(in)) {
            case 1:
                printf("name?\n");
                add_user(id++, gets(in));
                break;
            case 2:
                printf("id?\n");
                s = find_user(atoi(gets(in)));
                printf("user: %s\n", s ? s->name : "unknown");
                break;
            case 3:
                printf("id?\n");
                s = find_user(atoi(gets(in)));
                if (s) delete_user(s);
                else printf("id unknown\n");
                break;
            case 4:
                sort_by_name();
                break;
            case 5:
                sort_by_id();
                break;
            case 6:
                print_users();
                break;
        }
    }
}

This program is included in the distribution in tests/example.c. You can run make example in that directory to compile it easily.

Arbitrary keys

An example with a string key

This is almost the same as using integer keys except the macros are named HASH_ADD_STR and HASH_FIND_STR.

Note
char[ ] vs. char*

The string is within the structure in this example— name is a char[10] field. If instead our structure merely pointed to the key (i.e., name was declared char *), we'd use HASH_ADD_KEYPTR, described in Appendix F.

Example: A string-keyed hash
#include <string.h>  /* strcpy */
#include <stdlib.h>  /* malloc */
#include <stdio.h>   /* printf */
#include "uthash.h"

struct my_struct {
    char name[10];             /* key */
    int id;
    UT_hash_handle hh;         /* makes this structure hashable */
};


int main(int argc, char *argv[]) {
    char **n, *names[] = { "joe", "bob", "betty", NULL };
    struct my_struct *s, *users = NULL;
    int i=0;

    for (n = names; *n != NULL; n++) {
        s = malloc(sizeof(struct my_struct));
        strcpy(s->name, *n);
        s->id = i++;
        HASH_ADD_STR( users, name, s );
    }

    HASH_FIND_STR( users, "betty", s);
    if (s) printf("betty's id is %d\n", s->id);
}

This example is included in the distribution in tests/test15.c. It prints:

betty's id is 2
Note
Remember, don't change an item's key after adding it to the hash. (Instead, delete the item from the hash, change the key and then re-add it to the hash).

An example with a multi-field, binary key

Your key field can have any data type, and it can even be an aggregate of multiple contiguous fields.

The generalized version of the macros (e.g., HASH_ADD) must be used. In addition to the usual arguments, the generalized macros require the name of the UT_hash_handle field, and the key length.

Example: A hash with a multi-field binary key
#include <stdlib.h>    /* malloc       */
#include <stddef.h>    /* offsetof     */
#include <stdio.h>     /* printf       */
#include <string.h>    /* memset       */
#include "uthash.h"

struct my_event {
    struct timeval tv;         /* key is aggregate of this field */
    char event_code;           /* and this field.                */
    int user_id;
    UT_hash_handle hh;         /* makes this structure hashable */
};


int main(int argc, char *argv[]) {
    struct my_event *e, ev, *events = NULL;
    int i, keylen;

    keylen =   offsetof(struct my_event, event_code) + sizeof(char)
             - offsetof(struct my_event, tv);

    for(i = 0; i < 10; i++) {
        e = malloc(sizeof(struct my_event));
        memset(e,0,sizeof(struct my_event));
        e->tv.tv_sec = i * (60*60*24*365);          /* i years (sec)*/
        e->tv.tv_usec = 0;
        e->event_code = 'a'+(i%2);                   /* meaningless */
        e->user_id = i;

        HASH_ADD( hh, events, tv, keylen, e);
    }

    /* look for one specific event */
    memset(&ev,0,sizeof(struct my_event));
    ev.tv.tv_sec = 5 * (60*60*24*365);
    ev.tv.tv_usec = 0;
    ev.event_code = 'b';
    HASH_FIND( hh, events, &ev.tv, keylen, e );
    if (e) printf("found: user %d, unix time %ld\n", e->user_id, e->tv.tv_sec);
}

This example is included in the distribution in tests/test16.c.

Participating in multiple hashes

A structure can be in multiple hashes simultaneously. For example, you might want to hash a structure on both an ID field and on a unique username.

Example: A structure with two keys
struct my_struct {
    int id;                    /* key 1 */
    char username[10];         /* key 2 */
    UT_hash_handle hh1,hh2;    /* makes this structure hashable */
};

In the example above, the structure can now be added to two separate hashes; in one hash, id is its key, while in the other hash, username is the key. (There is no requirement that the two hashes have different key fields. They could both use the same key, such as id).

Note
You need to have a separate UT_hash_handle for each hash that your structure will participate in simultaneously. These are hh1 and hh2 in this example.
Example: A structure participating in two hashes
    struct my_struct *users_by_id = NULL, *users_by_name = NULL, *s;
    int i;
    char *name;

    s = malloc(sizeof(struct my_struct));
    s->id = 1;
    strcpy(s->username, "thanson");

    HASH_ADD(hh1, users_by_id, id, sizeof(int), s);
    HASH_ADD(hh2, users_by_name, username, strlen(s->username), s);

    /* lookup user by ID */
    i=1;
    HASH_FIND(hh1, users_by_id, &i, sizeof(int), s);
    if (s) printf("found id %d: %s\n", i, s->username);

    /* lookup user by username */
    name = "thanson";
    HASH_FIND(hh2, users_by_name, name, strlen(name), s);
    if (s) printf("found user %s: %d\n", name, s->id);

When participating in two or more hashes simultaneously, always use the same key field with the same hash handle. E.g., in this example, hh1 is always used with the id key, and hh2 is always used with the username key.

Threaded programs

You can use this hash in a threaded program. But, the locking is up to you. You must protect the hash against concurrent modification. For every hash that you create in a threaded program, you should create a lock to protect it.

Concurrent hash lookups (HASH_FIND) are permitted, but modifications (HASH_ADD, HASH_DEL, and HASH_SORT) cannot be concurrent. Therefore, you can use a read/write lock which provides concurrent-read/exclusive-write semantics. Or you can use a mutex-style lock to prevent any concurrent access to the hash, but this is less efficient.

Appendix A: Choosing the hash function

Internally this software uses a particular hash function, such as Bernstein's hash, to transform a key to a bucket number.

Example: Bernstein hash function
#define HASH_BER(key,keylen,num_bkts,bkt,i,j,k)                        \
  bkt = 0;                                                             \
  while (keylen--)  bkt = (bkt * 33) + *key++;                         \
  bkt &= (num_bkts-1);

Several such hash functions are built-in. The default is Jenkin's hash, but your key domain may get better distribution with another function. You can use a specific hash function by compiling your program with -DHASH_FUNCTION=HASH_xyz where xyz is one of the symbolic names listed below. E.g.,

cc -DHASH_FUNCTION=HASH_OAT -o program program.c
Table: Built-in hash functions
Symbol Name
JEN Jenkins (default)
BER Bernstein
SAX Shift-Add-Xor
OAT One-at-a-time
FNV Fowler/Noll/Vo
JSW Julienne Walker

Which hash function is best?

You can easily determine the best hash function for your key domain. To do so, you'll need to run your program once in a data-collection pass, and then run the collected data through an included analysis utility.

First you must build the analysis utility. From the top-level directory,

cd tests/
make

We'll use test14.c to demonstrate the data-collection and analysis steps (using sh syntax):

cc -DHASH_EMIT_KEYS=3 -I../src -o test14 test14.c
./test14 3>test14.keys
./keystats test14.keys

The numeric value of HASH_EMIT_KEYS is a file descriptor. Any file descriptor that your program doesn't use for its own purposes can be used instead of 3.

Note
The data-collection mode enabled by -DHASH_EMIT_KEYS=x should not be used in production code.

The result of running the keystats command should look like this:

fcn     hash_q     #items   #buckets      #dups      flags   add_usec  find_usec
--- ---------- ---------- ---------- ---------- ---------- ---------- ----------
BER   1.000000       1219        512          0         ok       1206        550
FNV   1.000000       1219        512          0         ok       1734        660
JEN   1.000000       1219        512          0         ok       2008        845
OAT   0.998897       1219        512          0         ok       1941        833
SAX   0.997514       1219       1024          0         ok       2217        674
JSW   0.994652       1219        512          0         ok       1369        608

Usually, you should just pick the first hash function that is listed. Here, this is BER. This is the function that provides the most even distribution for your keys. If several have the same hash_q, then choose the fastest one according to the find_usec column.

Tip
Sometimes you might pick the fastest one even though its hash_q isn't the best. This is ok, particularly if its substantially faster (ideally hash functions have negligible performance overhead but in reality they can vary).

keystats column reference

fcn

symbolic name of hash function

hash_q

a number between 0 and 1 that measures how evenly the items are distributed among the buckets (1 is best)

#items

the number of keys that were read in from the emitted key file

#buckets

the number of buckets in the hash after all the keys were added

#dups

the number of duplicate keys encounted in the emitted key file. Duplicates keys are filtered out to maintain key uniqueness. (Duplicates are normal. For example, if the application adds an item to a hash, deletes it, then re-adds it, the key is written twice to the emitted file.)

flags

this is either ok, or noexpand if the expansion inhibited flag is set, described in Appendix C. It is not recommended to use a hash function that has the noexpand flag set.

add_usec

the clock time in microseconds required to add all the keys to a hash

find_usec

the clock time in microseconds required to look up every key in the hash

Appendix B: Statistic hash_q

The hash_q statistic measures how evenly the keys are distributed among the buckets. It is a value between 0 (worst) and 1 (best). You can access it in this way, if you're interested in observing how well your hash keys hold up to the ideal distribution.

Example: Observing the hash_q statistic

/* using the example "users" hash from previous listings */

void print_hashq() {
    if (users) printf("hash_q: %f\n", users->hh.tbl->hash_q);
}

The hash_q statistic is updated whenever bucket expansion occurs (see Appendix C).

A low value of hash_q may lead to reduced lookup (HASH_FIND) performance. You can change the hash function to one that distributes your keys more evenly; see Appendix A: Choosing the hash function.

Appendix C: Bucket expansion

Internally this hash manages the number of buckets, with the goal of having enough buckets so that each one contains only small number of items.

This hash attempts to keep fewer than 10 items in each bucket. When an item is added that would cause a bucket to exceed this number, the number of buckets in the hash is doubled and the items are redistributed.

Note
Bucket expansion occurs automatically and invisibly as needed. There is no need for the application to know when it occurs.

Bucket expansion hook

There is a hook for this event, primarily for uthash development.

Example: Bucket expansion hook
#include "uthash.h"

#undef uthash_expand_fyi
#define uthash_expand_fyi(tbl) printf("expanded to %d buckets\n", tbl->num_buckets)

...

During bucket expansion, if the number of items in a bucket exceeds the ideal number, then the expansion threshold for that bucket will be increased (instead of being 10, it will be a multiple of 10). This prevents excessive bucket expansion for hash functions that tend to over-utilize a few buckets yet have good distribution overall.

Bucket expansion inhibited hook

Some key domains may not yield a good bucket distribution using the default (or manually-specified) hash function.

If this occurs, statistic hash_q reflects the uneven distribution by tending closer to 0 than 1. This means that items are piling up in some buckets while other buckets are under-utilized. Bucket expansion then takes place as this hash tries to keep a bounded number of items in each bucket.

However, if items continue to be distributed unevenly after repeated bucket expansion, there is a point at which this hash in effect says, "This key domain is really incompatible with this hash function. I could spend all day expanding the buckets but it isn't helping to spread out the items in the hash, so I'm going to cut my losses and stop expanding the buckets."

Inhibition criteria

The decision to inhibit expansion is made by a heuristic rule: if two consecutive bucket expansions yield hash_q values below 0.5, then the bucket expansion inhibited flag is set for this hash.

Because this condition would cause HASH_FIND lookups to exhibit worse than constant-time performance, your application can choose to be made aware if it occurs. (Usually you don't need to worry about this, if you used the keystats utility during development to select a good hash for your keys).

Example: Bucket expansion inhibited hook
#include "uthash.h"

#undef uthash_noexpand_fyi
#define uthash_noexpand_fyi printf("warning: bucket expansion inhibited\n");

...

Once set, the bucket expansion inhibited flag remains in effect as long as the hash has items in it.

Appendix D: Hooks for memory allocation

By default this hash implementation uses malloc and free to manage memory. If your application uses its own custom allocator, this hash can use them too.

Example: Specifying alternate memory management functions
#include "uthash.h"


/* undefine the defaults */
#undef uthash_bkt_malloc
#undef uthash_bkt_free
#undef uthash_tbl_malloc
#undef uthash_tbl_free

/* re-define, specifying alternate functions */
#define uthash_bkt_malloc(sz) my_malloc(sz)  /* for UT_hash_bucket */
#define uthash_bkt_free(ptr) my_free(ptr)
#define uthash_tbl_malloc(sz) my_malloc(sz)  /* for UT_hash_table  */
#define uthash_tbl_free(ptr) my_free(ptr)

...

Out of memory handling

If memory allocation fails (i.e., the malloc function returned NULL), the default behavior is to terminate the process by calling exit(-1). This can be modified by re-defining the uthash_fatal macro.

#undef uthash_fatal
#define uthash_fatal(msg) my_fatal_function(msg);

The fatal function should terminate the process; uthash does not support "returning a failure" if memory cannot be allocated.

Appendix E: Internal implementation notes

Application and bucket order

The UT_hash_handle data structure includes next, prev, hh_next and hh_prev fields. The former two fields determine the "application" ordering (that is, iteration order that corresponds to the order the items were added). The latter two fields determine the "bucket" order. These link the UT_hash_handles together in a doubly-linked list that defines a bucket.

Internal consistency check with -DHASH_DEBUG=1

If a program that uses this hash is compiled with -DHASH_DEBUG=1, a special internal consistency-checking mode is activated. In this mode, the integrity of the whole hash is checked following every add or delete operation.

Note
This is for debugging the uthash software only, not for use in production code.

In this mode, any internal errors in the hash data structure will cause a message to be printed to stderr and the program to exit.

Checks performed in -DHASH_DEBUG=1 mode:

In the tests/ directory, running make debug will run all the tests in this mode.

Appendix F: Macro reference

Generalized macros

These macros add, find, delete and sort the items in a hash.

Table: Generalized macros
macro arguments
HASH_ADD (hh_name, head, keyfield_name, key_len, item_ptr)
HASH_ADD_KEYPTR (hh_name, head, key_ptr, key_len, item_ptr)
HASH_FIND (hh_name, head, key_ptr, key_len, item_ptr)
HASH_DELETE (hh_name, head, item_ptr)
HASH_SORT (head, cmp)
Note
HASH_ADD_KEYPTR is used when the structure contains a pointer to the key, rather than the key itself.

Convenience macros

The convenience macros do the same thing as the generalized macros, but require fewer arguments. They have key and naming restrictions (see below).

Table: Convenience macros
macro arguments
HASH_ADD_INT (head, keyfield_name, item_ptr)
HASH_FIND_INT (head, key_ptr, item_ptr)
HASH_ADD_STR (head, keyfield_name, item_ptr)
HASH_FIND_STR (head, key_ptr, item_ptr)
HASH_DEL (head, item_ptr)

Key and naming restrictions for convenience macros

In order to use the convenience macros,

  1. the structure's UT_hash_handle field must be named hh, and

  2. the key field must be of type int or char[]

HASH_DEL does not require the second condition.

Argument descriptions

hh_name

name of the UT_hash_handle field in the structure. Conventionally called hh.

head

the structure pointer variable which acts as the "head" of the hash. So named because it points to the first item which is added to the hash.

keyfield_name

the name of the key field in the structure. (In the case of a multi-field key, this is the first field of the key). If you're new to macros, it might seem strange to pass the name of a field as a parameter. See note.

key_len

the length of the key field in bytes. E.g. for an integer key, this is sizeof(int), while for a string key it's strlen(key). (For a multi-field key, see the notes in this guide on calculating key length).

key_ptr

for HASH_FIND, this is a pointer to the key to look up in the hash (since it's a pointer, you can't directly pass a literal value here). For HASH_ADD_KEYPTR, this is the address of the key of the item being added.

item_ptr

pointer to the structure being added, deleted or looked up. This is an input parameter for HASH_ADD and HASH_DELETE macros, and an output parameter for HASH_FIND.

cmp

pointer to comparison function which accepts two arguments (pointers to items to compare) and returns an int specifying whether the first item should sort before, equal to, or after the second item (like strcmp).