README.md (7127B)
1 Generic hash table implementation with focus on being minimally 2 invasive on existing items to be indexed. 3 4 The key is stored arbitrarily in the referenced item. A custom match 5 function `HT_MATCH` provides the necessary abstraction. Items are 6 NOT allocated by the hash table. 7 8 Removed items are replaced with a sentinel value (1) to preserve 9 chaining. 10 11 See the example implementations `hash_set.h`, `hash_item_table.h`, 12 and `hash_test.c`. 13 14 The hash function can also be customized, see the default below. 15 16 In all cases the key as assumed to be char string that is not 17 (necessarily) zero terminated. The length is given separately. Keys 18 can therefore be arbitrary binary values of arbitrary length. 19 20 Instead of initializing the hash table, it may be zeroed. In that 21 case the count defaults to 4 upon first insert, meaning it can hold 22 up to 4 items before resizing or less depending on load factor. By 23 zeroing memory, hash tables use no memory until actually used. 24 25 For increased portability we do not rely upon `stdint.h` outside the 26 default hash function. 27 28 Build 29 ----- 30 31 There are no special build requirements. 32 33 CMakeLists.txt simply links the appropriate hash function with the test 34 files, but CMake is not required, for example: 35 36 cc load_test.c ptr_set.c cmetrohash64.c -O4 -DNDEBUG -o load_test 37 38 There are several significant flags that can be set, but look at 39 `CMakeLists.txt`, `hash_test.c`, and `load_test.c`. 40 41 `initbuild.sh` is an easy way to create a CMake Ninja build for 42 platforms that support it. 43 44 Usage 45 ----- 46 47 The hash table is implemented in a generic form with a static (private) 48 interface. The macros 49 50 `HASH_TABLE_HEADER(name, item)` defines the public prototype for the 51 specialized type, and `HASH_TABLE_API(name)` defines non-static wrapper 52 functions to access the generic implementation. This avoids creating all 53 the code as macros which are painful to develop and debug. 54 55 See `token_map.h`, `token_map.c` which are used in `hash_test.c`. 56 57 If the datatype is only needed in one file, the implementation such as 58 `token_map.c` can be included after defining `HT_PRIVATE`. This gives 59 the compiler better optimization opportunities and hides the interface 60 from other compilation units. 61 62 The basic datatype `hash_table_t` is a small struct that can be embedded 63 anywhere and used as the instance of any hash table derived type. 64 65 66 Note on algorithmic choice 67 -------------------------- 68 69 We use linear or quadratic probing hash tables because it allows for 70 many small hash tables. We overallocate the hash table by a factor 2 71 (default) but only store a single pointer per item. This probing does 72 not allow for dense tables by itself, but because the hash table only 73 stores a single pointer per bucket, we can afford a larger table. 74 Advanced hashing such as Hopscotch can pack much more densely but 75 e.g. Hopscotch need to store a bitmask, thus already doubling the 76 size of the table. Hopscotch is probably good, but more complex and 77 depends on optimizing bit scan insructions, furthermore, when the use 78 case is many small tables such as symbol table scopes, cache locality 79 is less relevant. Chained hashing with 50% load factor is a good 80 choice, but require intrusive links, and cannot trivially hash string 81 sets without extra allocation. There is some evidence that linear 82 probing may be faster than quadratic probing due to cache effects, as 83 long as we do not pack too densely - however, the tradional quadratic 84 probing (k + i * i) modulo prime does not cover all buckets. We use 85 (k + i * (i + 1) / 2) modulo power of 2 which covers all buckets so 86 without experimentation it is unclear whether linear probing or 87 quadratic probing is best. 88 89 The use of open addressing leads to more key comparisons than chained 90 hashing. The fact we store the keys indirectly in the stored item is 91 also not ideal, except when the item is also directly the key. If we 92 use larger hash tables from the saved space, we suspect this will 93 still perform well, also considering external factors such as not 94 having to allocate and copy a key from e.g. a text buffer being 95 parsed. 96 97 It is generally understood that linear probing degrades significantly 98 with a load factor above 0.7. In this light, it is interesting to note 99 that Emmanuel Goossaert tested hopscotch hashing and found that bucket 100 swaps only take place in significance above a load factor of 0.7. A 101 commenter to Goossaert's blog also found that neighbourhoods rarely 102 exceed 64 even when allowed to grow on demand. Without deep analysis 103 it would appear that linear probing and hopscotch is pretty similar 104 at a load factor of 0.5 especially if tombstones are not present. 105 Because hopscotch requires extra data (e.g. the hash key or a bitmap 106 or a linked list) this confirms our intuition that it is better with 107 lower load factors and smaller buckets, than advanced algorithms. 108 Furthermore, hopscotch insert degrades badly when it needs to search for 109 empty buckets at high load factors. Of course, for on disk storage 110 it is a different matter, and this is why Goossaert is interested 111 in caching hash keys in buckets. 112 113 Robin Hood hashing is mostly interesting when there are many deletions 114 to clean up and when the load factor increases. In our implementation we 115 try to keep the per bucket size small: a pointer and a 8 bit offset, or 116 just a pointer for the linear and quadratic probing implementations. 117 This makes it affordable with a lower load factor. 118 119 This Robin Hood variation stores the offset from the hashed bucket to 120 where the first entry is stored. This means we can avoiding sampling any 121 bucket not indexed by the current hash key, and it also means that we 122 avoid having to store or calculate the hash key when updating. 123 124 A sorted Robin Hood hashing implementation was also made, but it prooved 125 to be error prone with many special cases and slower than regular Robin 126 Hood hashing. It would conceivably protect against hash collision 127 attacks through exponential search, but insertions and deletions would 128 still need to move memory in linear time, making this point mood. 129 Therefore the sorted Robin Hood variant has been removed. 130 131 132 Section 4.5: 133 <http://codecapsule.com/2014/05/07/implementing-a-key-value-store-part-6-open-addressing-hash-tables/> 134 135 <http://codecapsule.com/2013/08/11/hopscotch-hashing/> 136 137 Source file references 138 ---------------------- 139 140 <http://www.jandrewrogers.com/2015/05/27/metrohash/> 141 142 downloaded from 143 144 <https://github.com/rurban/smhasher> 145 <https://github.com/rurban/smhasher/commit/00a4e5ab6bfb7b25bd3c7cf915f68984d4910cfd> 146 147 <https://raw.githubusercontent.com/rurban/smhasher/master/cmetrohash64.c> 148 <https://raw.githubusercontent.com/rurban/smhasher/master/cmetrohash.h> 149 <https://raw.githubusercontent.com/rurban/smhasher/master/PMurHash.c> 150 <https://raw.githubusercontent.com/rurban/smhasher/master/PMurHash.h> 151 152 As of July 2015, for 64-bit hashes, the C port of the 64 bit metro hash 153 is a good trade-off between speed and simplicity. The For a 32-bit C hash 154 function, the ported MurmurHash3 is safe and easy to use in this 155 environment, but xxHash32 may also be worth considering. 156 157 See also <http://www.strchr.com/hash_functions> 158