nostrdb

an unfairly fast embedded nostr database backed by lmdb
git clone git://jb55.com/nostrdb
Log | Files | Refs | Submodules | README | LICENSE

README.md (7127B)


      1 Generic hash table implementation with focus on being minimally
      2 invasive on existing items to be indexed.
      3 
      4 The key is stored arbitrarily in the referenced item. A custom match
      5 function `HT_MATCH` provides the necessary abstraction. Items are
      6 NOT allocated by the hash table.
      7 
      8 Removed items are replaced with a sentinel value (1) to preserve
      9 chaining.
     10 
     11 See the example implementations `hash_set.h`,  `hash_item_table.h`,
     12 and `hash_test.c`.
     13 
     14 The hash function can also be customized, see the default below.
     15 
     16 In all cases the key as assumed to be char string that is not
     17 (necessarily) zero terminated. The length is given separately. Keys
     18 can therefore be arbitrary binary values of arbitrary length.
     19 
     20 Instead of initializing the hash table, it may be zeroed. In that
     21 case the count defaults to 4 upon first insert, meaning it can hold
     22 up to 4 items before resizing or less depending on load factor. By
     23 zeroing memory, hash tables use no memory until actually used.
     24 
     25 For increased portability we do not rely upon `stdint.h` outside the
     26 default hash function.
     27 
     28 Build
     29 -----
     30 
     31 There are no special build requirements.
     32 
     33 CMakeLists.txt simply links the appropriate hash function with the test
     34 files, but CMake is not required, for example:
     35 
     36     cc load_test.c ptr_set.c cmetrohash64.c -O4 -DNDEBUG -o load_test
     37 
     38 There are several significant flags that can be set, but look at
     39 `CMakeLists.txt`, `hash_test.c`, and `load_test.c`.
     40 
     41 `initbuild.sh` is an easy way to create a CMake Ninja build for
     42 platforms that support it.
     43 
     44 Usage
     45 -----
     46 
     47 The hash table is implemented in a generic form with a static (private)
     48 interface. The macros
     49 
     50 `HASH_TABLE_HEADER(name, item)` defines the public prototype for the
     51 specialized type, and `HASH_TABLE_API(name)` defines non-static wrapper
     52 functions to access the generic implementation. This avoids creating all
     53 the code as macros which are painful to develop and debug.
     54 
     55 See `token_map.h`, `token_map.c` which are used in `hash_test.c`.
     56 
     57 If the datatype is only needed in one file, the implementation such as
     58 `token_map.c` can be included after defining `HT_PRIVATE`. This gives
     59 the compiler better optimization opportunities and hides the interface
     60 from other compilation units.
     61 
     62 The basic datatype `hash_table_t` is a small struct that can be embedded
     63 anywhere and used as the instance of any hash table derived type.
     64 
     65 
     66 Note on algorithmic choice
     67 --------------------------
     68 
     69 We use linear or quadratic probing hash tables because it allows for
     70 many small hash tables. We overallocate the hash table by a factor 2
     71 (default) but only store a single pointer per item. This probing does
     72 not allow for dense tables by itself, but because the hash table only
     73 stores a single pointer per bucket, we can afford a larger table.
     74 Advanced hashing such as Hopscotch can pack much more densely but
     75 e.g. Hopscotch need to store a bitmask, thus already doubling the
     76 size of the table. Hopscotch is probably good, but more complex and
     77 depends on optimizing bit scan insructions, furthermore, when the use
     78 case is many small tables such as symbol table scopes, cache locality
     79 is less relevant. Chained hashing with 50% load factor is a good
     80 choice, but require intrusive links, and cannot trivially hash string
     81 sets without extra allocation. There is some evidence that linear
     82 probing may be faster than quadratic probing due to cache effects, as
     83 long as we do not pack too densely - however, the tradional quadratic
     84 probing (k + i * i) modulo prime does not cover all buckets. We use
     85 (k + i * (i + 1) / 2) modulo power of 2 which covers all buckets so
     86 without experimentation it is unclear whether linear probing or
     87 quadratic probing is best.
     88 
     89 The use of open addressing leads to more key comparisons than chained
     90 hashing. The fact we store the keys indirectly in the stored item is
     91 also not ideal, except when the item is also directly the key. If we
     92 use larger hash tables from the saved space, we suspect this will
     93 still perform well, also considering external factors such as not
     94 having to allocate and copy a key from e.g. a text buffer being
     95 parsed.
     96 
     97 It is generally understood that linear probing degrades significantly
     98 with a load factor above 0.7. In this light, it is interesting to note
     99 that Emmanuel Goossaert tested hopscotch hashing and found that bucket
    100 swaps only take place in significance above a load factor of 0.7. A
    101 commenter to Goossaert's blog also found that neighbourhoods rarely
    102 exceed 64 even when allowed to grow on demand. Without deep analysis
    103 it would appear that linear probing and hopscotch is pretty similar
    104 at a load factor of 0.5 especially if tombstones are not present.
    105 Because hopscotch requires extra data (e.g. the hash key or a bitmap
    106 or a linked list) this confirms our intuition that it is better with
    107 lower load factors and smaller buckets, than advanced algorithms.
    108 Furthermore, hopscotch insert degrades badly when it needs to search for
    109 empty buckets at high load factors. Of course, for on disk storage
    110 it is a different matter, and this is why Goossaert is interested
    111 in caching hash keys in buckets.
    112 
    113 Robin Hood hashing is mostly interesting when there are many deletions
    114 to clean up and when the load factor increases. In our implementation we
    115 try to keep the per bucket size small: a pointer and a 8 bit offset, or
    116 just a pointer for the linear and quadratic probing implementations.
    117 This makes it affordable with a lower load factor.
    118 
    119 This Robin Hood variation stores the offset from the hashed bucket to
    120 where the first entry is stored. This means we can avoiding sampling any
    121 bucket not indexed by the current hash key, and it also means that we
    122 avoid having to store or calculate the hash key when updating.
    123 
    124 A sorted Robin Hood hashing implementation was also made, but it prooved
    125 to be error prone with many special cases and slower than regular Robin
    126 Hood hashing. It would conceivably protect against hash collision
    127 attacks through exponential search, but insertions and deletions would
    128 still need to move memory in linear time, making this point mood.
    129 Therefore the sorted Robin Hood variant has been removed.
    130 
    131 
    132 Section 4.5:
    133 <http://codecapsule.com/2014/05/07/implementing-a-key-value-store-part-6-open-addressing-hash-tables/>
    134 
    135 <http://codecapsule.com/2013/08/11/hopscotch-hashing/>
    136 
    137 Source file references
    138 ----------------------
    139 
    140 <http://www.jandrewrogers.com/2015/05/27/metrohash/>
    141 
    142 downloaded from
    143 
    144     <https://github.com/rurban/smhasher>
    145     <https://github.com/rurban/smhasher/commit/00a4e5ab6bfb7b25bd3c7cf915f68984d4910cfd>
    146 
    147     <https://raw.githubusercontent.com/rurban/smhasher/master/cmetrohash64.c>
    148     <https://raw.githubusercontent.com/rurban/smhasher/master/cmetrohash.h>
    149     <https://raw.githubusercontent.com/rurban/smhasher/master/PMurHash.c>
    150     <https://raw.githubusercontent.com/rurban/smhasher/master/PMurHash.h>
    151 
    152 As of July 2015, for 64-bit hashes, the C port of the 64 bit metro hash
    153 is a good trade-off between speed and simplicity. The For a 32-bit C hash
    154 function, the ported MurmurHash3 is safe and easy to use in this
    155 environment, but xxHash32 may also be worth considering.
    156 
    157 See also <http://www.strchr.com/hash_functions>
    158