README.md - nostrdb - an unfairly fast embedded nostr database backed by lmdb

README.md (7127B)

1 Generic hash table implementation with focus on being minimally
2 invasive on existing items to be indexed.
3
4 The key is stored arbitrarily in the referenced item. A custom match
5 function `HT_MATCH` provides the necessary abstraction. Items are
6 NOT allocated by the hash table.
7
8 Removed items are replaced with a sentinel value (1) to preserve
9 chaining.
10
11 See the example implementations `hash_set.h`, `hash_item_table.h`,
12 and `hash_test.c`.
13
14 The hash function can also be customized, see the default below.
15
16 In all cases the key as assumed to be char string that is not
17 (necessarily) zero terminated. The length is given separately. Keys
18 can therefore be arbitrary binary values of arbitrary length.
19
20 Instead of initializing the hash table, it may be zeroed. In that
21 case the count defaults to 4 upon first insert, meaning it can hold
22 up to 4 items before resizing or less depending on load factor. By
23 zeroing memory, hash tables use no memory until actually used.
24
25 For increased portability we do not rely upon `stdint.h` outside the
26 default hash function.
27
28 Build
29 -----
30
31 There are no special build requirements.
32
33 CMakeLists.txt simply links the appropriate hash function with the test
34 files, but CMake is not required, for example:
35
36 cc load_test.c ptr_set.c cmetrohash64.c -O4 -DNDEBUG -o load_test
37
38 There are several significant flags that can be set, but look at
39 `CMakeLists.txt`, `hash_test.c`, and `load_test.c`.
40
41 `initbuild.sh` is an easy way to create a CMake Ninja build for
42 platforms that support it.
43
44 Usage
45 -----
46
47 The hash table is implemented in a generic form with a static (private)
48 interface. The macros
49
50 `HASH_TABLE_HEADER(name, item)` defines the public prototype for the
51 specialized type, and `HASH_TABLE_API(name)` defines non-static wrapper
52 functions to access the generic implementation. This avoids creating all
53 the code as macros which are painful to develop and debug.
54
55 See `token_map.h`, `token_map.c` which are used in `hash_test.c`.
56
57 If the datatype is only needed in one file, the implementation such as
58 `token_map.c` can be included after defining `HT_PRIVATE`. This gives
59 the compiler better optimization opportunities and hides the interface
60 from other compilation units.
61
62 The basic datatype `hash_table_t` is a small struct that can be embedded
63 anywhere and used as the instance of any hash table derived type.
64
65
66 Note on algorithmic choice
67 --------------------------
68
69 We use linear or quadratic probing hash tables because it allows for
70 many small hash tables. We overallocate the hash table by a factor 2
71 (default) but only store a single pointer per item. This probing does
72 not allow for dense tables by itself, but because the hash table only
73 stores a single pointer per bucket, we can afford a larger table.
74 Advanced hashing such as Hopscotch can pack much more densely but
75 e.g. Hopscotch need to store a bitmask, thus already doubling the
76 size of the table. Hopscotch is probably good, but more complex and
77 depends on optimizing bit scan insructions, furthermore, when the use
78 case is many small tables such as symbol table scopes, cache locality
79 is less relevant. Chained hashing with 50% load factor is a good
80 choice, but require intrusive links, and cannot trivially hash string
81 sets without extra allocation. There is some evidence that linear
82 probing may be faster than quadratic probing due to cache effects, as
83 long as we do not pack too densely - however, the tradional quadratic
84 probing (k + i * i) modulo prime does not cover all buckets. We use
85 (k + i * (i + 1) / 2) modulo power of 2 which covers all buckets so
86 without experimentation it is unclear whether linear probing or
87 quadratic probing is best.
88
89 The use of open addressing leads to more key comparisons than chained
90 hashing. The fact we store the keys indirectly in the stored item is
91 also not ideal, except when the item is also directly the key. If we
92 use larger hash tables from the saved space, we suspect this will
93 still perform well, also considering external factors such as not
94 having to allocate and copy a key from e.g. a text buffer being
95 parsed.
96
97 It is generally understood that linear probing degrades significantly
98 with a load factor above 0.7. In this light, it is interesting to note
99 that Emmanuel Goossaert tested hopscotch hashing and found that bucket
100 swaps only take place in significance above a load factor of 0.7. A
101 commenter to Goossaert's blog also found that neighbourhoods rarely
102 exceed 64 even when allowed to grow on demand. Without deep analysis
103 it would appear that linear probing and hopscotch is pretty similar
104 at a load factor of 0.5 especially if tombstones are not present.
105 Because hopscotch requires extra data (e.g. the hash key or a bitmap
106 or a linked list) this confirms our intuition that it is better with
107 lower load factors and smaller buckets, than advanced algorithms.
108 Furthermore, hopscotch insert degrades badly when it needs to search for
109 empty buckets at high load factors. Of course, for on disk storage
110 it is a different matter, and this is why Goossaert is interested
111 in caching hash keys in buckets.
112
113 Robin Hood hashing is mostly interesting when there are many deletions
114 to clean up and when the load factor increases. In our implementation we
115 try to keep the per bucket size small: a pointer and a 8 bit offset, or
116 just a pointer for the linear and quadratic probing implementations.
117 This makes it affordable with a lower load factor.
118
119 This Robin Hood variation stores the offset from the hashed bucket to
120 where the first entry is stored. This means we can avoiding sampling any
121 bucket not indexed by the current hash key, and it also means that we
122 avoid having to store or calculate the hash key when updating.
123
124 A sorted Robin Hood hashing implementation was also made, but it prooved
125 to be error prone with many special cases and slower than regular Robin
126 Hood hashing. It would conceivably protect against hash collision
127 attacks through exponential search, but insertions and deletions would
128 still need to move memory in linear time, making this point mood.
129 Therefore the sorted Robin Hood variant has been removed.
130
131
132 Section 4.5:
133 <http://codecapsule.com/2014/05/07/implementing-a-key-value-store-part-6-open-addressing-hash-tables/>
134
135 <http://codecapsule.com/2013/08/11/hopscotch-hashing/>
136
137 Source file references
138 ----------------------
139
140 <http://www.jandrewrogers.com/2015/05/27/metrohash/>
141
142 downloaded from
143
144 <https://github.com/rurban/smhasher>
145 <https://github.com/rurban/smhasher/commit/00a4e5ab6bfb7b25bd3c7cf915f68984d4910cfd>
146
147 <https://raw.githubusercontent.com/rurban/smhasher/master/cmetrohash64.c>
148 <https://raw.githubusercontent.com/rurban/smhasher/master/cmetrohash.h>
149 <https://raw.githubusercontent.com/rurban/smhasher/master/PMurHash.c>
150 <https://raw.githubusercontent.com/rurban/smhasher/master/PMurHash.h>
151
152 As of July 2015, for 64-bit hashes, the C port of the 64 bit metro hash
153 is a good trade-off between speed and simplicity. The For a 32-bit C hash
154 function, the ported MurmurHash3 is safe and easy to use in this
155 environment, but xxHash32 may also be worth considering.
156
157 See also <http://www.strchr.com/hash_functions>
158

	nostrdb an unfairly fast embedded nostr database backed by lmdb
	git clone git://jb55.com/nostrdb
	Log \| Files \| Refs \| Submodules \| README \| LICENSE