nostrdb

an unfairly fast embedded nostr database backed by lmdb
git clone git://jb55.com/nostrdb
Log | Files | Refs | Submodules | README | LICENSE

json_parser_design.md (6760B)


      1 # JSON Parser Design
      2 
      3 The overall principle of the json parser is as follows:
      4 
      5 `flatcc/flatcc_json.h` contains functions to parse json primitives
      6 and type conversions, and a generic json parser to skip unrecognized
      7 fields.
      8 
      9 For each table all known fields are sorted and a trie is constructed.
     10 After recognizing a json object, each member name is read partially
     11 and matched against the trie, and more of the field name is read as
     12 demanded by the trie and a match or rejection is decided.
     13 
     14 The member name is read as a single 64 bit word read in big endian
     15 format (zero padded at the end of file). Big endian makes the first
     16 character in the name most significant and allows the word to contain
     17 trailing "garbage". The efficiency depends on unaligned 64-bit reads
     18 with a fast byteswapping operation, but there is a fall-back that
     19 works in all cases via an unaligned read support function.
     20 
     21 The trie is constructed by sorting all field names and taking a
     22 median split. If the split name has preceeding name which is a strict
     23 prefix, that name is chosen as split instead. This is important
     24 because words are read with trailing "garbage". If the median name is
     25 longer than a 64 bit word, it is treated specially where a match
     26 triggers reading more data and repeating the process.
     27 
     28 The median is tested for less than, but not equality. Each side if
     29 the choice uses a new median like above, until there is only one
     30 option left. This option is then tested for exact match by masking
     31 out any trailing data. If the match succeeds the input member name
     32 must be checked for closing `"` or other valid termination if
     33 allowing unquoted names.
     34 
     35 Note that this match only visits data once, unlike a hash table where
     36 a lookup still requires matching the name. Since we expect the number
     37 of field names per table to be reasonable, this approach is likely
     38 faster than a hash table.
     39 
     40 If the match fails, then an error can be generated with a "unknown
     41 field" error, or the field can be consumed by the generic json
     42 parser.
     43 
     44 When a member name is successfully matched againts a known field, the
     45 member value is expected to be of a primitive type such as a string
     46 or an integer, the value is parsed using a json support parser and
     47 type converter which may issue overflow and format errors. For
     48 example, integers won't accept decimal points and ubyte fields won't
     49 accept integers of value 256 or higher. If the parse was successful
     50 the member is added the current flatbuffer table opened at start of
     51 the json object. In fact, the value is parsed directly into the
     52 flatbuffer builders entry for the field.
     53 
     54 If a scalar field fails to parse its value, a second attempt is
     55 done parsing the value as a symbolic value. This parse is similar
     56 to parsing member names but uses a global set of known constants
     57 from enumerations. If this also fails, the field is invalid and
     58 an error is generated.
     59 
     60 When the field is parsed, comma is detected and the process is
     61 repeated. It may fail on duplicate fields because the builder will
     62 complain.
     63 
     64 Once a closing bracket is matched, the table is closed in the
     65 builder.
     66 
     67 In the above we discussed tables, but structs work the same with the
     68 exception that any fields not present are simply set to zero and it
     69 is not checked. Nor are duplicate fields checked. This is to avoid
     70 allocating extra space for something that isn't very important.
     71 
     72 Complex member types require more consideration. We wish to avoid a
     73 recursive descent parser because it imposes limits on nesting depth,
     74 but we want to have a composable parser. This leads to solutions such
     75 as a trampoline, returning a parser function for the matched field
     76 which is called by a driver function, before resuming the current
     77 parser. We wish to avoid this as it adds overhead, especially in
     78 inlining.
     79 
     80 To avoid trampolines we compile all tables and structs into one large
     81 function where each type can be reached by a goto label or a switch
     82 statement. We can then use a variant of Simon Tathams famous
     83 co-routine by duff device to abort and resume the current parse. We
     84 still need a stack to track return state.
     85 <http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html>
     86 
     87 Outside the giant state machine we provide wrappers so we can parse
     88 each type from a clean call interface to parse root type.
     89 
     90 The above solution has one major problem: we apparently have to
     91 include types from included schema as well. But we don't have to it
     92 turns out: Each included schema is closed and can never have a member
     93 type in our current schema. Therefore mutual recursion is not
     94 possible. This means we can afford to use function call recursion
     95 into included parsers with recursion depth bounded by the number of
     96 included schema.
     97 
     98 Spaces are optimized by first testing to see if there is a character
     99 above space and exit immediately if so.  Otherwise, if unaligned
    100 access is valid and the buffer holds at least 16 bytes, we descend
    101 hierarchically to test ever small unligned loads against a word of
    102 spaces. This loops as long as 2x64 bit words can be matched to a word
    103 of pure spaces (ASCII 0x20).  If there is any control character, or
    104 if we reach near the end or, or if unaligned access is not available,
    105 we bail out to a standard character parsing loop that also does line
    106 counting.
    107 
    108 A final, and major issue, is how to handle unions:
    109 
    110 Googles `flatc` compiler makes the assumtion that the union type
    111 field appear before the union table field. However, this order cannot
    112 be trusted in standard JSON. If we were to sort the fields first
    113 or prescan for type, it would complete ruin our parsing strategy.
    114 
    115 To handle unions efficiently we therefore require either that union
    116 fields appear before the union table, and that the associated
    117 union table is present before another unions type field, or we
    118 require that the member name is tagged with its type. We prefer
    119 the tagged approach, but it is not compliant with `flatc v1.2`.
    120 
    121 By using a tag such as `"test_as_Monster"` instead of just "test",
    122 the parser can treat the field as any other table field. Of course
    123 there is only allowed to be one instance, so `"test_as_Weapon"`
    124 in the same object would not be allowed. To the parser it just
    125 triggers a duplicate field error.
    126 
    127 In addition, the `"test_type"` field is allowed to appear anywhere,
    128 or to be absent. If it is present, it must have a type that
    129 matches the tagged member, unless it is of type NONE. NONE may
    130 also be represented as `"test_as_NONE": null`.
    131 
    132 While the tagged approach has no use of the type field, it may still
    133 be very useful to other JSON consumers so they can know what tagged
    134 field to look up in the JSON object.
    135 
    136 The parser handles the type field by setting the type field in the
    137 flatcc builder table and checking if it is already set when
    138 seeing a tagged or untagged union table field.