json_parser_design.md (6760B)
1 # JSON Parser Design 2 3 The overall principle of the json parser is as follows: 4 5 `flatcc/flatcc_json.h` contains functions to parse json primitives 6 and type conversions, and a generic json parser to skip unrecognized 7 fields. 8 9 For each table all known fields are sorted and a trie is constructed. 10 After recognizing a json object, each member name is read partially 11 and matched against the trie, and more of the field name is read as 12 demanded by the trie and a match or rejection is decided. 13 14 The member name is read as a single 64 bit word read in big endian 15 format (zero padded at the end of file). Big endian makes the first 16 character in the name most significant and allows the word to contain 17 trailing "garbage". The efficiency depends on unaligned 64-bit reads 18 with a fast byteswapping operation, but there is a fall-back that 19 works in all cases via an unaligned read support function. 20 21 The trie is constructed by sorting all field names and taking a 22 median split. If the split name has preceeding name which is a strict 23 prefix, that name is chosen as split instead. This is important 24 because words are read with trailing "garbage". If the median name is 25 longer than a 64 bit word, it is treated specially where a match 26 triggers reading more data and repeating the process. 27 28 The median is tested for less than, but not equality. Each side if 29 the choice uses a new median like above, until there is only one 30 option left. This option is then tested for exact match by masking 31 out any trailing data. If the match succeeds the input member name 32 must be checked for closing `"` or other valid termination if 33 allowing unquoted names. 34 35 Note that this match only visits data once, unlike a hash table where 36 a lookup still requires matching the name. Since we expect the number 37 of field names per table to be reasonable, this approach is likely 38 faster than a hash table. 39 40 If the match fails, then an error can be generated with a "unknown 41 field" error, or the field can be consumed by the generic json 42 parser. 43 44 When a member name is successfully matched againts a known field, the 45 member value is expected to be of a primitive type such as a string 46 or an integer, the value is parsed using a json support parser and 47 type converter which may issue overflow and format errors. For 48 example, integers won't accept decimal points and ubyte fields won't 49 accept integers of value 256 or higher. If the parse was successful 50 the member is added the current flatbuffer table opened at start of 51 the json object. In fact, the value is parsed directly into the 52 flatbuffer builders entry for the field. 53 54 If a scalar field fails to parse its value, a second attempt is 55 done parsing the value as a symbolic value. This parse is similar 56 to parsing member names but uses a global set of known constants 57 from enumerations. If this also fails, the field is invalid and 58 an error is generated. 59 60 When the field is parsed, comma is detected and the process is 61 repeated. It may fail on duplicate fields because the builder will 62 complain. 63 64 Once a closing bracket is matched, the table is closed in the 65 builder. 66 67 In the above we discussed tables, but structs work the same with the 68 exception that any fields not present are simply set to zero and it 69 is not checked. Nor are duplicate fields checked. This is to avoid 70 allocating extra space for something that isn't very important. 71 72 Complex member types require more consideration. We wish to avoid a 73 recursive descent parser because it imposes limits on nesting depth, 74 but we want to have a composable parser. This leads to solutions such 75 as a trampoline, returning a parser function for the matched field 76 which is called by a driver function, before resuming the current 77 parser. We wish to avoid this as it adds overhead, especially in 78 inlining. 79 80 To avoid trampolines we compile all tables and structs into one large 81 function where each type can be reached by a goto label or a switch 82 statement. We can then use a variant of Simon Tathams famous 83 co-routine by duff device to abort and resume the current parse. We 84 still need a stack to track return state. 85 <http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html> 86 87 Outside the giant state machine we provide wrappers so we can parse 88 each type from a clean call interface to parse root type. 89 90 The above solution has one major problem: we apparently have to 91 include types from included schema as well. But we don't have to it 92 turns out: Each included schema is closed and can never have a member 93 type in our current schema. Therefore mutual recursion is not 94 possible. This means we can afford to use function call recursion 95 into included parsers with recursion depth bounded by the number of 96 included schema. 97 98 Spaces are optimized by first testing to see if there is a character 99 above space and exit immediately if so. Otherwise, if unaligned 100 access is valid and the buffer holds at least 16 bytes, we descend 101 hierarchically to test ever small unligned loads against a word of 102 spaces. This loops as long as 2x64 bit words can be matched to a word 103 of pure spaces (ASCII 0x20). If there is any control character, or 104 if we reach near the end or, or if unaligned access is not available, 105 we bail out to a standard character parsing loop that also does line 106 counting. 107 108 A final, and major issue, is how to handle unions: 109 110 Googles `flatc` compiler makes the assumtion that the union type 111 field appear before the union table field. However, this order cannot 112 be trusted in standard JSON. If we were to sort the fields first 113 or prescan for type, it would complete ruin our parsing strategy. 114 115 To handle unions efficiently we therefore require either that union 116 fields appear before the union table, and that the associated 117 union table is present before another unions type field, or we 118 require that the member name is tagged with its type. We prefer 119 the tagged approach, but it is not compliant with `flatc v1.2`. 120 121 By using a tag such as `"test_as_Monster"` instead of just "test", 122 the parser can treat the field as any other table field. Of course 123 there is only allowed to be one instance, so `"test_as_Weapon"` 124 in the same object would not be allowed. To the parser it just 125 triggers a duplicate field error. 126 127 In addition, the `"test_type"` field is allowed to appear anywhere, 128 or to be absent. If it is present, it must have a type that 129 matches the tagged member, unless it is of type NONE. NONE may 130 also be represented as `"test_as_NONE": null`. 131 132 While the tagged approach has no use of the type field, it may still 133 be very useful to other JSON consumers so they can know what tagged 134 field to look up in the JSON object. 135 136 The parser handles the type field by setting the type field in the 137 flatcc builder table and checking if it is already set when 138 seeing a tagged or untagged union table field.