lib/misc/fts/README.md

   1 # LWS Full Text Search
   2
   3 ## Introduction
   4
   5 ![lwsac flow](/doc-assets/lws-fts.svg)
   6
   7 The general approach is to scan one or more UTF-8 input text "files" (they may
   8 only exist in memory) and create an in-memory optimized trie for every token in
   9 the file.
  10
  11 This can then be serialized out to disk in the form of a single index file (no
  12 matter how many input files were involved or how large they were).
  13
  14 The implementation is designed to be modest on memory and cpu for both index
  15 creation and querying, and suitable for weak machines with some kind of random
  16 access storage.  For searching only memory to hold results is required, the
  17 actual searches and autocomplete suggestions are done very rapidly by seeking
  18 around structures in the on-disk index file.
  19
  20 Function|Related Link
  21 ---|---
  22 Public API|[include/libwebsockets/lws-fts.h](https://libwebsockets.org/git/libwebsockets/tree/include/libwebsockets/lws-fts.h)
  23 CI test app|[minimal-examples/api-tests/api-test-fts](https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/api-tests/api-test-fts)
  24 Demo minimal example|[minimal-examples/http-server/minimal-http-server-fulltext-search](https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/http-server/minimal-http-server-fulltext-search)
  25 Live Demo|[https://libwebsockets.org/ftsdemo/](https://libwebsockets.org/ftsdemo/)
  26
  27 ## Query API overview
  28
  29 Searching returns a potentially very large lwsac allocated object, with contents
  30 and max size controlled by the members of a struct lws_fts_search_params passed
  31 to the search function.  Three kinds of result are possible:
  32
  33 ### Autocomplete suggestions
  34
  35 These are useful to provide lists of extant results in
  36 realtime as the user types characters that constrain the search.  So if the
  37 user has typed 'len', any hits for 'len' itself are reported along with
  38 'length', and whatever else is in the index beginning 'len'..  The results are
  39 selected using and are accompanied by an aggregated count of results down that
  40 path, and the results so the "most likely" results already measured by potential
  41 hits appear first.
  42
  43 These results are in a linked-list headed by `result.autocomplete_head` and
  44 each is in a `struct lws_fts_result_autocomplete`.
  45
  46 They're enabled in the search results by giving the flag
  47  `LWSFTS_F_QUERY_AUTOCOMPLETE` in the search parameter flags.
  48
  49 ### Filepath results
  50
  51 Simply a list of input files containing the search term with some statistics,
  52 one file is mentioned in a `struct lws_fts_result_filepath` result struct.
  53
  54 This would be useful for creating a selection UI to "drill down" to individual
  55 files when there are many with matches.
  56
  57 This is enabled by the `LWSFTS_F_QUERY_FILES` search flag.
  58
  59 ### Filepath and line results
  60
  61 Same as the file path list, but for each filepath, information on the line
  62 numbers and input file offset where the line starts are provided.
  63
  64 This is enabled by `LWSFTS_F_QUERY_FILE_LINES`... if you additionally give
  65 `LWSFTS_F_QUERY_QUOTE_LINE` flag then the contents of each hit line from the
  66 input file are also provided.
  67
  68 ## Result format inside the lwsac
  69
  70 A `struct lws_fts_result` at the start of the lwsac contains heads for linked-
  71 lists of autocomplete and filepath results inside the lwsac.
  72
  73 For autocomplete suggestions, the string itself is immediately after the
  74 `struct lws_fts_result_autocomplete` in memory.  For filepath results, after
  75 each `struct lws_fts_result_filepath` is
  76
  77  - match information depending on the flags given to the search
  78  - the filepath string
  79
  80 You can always skip the line number table to get the filepath string by adding
  81 .matches_length to the address of the byte after the struct.
  82
  83 The matches information is either
  84
  85  - 0 bytes per match
  86
  87  - 2x int32_t per match (8 bytes) if `LWSFTS_F_QUERY_FILE_LINES` given... the
  88    first is the native-endian line number of the match, the second is the
  89    byte offset in the original file where that line starts
  90
  91  - 2 x int32_t as above plus a const char * if `LWSFTS_F_QUERY_QUOTE_LINE` is
  92    also given... this points to a NUL terminated string also stored in the
  93    results lwsac that contains up to 255 chars of the line from the original
  94    file.  In some cases, the original file was either virtual (you are indexing
  95    a git revision) or is not stored with the index, in that case you can't
  96    usefully use `LWSFTS_F_QUERY_QUOTE_LINE`.
  97
  98 To facilitate interpreting what is stored per match, the original search flags
  99 that created the result are stored in the `struct lws_fts_result`.
 100
 101 ## Indexing In-memory and serialized to file
 102
 103 When creating the trie, in-memory structs are used with various optimization
 104 schemes trading off memory usage for speed.  While in-memory, it's possible to
 105 add more indexed filepaths to the single index.  Once the trie is complete in
 106 terms of having indexed everything, it is serialized to disk.
 107
 108 These contain many additional housekeeping pointers and trie entries which can
 109 be optimized out.  Most in-memory values must be held literally in large types,
 110 whereas most of the values in the serialized file use smaller VLI which use
 111 more or less bytes according to the value.  So the peak memory requirements for
 112 large tries are much bigger than the size of the serialized trie file that is
 113 output.
 114
 115 For the linux kernel at 4.14 and default indexing whitelist on a 2.8GHz AMD
 116 threadripper (using one thread), the stats are:
 117
 118 Name|Value
 119 ---|---
 120 Files indexed|52932
 121 Input corpus size|694MiB
 122 Indexing cpu time|50.1s (>1000 files / sec; 13.8MBytes/sec)
 123 Peak alloc|78MiB
 124 Serialization time|202ms
 125 Trie File size|347MiB
 126
 127 To index libwebsockets master under the same conditions:
 128
 129 Name|Value
 130 ---|---
 131 Files indexed|489
 132 Input corpus size|3MiB
 133 Indexing time|123ms
 134 Peak alloc|3MiB
 135 Serialization time|1ms
 136 Trie File size|1.4MiB
 137
 138
 139 Once it's generated, querying the trie file is very inexpensive, even when there
 140 are lots of results.
 141
 142  - trie entry child lists are kept sorted by the character they map to.  This
 143    allows discovering there is no match as soon as a character later in the
 144    order than the one being matched is seen
 145
 146  - for the root trie, in addition to the linked-list child + sibling entries,
 147    a 256-entry pointer table is associated with the root trie, allowing one-
 148    step lookup.  But as the table is 2KiB, it's too expensive to use on all
 149    trie entries
 150
 151 ## Structure on disk
 152
 153 All explicit multibyte numbers are stored in Network (MSB-first) byte order.
 154
 155  - file header
 156  - filepath line number tables
 157  - filepath information
 158  - filepath map table
 159  - tries, trie instances (hits), trie child tables
 160
 161 ### VLI coding
 162
 163 VLI (Variable Length Integer) coding works like this
 164
 165 [b7 EON] [b6 .. b0  DATA]
 166
 167 If EON = 0, then DATA represents the Least-significant 7 bits of the number.
 168 if EON = 1, DATA represents More-significant 7-bits that should be shifted
 169 left until the byte with EON = 0 is found to terminate the number.
 170
 171 The VLI used is predicated around 32-bit unsigned integers
 172
 173 Examples:
 174
 175  - 0x30            =    48
 176  - 0x81 30         =   176
 177  - 0x81 0x80 0x00  = 16384
 178
 179 Bytes | Range
 180 ---|---
 181 1|<= 127
 182 2|<= 16K - 1
 183 3|<= 2M -1
 184 4|<= 256M - 1
 185 5|<= 4G - 1
 186
 187 The coding is very efficient if there's a high probabilty the number being
 188 stored is not large.  So it's great for line numbers for example, where most
 189 files have less that 16K lines and the VLI for the line number fits in 2 bytes,
 190 but if you meet a huge file, the VLI coding can also handle it.
 191
 192 All numbers except a few in the headers that are actually written after the
 193 following data are stored using VLI for space- efficiency without limiting
 194 capability.  The numbers that are fixed up after the fact have to have a fixed
 195 size and can't use VLI.
 196
 197 ### File header
 198
 199 The first byte of the file header where the magic is, is "fileoffset" 0.  All
 200 the stored "fileoffset"s are relative to that.
 201
 202 The header has a fixed size of 16 bytes.
 203
 204 size|function
 205 ---|---
 206 32-bits|Magic 0xCA7A5F75
 207 32-bits|Fileoffset to root trie entry
 208 32-bits|Size of the trie file when it was created (to detect truncation)
 209 32-bits|Fileoffset to the filepath map
 210 32-bits|Number of filepaths
 211
 212 ### Filepath line tables
 213
 214 Immediately after the file header are the line length tables.
 215
 216 As the input files are parsed, line length tables are written for each file...
 217 at that time the rest of the parser data is held in memory so nothing else is
 218 in the file yet.  These allow you to map logical line numbers in the file to
 219 file offsets space- and time- efficiently without having to walk through the
 220 file contents.
 221
 222 The line information is cut into blocks, allowing quick skipping over the VLI
 223 data that doesn't contain the line you want just by following the 8-byte header
 224 part.
 225
 226 Once you find the block with your line, you have to iteratively add the VLIs
 227 until you hit the one you want.
 228
 229 For normal text files with average line length below 128, the VLIs will
 230 typically be a single byte.  So a block of 200 line lengths is typically
 231 208 bytes long.
 232
 233 There is a final linetable chunk consisting of all zeros to indicate the end
 234 of the filepath line chunk series for a filepath.
 235
 236 size|function
 237 ---|---
 238 16-bit|length of this chunk itself in bytes
 239 16-bit|count of lines covered in this chunk
 240 32-bit|count of bytes in the input file this chunk covers
 241 VLI...|for each line in the chunk, the number of bytes in the line
 242
 243
 244 ### Filepaths
 245
 246 The single trie in the file may contain information from multiple files, for
 247 example one trie may cover all files in a directory.  The "Filepaths" are
 248 listed after the line tables, and referred to by index thereafter.
 249
 250 For each filepath, one after the other:
 251
 252 size|function
 253 ---|---
 254 VLI|fileoffset of the start of this filepath's line table
 255 VLI|count of lines in the file
 256 VLI|length of filepath in bytes
 257 ...|the filepath (with no NUL)
 258
 259 ### Filepath map
 260
 261 To facilitate rapid filepath lookup, there's a filepath map table with a 32-bit
 262 fileoffset per filepath.  This is the way to convert filepath indexes to
 263 information on the filepath like its name, etc
 264
 265 size|function
 266 ---|---
 267 32-bit...|fileoffset to filepath table for each filepath
 268
 269 ### Trie entries
 270
 271 Immediately after that, the trie entries are dumped, for each one a header:
 272
 273 #### Trie entry header
 274
 275 size|function
 276 ---|---
 277 VLI|Fileoffset of first file table in this trie entry instance list
 278 VLI|number of child trie entries this trie entry has
 279 VLI|number of instances this trie entry has
 280
 281 The child list follows immediately after this header
 282
 283 #### Trie entry instance file
 284
 285 For each file that has instances of this symbol:
 286
 287 size|function
 288 ---|---
 289 VLI|Fileoffset of next file table in this trie entry instance list
 290 VLI|filepath index
 291 VLI|count of line number instances following
 292
 293 #### Trie entry file line number table
 294
 295 Then for the file mentioned above, a list of all line numbers in the file with
 296 the symbol in them, in ascending order.  As a VLI, the median size per entry
 297 will typically be ~15.9 bits due to the probability of line numbers below 16K.
 298
 299 size|function
 300 ---|---
 301 VLI|line number
 302 ...
 303
 304 #### Trie entry child table
 305
 306 For each child node
 307
 308 size|function
 309 ---|---
 310 VLI|file offset of child
 311 VLI|instance count belonging directly to this child
 312 VLI|aggregated number of instances down all descendent paths of child
 313 VLI|aggregated number of children down all descendent paths of child
 314 VLI|match string length
 315 ...|the match string