docs/source/Internals.md

   1 FlatBuffer Internals    {#flatbuffers_internals}
   2 ====================
   3
   4 This section is entirely optional for the use of FlatBuffers. In normal
   5 usage, you should never need the information contained herein. If you're
   6 interested however, it should give you more of an appreciation of why
   7 FlatBuffers is both efficient and convenient.
   8
   9 ### Format components
  10
  11 A FlatBuffer is a binary file and in-memory format consisting mostly of
  12 scalars of various sizes, all aligned to their own size. Each scalar is
  13 also always represented in little-endian format, as this corresponds to
  14 all commonly used CPUs today. FlatBuffers will also work on big-endian
  15 machines, but will be slightly slower because of additional
  16 byte-swap intrinsics.
  17
  18 It is assumed that the following conditions are met, to ensure
  19 cross-platform interoperability:
  20 - The binary `IEEE-754` format is used for floating-point numbers.
  21 - The `two's complemented` representation is used for signed integers.
  22 - The endianness is the same for floating-point numbers as for integers.
  23
  24 On purpose, the format leaves a lot of details about where exactly
  25 things live in memory undefined, e.g. fields in a table can have any
  26 order, and objects to some extent can be stored in many orders. This is
  27 because the format doesn't need this information to be efficient, and it
  28 leaves room for optimization and extension (for example, fields can be
  29 packed in a way that is most compact). Instead, the format is defined in
  30 terms of offsets and adjacency only. This may mean two different
  31 implementations may produce different binaries given the same input
  32 values, and this is perfectly valid.
  33
  34 ### Format identification
  35
  36 The format also doesn't contain information for format identification
  37 and versioning, which is also by design. FlatBuffers is a statically typed
  38 system, meaning the user of a buffer needs to know what kind of buffer
  39 it is. FlatBuffers can of course be wrapped inside other containers
  40 where needed, or you can use its union feature to dynamically identify
  41 multiple possible sub-objects stored. Additionally, it can be used
  42 together with the schema parser if full reflective capabilities are
  43 desired.
  44
  45 Versioning is something that is intrinsically part of the format (the
  46 optionality / extensibility of fields), so the format itself does not
  47 need a version number (it's a meta-format, in a sense). We're hoping
  48 that this format can accommodate all data needed. If format breaking
  49 changes are ever necessary, it would become a new kind of format rather
  50 than just a variation.
  51
  52 ### Offsets
  53
  54 The most important and generic offset type (see `flatbuffers.h`) is
  55 `uoffset_t`, which is currently always a `uint32_t`, and is used to
  56 refer to all tables/unions/strings/vectors (these are never stored
  57 in-line). 32bit is
  58 intentional, since we want to keep the format binary compatible between
  59 32 and 64bit systems, and a 64bit offset would bloat the size for almost
  60 all uses. A version of this format with 64bit (or 16bit) offsets is easy to set
  61 when needed. Unsigned means they can only point in one direction, which
  62 typically is forward (towards a higher memory location). Any backwards
  63 offsets will be explicitly marked as such.
  64
  65 The format starts with an `uoffset_t` to the root table in the buffer.
  66
  67 We have two kinds of objects, structs and tables.
  68
  69 ### Structs
  70
  71 These are the simplest, and as mentioned, intended for simple data that
  72 benefits from being extra efficient and doesn't need versioning /
  73 extensibility. They are always stored inline in their parent (a struct,
  74 table, or vector) for maximum compactness. Structs define a consistent
  75 memory layout where all components are aligned to their size, and
  76 structs aligned to their largest scalar member. This is done independent
  77 of the alignment rules of the underlying compiler to guarantee a cross
  78 platform compatible layout. This layout is then enforced in the generated
  79 code.
  80
  81 ### Tables
  82
  83 Unlike structs, these are not stored in inline in their parent, but are
  84 referred to by offset.
  85
  86 They start with an `soffset_t` to a vtable. This is a signed version of
  87 `uoffset_t`, since vtables may be stored anywhere relative to the object.
  88 This offset is substracted (not added) from the object start to arrive at
  89 the vtable start. This offset is followed by all the
  90 fields as aligned scalars (or offsets). Unlike structs, not all fields
  91 need to be present. There is no set order and layout.
  92
  93 To be able to access fields regardless of these uncertainties, we go
  94 through a vtable of offsets. Vtables are shared between any objects that
  95 happen to have the same vtable values.
  96
  97 The elements of a vtable are all of type `voffset_t`, which is
  98 a `uint16_t`. The first element is the size of the vtable in bytes,
  99 including the size element. The second one is the size of the object, in bytes
 100 (including the vtable offset). This size could be used for streaming, to know
 101 how many bytes to read to be able to access all *inline* fields of the object.
 102 The remaining elements are the N offsets, where N is the amount of fields
 103 declared in the schema when the code that constructed this buffer was
 104 compiled (thus, the size of the table is N + 2).
 105
 106 All accessor functions in the generated code for tables contain the
 107 offset into this table as a constant. This offset is checked against the
 108 first field (the number of elements), to protect against newer code
 109 reading older data. If this offset is out of range, or the vtable entry
 110 is 0, that means the field is not present in this object, and the
 111 default value is return. Otherwise, the entry is used as offset to the
 112 field to be read.
 113
 114 ### Strings and Vectors
 115
 116 Strings are simply a vector of bytes, and are always
 117 null-terminated. Vectors are stored as contiguous aligned scalar
 118 elements prefixed by a 32bit element count (not including any
 119 null termination). Neither is stored inline in their parent, but are referred to
 120 by offset.
 121
 122 ### Construction
 123
 124 The current implementation constructs these buffers backwards (starting
 125 at the highest memory address of the buffer), since
 126 that significantly reduces the amount of bookkeeping and simplifies the
 127 construction API.
 128
 129 ### Code example
 130
 131 Here's an example of the code that gets generated for the `samples/monster.fbs`.
 132 What follows is the entire file, broken up by comments:
 133
 134     // automatically generated, do not modify
 135
 136     #include "flatbuffers/flatbuffers.h"
 137
 138     namespace MyGame {
 139     namespace Sample {
 140
 141 Nested namespace support.
 142
 143     enum {
 144       Color_Red = 0,
 145       Color_Green = 1,
 146       Color_Blue = 2,
 147     };
 148
 149     inline const char **EnumNamesColor() {
 150       static const char *names[] = { "Red", "Green", "Blue", nullptr };
 151       return names;
 152     }
 153
 154     inline const char *EnumNameColor(int e) { return EnumNamesColor()[e]; }
 155
 156 Enums and convenient reverse lookup.
 157
 158     enum {
 159       Any_NONE = 0,
 160       Any_Monster = 1,
 161     };
 162
 163     inline const char **EnumNamesAny() {
 164       static const char *names[] = { "NONE", "Monster", nullptr };
 165       return names;
 166     }
 167
 168     inline const char *EnumNameAny(int e) { return EnumNamesAny()[e]; }
 169
 170 Unions share a lot with enums.
 171
 172     struct Vec3;
 173     struct Monster;
 174
 175 Predeclare all data types since circular references between types are allowed
 176 (circular references between object are not, though).
 177
 178     FLATBUFFERS_MANUALLY_ALIGNED_STRUCT(4) Vec3 {
 179      private:
 180       float x_;
 181       float y_;
 182       float z_;
 183
 184      public:
 185       Vec3(float x, float y, float z)
 186         : x_(flatbuffers::EndianScalar(x)), y_(flatbuffers::EndianScalar(y)), z_(flatbuffers::EndianScalar(z)) {}
 187
 188       float x() const { return flatbuffers::EndianScalar(x_); }
 189       float y() const { return flatbuffers::EndianScalar(y_); }
 190       float z() const { return flatbuffers::EndianScalar(z_); }
 191     };
 192     FLATBUFFERS_STRUCT_END(Vec3, 12);
 193
 194 These ugly macros do a couple of things: they turn off any padding the compiler
 195 might normally do, since we add padding manually (though none in this example),
 196 and they enforce alignment chosen by FlatBuffers. This ensures the layout of
 197 this struct will look the same regardless of compiler and platform. Note that
 198 the fields are private: this is because these store little endian scalars
 199 regardless of platform (since this is part of the serialized data).
 200 `EndianScalar` then converts back and forth, which is a no-op on all current
 201 mobile and desktop platforms, and a single machine instruction on the few
 202 remaining big endian platforms.
 203
 204     struct Monster : private flatbuffers::Table {
 205       const Vec3 *pos() const { return GetStruct<const Vec3 *>(4); }
 206       int16_t mana() const { return GetField<int16_t>(6, 150); }
 207       int16_t hp() const { return GetField<int16_t>(8, 100); }
 208       const flatbuffers::String *name() const { return GetPointer<const flatbuffers::String *>(10); }
 209       const flatbuffers::Vector<uint8_t> *inventory() const { return GetPointer<const flatbuffers::Vector<uint8_t> *>(14); }
 210       int8_t color() const { return GetField<int8_t>(16, 2); }
 211     };
 212
 213 Tables are a bit more complicated. A table accessor struct is used to point at
 214 the serialized data for a table, which always starts with an offset to its
 215 vtable. It derives from `Table`, which contains the `GetField` helper functions.
 216 GetField takes a vtable offset, and a default value. It will look in the vtable
 217 at that offset. If the offset is out of bounds (data from an older version) or
 218 the vtable entry is 0, the field is not present and the default is returned.
 219 Otherwise, it uses the entry as an offset into the table to locate the field.
 220
 221     struct MonsterBuilder {
 222       flatbuffers::FlatBufferBuilder &fbb_;
 223       flatbuffers::uoffset_t start_;
 224       void add_pos(const Vec3 *pos) { fbb_.AddStruct(4, pos); }
 225       void add_mana(int16_t mana) { fbb_.AddElement<int16_t>(6, mana, 150); }
 226       void add_hp(int16_t hp) { fbb_.AddElement<int16_t>(8, hp, 100); }
 227       void add_name(flatbuffers::Offset<flatbuffers::String> name) { fbb_.AddOffset(10, name); }
 228       void add_inventory(flatbuffers::Offset<flatbuffers::Vector<uint8_t>> inventory) { fbb_.AddOffset(14, inventory); }
 229       void add_color(int8_t color) { fbb_.AddElement<int8_t>(16, color, 2); }
 230       MonsterBuilder(flatbuffers::FlatBufferBuilder &_fbb) : fbb_(_fbb) { start_ = fbb_.StartTable(); }
 231       flatbuffers::Offset<Monster> Finish() { return flatbuffers::Offset<Monster>(fbb_.EndTable(start_, 7)); }
 232     };
 233
 234 `MonsterBuilder` is the base helper struct to construct a table using a
 235 `FlatBufferBuilder`. You can add the fields in any order, and the `Finish`
 236 call will ensure the correct vtable gets generated.
 237
 238     inline flatbuffers::Offset<Monster> CreateMonster(flatbuffers::FlatBufferBuilder &_fbb,
 239                                                       const Vec3 *pos, int16_t mana,
 240                                                       int16_t hp,
 241                                                       flatbuffers::Offset<flatbuffers::String> name,
 242                                                       flatbuffers::Offset<flatbuffers::Vector<uint8_t>> inventory,
 243                                                       int8_t color) {
 244       MonsterBuilder builder_(_fbb);
 245       builder_.add_inventory(inventory);
 246       builder_.add_name(name);
 247       builder_.add_pos(pos);
 248       builder_.add_hp(hp);
 249       builder_.add_mana(mana);
 250       builder_.add_color(color);
 251       return builder_.Finish();
 252     }
 253
 254 `CreateMonster` is a convenience function that calls all functions in
 255 `MonsterBuilder` above for you. Note that if you pass values which are
 256 defaults as arguments, it will not actually construct that field, so
 257 you can probably use this function instead of the builder class in
 258 almost all cases.
 259
 260     inline const Monster *GetMonster(const void *buf) { return flatbuffers::GetRoot<Monster>(buf); }
 261
 262 This function is only generated for the root table type, to be able to
 263 start traversing a FlatBuffer from a raw buffer pointer.
 264
 265     }; // namespace MyGame
 266     }; // namespace Sample
 267
 268 ### Encoding example.
 269
 270 Below is a sample encoding for the following JSON corresponding to the above
 271 schema:
 272
 273     { pos: { x: 1, y: 2, z: 3 }, name: "fred", hp: 50 }
 274
 275 Resulting in this binary buffer:
 276
 277     // Start of the buffer:
 278     uint32_t 20  // Offset to the root table.
 279
 280     // Start of the vtable. Not shared in this example, but could be:
 281     uint16_t 16 // Size of table, starting from here.
 282     uint16_t 22 // Size of object inline data.
 283     uint16_t 4, 0, 20, 16, 0, 0  // Offsets to fields from start of (root) table, 0 for not present.
 284
 285     // Start of the root table:
 286     int32_t 16     // Offset to vtable used (default negative direction)
 287     float 1, 2, 3  // the Vec3 struct, inline.
 288     uint32_t 8     // Offset to the name string.
 289     int16_t 50     // hp field.
 290     int16_t 0      // Padding for alignment.
 291
 292     // Start of name string:
 293     uint32_t 4  // Length of string.
 294     int8_t 'f', 'r', 'e', 'd', 0, 0, 0, 0  // Text + 0 termination + padding.
 295
 296 Note that this not the only possible encoding, since the writer has some
 297 flexibility in which of the children of root object to write first (though in
 298 this case there's only one string), and what order to write the fields in.
 299 Different orders may also cause different alignments to happen.
 300
 301 ### Additional reading.
 302
 303 The author of the C language implementation has made a similar
 304 [document](https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#flatbuffers-binary-format)
 305 that may further help clarify the format.
 306
 307 # FlexBuffers
 308
 309 The [schema-less](@ref flexbuffers) version of FlatBuffers have their
 310 own encoding, detailed here.
 311
 312 It shares many properties mentioned above, in that all data is accessed
 313 over offsets, all scalars are aligned to their own size, and
 314 all data is always stored in little endian format.
 315
 316 One difference is that FlexBuffers are built front to back, so children are
 317 stored before parents, and the root of the data starts at the last byte.
 318
 319 Another difference is that scalar data is stored with a variable number of bits
 320 (8/16/32/64). The current width is always determined by the *parent*, i.e. if
 321 the scalar sits in a vector, the vector determines the bit width for all
 322 elements at once. Selecting the minimum bit width for a particular vector is
 323 something the encoder does automatically and thus is typically of no concern
 324 to the user, though being aware of this feature (and not sticking a double in
 325 the same vector as a bunch of byte sized elements) is helpful for efficiency.
 326
 327 Unlike FlatBuffers there is only one kind of offset, and that is an unsigned
 328 integer indicating the number of bytes in a negative direction from the address
 329 of itself (where the offset is stored).
 330
 331 ### Vectors
 332
 333 The representation of the vector is at the core of how FlexBuffers works (since
 334 maps are really just a combination of 2 vectors), so it is worth starting there.
 335
 336 As mentioned, a vector is governed by a single bit width (supplied by its
 337 parent). This includes the size field. For example, a vector that stores the
 338 integer values `1, 2, 3` is encoded as follows:
 339
 340     uint8_t 3, 1, 2, 3, 4, 4, 4
 341
 342 The first `3` is the size field, and is placed before the vector (an offset
 343 from the parent to this vector points to the first element, not the size
 344 field, so the size field is effectively at index -1).
 345 Since this is an untyped vector `SL_VECTOR`, it is followed by 3 type
 346 bytes (one per element of the vector), which are always following the vector,
 347 and are always a uint8_t even if the vector is made up of bigger scalars.
 348
 349 ### Types
 350
 351 A type byte is made up of 2 components (see flexbuffers.h for exact values):
 352
 353 * 2 lower bits representing the bit-width of the child (8, 16, 32, 64).
 354   This is only used if the child is accessed over an offset, such as a child
 355   vector. It is ignored for inline types.
 356 * 6 bits representing the actual type (see flexbuffers.h).
 357
 358 Thus, in this example `4` means 8 bit child (value 0, unused, since the value is
 359 in-line), type `SL_INT` (value 1).
 360
 361 ### Typed Vectors
 362
 363 These are like the Vectors above, but omit the type bytes. The type is instead
 364 determined by the vector type supplied by the parent. Typed vectors are only
 365 available for a subset of types for which these savings can be significant,
 366 namely inline signed/unsigned integers (`TYPE_VECTOR_INT` / `TYPE_VECTOR_UINT`),
 367 floats (`TYPE_VECTOR_FLOAT`), and keys (`TYPE_VECTOR_KEY`, see below).
 368
 369 Additionally, for scalars, there are fixed length vectors of sizes 2 / 3 / 4
 370 that don't store the size (`TYPE_VECTOR_INT2` etc.), for an additional savings
 371 in space when storing common vector or color data.
 372
 373 ### Scalars
 374
 375 FlexBuffers supports integers (`TYPE_INT` and `TYPE_UINT`) and floats
 376 (`TYPE_FLOAT`), available in the bit-widths mentioned above. They can be stored
 377 both inline and over an offset (`TYPE_INDIRECT_*`).
 378
 379 The offset version is useful to encode costly 64bit (or even 32bit) quantities
 380 into vectors / maps of smaller sizes, and to share / repeat a value multiple
 381 times.
 382
 383 ### Booleans and Nulls
 384
 385 Booleans (`TYPE_BOOL`) and nulls (`TYPE_NULL`) are encoded as inlined unsigned integers.
 386
 387 ### Blobs, Strings and Keys.
 388
 389 A blob (`TYPE_BLOB`) is encoded similar to a vector, with one difference: the
 390 elements are always `uint8_t`. The parent bit width only determines the width of
 391 the size field, allowing blobs to be large without the elements being large.
 392
 393 Strings (`TYPE_STRING`) are similar to blobs, except they have an additional 0
 394 termination byte for convenience, and they MUST be UTF-8 encoded (since an
 395 accessor in a language that does not support pointers to UTF-8 data may have to
 396 convert them to a native string type).
 397
 398 A "Key" (`TYPE_KEY`) is similar to a string, but doesn't store the size
 399 field. They're so named because they are used with maps, which don't care
 400 for the size, and can thus be even more compact. Unlike strings, keys cannot
 401 contain bytes of value 0 as part of their data (size can only be determined by
 402 `strlen`), so while you can use them outside the context of maps if you so
 403 desire, you're usually better off with strings.
 404
 405 ### Maps
 406
 407 A map (`TYPE_MAP`) is like an (untyped) vector, but with 2 prefixes before the
 408 size field:
 409
 410 | index | field                                                        |
 411 | ----: | :----------------------------------------------------------- |
 412 | -3    | An offset to the keys vector (may be shared between tables). |
 413 | -2    | Byte width of the keys vector.                               |
 414 | -1    | Size (from here on it is compatible with `TYPE_VECTOR`)      |
 415 | 0     | Elements.                                                    |
 416 | Size  | Types.                                                       |
 417
 418 Since a map is otherwise the same as a vector, it can be iterated like
 419 a vector (which is probably faster than lookup by key).
 420
 421 The keys vector is a typed vector of keys. Both the keys and corresponding
 422 values *have* to be stored in sorted order (as determined by `strcmp`), such
 423 that lookups can be made using binary search.
 424
 425 The reason the key vector is a seperate structure from the value vector is
 426 such that it can be shared between multiple value vectors, and also to
 427 allow it to be treated as its own individual vector in code.
 428
 429 An example map { foo: 13, bar: 14 } would be encoded as:
 430
 431     0 : uint8_t 'b', 'a', 'r', 0
 432     4 : uint8_t 'f', 'o', 'o', 0
 433     8 : uint8_t 2      // key vector of size 2
 434     // key vector offset points here
 435     9 : uint8_t 9, 6   // offsets to bar_key and foo_key
 436     11: uint8_t 2, 1   // offset to key vector, and its byte width
 437     13: uint8_t 2      // value vector of size
 438     // value vector offset points here
 439     14: uint8_t 14, 13 // values
 440     16: uint8_t 4, 4   // types
 441
 442 ### The root
 443
 444 As mentioned, the root starts at the end of the buffer.
 445 The last uint8_t is the width in bytes of the root (normally the parent
 446 determines the width, but the root has no parent). The uint8_t before this is
 447 the type of the root, and the bytes before that are the root value (of the
 448 number of bytes specified by the last byte).
 449
 450 So for example, the integer value `13` as root would be:
 451
 452     uint8_t 13, 4, 1    // Value, type, root byte width.
 453
 454
 455 <br>