third_party/pigweed/repo/pw_tokenizer/docs.rst

   1 .. _module-pw_tokenizer:
   2
   3 ------------
   4 pw_tokenizer
   5 ------------
   6 Logging is critical, but developers are often forced to choose between
   7 additional logging or saving crucial flash space. The ``pw_tokenizer`` module
   8 helps address this by replacing printf-style strings with binary tokens during
   9 compilation. This enables extensive logging with substantially less memory
  10 usage.
  11
  12 .. note::
  13   This usage of the term "tokenizer" is not related to parsing! The
  14   module is called tokenizer because it replaces a whole string literal with an
  15   integer token. It does not parse strings into separate tokens.
  16
  17 The most common application of ``pw_tokenizer`` is binary logging, and it is
  18 designed to integrate easily into existing logging systems. However, the
  19 tokenizer is general purpose and can be used to tokenize any strings, with or
  20 without printf-style arguments.
  21
  22 **Why tokenize strings?**
  23
  24   * Dramatically reduce binary size by removing string literals from binaries.
  25   * Reduce I/O traffic, RAM, and flash usage by sending and storing compact
  26     tokens instead of strings. We've seen over 50% reduction in encoded log
  27     contents.
  28   * Reduce CPU usage by replacing snprintf calls with simple tokenization code.
  29   * Remove potentially sensitive log, assert, and other strings from binaries.
  30
  31 Basic overview
  32 ==============
  33 There are two sides to ``pw_tokenizer``, which we call tokenization and
  34 detokenization.
  35
  36   * **Tokenization** converts string literals in the source code to
  37     binary tokens at compile time. If the string has printf-style arguments,
  38     these are encoded to compact binary form at runtime.
  39   * **Detokenization** converts tokenized strings back to the original
  40     human-readable strings.
  41
  42 Here's an overview of what happens when ``pw_tokenizer`` is used:
  43
  44   1. During compilation, the ``pw_tokenizer`` module hashes string literals to
  45      generate stable 32-bit tokens.
  46   2. The tokenization macro removes these strings by declaring them in an ELF
  47      section that is excluded from the final binary.
  48   3. After compilation, strings are extracted from the ELF to build a database
  49      of tokenized strings for use by the detokenizer. The ELF file may also be
  50      used directly.
  51   4. During operation, the device encodes the string token and its arguments, if
  52      any.
  53   5. The encoded tokenized strings are sent off-device or stored.
  54   6. Off-device, the detokenizer tools use the token database to decode the
  55      strings to human-readable form.
  56
  57 Example: tokenized logging
  58 --------------------------
  59 This example demonstrates using ``pw_tokenizer`` for logging. In this example,
  60 tokenized logging saves ~90% in binary size (41 → 4 bytes) and 70% in encoded
  61 size (49 → 15 bytes).
  62
  63 **Before**: plain text logging
  64
  65 +------------------+-------------------------------------------+---------------+
  66 | Location         | Logging Content                           | Size in bytes |
  67 +==================+===========================================+===============+
  68 | Source contains  | ``LOG("Battery state: %s; battery         |               |
  69 |                  | voltage: %d mV", state, voltage);``       |               |
  70 +------------------+-------------------------------------------+---------------+
  71 | Binary contains  | ``"Battery state: %s; battery             | 41            |
  72 |                  | voltage: %d mV"``                         |               |
  73 +------------------+-------------------------------------------+---------------+
  74 |                  | (log statement is called with             |               |
  75 |                  | ``"CHARGING"`` and ``3989`` as arguments) |               |
  76 +------------------+-------------------------------------------+---------------+
  77 | Device transmits | ``"Battery state: CHARGING; battery       | 49            |
  78 |                  | voltage: 3989 mV"``                       |               |
  79 +------------------+-------------------------------------------+---------------+
  80 | When viewed      | ``"Battery state: CHARGING; battery       |               |
  81 |                  | voltage: 3989 mV"``                       |               |
  82 +------------------+-------------------------------------------+---------------+
  83
  84 **After**: tokenized logging
  85
  86 +------------------+-----------------------------------------------------------+---------+
  87 | Location         | Logging Content                                           | Size in |
  88 |                  |                                                           | bytes   |
  89 +==================+===========================================================+=========+
  90 | Source contains  | ``LOG("Battery state: %s; battery                         |         |
  91 |                  | voltage: %d mV", state, voltage);``                       |         |
  92 +------------------+-----------------------------------------------------------+---------+
  93 | Binary contains  | ``d9 28 47 8e`` (0x8e4728d9)                              | 4       |
  94 +------------------+-----------------------------------------------------------+---------+
  95 |                  | (log statement is called with                             |         |
  96 |                  | ``"CHARGING"`` and ``3989`` as arguments)                 |         |
  97 +------------------+-----------------------------------------------------------+---------+
  98 | Device transmits | =============== ============================== ========== | 15      |
  99 |                  | ``d9 28 47 8e`` ``08 43 48 41 52 47 49 4E 47`` ``aa 3e``  |         |
 100 |                  | --------------- ------------------------------ ---------- |         |
 101 |                  | Token           ``"CHARGING"`` argument        ``3989``,  |         |
 102 |                  |                                                as         |         |
 103 |                  |                                                varint     |         |
 104 |                  | =============== ============================== ========== |         |
 105 +------------------+-----------------------------------------------------------+---------+
 106 | When viewed      | ``"Battery state: CHARGING; battery voltage: 3989 mV"``   |         |
 107 +------------------+-----------------------------------------------------------+---------+
 108
 109 Getting started
 110 ===============
 111 Integrating ``pw_tokenizer`` requires a few steps beyond building the code. This
 112 section describes one way ``pw_tokenizer`` might be integrated with a project.
 113 These steps can be adapted as needed.
 114
 115   1. Add ``pw_tokenizer`` to your build. Build files for GN, CMake, and Bazel
 116      are provided. For Make or other build systems, add the files specified in
 117      the BUILD.gn's ``pw_tokenizer`` target to the build.
 118   2. Use the tokenization macros in your code. See `Tokenization`_.
 119   3. Add the contents of ``pw_tokenizer_linker_sections.ld`` to your project's
 120      linker script. In GN and CMake, this step is done automatically.
 121   4. Compile your code to produce an ELF file.
 122   5. Run ``database.py create`` on the ELF file to generate a CSV token
 123      database. See `Managing token databases`_.
 124   6. Commit the token database to your repository. See notes in `Database
 125      management`_.
 126   7. Integrate a ``database.py add`` command to your build to automatically
 127      update the committed token database. In GN, use the
 128      ``pw_tokenizer_database`` template to do this. See `Update a database`_.
 129   8. Integrate ``detokenize.py`` or the C++ detokenization library with your
 130      tools to decode tokenized logs. See `Detokenization`_.
 131
 132 Tokenization
 133 ============
 134 Tokenization converts a string literal to a token. If it's a printf-style
 135 string, its arguments are encoded along with it. The results of tokenization can
 136 be sent off device or stored in place of a full string.
 137
 138 Tokenization macros
 139 -------------------
 140 Adding tokenization to a project is simple. To tokenize a string, include
 141 ``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros.
 142
 143 Tokenize a string literal
 144 ^^^^^^^^^^^^^^^^^^^^^^^^^
 145 The ``PW_TOKENIZE_STRING`` macro converts a string literal to a ``uint32_t``
 146 token.
 147
 148 .. code-block:: cpp
 149
 150   constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
 151
 152 .. admonition:: When to use this macro
 153
 154   Use ``PW_TOKENIZE_STRING`` to tokenize string literals that do not have
 155   %-style arguments.
 156
 157 Tokenize to a handler function
 158 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 159 ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function,
 160 since it takes the fewest arguments. It encodes a tokenized string to a
 161 buffer on the stack. The size of the buffer is set with
 162 ``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
 163
 164 This macro is provided by the ``pw_tokenizer:global_handler`` facade. The
 165 backend for this facade must define the ``pw_tokenizer_HandleEncodedMessage``
 166 C-linkage function.
 167
 168 .. code-block:: cpp
 169
 170   PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...);
 171
 172   void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
 173                                          size_t size_bytes);
 174
 175 ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a
 176 ``uintptr_t`` argument to the global handler function. Values like a log level
 177 can be packed into the ``uintptr_t``.
 178
 179 This macro is provided by the ``pw_tokenizer:global_handler_with_payload``
 180 facade. The backend for this facade must define the
 181 ``pw_tokenizer_HandleEncodedMessageWithPayload`` C-linkage function.
 182
 183 .. code-block:: cpp
 184
 185   PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload,
 186                                              format_string_literal,
 187                                              arguments...);
 188
 189   void pw_tokenizer_HandleEncodedMessageWithPayload(
 190       uintptr_t payload, const uint8_t encoded_message[], size_t size_bytes);
 191
 192 .. admonition:: When to use these macros
 193
 194   Use anytime a global handler is sufficient, particularly for widely expanded
 195   macros, like a logging macro. ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` or
 196   ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` are the most efficient macros
 197   for tokenizing printf-style strings.
 198
 199 Tokenize to a callback
 200 ^^^^^^^^^^^^^^^^^^^^^^
 201 ``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer on the stack and calls a
 202 ``void(const uint8_t* buffer, size_t buffer_size)`` callback that is provided at
 203 the call site. The size of the buffer is set with
 204 ``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
 205
 206 .. code-block:: cpp
 207
 208   PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arguments...);
 209
 210 .. admonition:: When to use this macro
 211
 212   Use ``PW_TOKENIZE_TO_CALLBACK`` if the global handler version is already in
 213   use for another purpose or more flexibility is needed.
 214
 215 Tokenize to a buffer
 216 ^^^^^^^^^^^^^^^^^^^^
 217 The most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes
 218 to a caller-provided buffer.
 219
 220 .. code-block:: cpp
 221
 222   uint8_t buffer[BUFFER_SIZE];
 223   size_t size_bytes = sizeof(buffer);
 224   PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, arguments...);
 225
 226 While ``PW_TOKENIZE_TO_BUFFER`` is maximally flexible, it takes more arguments
 227 than the other macros, so its per-use code size overhead is larger.
 228
 229 .. admonition:: When to use this macro
 230
 231   Use ``PW_TOKENIZE_TO_BUFFER`` to encode to a custom-sized buffer or if the
 232   other macros are insufficient. Avoid using ``PW_TOKENIZE_TO_BUFFER`` in
 233   widely expanded macros, such as a logging macro, because it will result in
 234   larger code size than its alternatives.
 235
 236 Example: binary logging
 237 ^^^^^^^^^^^^^^^^^^^^^^^
 238 String tokenization is perfect for logging. Consider the following log macro,
 239 which gathers the file, line number, and log message. It calls the ``RecordLog``
 240 function, which formats the log string, collects a timestamp, and transmits the
 241 result.
 242
 243 .. code-block:: cpp
 244
 245   #define LOG_INFO(format, ...) \
 246       RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__)
 247
 248   void RecordLog(LogLevel level, const char* file, int line, const char* format,
 249                  ...) {
 250     if (level < current_log_level) {
 251       return;
 252     }
 253
 254     int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line);
 255
 256     va_list args;
 257     va_start(args, format);
 258     bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args);
 259     va_end(args);
 260
 261     TransmitLog(TimeSinceBootMillis(), buffer, size);
 262   }
 263
 264 It is trivial to convert this to a binary log using the tokenizer. The
 265 ``RecordLog`` call is replaced with a
 266 ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The
 267 ``pw_tokenizer_HandleEncodedMessageWithPayload`` implementation collects the
 268 timestamp and transmits the message with ``TransmitLog``.
 269
 270 .. code-block:: cpp
 271
 272   #define LOG_INFO(format, ...)                   \
 273       PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \
 274           (pw_tokenizer_Payload)LogLevel_INFO,    \
 275           __FILE_NAME__ ":%d " format,            \
 276           __LINE__,                               \
 277           __VA_ARGS__);                           \
 278
 279   extern "C" void pw_tokenizer_HandleEncodedMessageWithPayload(
 280       uintptr_t level, const uint8_t encoded_message[], size_t size_bytes) {
 281     if (static_cast<LogLevel>(level) >= current_log_level) {
 282       TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes);
 283     }
 284   }
 285
 286 Note that the ``__FILE_NAME__`` string is directly included in the log format
 287 string. Since the string is tokenized, this has no effect on binary size. A
 288 ``%d`` for the line number is added to the format string, so that changing the
 289 line of the log message does not generate a new token. There is no overhead for
 290 additional tokens, but it may not be desirable to fill a token database with
 291 duplicate log lines.
 292
 293 Tokenizing function names
 294 -------------------------
 295 The string literal tokenization functions support tokenizing string literals or
 296 constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the
 297 special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared
 298 as ``static constexpr char[]`` in C++ instead of the standard ``static const
 299 char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be
 300 tokenized while compiling C++ with GCC or Clang.
 301
 302 .. code-block:: cpp
 303
 304   // Tokenize the special function name variables.
 305   constexpr uint32_t function = PW_TOKENIZE_STRING(__func__);
 306   constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__);
 307
 308   // Tokenize the function name variables to a handler function.
 309   PW_TOKENIZE_TO_GLOBAL_HANDLER(__func__)
 310   PW_TOKENIZE_TO_GLOBAL_HANDLER(__PRETTY_FUNCTION__)
 311
 312 Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals.
 313 They are defined as static character arrays, so they cannot be implicitly
 314 concatentated with string literals. For example, ``printf(__func__ ": %d",
 315 123);`` will not compile.
 316
 317 Tokenization in Python
 318 ----------------------
 319 The Python ``pw_tokenizer.encode`` module has limited support for encoding
 320 tokenized messages with the ``encode_token_and_args`` function.
 321
 322 .. autofunction:: pw_tokenizer.encode.encode_token_and_args
 323
 324 Encoding
 325 --------
 326 The token is a 32-bit hash calculated during compilation. The string is encoded
 327 little-endian with the token followed by arguments, if any. For example, the
 328 31-byte string ``You can go about your business.`` hashes to 0xdac9a244.
 329 This is encoded as 4 bytes: ``44 a2 c9 da``.
 330
 331 Arguments are encoded as follows:
 332
 333   * **Integers**  (1--10 bytes) --
 334     `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_,
 335     similarly to Protocol Buffers. Smaller values take fewer bytes.
 336   * **Floating point numbers** (4 bytes) -- Single precision floating point.
 337   * **Strings** (1--128 bytes) -- Length byte followed by the string contents.
 338     The top bit of the length whether the string was truncated or
 339     not. The remaining 7 bits encode the string length, with a maximum of 127
 340     bytes.
 341
 342 .. TODO: insert diagram here!
 343
 344 .. tip::
 345   ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` arguments
 346   short or avoid encoding them as strings (e.g. encode an enum as an integer
 347   instead of a string). See also `Tokenized strings as %s arguments`_.
 348
 349 Token generation: fixed length hashing at compile time
 350 ------------------------------------------------------
 351 String tokens are generated using a modified version of the x65599 hash used by
 352 the SDBM project. All hashing is done at compile time.
 353
 354 In C code, strings are hashed with a preprocessor macro. For compatibility with
 355 macros, the hash must be limited to a fixed maximum number of characters. This
 356 value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing
 357 ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to
 358 the complexity of the hashing macros.
 359
 360 C++ macros use a constexpr function instead of a macro. This function works with
 361 any length of string and has lower compilation time impact than the C macros.
 362 For consistency, C++ tokenization uses the same hash algorithm, but the
 363 calculated values will differ between C and C++ for strings longer than
 364 ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters.
 365
 366 Tokenization domains
 367 --------------------
 368 ``pw_tokenizer`` supports having multiple tokenization domains. Domains are a
 369 string label associated with each tokenized string. This allows projects to keep
 370 tokens from different sources separate. Potential use cases include the
 371 following:
 372
 373 * Keep large sets of tokenized strings separate to avoid collisions.
 374 * Create a separate database for a small number of strings that use truncated
 375   tokens, for example only 10 or 16 bits instead of the full 32 bits.
 376
 377 If no domain is specified, the domain is empty (``""``). For many projects, this
 378 default domain is sufficient, so no additional configuration is required.
 379
 380 .. code-block:: cpp
 381
 382   // Tokenizes this string to the default ("") domain.
 383   PW_TOKENIZE_STRING("Hello, world!");
 384
 385   // Tokenizes this string to the "my_custom_domain" domain.
 386   PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!");
 387
 388 The database and detokenization command line tools default to reading from the
 389 default domain. The domain may be specified for ELF files by appending
 390 ``#DOMAIN_NAME`` to the file path. Use ``#.*`` to read from all domains. For
 391 example, the following reads strings in ``some_domain`` from ``my_image.elf``.
 392
 393 .. code-block:: sh
 394
 395   ./database.py create --database my_db.csv path/to/my_image.elf#some_domain
 396
 397 See `Managing token databases`_ for information about the ``database.py``
 398 command line tool.
 399
 400 Token databases
 401 ===============
 402 Token databases store a mapping of tokens to the strings they represent. An ELF
 403 file can be used as a token database, but it only contains the strings for its
 404 exact build. A token database file aggregates tokens from multiple ELF files, so
 405 that a single database can decode tokenized strings from any known ELF.
 406
 407 Token databases contain the token, removal date (if any), and string for each
 408 tokenized string. Two token database formats are supported: CSV and binary.
 409
 410 CSV database format
 411 -------------------
 412 The CSV database format has three columns: the token in hexadecimal, the removal
 413 date (if any) in year-month-day format, and the string literal, surrounded by
 414 quotes. Quote characters within the string are represented as two quote
 415 characters.
 416
 417 This example database contains six strings, three of which have removal dates.
 418
 419 .. code-block::
 420
 421   141c35d5,          ,"The answer: ""%s"""
 422   2e668cd6,2019-12-25,"Jello, world!"
 423   7b940e2a,          ,"Hello %s! %hd %e"
 424   851beeb6,          ,"%u %d"
 425   881436a0,2020-01-01,"The answer is: %s"
 426   e13b0f94,2020-04-01,"%llu"
 427
 428 Binary database format
 429 ----------------------
 430 The binary database format is comprised of a 16-byte header followed by a series
 431 of 8-byte entries. Each entry stores the token and the removal date, which is
 432 0xFFFFFFFF if there is none. The string literals are stored next in the same
 433 order as the entries. Strings are stored with null terminators. See
 434 `token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/refs/heads/master/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
 435 for full details.
 436
 437 The binary form of the CSV database is shown below. It contains the same
 438 information, but in a more compact and easily processed form. It takes 141 B
 439 compared with the CSV database's 211 B.
 440
 441 .. code-block:: text
 442
 443   [header]
 444   0x00: 454b4f54 0000534e  TOKENS..
 445   0x08: 00000006 00000000  ........
 446
 447   [entries]
 448   0x10: 141c35d5 ffffffff  .5......
 449   0x18: 2e668cd6 07e30c19  ..f.....
 450   0x20: 7b940e2a ffffffff  *..{....
 451   0x28: 851beeb6 ffffffff  ........
 452   0x30: 881436a0 07e40101  .6......
 453   0x38: e13b0f94 07e40401  ..;.....
 454
 455   [string table]
 456   0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22  The answer: "%s"
 457   0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48  .Jello, world!.H
 458   0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00  ello %s! %hd %e.
 459   0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72  %u %d.The answer
 460   0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00            is: %s.%llu.
 461
 462 Managing token databases
 463 ------------------------
 464 Token databases are managed with the ``database.py`` script. This script can be
 465 used to extract tokens from compilation artifacts and manage database files.
 466 Invoke ``database.py`` with ``-h`` for full usage information.
 467
 468 An example ELF file with tokenized logs is provided at
 469 ``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
 470 file to experiment with the ``database.py`` commands.
 471
 472 Create a database
 473 ^^^^^^^^^^^^^^^^^
 474 The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
 475 etc.), archives (.a), or existing token databases (CSV or binary).
 476
 477 .. code-block:: sh
 478
 479   ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
 480
 481 Two database formats are supported: CSV and binary. Provide ``--type binary`` to
 482 ``create`` to generate a binary database instead of the default CSV. CSV
 483 databases are great for checking into a source control or for human review.
 484 Binary databases are more compact and simpler to parse. The C++ detokenizer
 485 library only supports binary databases currently.
 486
 487 Update a database
 488 ^^^^^^^^^^^^^^^^^
 489 As new tokenized strings are added, update the database with the ``add``
 490 command.
 491
 492 .. code-block:: sh
 493
 494   ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
 495
 496 A CSV token database can be checked into a source repository and updated as code
 497 changes are made. The build system can invoke ``database.py`` to update the
 498 database after each build.
 499
 500 GN integration
 501 ^^^^^^^^^^^^^^
 502 Token databases may be updated or created as part of a GN build. The
 503 ``pw_tokenizer_database`` template provided by ``dir_pw_tokenizer/database.gni``
 504 automatically updates an in-source tokenized strings database or creates a new
 505 database with artifacts from one or more GN targets or other database files.
 506
 507 To create a new database, set the ``create`` variable to the desired database
 508 type (``"csv"`` or ``"binary"``). The database will be created in the output
 509 directory. To update an existing database, provide the path to the database with
 510 the ``database`` variable.
 511
 512 Each database in the source tree can only be updated from a single
 513 ``pw_tokenizer_database`` rule. Updating the same database in multiple rules
 514 results in ``Duplicate output file`` GN errors or ``multiple rules generate
 515 <file>`` Ninja errors. To avoid these errors, ``pw_tokenizer_database`` rules
 516 should be defined in the default toolchain, and the input targets should be
 517 referenced with specific toolchains.
 518
 519 .. code-block::
 520
 521   import("//build_overrides/pigweed.gni")
 522
 523   import("$dir_pw_tokenizer/database.gni")
 524
 525   pw_tokenizer_database("my_database") {
 526     database = "database_in_the_source_tree.csv"
 527     targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
 528     input_databases = [ "other_database.csv" ]
 529   }
 530
 531 Detokenization
 532 ==============
 533 Detokenization is the process of expanding a token to the string it represents
 534 and decoding its arguments. This module provides Python and C++ detokenization
 535 libraries.
 536
 537 **Example: decoding tokenized logs**
 538
 539 A project might tokenize its log messages with the `Base64 format`_. Consider
 540 the following log file, which has four tokenized logs and one plain text log:
 541
 542 .. code-block:: text
 543
 544   20200229 14:38:58 INF $HL2VHA==
 545   20200229 14:39:00 DBG $5IhTKg==
 546   20200229 14:39:20 DBG Crunching numbers to calculate probability of success
 547   20200229 14:39:21 INF $EgFj8lVVAUI=
 548   20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=
 549
 550 The project's log strings are stored in a database like the following:
 551
 552 .. code-block::
 553
 554   1c95bd1c,          ,"Initiating retrieval process for recovery object"
 555   2a5388e4,          ,"Determining optimal approach and coordinating vectors"
 556   3743540c,          ,"Recovery object retrieval failed with status %s"
 557   f2630112,          ,"Calculated acceptable probability of success (%.2f%%)"
 558
 559 Using the detokenizing tools with the database, the logs can be decoded:
 560
 561 .. code-block:: text
 562
 563   20200229 14:38:58 INF Initiating retrieval process for recovery object
 564   20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
 565   20200229 14:39:20 DBG Crunching numbers to calculate probability of success
 566   20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
 567   20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY
 568
 569 .. note::
 570
 571   This example uses the `Base64 format`_, which occupies about 4/3 (133%) as
 572   much space as the default binary format when encoded. For projects that wish
 573   to interleave tokenized with plain text, using Base64 is a worthwhile
 574   tradeoff.
 575
 576 Python
 577 ------
 578 To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer``
 579 package, and instantiate it with paths to token databases or ELF files.
 580
 581 .. code-block:: python
 582
 583   import pw_tokenizer
 584
 585   detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
 586
 587   def process_log_message(log_message):
 588       result = detokenizer.detokenize(log_message.payload)
 589       self._log(str(result))
 590
 591 The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer``
 592 class, which can be used in place of the standard ``Detokenizer``. This class
 593 monitors database files for changes and automatically reloads them when they
 594 change. This is helpful for long-running tools that use detokenization.
 595
 596 C++
 597 ---
 598 The C++ detokenization libraries can be used in C++ or any language that can
 599 call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
 600 Java Native Interface (JNI) implementation is provided.
 601
 602 The C++ detokenization library uses binary-format token databases (created with
 603 ``database.py create --type binary``). Read a binary format database from a
 604 file or include it in the source code. Pass the database array to
 605 ``TokenDatabase::Create``, and construct a detokenizer.
 606
 607 .. code-block:: cpp
 608
 609   Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
 610
 611   std::string ProcessLog(span<uint8_t> log_data) {
 612     return detokenizer.Detokenize(log_data).BestString();
 613   }
 614
 615 The ``TokenDatabase`` class verifies that its data is valid before using it. If
 616 it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
 617 ``ok()`` returns false. If the token database is included in the source code,
 618 this check can be done at compile time.
 619
 620 .. code-block:: cpp
 621
 622   // This line fails to compile with a static_assert if the database is invalid.
 623   constexpr TokenDatabase kDefaultDatabase =  TokenDatabase::Create<kData>();
 624
 625   Detokenizer OpenDatabase(std::string_view path) {
 626     std::vector<uint8_t> data = ReadWholeFile(path);
 627
 628     TokenDatabase database = TokenDatabase::Create(data);
 629
 630     // This checks if the file contained a valid database. It is safe to use a
 631     // TokenDatabase that failed to load (it will be empty), but it may be
 632     // desirable to provide a default database or otherwise handle the error.
 633     if (database.ok()) {
 634       return Detokenizer(database);
 635     }
 636     return Detokenizer(kDefaultDatabase);
 637   }
 638
 639 Base64 format
 640 =============
 641 The tokenizer encodes messages to a compact binary representation. Applications
 642 may desire a textual representation of tokenized strings. This makes it easy to
 643 use tokenized messages alongside plain text messages, but comes at a small
 644 efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory
 645 as binary messages.
 646
 647 The Base64 format is comprised of a ``$`` character followed by the
 648 Base64-encoded contents of the tokenized message. For example, consider
 649 tokenizing the string ``This is an example: %d!`` with the argument -1. The
 650 string's token is 0x4b016e66.
 651
 652 .. code-block:: text
 653
 654   Source code: PW_TOKENIZE_TO_GLOBAL_HANDLER("This is an example: %d!", -1);
 655
 656    Plain text: This is an example: -1! [23 bytes]
 657
 658        Binary: 66 6e 01 4b 01          [ 5 bytes]
 659
 660        Base64: $Zm4BSwE=               [ 9 bytes]
 661
 662 Encoding
 663 --------
 664 To encode with the Base64 format, add a call to
 665 ``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode``
 666 in the tokenizer handler function. For example,
 667
 668 .. code-block:: cpp
 669
 670   void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
 671                                         size_t size_bytes) {
 672     char base64_buffer[64];
 673     size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
 674         pw::span(encoded_message, size_bytes), base64_buffer);
 675
 676     TransmitLogMessage(base64_buffer, base64_size);
 677   }
 678
 679 Decoding
 680 --------
 681 Base64 decoding and detokenizing is supported in the Python detokenizer through
 682 the ``detokenize_base64`` and related functions.
 683
 684 .. tip::
 685   The Python detokenization tools support recursive detokenization for prefixed
 686   Base64 text. Tokenized strings found in detokenized text are detokenized, so
 687   prefixed Base64 messages can be passed as ``%s`` arguments.
 688
 689   For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be
 690   passed as an argument to the printf-style string ``Nested message: %s``, which
 691   encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message
 692   as follows:
 693
 694   ::
 695
 696    "$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!"
 697
 698 Base64 decoding is supported in C++ or C with the
 699 ``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode``
 700 functions.
 701
 702 .. code-block:: cpp
 703
 704   void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
 705                                         size_t size_bytes) {
 706     char base64_buffer[64];
 707     size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
 708         pw::span(encoded_message, size_bytes), base64_buffer);
 709
 710     TransmitLogMessage(base64_buffer, base64_size);
 711   }
 712
 713 Command line utilities
 714 ^^^^^^^^^^^^^^^^^^^^^^
 715 ``pw_tokenizer`` provides two standalone command line utilities for detokenizing
 716 Base64-encoded tokenized strings.
 717
 718 * ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from
 719   stdin.
 720 * ``detokenize_serial.py`` -- Detokenizes Base64-encoded strings from a
 721   connected serial device.
 722
 723 If the ``pw_tokenizer`` Python package is installed, these tools may be executed
 724 as runnable modules. For example:
 725
 726 .. code-block::
 727
 728   # Detokenize Base64-encoded strings in a file
 729   python -m pw_tokenizer.detokenize -i input_file.txt
 730
 731   # Detokenize Base64-encoded strings in output from a serial device
 732   python -m pw_tokenizer.detokenize_serial --device /dev/ttyACM0
 733
 734 See the ``--help`` options for these tools for full usage information.
 735
 736 Deployment war story
 737 ====================
 738 The tokenizer module was developed to bring tokenized logging to an
 739 in-development product. The product already had an established text-based
 740 logging system. Deploying tokenization was straightforward and had substantial
 741 benefits.
 742
 743 Results
 744 -------
 745   * Log contents shrunk by over 50%, even with Base64 encoding.
 746
 747     * Significant size savings for encoded logs, even using the less-efficient
 748       Base64 encoding required for compatibility with the existing log system.
 749     * Freed valuable communication bandwidth.
 750     * Allowed storing many more logs in crash dumps.
 751
 752   * Substantial flash savings.
 753
 754     * Reduced the size firmware images by up to 18%.
 755
 756   * Simpler logging code.
 757
 758     * Removed CPU-heavy ``snprintf`` calls.
 759     * Removed complex code for forwarding log arguments to a low-priority task.
 760
 761 This section describes the tokenizer deployment process and highlights key
 762 insights.
 763
 764 Firmware deployment
 765 -------------------
 766   * In the project's logging macro, calls to the underlying logging function
 767     were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD``
 768     invocation.
 769   * The log level was passed as the payload argument to facilitate runtime log
 770     level control.
 771   * For this project, it was necessary to encode the log messages as text. In
 772     ``pw_tokenizer_HandleEncodedMessageWithPayload``, the log messages were
 773     encoded in the $-prefixed `Base64 format`_, then dispatched as normal log
 774     messages.
 775   * Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``.
 776
 777 .. attention::
 778   Do not encode line numbers in tokenized strings. This results in a huge
 779   number of lines being added to the database, since every time code moves,
 780   new strings are tokenized. If line numbers are desired in a tokenized
 781   string, add a ``"%d"`` to the string and pass ``__LINE__`` as an argument.
 782
 783 Database management
 784 -------------------
 785   * The token database was stored as a CSV file in the project's Git repo.
 786   * The token database was automatically updated as part of the build, and
 787     developers were expected to check in the database changes alongside their
 788     code changes.
 789   * A presubmit check verified that all strings added by a change were added to
 790     the token database.
 791   * The token database included logs and asserts for all firmware images in the
 792     project.
 793   * No strings were purged from the token database.
 794
 795 .. tip::
 796   Merge conflicts may be a frequent occurrence with an in-source database. If
 797   the database is in-source, make sure there is a simple script to resolve any
 798   merge conflicts. The script could either keep both sets of lines or discard
 799   local changes and regenerate the database.
 800
 801 Decoding tooling deployment
 802 ---------------------------
 803   * The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
 804
 805       * Product-specific Python command line tools, using
 806         ``pw_tokenizer.Detokenizer``.
 807       * Standalone script for decoding prefixed Base64 tokens in files or
 808         live output (e.g. from ``adb``), using ``detokenize.py``'s command line
 809         interface.
 810
 811   * The C++ detokenizer library was deployed to two Android apps with a Java
 812     Native Interface (JNI) layer.
 813
 814       * The binary token database was included as a raw resource in the APK.
 815       * In one app, the built-in token database could be overridden by copying a
 816         file to the phone.
 817
 818 .. tip::
 819   Make the tokenized logging tools simple to use for your project.
 820
 821   * Provide simple wrapper shell scripts that fill in arguments for the
 822     project. For example, point ``detokenize.py`` to the project's token
 823     databases.
 824   * Use ``pw_tokenizer.AutoReloadingDetokenizer`` to decode in
 825     continuously-running tools, so that users don't have to restart the tool
 826     when the token database updates.
 827   * Integrate detokenization everywhere it is needed. Integrating the tools
 828     takes just a few lines of code, and token databases can be embedded in
 829     APKs or binaries.
 830
 831 Limitations and future work
 832 ===========================
 833
 834 GCC bug: tokenization in template functions
 835 -------------------------------------------
 836 GCC incorrectly ignores the section attribute for template
 837 `functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and
 838 `variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this
 839 bug, tokenized strings in template functions may be emitted into ``.rodata``
 840 instead of the special tokenized string section. This causes two problems:
 841
 842   1. Tokenized strings will not be discovered by the token database tools.
 843   2. Tokenized strings may not be removed from the final binary.
 844
 845 clang does **not** have this issue! Use clang to avoid this.
 846
 847 It is possible to work around this bug in GCC. One approach would be to tag
 848 format strings so that the database tools can find them in ``.rodata``. Then, to
 849 remove the strings, compile two binaries: one metadata binary with all tokenized
 850 strings and a second, final binary that removes the strings. The strings could
 851 be removed by providing the appropriate linker flags or by removing the ``used``
 852 attribute from the tokenized string character array declaration.
 853
 854 64-bit tokenization
 855 -------------------
 856 The Python and C++ detokenizing libraries currently assume that strings were
 857 tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
 858 ``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
 859 device performed the tokenization.
 860
 861 Supporting detokenization of strings tokenized on 64-bit targets would be
 862 simple. This could be done by adding an option to switch the 32-bit types to
 863 64-bit. The tokenizer stores the sizes of these types in the
 864 ``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified
 865 by checking the ELF file, if necessary.
 866
 867 Tokenization in headers
 868 -----------------------
 869 Tokenizing code in header files (inline functions or templates) may trigger
 870 warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
 871 is because tokenization requires declaring a character array for each tokenized
 872 string. If the tokenized string includes macros that change value, the size of
 873 this character array changes, which means the same static variable is defined
 874 with different sizes. It should be safe to suppress these warnings, but, when
 875 possible, code that tokenizes strings with macros that can change value should
 876 be moved to source files rather than headers.
 877
 878 Tokenized strings as ``%s`` arguments
 879 -------------------------------------
 880 Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
 881 encoded 1:1, with no tokenization. It would be better to send a tokenized string
 882 literal as an integer instead of a string argument, but this is not yet
 883 supported.
 884
 885 A string token could be sent by marking an integer % argument in a way
 886 recognized by the detokenization tools. The detokenizer would expand the
 887 argument to the string represented by the integer.
 888
 889 .. code-block:: cpp
 890
 891   #define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]"
 892
 893   constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
 894
 895   PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: %" PW_TOKEN_ARG "?", answer_token);
 896
 897 Strings with arguments could be encoded to a buffer, but since printf strings
 898 are null-terminated, a binary encoding would not work. These strings can be
 899 prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
 900
 901 Another possibility: encode strings with arguments to a ``uint64_t`` and send
 902 them as an integer. This would be efficient and simple, but only support a small
 903 number of arguments.
 904
 905 Legacy tokenized string ELF format
 906 ==================================
 907 The original version of ``pw_tokenizer`` stored tokenized stored as plain C
 908 strings in the ELF file instead of structured tokenized string entries. Strings
 909 in different domains were stored in different linker sections. The Python script
 910 that parsed the ELF file would re-calculate the tokens.
 911
 912 In the current version of ``pw_tokenizer``, tokenized strings are stored in a
 913 structured entry containing a token, domain, and length-delimited string. This
 914 has several advantages over the legacy format:
 915
 916 * The Python script does not have to recalculate the token, so any hash
 917   algorithm may be used in the firmware.
 918 * In C++, the tokenization hash no longer has a length limitation.
 919 * Strings with null terminators in them are properly handled.
 920 * Only one linker section is required in the linker script, instead of a
 921   separate section for each domain.
 922
 923 To migrate to the new format, all that is required is update the linker sections
 924 to match those in ``pw_tokenizer_linker_sections.ld``. Replace all
 925 ``pw_tokenized.<DOMAIN>`` sections with one ``pw_tokenizer.entries`` section.
 926 The Python tooling continues to support the legacy tokenized string ELF format.
 927
 928 Compatibility
 929 =============
 930   * C11
 931   * C++11
 932   * Python 3
 933
 934 Dependencies
 935 ============
 936   * ``pw_varint`` module
 937   * ``pw_preprocessor`` module
 938   * ``pw_span`` module