1 .. _module-pw_tokenizer:
6 Logging is critical, but developers are often forced to choose between
7 additional logging or saving crucial flash space. The ``pw_tokenizer`` module
8 helps address this by replacing printf-style strings with binary tokens during
9 compilation. This enables extensive logging with substantially less memory
13 This usage of the term "tokenizer" is not related to parsing! The
14 module is called tokenizer because it replaces a whole string literal with an
15 integer token. It does not parse strings into separate tokens.
17 The most common application of ``pw_tokenizer`` is binary logging, and it is
18 designed to integrate easily into existing logging systems. However, the
19 tokenizer is general purpose and can be used to tokenize any strings, with or
20 without printf-style arguments.
22 **Why tokenize strings?**
24 * Dramatically reduce binary size by removing string literals from binaries.
25 * Reduce I/O traffic, RAM, and flash usage by sending and storing compact
26 tokens instead of strings. We've seen over 50% reduction in encoded log
28 * Reduce CPU usage by replacing snprintf calls with simple tokenization code.
29 * Remove potentially sensitive log, assert, and other strings from binaries.
33 There are two sides to ``pw_tokenizer``, which we call tokenization and
36 * **Tokenization** converts string literals in the source code to
37 binary tokens at compile time. If the string has printf-style arguments,
38 these are encoded to compact binary form at runtime.
39 * **Detokenization** converts tokenized strings back to the original
40 human-readable strings.
42 Here's an overview of what happens when ``pw_tokenizer`` is used:
44 1. During compilation, the ``pw_tokenizer`` module hashes string literals to
45 generate stable 32-bit tokens.
46 2. The tokenization macro removes these strings by declaring them in an ELF
47 section that is excluded from the final binary.
48 3. After compilation, strings are extracted from the ELF to build a database
49 of tokenized strings for use by the detokenizer. The ELF file may also be
51 4. During operation, the device encodes the string token and its arguments, if
53 5. The encoded tokenized strings are sent off-device or stored.
54 6. Off-device, the detokenizer tools use the token database to decode the
55 strings to human-readable form.
57 Example: tokenized logging
58 --------------------------
59 This example demonstrates using ``pw_tokenizer`` for logging. In this example,
60 tokenized logging saves ~90% in binary size (41 → 4 bytes) and 70% in encoded
63 **Before**: plain text logging
65 +------------------+-------------------------------------------+---------------+
66 | Location | Logging Content | Size in bytes |
67 +==================+===========================================+===============+
68 | Source contains | ``LOG("Battery state: %s; battery | |
69 | | voltage: %d mV", state, voltage);`` | |
70 +------------------+-------------------------------------------+---------------+
71 | Binary contains | ``"Battery state: %s; battery | 41 |
72 | | voltage: %d mV"`` | |
73 +------------------+-------------------------------------------+---------------+
74 | | (log statement is called with | |
75 | | ``"CHARGING"`` and ``3989`` as arguments) | |
76 +------------------+-------------------------------------------+---------------+
77 | Device transmits | ``"Battery state: CHARGING; battery | 49 |
78 | | voltage: 3989 mV"`` | |
79 +------------------+-------------------------------------------+---------------+
80 | When viewed | ``"Battery state: CHARGING; battery | |
81 | | voltage: 3989 mV"`` | |
82 +------------------+-------------------------------------------+---------------+
84 **After**: tokenized logging
86 +------------------+-----------------------------------------------------------+---------+
87 | Location | Logging Content | Size in |
89 +==================+===========================================================+=========+
90 | Source contains | ``LOG("Battery state: %s; battery | |
91 | | voltage: %d mV", state, voltage);`` | |
92 +------------------+-----------------------------------------------------------+---------+
93 | Binary contains | ``d9 28 47 8e`` (0x8e4728d9) | 4 |
94 +------------------+-----------------------------------------------------------+---------+
95 | | (log statement is called with | |
96 | | ``"CHARGING"`` and ``3989`` as arguments) | |
97 +------------------+-----------------------------------------------------------+---------+
98 | Device transmits | =============== ============================== ========== | 15 |
99 | | ``d9 28 47 8e`` ``08 43 48 41 52 47 49 4E 47`` ``aa 3e`` | |
100 | | --------------- ------------------------------ ---------- | |
101 | | Token ``"CHARGING"`` argument ``3989``, | |
104 | | =============== ============================== ========== | |
105 +------------------+-----------------------------------------------------------+---------+
106 | When viewed | ``"Battery state: CHARGING; battery voltage: 3989 mV"`` | |
107 +------------------+-----------------------------------------------------------+---------+
111 Integrating ``pw_tokenizer`` requires a few steps beyond building the code. This
112 section describes one way ``pw_tokenizer`` might be integrated with a project.
113 These steps can be adapted as needed.
115 1. Add ``pw_tokenizer`` to your build. Build files for GN, CMake, and Bazel
116 are provided. For Make or other build systems, add the files specified in
117 the BUILD.gn's ``pw_tokenizer`` target to the build.
118 2. Use the tokenization macros in your code. See `Tokenization`_.
119 3. Add the contents of ``pw_tokenizer_linker_sections.ld`` to your project's
120 linker script. In GN and CMake, this step is done automatically.
121 4. Compile your code to produce an ELF file.
122 5. Run ``database.py create`` on the ELF file to generate a CSV token
123 database. See `Managing token databases`_.
124 6. Commit the token database to your repository. See notes in `Database
126 7. Integrate a ``database.py add`` command to your build to automatically
127 update the committed token database. In GN, use the
128 ``pw_tokenizer_database`` template to do this. See `Update a database`_.
129 8. Integrate ``detokenize.py`` or the C++ detokenization library with your
130 tools to decode tokenized logs. See `Detokenization`_.
134 Tokenization converts a string literal to a token. If it's a printf-style
135 string, its arguments are encoded along with it. The results of tokenization can
136 be sent off device or stored in place of a full string.
140 Adding tokenization to a project is simple. To tokenize a string, include
141 ``pw_tokenizer/tokenize.h`` and invoke one of the ``PW_TOKENIZE_`` macros.
143 Tokenize a string literal
144 ^^^^^^^^^^^^^^^^^^^^^^^^^
145 The ``PW_TOKENIZE_STRING`` macro converts a string literal to a ``uint32_t``
150 constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
152 .. admonition:: When to use this macro
154 Use ``PW_TOKENIZE_STRING`` to tokenize string literals that do not have
157 Tokenize to a handler function
158 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
159 ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` is the most efficient tokenization function,
160 since it takes the fewest arguments. It encodes a tokenized string to a
161 buffer on the stack. The size of the buffer is set with
162 ``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
164 This macro is provided by the ``pw_tokenizer:global_handler`` facade. The
165 backend for this facade must define the ``pw_tokenizer_HandleEncodedMessage``
170 PW_TOKENIZE_TO_GLOBAL_HANDLER(format_string_literal, arguments...);
172 void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
175 ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` is similar, but passes a
176 ``uintptr_t`` argument to the global handler function. Values like a log level
177 can be packed into the ``uintptr_t``.
179 This macro is provided by the ``pw_tokenizer:global_handler_with_payload``
180 facade. The backend for this facade must define the
181 ``pw_tokenizer_HandleEncodedMessageWithPayload`` C-linkage function.
185 PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD(payload,
186 format_string_literal,
189 void pw_tokenizer_HandleEncodedMessageWithPayload(
190 uintptr_t payload, const uint8_t encoded_message[], size_t size_bytes);
192 .. admonition:: When to use these macros
194 Use anytime a global handler is sufficient, particularly for widely expanded
195 macros, like a logging macro. ``PW_TOKENIZE_TO_GLOBAL_HANDLER`` or
196 ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` are the most efficient macros
197 for tokenizing printf-style strings.
199 Tokenize to a callback
200 ^^^^^^^^^^^^^^^^^^^^^^
201 ``PW_TOKENIZE_TO_CALLBACK`` tokenizes to a buffer on the stack and calls a
202 ``void(const uint8_t* buffer, size_t buffer_size)`` callback that is provided at
203 the call site. The size of the buffer is set with
204 ``PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES``.
208 PW_TOKENIZE_TO_CALLBACK(HandlerFunction, "Format string: %x", arguments...);
210 .. admonition:: When to use this macro
212 Use ``PW_TOKENIZE_TO_CALLBACK`` if the global handler version is already in
213 use for another purpose or more flexibility is needed.
217 The most flexible tokenization macro is ``PW_TOKENIZE_TO_BUFFER``, which encodes
218 to a caller-provided buffer.
222 uint8_t buffer[BUFFER_SIZE];
223 size_t size_bytes = sizeof(buffer);
224 PW_TOKENIZE_TO_BUFFER(buffer, &size_bytes, format_string_literal, arguments...);
226 While ``PW_TOKENIZE_TO_BUFFER`` is maximally flexible, it takes more arguments
227 than the other macros, so its per-use code size overhead is larger.
229 .. admonition:: When to use this macro
231 Use ``PW_TOKENIZE_TO_BUFFER`` to encode to a custom-sized buffer or if the
232 other macros are insufficient. Avoid using ``PW_TOKENIZE_TO_BUFFER`` in
233 widely expanded macros, such as a logging macro, because it will result in
234 larger code size than its alternatives.
236 Example: binary logging
237 ^^^^^^^^^^^^^^^^^^^^^^^
238 String tokenization is perfect for logging. Consider the following log macro,
239 which gathers the file, line number, and log message. It calls the ``RecordLog``
240 function, which formats the log string, collects a timestamp, and transmits the
245 #define LOG_INFO(format, ...) \
246 RecordLog(LogLevel_INFO, __FILE_NAME__, __LINE__, format, ##__VA_ARGS__)
248 void RecordLog(LogLevel level, const char* file, int line, const char* format,
250 if (level < current_log_level) {
254 int bytes = snprintf(buffer, sizeof(buffer), "%s:%d ", file, line);
257 va_start(args, format);
258 bytes += vsnprintf(&buffer[bytes], sizeof(buffer) - bytes, format, args);
261 TransmitLog(TimeSinceBootMillis(), buffer, size);
264 It is trivial to convert this to a binary log using the tokenizer. The
265 ``RecordLog`` call is replaced with a
266 ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD`` invocation. The
267 ``pw_tokenizer_HandleEncodedMessageWithPayload`` implementation collects the
268 timestamp and transmits the message with ``TransmitLog``.
272 #define LOG_INFO(format, ...) \
273 PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD( \
274 (pw_tokenizer_Payload)LogLevel_INFO, \
275 __FILE_NAME__ ":%d " format, \
279 extern "C" void pw_tokenizer_HandleEncodedMessageWithPayload(
280 uintptr_t level, const uint8_t encoded_message[], size_t size_bytes) {
281 if (static_cast<LogLevel>(level) >= current_log_level) {
282 TransmitLog(TimeSinceBootMillis(), encoded_message, size_bytes);
286 Note that the ``__FILE_NAME__`` string is directly included in the log format
287 string. Since the string is tokenized, this has no effect on binary size. A
288 ``%d`` for the line number is added to the format string, so that changing the
289 line of the log message does not generate a new token. There is no overhead for
290 additional tokens, but it may not be desirable to fill a token database with
293 Tokenizing function names
294 -------------------------
295 The string literal tokenization functions support tokenizing string literals or
296 constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the
297 special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared
298 as ``static constexpr char[]`` in C++ instead of the standard ``static const
299 char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be
300 tokenized while compiling C++ with GCC or Clang.
304 // Tokenize the special function name variables.
305 constexpr uint32_t function = PW_TOKENIZE_STRING(__func__);
306 constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__);
308 // Tokenize the function name variables to a handler function.
309 PW_TOKENIZE_TO_GLOBAL_HANDLER(__func__)
310 PW_TOKENIZE_TO_GLOBAL_HANDLER(__PRETTY_FUNCTION__)
312 Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals.
313 They are defined as static character arrays, so they cannot be implicitly
314 concatentated with string literals. For example, ``printf(__func__ ": %d",
315 123);`` will not compile.
317 Tokenization in Python
318 ----------------------
319 The Python ``pw_tokenizer.encode`` module has limited support for encoding
320 tokenized messages with the ``encode_token_and_args`` function.
322 .. autofunction:: pw_tokenizer.encode.encode_token_and_args
326 The token is a 32-bit hash calculated during compilation. The string is encoded
327 little-endian with the token followed by arguments, if any. For example, the
328 31-byte string ``You can go about your business.`` hashes to 0xdac9a244.
329 This is encoded as 4 bytes: ``44 a2 c9 da``.
331 Arguments are encoded as follows:
333 * **Integers** (1--10 bytes) --
334 `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_,
335 similarly to Protocol Buffers. Smaller values take fewer bytes.
336 * **Floating point numbers** (4 bytes) -- Single precision floating point.
337 * **Strings** (1--128 bytes) -- Length byte followed by the string contents.
338 The top bit of the length whether the string was truncated or
339 not. The remaining 7 bits encode the string length, with a maximum of 127
342 .. TODO: insert diagram here!
345 ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` arguments
346 short or avoid encoding them as strings (e.g. encode an enum as an integer
347 instead of a string). See also `Tokenized strings as %s arguments`_.
349 Token generation: fixed length hashing at compile time
350 ------------------------------------------------------
351 String tokens are generated using a modified version of the x65599 hash used by
352 the SDBM project. All hashing is done at compile time.
354 In C code, strings are hashed with a preprocessor macro. For compatibility with
355 macros, the hash must be limited to a fixed maximum number of characters. This
356 value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing
357 ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to
358 the complexity of the hashing macros.
360 C++ macros use a constexpr function instead of a macro. This function works with
361 any length of string and has lower compilation time impact than the C macros.
362 For consistency, C++ tokenization uses the same hash algorithm, but the
363 calculated values will differ between C and C++ for strings longer than
364 ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters.
368 ``pw_tokenizer`` supports having multiple tokenization domains. Domains are a
369 string label associated with each tokenized string. This allows projects to keep
370 tokens from different sources separate. Potential use cases include the
373 * Keep large sets of tokenized strings separate to avoid collisions.
374 * Create a separate database for a small number of strings that use truncated
375 tokens, for example only 10 or 16 bits instead of the full 32 bits.
377 If no domain is specified, the domain is empty (``""``). For many projects, this
378 default domain is sufficient, so no additional configuration is required.
382 // Tokenizes this string to the default ("") domain.
383 PW_TOKENIZE_STRING("Hello, world!");
385 // Tokenizes this string to the "my_custom_domain" domain.
386 PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!");
388 The database and detokenization command line tools default to reading from the
389 default domain. The domain may be specified for ELF files by appending
390 ``#DOMAIN_NAME`` to the file path. Use ``#.*`` to read from all domains. For
391 example, the following reads strings in ``some_domain`` from ``my_image.elf``.
395 ./database.py create --database my_db.csv path/to/my_image.elf#some_domain
397 See `Managing token databases`_ for information about the ``database.py``
402 Token databases store a mapping of tokens to the strings they represent. An ELF
403 file can be used as a token database, but it only contains the strings for its
404 exact build. A token database file aggregates tokens from multiple ELF files, so
405 that a single database can decode tokenized strings from any known ELF.
407 Token databases contain the token, removal date (if any), and string for each
408 tokenized string. Two token database formats are supported: CSV and binary.
412 The CSV database format has three columns: the token in hexadecimal, the removal
413 date (if any) in year-month-day format, and the string literal, surrounded by
414 quotes. Quote characters within the string are represented as two quote
417 This example database contains six strings, three of which have removal dates.
421 141c35d5, ,"The answer: ""%s"""
422 2e668cd6,2019-12-25,"Jello, world!"
423 7b940e2a, ,"Hello %s! %hd %e"
425 881436a0,2020-01-01,"The answer is: %s"
426 e13b0f94,2020-04-01,"%llu"
428 Binary database format
429 ----------------------
430 The binary database format is comprised of a 16-byte header followed by a series
431 of 8-byte entries. Each entry stores the token and the removal date, which is
432 0xFFFFFFFF if there is none. The string literals are stored next in the same
433 order as the entries. Strings are stored with null terminators. See
434 `token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/refs/heads/master/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
437 The binary form of the CSV database is shown below. It contains the same
438 information, but in a more compact and easily processed form. It takes 141 B
439 compared with the CSV database's 211 B.
444 0x00: 454b4f54 0000534e TOKENS..
445 0x08: 00000006 00000000 ........
448 0x10: 141c35d5 ffffffff .5......
449 0x18: 2e668cd6 07e30c19 ..f.....
450 0x20: 7b940e2a ffffffff *..{....
451 0x28: 851beeb6 ffffffff ........
452 0x30: 881436a0 07e40101 .6......
453 0x38: e13b0f94 07e40401 ..;.....
456 0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s"
457 0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H
458 0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e.
459 0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer
460 0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu.
462 Managing token databases
463 ------------------------
464 Token databases are managed with the ``database.py`` script. This script can be
465 used to extract tokens from compilation artifacts and manage database files.
466 Invoke ``database.py`` with ``-h`` for full usage information.
468 An example ELF file with tokenized logs is provided at
469 ``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
470 file to experiment with the ``database.py`` commands.
474 The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
475 etc.), archives (.a), or existing token databases (CSV or binary).
479 ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
481 Two database formats are supported: CSV and binary. Provide ``--type binary`` to
482 ``create`` to generate a binary database instead of the default CSV. CSV
483 databases are great for checking into a source control or for human review.
484 Binary databases are more compact and simpler to parse. The C++ detokenizer
485 library only supports binary databases currently.
489 As new tokenized strings are added, update the database with the ``add``
494 ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
496 A CSV token database can be checked into a source repository and updated as code
497 changes are made. The build system can invoke ``database.py`` to update the
498 database after each build.
502 Token databases may be updated or created as part of a GN build. The
503 ``pw_tokenizer_database`` template provided by ``dir_pw_tokenizer/database.gni``
504 automatically updates an in-source tokenized strings database or creates a new
505 database with artifacts from one or more GN targets or other database files.
507 To create a new database, set the ``create`` variable to the desired database
508 type (``"csv"`` or ``"binary"``). The database will be created in the output
509 directory. To update an existing database, provide the path to the database with
510 the ``database`` variable.
512 Each database in the source tree can only be updated from a single
513 ``pw_tokenizer_database`` rule. Updating the same database in multiple rules
514 results in ``Duplicate output file`` GN errors or ``multiple rules generate
515 <file>`` Ninja errors. To avoid these errors, ``pw_tokenizer_database`` rules
516 should be defined in the default toolchain, and the input targets should be
517 referenced with specific toolchains.
521 import("//build_overrides/pigweed.gni")
523 import("$dir_pw_tokenizer/database.gni")
525 pw_tokenizer_database("my_database") {
526 database = "database_in_the_source_tree.csv"
527 targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
528 input_databases = [ "other_database.csv" ]
533 Detokenization is the process of expanding a token to the string it represents
534 and decoding its arguments. This module provides Python and C++ detokenization
537 **Example: decoding tokenized logs**
539 A project might tokenize its log messages with the `Base64 format`_. Consider
540 the following log file, which has four tokenized logs and one plain text log:
544 20200229 14:38:58 INF $HL2VHA==
545 20200229 14:39:00 DBG $5IhTKg==
546 20200229 14:39:20 DBG Crunching numbers to calculate probability of success
547 20200229 14:39:21 INF $EgFj8lVVAUI=
548 20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=
550 The project's log strings are stored in a database like the following:
554 1c95bd1c, ,"Initiating retrieval process for recovery object"
555 2a5388e4, ,"Determining optimal approach and coordinating vectors"
556 3743540c, ,"Recovery object retrieval failed with status %s"
557 f2630112, ,"Calculated acceptable probability of success (%.2f%%)"
559 Using the detokenizing tools with the database, the logs can be decoded:
563 20200229 14:38:58 INF Initiating retrieval process for recovery object
564 20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
565 20200229 14:39:20 DBG Crunching numbers to calculate probability of success
566 20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
567 20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY
571 This example uses the `Base64 format`_, which occupies about 4/3 (133%) as
572 much space as the default binary format when encoded. For projects that wish
573 to interleave tokenized with plain text, using Base64 is a worthwhile
578 To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer``
579 package, and instantiate it with paths to token databases or ELF files.
581 .. code-block:: python
585 detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
587 def process_log_message(log_message):
588 result = detokenizer.detokenize(log_message.payload)
589 self._log(str(result))
591 The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer``
592 class, which can be used in place of the standard ``Detokenizer``. This class
593 monitors database files for changes and automatically reloads them when they
594 change. This is helpful for long-running tools that use detokenization.
598 The C++ detokenization libraries can be used in C++ or any language that can
599 call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
600 Java Native Interface (JNI) implementation is provided.
602 The C++ detokenization library uses binary-format token databases (created with
603 ``database.py create --type binary``). Read a binary format database from a
604 file or include it in the source code. Pass the database array to
605 ``TokenDatabase::Create``, and construct a detokenizer.
609 Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
611 std::string ProcessLog(span<uint8_t> log_data) {
612 return detokenizer.Detokenize(log_data).BestString();
615 The ``TokenDatabase`` class verifies that its data is valid before using it. If
616 it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
617 ``ok()`` returns false. If the token database is included in the source code,
618 this check can be done at compile time.
622 // This line fails to compile with a static_assert if the database is invalid.
623 constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>();
625 Detokenizer OpenDatabase(std::string_view path) {
626 std::vector<uint8_t> data = ReadWholeFile(path);
628 TokenDatabase database = TokenDatabase::Create(data);
630 // This checks if the file contained a valid database. It is safe to use a
631 // TokenDatabase that failed to load (it will be empty), but it may be
632 // desirable to provide a default database or otherwise handle the error.
634 return Detokenizer(database);
636 return Detokenizer(kDefaultDatabase);
641 The tokenizer encodes messages to a compact binary representation. Applications
642 may desire a textual representation of tokenized strings. This makes it easy to
643 use tokenized messages alongside plain text messages, but comes at a small
644 efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory
647 The Base64 format is comprised of a ``$`` character followed by the
648 Base64-encoded contents of the tokenized message. For example, consider
649 tokenizing the string ``This is an example: %d!`` with the argument -1. The
650 string's token is 0x4b016e66.
654 Source code: PW_TOKENIZE_TO_GLOBAL_HANDLER("This is an example: %d!", -1);
656 Plain text: This is an example: -1! [23 bytes]
658 Binary: 66 6e 01 4b 01 [ 5 bytes]
660 Base64: $Zm4BSwE= [ 9 bytes]
664 To encode with the Base64 format, add a call to
665 ``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode``
666 in the tokenizer handler function. For example,
670 void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
672 char base64_buffer[64];
673 size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
674 pw::span(encoded_message, size_bytes), base64_buffer);
676 TransmitLogMessage(base64_buffer, base64_size);
681 Base64 decoding and detokenizing is supported in the Python detokenizer through
682 the ``detokenize_base64`` and related functions.
685 The Python detokenization tools support recursive detokenization for prefixed
686 Base64 text. Tokenized strings found in detokenized text are detokenized, so
687 prefixed Base64 messages can be passed as ``%s`` arguments.
689 For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be
690 passed as an argument to the printf-style string ``Nested message: %s``, which
691 encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message
696 "$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!"
698 Base64 decoding is supported in C++ or C with the
699 ``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode``
704 void pw_tokenizer_HandleEncodedMessage(const uint8_t encoded_message[],
706 char base64_buffer[64];
707 size_t base64_size = pw::tokenizer::PrefixedBase64Encode(
708 pw::span(encoded_message, size_bytes), base64_buffer);
710 TransmitLogMessage(base64_buffer, base64_size);
713 Command line utilities
714 ^^^^^^^^^^^^^^^^^^^^^^
715 ``pw_tokenizer`` provides two standalone command line utilities for detokenizing
716 Base64-encoded tokenized strings.
718 * ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from
720 * ``detokenize_serial.py`` -- Detokenizes Base64-encoded strings from a
721 connected serial device.
723 If the ``pw_tokenizer`` Python package is installed, these tools may be executed
724 as runnable modules. For example:
728 # Detokenize Base64-encoded strings in a file
729 python -m pw_tokenizer.detokenize -i input_file.txt
731 # Detokenize Base64-encoded strings in output from a serial device
732 python -m pw_tokenizer.detokenize_serial --device /dev/ttyACM0
734 See the ``--help`` options for these tools for full usage information.
738 The tokenizer module was developed to bring tokenized logging to an
739 in-development product. The product already had an established text-based
740 logging system. Deploying tokenization was straightforward and had substantial
745 * Log contents shrunk by over 50%, even with Base64 encoding.
747 * Significant size savings for encoded logs, even using the less-efficient
748 Base64 encoding required for compatibility with the existing log system.
749 * Freed valuable communication bandwidth.
750 * Allowed storing many more logs in crash dumps.
752 * Substantial flash savings.
754 * Reduced the size firmware images by up to 18%.
756 * Simpler logging code.
758 * Removed CPU-heavy ``snprintf`` calls.
759 * Removed complex code for forwarding log arguments to a low-priority task.
761 This section describes the tokenizer deployment process and highlights key
766 * In the project's logging macro, calls to the underlying logging function
767 were replaced with a ``PW_TOKENIZE_TO_GLOBAL_HANDLER_WITH_PAYLOAD``
769 * The log level was passed as the payload argument to facilitate runtime log
771 * For this project, it was necessary to encode the log messages as text. In
772 ``pw_tokenizer_HandleEncodedMessageWithPayload``, the log messages were
773 encoded in the $-prefixed `Base64 format`_, then dispatched as normal log
775 * Asserts were tokenized using ``PW_TOKENIZE_TO_CALLBACK``.
778 Do not encode line numbers in tokenized strings. This results in a huge
779 number of lines being added to the database, since every time code moves,
780 new strings are tokenized. If line numbers are desired in a tokenized
781 string, add a ``"%d"`` to the string and pass ``__LINE__`` as an argument.
785 * The token database was stored as a CSV file in the project's Git repo.
786 * The token database was automatically updated as part of the build, and
787 developers were expected to check in the database changes alongside their
789 * A presubmit check verified that all strings added by a change were added to
791 * The token database included logs and asserts for all firmware images in the
793 * No strings were purged from the token database.
796 Merge conflicts may be a frequent occurrence with an in-source database. If
797 the database is in-source, make sure there is a simple script to resolve any
798 merge conflicts. The script could either keep both sets of lines or discard
799 local changes and regenerate the database.
801 Decoding tooling deployment
802 ---------------------------
803 * The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
805 * Product-specific Python command line tools, using
806 ``pw_tokenizer.Detokenizer``.
807 * Standalone script for decoding prefixed Base64 tokens in files or
808 live output (e.g. from ``adb``), using ``detokenize.py``'s command line
811 * The C++ detokenizer library was deployed to two Android apps with a Java
812 Native Interface (JNI) layer.
814 * The binary token database was included as a raw resource in the APK.
815 * In one app, the built-in token database could be overridden by copying a
819 Make the tokenized logging tools simple to use for your project.
821 * Provide simple wrapper shell scripts that fill in arguments for the
822 project. For example, point ``detokenize.py`` to the project's token
824 * Use ``pw_tokenizer.AutoReloadingDetokenizer`` to decode in
825 continuously-running tools, so that users don't have to restart the tool
826 when the token database updates.
827 * Integrate detokenization everywhere it is needed. Integrating the tools
828 takes just a few lines of code, and token databases can be embedded in
831 Limitations and future work
832 ===========================
834 GCC bug: tokenization in template functions
835 -------------------------------------------
836 GCC incorrectly ignores the section attribute for template
837 `functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and
838 `variables <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. Due to this
839 bug, tokenized strings in template functions may be emitted into ``.rodata``
840 instead of the special tokenized string section. This causes two problems:
842 1. Tokenized strings will not be discovered by the token database tools.
843 2. Tokenized strings may not be removed from the final binary.
845 clang does **not** have this issue! Use clang to avoid this.
847 It is possible to work around this bug in GCC. One approach would be to tag
848 format strings so that the database tools can find them in ``.rodata``. Then, to
849 remove the strings, compile two binaries: one metadata binary with all tokenized
850 strings and a second, final binary that removes the strings. The strings could
851 be removed by providing the appropriate linker flags or by removing the ``used``
852 attribute from the tokenized string character array declaration.
856 The Python and C++ detokenizing libraries currently assume that strings were
857 tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
858 ``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
859 device performed the tokenization.
861 Supporting detokenization of strings tokenized on 64-bit targets would be
862 simple. This could be done by adding an option to switch the 32-bit types to
863 64-bit. The tokenizer stores the sizes of these types in the
864 ``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified
865 by checking the ELF file, if necessary.
867 Tokenization in headers
868 -----------------------
869 Tokenizing code in header files (inline functions or templates) may trigger
870 warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
871 is because tokenization requires declaring a character array for each tokenized
872 string. If the tokenized string includes macros that change value, the size of
873 this character array changes, which means the same static variable is defined
874 with different sizes. It should be safe to suppress these warnings, but, when
875 possible, code that tokenizes strings with macros that can change value should
876 be moved to source files rather than headers.
878 Tokenized strings as ``%s`` arguments
879 -------------------------------------
880 Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
881 encoded 1:1, with no tokenization. It would be better to send a tokenized string
882 literal as an integer instead of a string argument, but this is not yet
885 A string token could be sent by marking an integer % argument in a way
886 recognized by the detokenization tools. The detokenizer would expand the
887 argument to the string represented by the integer.
891 #define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]"
893 constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");
895 PW_TOKENIZE_TO_GLOBAL_HANDLER("Knock knock: %" PW_TOKEN_ARG "?", answer_token);
897 Strings with arguments could be encoded to a buffer, but since printf strings
898 are null-terminated, a binary encoding would not work. These strings can be
899 prefixed Base64-encoded and sent as ``%s`` instead. See `Base64 format`_.
901 Another possibility: encode strings with arguments to a ``uint64_t`` and send
902 them as an integer. This would be efficient and simple, but only support a small
905 Legacy tokenized string ELF format
906 ==================================
907 The original version of ``pw_tokenizer`` stored tokenized stored as plain C
908 strings in the ELF file instead of structured tokenized string entries. Strings
909 in different domains were stored in different linker sections. The Python script
910 that parsed the ELF file would re-calculate the tokens.
912 In the current version of ``pw_tokenizer``, tokenized strings are stored in a
913 structured entry containing a token, domain, and length-delimited string. This
914 has several advantages over the legacy format:
916 * The Python script does not have to recalculate the token, so any hash
917 algorithm may be used in the firmware.
918 * In C++, the tokenization hash no longer has a length limitation.
919 * Strings with null terminators in them are properly handled.
920 * Only one linker section is required in the linker script, instead of a
921 separate section for each domain.
923 To migrate to the new format, all that is required is update the linker sections
924 to match those in ``pw_tokenizer_linker_sections.ld``. Replace all
925 ``pw_tokenized.<DOMAIN>`` sections with one ``pw_tokenizer.entries`` section.
926 The Python tooling continues to support the legacy tokenized string ELF format.
936 * ``pw_varint`` module
937 * ``pw_preprocessor`` module