updated LZ4 block format

author Yann Collet <cyan@fb.com>

Wed, 2 Jan 2019 22:36:12 +0000 (14:36 -0800)

committer Yann Collet <cyan@fb.com>

Wed, 2 Jan 2019 22:36:12 +0000 (14:36 -0800)
author Yann Collet <cyan@fb.com>
Wed, 2 Jan 2019 22:36:12 +0000 (14:36 -0800)
committer Yann Collet <cyan@fb.com>
Wed, 2 Jan 2019 22:36:12 +0000 (14:36 -0800)
diff --git a/doc/lz4_Block_format.md b/doc/lz4_Block_format.md

index 5438730..2fb4c19 100644 (file)
--- a/doc/lz4_Block_format.md
+++ b/doc/lz4_Block_format.md
@@ -1,6 +1,6 @@
  LZ4 Block Format Description
  ============================
-Last revised: 2018-04-25.
+Last revised: 2018-12-30.
  Author : Yann Collet
  
  
@@ -10,7 +10,8 @@ using any programming language.
  
  LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
  There is no entropy encoder back-end nor framing layer.
-The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]).
+The latter is assumed to be handled by other parts of the system
+(see [LZ4 Frame format]).
  This design is assumed to favor simplicity and speed.
  It helps later on for optimizations, compactness, and features.
  
@@ -104,45 +105,41 @@ A common case is an offset of 1,
  meaning the last byte is repeated `matchlength` times.
  
  
-Parsing restrictions
+End of block restrictions
  -----------------------
-There are specific parsing rules to respect in order to remain compatible
-with assumptions made by the decoder :
-
-1. The last 5 bytes are always literals.  In other words, the last five bytes
-   from the uncompressed input (or all bytes, if the input has less than five
-   bytes) must be encoded as literals on behalf of the last sequence.
-   The last sequence is incomplete, and stops right after the literals.
-2. The last match must start at least 12 bytes before end of block.
-   The last match is part of the penultimate sequence,
-   since the last sequence stops right after literals.
+There are specific rules required to terminate a block.
+
+1. The last sequence only contains literals. The block ends right after them.
+1. The last 5 bytes of input are always literals.
+   Therefore, the last sequence contains at least 5 bytes,
+   or all input bytes if input is smaller than 5 bytes
+   (empty input can be represented with a zero byte,
+   interpreted as a token without literal and without a match).
+2. The last match must start at least 12 bytes before the end of block.
+   The last match is part of the penultimate sequence.
+   It is followed by the last sequence, which only contains literals.
     Note that, as a consequence, blocks < 13 bytes cannot be compressed.
  
-These rules are in place to ensure that the decoder
-can speculatively execute copy instructions
-without ever reading nor writing beyond provided I/O buffers.
-
-1. To copy literals from a non-last sequence, an 8-byte copy instruction
-   can always be safely issued (without reading past the input),
-   because literals are followed by a 2-byte offset,
-   and last sequence is at least 1+5 bytes long.
-2. Similarly, a match operation can speculatively copy up to 12 bytes
-   while remaining within output buffer boundaries.
-
-Empty inputs can be represented with a zero byte,
-interpreted as a token without literals and without a match.
+These rules are in place to ensure that a compatible decoder
+can be designed for speed, issuing speculatively instructions,
+while never reading nor writing beyond provided I/O buffers.
  
  
  Additional notes
  -----------------------
-There is no assumption nor limits to the way the compressor
+If the decoder will decompress data from an external source,
+it is recommended to ensure that the decoder will not be vulnerable to
+buffer overflow manipulations.
+Always ensure that read and write operations
+remain within the limits of provided buffers.
+Test the decoder with fuzzers
+to ensure it's resilient to improbable combinations.
+
+The format makes no assumption nor limits to the way the compressor
  searches and selects matches within the source data block.
-It could be a fast scan, a multi-probe, a full search using BST,
-standard hash chains or MMC, well whatever.
-
-Advanced parsing strategies can also be implemented, such as lazy match,
-or full optimal parsing.
-
-All these trade-off offer distinctive speed/memory/compression advantages.
-Whatever the method used by the compressor, its result will be decodable
-by any LZ4 decoder if it follows the format specification described above.
+Multiple techniques can be considered,
+featuring distinct time / performance trade offs.
+As long as the format is respected,
+the result will be compatible and decodable by any compliant decoder.
+An upper compression limit can be reached,
+using a technique called "full optimal parsing", at high cpu cost.
author	Yann Collet <cyan@fb.com>
	Wed, 2 Jan 2019 22:36:12 +0000 (14:36 -0800)
committer	Yann Collet <cyan@fb.com>
	Wed, 2 Jan 2019 22:36:12 +0000 (14:36 -0800)