draft-korn-vcdiff.txt

   1                                                      David G. Korn, AT&T Labs
   2                                              Joshua P. MacDonald, UC Berkeley
   3                                                  Jeffrey C. Mogul, Compaq WRL
   4 Internet-Draft                                       Kiem-Phong Vo, AT&T Labs
   5 Expires: 09 November 2002                                    09 November 2001
   6
   7
   8         The VCDIFF Generic Differencing and Compression Data Format
   9
  10                          draft-korn-vcdiff-06.txt
  11
  12
  13
  14 Status of this Memo
  15
  16     This document is an Internet-Draft and is in full conformance
  17     with all provisions of Section 10 of RFC2026.
  18
  19     Internet-Drafts are working documents of the Internet Engineering
  20     Task Force (IETF), its areas, and its working groups.  Note that
  21     other groups may also distribute working documents as
  22     Internet-Drafts.
  23
  24     Internet-Drafts are draft documents valid for a maximum of six
  25     months and may be updated, replaced, or obsoleted by other
  26     documents at any time.  It is inappropriate to use Internet-
  27     Drafts as reference material or to cite them other than as
  28     "work in progress."
  29
  30     The list of current Internet-Drafts can be accessed at
  31     http://www.ietf.org/ietf/1id-abstracts.txt
  32
  33     The list of Internet-Draft Shadow Directories can be accessed at
  34     http://www.ietf.org/shadow.html.
  35
  36
  37 Abstract
  38
  39     This memo describes a general, efficient and portable data format
  40     suitable for encoding compressed and/or differencing data so that
  41     they can be easily transported among computers.
  42
  43
  44 Table of Contents:
  45
  46     1.  EXECUTIVE SUMMARY ............................................  2
  47     2.  CONVENTIONS ..................................................  3
  48     3.  DELTA INSTRUCTIONS ...........................................  4
  49     4.  DELTA FILE ORGANIZATION ......................................  5
  50     5.  DELTA INSTRUCTION ENCODING ...................................  9
  51     6.  DECODING A TARGET WINDOW ..................................... 14
  52     7.  APPLICATION-DEFINED CODE TABLES .............................. 16
  53     8.  PERFORMANCE .................................................. 16
  54     9.  FURTHER ISSUES ............................................... 17
  55    10.  SUMMARY ...................................................... 18
  56    11.  ACKNOWLEDGEMENTS ............................................. 18
  57    12.  SECURITY CONSIDERATIONS ...................................... 18
  58    13.  SOURCE CODE AVAILABILITY ..................................... 18
  59    14.  INTELLECTUAL PROPERTY RIGHTS ................................. 18
  60    15.  IANA CONSIDERATIONS .......................................... 19
  61    16.  REFERENCES ................................................... 19
  62    17.  AUTHOR'S ADDRESS ............................................. 20
  63
  64
  65 1.  EXECUTIVE SUMMARY
  66
  67     Compression and differencing techniques can greatly improve storage
  68     and transmission of files and file versions.  Since files are often
  69     transported across machines with distinct architectures and performance
  70     characteristics, such data should be encoded in a form that is portable
  71     and can be decoded with little or no knowledge of the encoders.
  72     This document describes Vcdiff, a compact portable encoding format
  73     designed for these purposes.
  74
  75     Data differencing is the process of computing a compact and invertible
  76     encoding of a "target file" given a "source file".  Data compression
  77     is similar but without the use of source data.  The UNIX utilities diff,
  78     compress, and gzip are well-known examples of data differencing and
  79     compression tools.  For data differencing, the computed encoding is
  80     called a "delta file", and, for data compression, it is called
  81     a "compressed file".  Delta and compressed files are good for storage
  82     and transmission as they are often smaller than the originals.
  83
  84     Data differencing and data compression are traditionally treated
  85     as distinct types of data processing.  However, as shown in the Vdelta
  86     technique by Korn and Vo [1], compression can be thought of as a special
  87     case of differencing in which the source data is empty. The basic idea
  88     is to unify the string parsing scheme used in the Lempel-Ziv'77 style
  89     compressors [2], and the block-move technique of Tichy [3].  Loosely
  90     speaking, this works as follows:
  91
  92         a. Concatenate source and target data.
  93         b. Parse the data from left to right as in LZ'77 but
  94            make sure that a parsed segment starts the target data.
  95         c. Start to output when reaching target data.
  96
  97     Parsing is based on string matching algorithms such as suffix trees [4]
  98     or hashing with different time and space performance characteristics.
  99     Vdelta uses a fast string matching algorithm that requires less memory
 100     than other techniques [5,6].  However, even with this algorithm, the
 101     memory requirement can still be prohibitive for large files.  A common
 102     way to deal with memory limitation is to partition an input file into
 103     chunks called "windows" and process them separately. Here, except for
 104     unpublished work by Vo, little has been done on designing effective
 105     windowing schemes. Current techniques, including Vdelta, simply use
 106     source and target windows with corresponding addresses across source
 107     and target files.
 108
 109     String matching and windowing algorithms have large influence on the
 110     compression rate of delta and compressed files. However, it is desirable
 111     to have a portable encoding format that is independent of such algorithms.
 112     This enables construction of client-server applications in which a server
 113     may serve clients with unknown computing characteristics.  Unfortunately,
 114     all current differencing and compressing tools, including Vdelta, fall
 115     short in this respect. Their storage formats are closely intertwined
 116     with the implemented string matching and/or windowing algorithms.
 117
 118     The encoding format Vcdiff proposed here addresses the above issues.
 119     Vcdiff achieves the below characteristics:
 120
 121         Output compactness:
 122             The basic encoding format compactly represents compressed or delta
 123             files. Applications can further extend the basic encoding format
 124             with "secondary encoders" to achieve more compression.
 125
 126         Data portability:
 127             The basic encoding format is free from machine byte order and
 128             word size issues. This allows data to be encoded on one machine
 129             and decoded on a different machine with different architecture.
 130
 131         Algorithm genericity:
 132             The decoding algorithm is independent from string matching and
 133             windowing algorithms. This allows competition among implementations
 134             of the encoder while keeping the same decoder.
 135
 136         Decoding efficiency:
 137             Except for secondary encoder issues, the decoding algorithm runs
 138             in time proportional to the size of the target file and uses space
 139             proportional to the maximal window size.  Vcdiff differs from more
 140             conventional compressors in that it uses only byte-aligned
 141             data, thus avoiding bit-level operations, which improves
 142             decoding speed at the slight cost of compression efficiency.
 143
 144     The Vcdiff data format and the algorithms for decoding data shall be
 145     described next.  Since Vcdiff treats compression as a special case of
 146     differencing, we shall use the term "delta file" to indicate the
 147     compressed output for both cases.
 148
 149
 150 2. CONVENTIONS
 151
 152     The basic data unit is a byte.  For portability, Vcdiff shall limit
 153     a byte to its lower eight bits even on machines with larger bytes.
 154     The bits in a byte are ordered from right to left so that the least
 155     significant bit (LSB) has value 1, and the most significant bit (MSB),
 156     has value 128.
 157
 158     For purposes of exposition in this document, we adopt the convention
 159     that the LSB is numbered 0, and the MSB is numbered 7.  Bit numbers
 160     never appear in the encoded format itself.
 161
 162     Vcdiff encodes unsigned integer values using a portable variable-sized
 163     format (originally introduced in the Sfio library [7]). This encoding
 164     treats an integer as a number in base 128. Then, each digit in this
 165     representation is encoded in the lower seven bits of a byte. Except for
 166     the least significant byte, other bytes have their most significant bit
 167     turned on to indicate that there are still more digits in the encoding.
 168     The two key properties of this integer encoding that are beneficial
 169     to a data compression format are:
 170
 171         a. The encoding is portable among systems using 8-bit bytes, and
 172         b. Small values are encoded compactly.
 173
 174     For example, consider the value 123456789 which can be represented with
 175     four 7-bit digits whose values are 58, 111, 26, 21 in order from most
 176     to least significant. Below is the 8-bit byte encoding of these digits.
 177     Note that the MSBs of 58, 111 and 26 are on.
 178
 179                  +-------------------------------------------+
 180                  | 10111010 | 11101111 | 10011010 | 00010101 |
 181                  +-------------------------------------------+
 182                    MSB+58     MSB+111    MSB+26     0+21
 183
 184
 185     Henceforth, the terms "byte" and "integer" will refer to a byte and an
 186     unsigned integer as described.
 187
 188
 189     From time to time, algorithms are exhibited to clarify the descriptions
 190     of parts of the Vcdiff format. On such occasions, the C language will be
 191     used to make precise the algorithms.  The C code shown in this
 192     document is meant for clarification only, and is not part of the
 193     actual specification of the Vcdiff format.
 194
 195     In this specification, the key words "MUST", "MUST NOT",
 196     "SHOULD", "SHOULD NOT", and "MAY" document are to be interpreted as
 197     described in RFC2119 [12].
 198
 199
 200 3.  DELTA INSTRUCTIONS
 201
 202     A large target file is partitioned into non-overlapping sections
 203     called "target windows". These target windows are processed separately
 204     and sequentially based on their order in the target file.
 205
 206     A target window T of length t may be compared against some source data
 207     segment S of length s. By construction, this source data segment S
 208     comes either from the source file, if one is used, or from a part of
 209     the target file earlier than T.  In this way, during decoding, S is
 210     completely known when T is being decoded.
 211
 212     The choices of T, t, S and s are made by some window selection algorithm
 213     which can greatly affect the size of the encoding. However, as seen later,
 214     these choices are encoded so that no knowledge of the window selection
 215     algorithm is needed during decoding.
 216
 217     Assume that S[j] represents the jth byte in S, and T[k] represents
 218     the kth byte in T.  Then, for the delta instructions, we treat the data
 219     windows S and T as substrings of a superstring U formed by concatenating
 220     them like this:
 221
 222         S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]
 223
 224     The "address" of a byte in S or T is referred to by its location in U.
 225     For example, the address of T[k] is s+k.
 226
 227     The instructions to encode and direct the reconstruction of a target
 228     window are called delta instructions. There are three types:
 229
 230         ADD: This instruction has two arguments, a size x and a sequence of
 231             x bytes to be copied.
 232         COPY: This instruction has two arguments, a size x and an address p
 233             in the string U. The arguments specify the substring of U that
 234             must be copied. We shall assert that such a substring must be
 235             entirely contained in either S or T.
 236         RUN: This instruction has two arguments, a size x and a byte b that
 237             will be repeated x times.
 238
 239     Below are example source and target windows and the delta instructions
 240     that encode the target window in terms of the source window.
 241
 242         a b c d e f g h i j k l m n o p
 243         a b c d w x y z e f g h e f g h e f g h e f g h z z z z
 244
 245         COPY  4, 0
 246         ADD   4, w x y z
 247         COPY  4, 4
 248         COPY 12, 24
 249         RUN   4, z
 250
 251
 252     Thus, the first letter 'a' in the target window is at location 16
 253     in the superstring. Note that the fourth instruction, "COPY 12, 24",
 254     copies data from T itself since address 24 is position 8 in T.
 255     This instruction also shows that it is fine to overlap the data to be
 256     copied with the data being copied from as long as the latter starts
 257     earlier. This enables efficient encoding of periodic sequences,
 258     i.e., sequences with regularly repeated subsequences. The RUN instruction
 259     is a compact way to encode a sequence repeating the same byte even though
 260     such a sequence can be thought of as a periodic sequence with period 1.
 261
 262     To reconstruct the target window, one simply processes one delta
 263     instruction at a time and copy the data either from the source window
 264     or the being reconstructed target window based on the type of the
 265     instruction and the associated address, if any.
 266
 267
 268 4.  DELTA FILE ORGANIZATION
 269
 270     A Vcdiff delta file starts with a Header section followed by a sequence
 271     of Window sections. The Header section includes magic bytes to identify
 272     the file type, and information concerning data processing beyond the
 273     basic encoding format. The Window sections encode the target windows.
 274
 275     Below is the overall organization of a delta file. The indented items
 276     refine the ones immediately above them. An item in square brackets may
 277     or may not be present in the file depending on the information encoded
 278     in the Indicator byte above it.
 279
 280         Header
 281             Header1                                  - byte
 282             Header2                                  - byte
 283             Header3                                  - byte
 284             Header4                                  - byte
 285             Hdr_Indicator                            - byte
 286             [Secondary compressor ID]                - byte
 287
 288 [@@@ Why is compressor ID not an integer? ]
 289 [@@@ If we aren't defining any secondary compressors yet, then it seems
 290 that defining the [Secondary compressor ID] and the corresponding
 291 VCD_DECOMPRESS Hdr_Indicator bit in this draft has no real value.  An
 292 implementation of this specification won't be able to decode a VCDIFF
 293 encoded with this option if it doesn't know about any secondary
 294 compressors.  It seems that you should specify the bits related to
 295 secondary compressors once you have defined the first a secondary
 296 compressor.  I can imagine a secondary-compressor might want to supply
 297 extra information, such as a dictionary of some kind, in which case
 298 this speculative treatment wouldn't go far enough.]
 299
 300             [Length of code table data]              - integer
 301             [Code table data]
 302                 Size of near cache                   - byte
 303                 Size of same cache                   - byte
 304                 Compressed code table data
 305         Window1
 306             Win_Indicator                            - byte
 307             [Source segment size]                    - integer
 308             [Source segment position]                - integer
 309             The delta encoding of the target window
 310                 Length of the delta encoding         - integer
 311                 The delta encoding
 312                     Size of the target window        - integer
 313                     Delta_Indicator                  - byte
 314                     Length of data for ADDs and RUNs - integer
 315                     Length of instructions and sizes - integer
 316                     Length of addresses for COPYs    - integer
 317                     Data section for ADDs and RUNs   - array of bytes
 318                     Instructions and sizes section   - array of bytes
 319                     Addresses section for COPYs      - array of bytes
 320         Window2
 321         ...
 322
 323
 324
 325 4.1 The Header Section
 326
 327     Each delta file starts with a header section organized as below.
 328     Note the convention that square-brackets enclose optional items.
 329
 330             Header1                                  - byte = 0xE6
 331             Header2                                  - byte = 0xD3
 332             Header3                                  - byte = 0xD4
 333
 334 HMMM
 335
 336 0xD6
 337 0xC3
 338 0xC4
 339
 340             Header4                                  - byte
 341             Hdr_Indicator                            - byte
 342             [Secondary compressor ID]                - byte
 343             [Length of code table data]              - integer
 344             [Code table data]
 345
 346     The first three Header bytes are the ASCII characters 'V', 'C' and 'D'
 347     with their most significant bits turned on (in hexadecimal, the values
 348     are 0xE6, 0xD3, and 0xD4). The fourth Header byte is currently set to
 349     zero. In the future, it might be used to indicate the version of Vcdiff.
 350
 351     The Hdr_Indicator byte shows if there are any initialization data
 352     required to aid in the reconstruction of data in the Window sections.
 353     This byte MAY have non-zero values for either, both, or neither of
 354     the two bits VCD_DECOMPRESS and VCD_CODETABLE below:
 355
 356             7 6 5 4 3 2 1 0
 357            +-+-+-+-+-+-+-+-+
 358            | | | | | | | | |
 359            +-+-+-+-+-+-+-+-+
 360                         ^ ^
 361                         | |
 362                         | +-- VCD_DECOMPRESS
 363                         +---- VCD_CODETABLE
 364
 365     If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary
 366     compressor may have been used to further compress certain parts of the
 367     delta encoding data as described in Sections 4.3 and 6. In that case,
 368     the ID of the secondary compressor is given next. If this bit is zero,
 369     the compressor ID byte is not included.
 370
 371 [@@@ If we aren't defining any secondary compressors yet, then it seems
 372 this bit has no real value yet..]
 373
 374     If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an
 375     application-defined code table is to be used for decoding the delta
 376     instructions. This table itself is compressed.  The length of the data
 377     comprising this compressed code table and the data follow next. Section 7
 378     discusses application-defined code tables.  If this bit is zero, the code
 379     table data length and the code table data are not included.
 380
 381     If both bits are set, then the compressor ID byte is included
 382     before the code table data length and the code table data.
 383
 384
 385 4.2 The Format of a Window Section
 386
 387     Each Window section is organized as follows:
 388
 389             Win_Indicator                            - byte
 390             [Source segment length]                  - integer
 391             [Source segment position]                - integer
 392             The delta encoding of the target window
 393
 394
 395     Below are the detail of the various items:
 396
 397 [@@@ Here, I want to replace the Win_Indicator with a source-count,
 398 followed by source-count length/position pairs?]
 399
 400         Win_Indicator:
 401             This byte is a set of bits, as shown:
 402
 403             7 6 5 4 3 2 1 0
 404            +-+-+-+-+-+-+-+-+
 405            | | | | | | | | |
 406            +-+-+-+-+-+-+-+-+
 407                         ^ ^
 408                         | |
 409                         | +-- VCD_SOURCE
 410                         +---- VCD_TARGET
 411
 412
 413             If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment
 414             of data from the "source" file was used as the corresponding
 415             source window of data to encode the target window. The decoder
 416             will use this same source data segment to decode the target window.
 417
 418             If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment
 419             of data from the "target" file was used as the corresponding
 420             source window of data to encode the target window. As above, this
 421             same source data segment is used to decode the target window.
 422
 423             The Win_Indicator byte MUST NOT have more than one of the bits
 424             set (non-zero).  It MAY have none of these bits set.
 425
 426             If one of these bits is set, the byte is followed by two
 427             integers to indicate respectively the length and position of
 428             the source data segment in the relevant file.  If the
 429             indicator byte is zero, the target window was compressed
 430             by itself without comparing against another data segment,
 431             and these two integers are not included.
 432
 433         The delta encoding of the target window:
 434             This contains the delta encoding of the target window either
 435             in terms of the source data segment (i.e., VCD_SOURCE
 436             or VCD_TARGET was set) or by itself if no source window
 437             is specified. This data format is discussed next.
 438
 439
 440 4.3 The Delta Encoding of a Target Window
 441
 442     The delta encoding of a target window is organized as follows:
 443
 444         Length of the delta encoding            - integer
 445         The delta encoding
 446             Length of the target window         - integer
 447             Delta_Indicator                     - byte
 448             Length of data for ADDs and RUNs    - integer
 449             Length of instructions section      - integer
 450             Length of addresses for COPYs       - integer
 451             Data section for ADDs and RUNs      - array of bytes
 452             Instructions and sizes section      - array of bytes
 453             Addresses section for COPYs         - array of bytes
 454
 455
 456         Length of the delta encoding:
 457             This integer gives the total number of remaining bytes that
 458             comprise data of the delta encoding for this target window.
 459
 460         The delta encoding:
 461             This contains the data representing the delta encoding which
 462             is described next.
 463
 464         Length of the target window:
 465             This integer indicates the actual size of the target window
 466             after decompression. A decoder can use this value to allocate
 467             memory to store the uncompressed data.
 468
 469         Delta_Indicator:
 470             This byte is a set of bits, as shown:
 471
 472             7 6 5 4 3 2 1 0
 473            +-+-+-+-+-+-+-+-+
 474            | | | | | | | | |
 475            +-+-+-+-+-+-+-+-+
 476                       ^ ^ ^
 477                       | | |
 478                       | | +-- VCD_DATACOMP
 479                       | +---- VCD_INSTCOMP
 480                       +------ VCD_ADDRCOMP
 481
 482                 VCD_DATACOMP:   bit value 1.
 483                 VCD_INSTCOMP:   bit value 2.
 484                 VCD_ADDRCOMP:   bit value 4.
 485
 486             As discussed, the delta encoding consists of COPY, ADD and RUN
 487             instructions. The ADD and RUN instructions have accompanying
 488             unmatched data (that is, data that does not specifically match
 489             any data in the source window or in some earlier part of the
 490             target window) and the COPY instructions have addresses of where
 491             the matches occur. OPTIONALLY, these types of data MAY be further
 492             compressed using a secondary compressor. Thus, Vcdiff separates
 493             the encoding of the delta instructions into three parts:
 494
 495                 a. The unmatched data in the ADD and RUN instructions,
 496                 b. The delta instructions and accompanying sizes, and
 497                 c. The addresses of the COPY instructions.
 498
 499             If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these
 500             sections may have been compressed using the specified secondary
 501             compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP),
 502             and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that
 503             the corresponding parts are compressed. Then, these parts MUST
 504             be decompressed before decoding the delta instructions.
 505
 506         Length of data for ADDs and RUNs:
 507             This is the length (in bytes) of the section of data storing
 508             the unmatched data accompanying the ADD and RUN instructions.
 509
 510         Length of instructions section:
 511             This is the length (in bytes) of the delta instructions and
 512             accompanying sizes.
 513
 514         Length of addresses for COPYs:
 515             This is the length (in bytes) of the section storing
 516             the addresses of the COPY instructions.
 517
 518         Data section for ADDs and RUNs:
 519             This sequence of bytes encodes the unmatched data for the ADD
 520             and RUN instructions.
 521
 522         Instructions and sizes section:
 523             This sequence of bytes encodes the instructions and their sizes.
 524
 525         Addresses section for COPYs:
 526             This sequence of bytes encodes the addresses of the COPY
 527             instructions.
 528
 529
 530 5. DELTA INSTRUCTION ENCODING
 531
 532     The delta instructions described in Section 3 represent the results of
 533     string matching. For many data differencing applications in which the
 534     changes between source and target data are small, any straightforward
 535     representation of these instructions would be adequate.  However, for
 536     applications including data compression, it is important to encode
 537     these instructions well to achieve good compression rates.  From our
 538     experience, the following observations can be made:
 539
 540     a. The addresses in COPY instructions are locations of matches and
 541        often occur close by or even exactly equal to one another. This is
 542        because data in local regions are often replicated with minor changes.
 543        In turn, this means that coding a newly matched address against some
 544        set of recently matched addresses can be beneficial.
 545
 546     b. The matches are often short in length and separated by small amounts
 547        of unmatched data. That is, the lengths of COPY and ADD instructions
 548        are often small. This is particularly true of binary data such as
 549        executable files or structured data such as HTML or XML. In such cases,
 550        compression can be improved by combining the encoding of the sizes
 551        and the instruction types as well as combining the encoding of adjacent
 552        delta instructions with sufficiently small data sizes.
 553
 554     The below subsections discuss how the Vcdiff data format provides
 555     mechanisms enabling encoders to use the above observations to improve
 556     compression rates.
 557
 558
 559 5.1 Address Encoding Modes of COPY Instructions
 560
 561     As mentioned earlier, addresses of COPY instructions often occur close
 562     to one another or are exactly equal. To take advantage of this phenomenon
 563     and encode addresses of COPY instructions more efficiently, the Vcdiff
 564     data format supports the use of two different types of address caches.
 565     Both the encoder and decoder maintain these caches, so that decoder's
 566     caches remain synchronized with the encoder's caches.
 567
 568     a. A "near" cache is an array with "s_near" slots, each containing an
 569        address used for encoding addresses nearby to previously encoded
 570        addresses (in the positive direction only).  The near cache also
 571        maintains a "next_slot" index to the near cache.  New entries to the
 572        near cache are always inserted in the next_slot index, which maintains
 573        a circular buffer of the s_near most recent addresses.
 574
 575     b. A "same" cache is an array with "s_same" multiple of 256 slots, each
 576        containing an address.  The same cache maintains a hash table of recent
 577        addresses used for repeated encoding of the exact same address.
 578
 579
 580     By default, the parameters s_near and s_same are respectively set to 4
 581     and 3. An encoder MAY modify these values, but then it MUST encode the
 582     new values in the encoding itself, as discussed in Section 7, so that
 583     the decoder can properly set up its own caches.
 584
 585     At the start of processing a target window, an implementation
 586     (encoder or decoder) initializes all of the slots in both caches
 587     to zero.  The next_slot pointer of the near cache is set
 588     to point to slot zero.
 589
 590     Each time a COPY instruction is processed by the encoder or
 591     decoder, the implementation's caches are updated as follows, where
 592     "addr" is the address in the COPY instruction.
 593
 594     a. The slot in the near cache referenced by the next_slot
 595        index is set to addr.  The next_slot index is then incremented
 596        modulo s_near.
 597
 598     b. The slot in the same cache whose index is addr%(s_same*256)
 599        is set to addr. [We use the C notations of % for modulo and
 600        * for multiplication.]
 601
 602
 603 5.2 Example code for maintaining caches
 604
 605     To make clear the above description, below are example cache data
 606     structures and algorithms to initialize and update them:
 607
 608         typedef struct _cache_s
 609         {
 610             int*  near;      /* array of size s_near        */
 611             int   s_near;
 612             int   next_slot; /* the circular index for near */
 613             int*  same;      /* array of size s_same*256    */
 614             int   s_same;
 615         } Cache_t;
 616
 617         cache_init(Cache_t* ka)
 618         {
 619             int   i;
 620
 621             ka->next_slot = 0;
 622             for(i = 0; i < ka->s_near; ++i)
 623                  ka->near[i] = 0;
 624
 625             for(i = 0; i < ka->s_same*256; ++i)
 626                  ka->same[i] = 0;
 627         }
 628
 629         cache_update(Cache_t* ka, int addr)
 630         {
 631             if(ka->s_near > 0)
 632             {   ka->near[ka->next_slot] = addr;
 633                 ka->next_slot = (ka->next_slot + 1) % ka->s_near;
 634             }
 635
 636             if(ka->s_same > 0)
 637                 ka->same[addr % (ka->s_same*256)] = addr;
 638         }
 639
 640
 641 5.3 Encoding of COPY instruction addresses
 642
 643     The address of a COPY instruction is encoded using different modes
 644     depending on the type of cached address used, if any.
 645
 646     Let "addr" be the address of a COPY instruction to be decoded and "here"
 647     be the current location in the target data (i.e., the start of the data
 648     about to be encoded or decoded).  Let near[j] be the jth element in
 649     the near cache, and same[k] be the kth element in the same cache.
 650     Below are the possible address modes:
 651
 652         VCD_SELF: This mode has value 0. The address was encoded by itself
 653             as an integer.
 654
 655         VCD_HERE: This mode has value 1. The address was encoded as
 656             the integer value "here - addr".
 657
 658         Near modes: The "near modes" are in the range [2,s_near+1]. Let m
 659             be the mode of the address encoding. The address was encoded
 660             as the integer value "addr - near[m-2]".
 661
 662         Same modes: The "same modes" are in the range
 663             [s_near+2,s_near+s_same+1]. Let m be the mode of the encoding.
 664             The address was encoded as a single byte b such that
 665             "addr == same[(m - (s_near+2))*256 + b]".
 666
 667
 668 5.3 Example code for encoding and decoding of COPY instruction addresses
 669
 670     We show example algorithms below to demonstrate use of address modes more
 671     clearly. The encoder has freedom to choose address modes, the sample
 672     addr_encode() algorithm merely shows one way of picking the address
 673     mode. The decoding algorithm addr_decode() will uniquely decode addresses
 674     regardless of the encoder's algorithm choice.
 675
 676     Note that the address caches are updated immediately after an address is
 677     encoded or decoded. In this way, the decoder is always synchronized with
 678     the encoder.
 679
 680         int addr_encode(Cache_t* ka, int addr, int here, int* mode)
 681         {
 682             int  i, d, bestd, bestm;
 683
 684             /* Attempt to find the address mode that yields the
 685              * smallest integer value for "d", the encoded address
 686              * value, thereby minimizing the encoded size of the
 687              * address. */
 688
 689             bestd = addr; bestm = VCD_SELF;      /* VCD_SELF == 0 */
 690
 691             if((d = here-addr) < bestd)
 692                 { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */
 693
 694             for(i = 0; i < ka->s_near; ++i)
 695                 if((d = addr - ka->near[i]) >= 0 && d < bestd)
 696                     { bestd = d; bestm = i+2; }
 697
 698             if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr)
 699                 { bestd = d%256; bestm = ka->s_near + 2 + d/256; }
 700
 701             cache_update(ka,addr);
 702
 703             *mode = bestm; /* this returns the address encoding mode */
 704             return  bestd; /* this returns the encoded address       */
 705         }
 706
 707     Note that the addr_encode() algorithm chooses the best address mode using a
 708     local optimization, but that may not lead to the best encoding efficiency
 709     because different modes lead to different instruction encodings, as    described below.
 710
 711     The functions addrint() and addrbyte() used in addr_decode() obtain from
 712     the "Addresses section for COPYs" (Section 4.3) an integer or a byte,
 713     respectively. These utilities will not be described here.  We simply
 714     recall that an integer is represented as a compact variable-sized string
 715     of bytes as described in Section 2 (i.e., base 128).
 716
 717         int addr_decode(Cache_t* ka, int here, int mode)
 718         {   int  addr, m;
 719
 720             if(mode == VCD_SELF)
 721                  addr = addrint();
 722             else if(mode == VCD_HERE)
 723                  addr = here - addrint();
 724             else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */
 725                  addr = ka->near[m] + addrint();
 726             else /* same cache */
 727             {    m = mode - (2 + ka->s_near);
 728                  addr = ka->same[m*256 + addrbyte()];
 729             }
 730
 731             cache_update(ka, addr);
 732
 733             return addr;
 734         }
 735
 736
 737 5.4 Instruction Codes
 738
 739     As noted, the data sizes associated with delta instructions are often
 740     small. Thus, compression efficiency can be improved by combining the sizes
 741     and instruction types in a single encoding, as well by combining certain
 742     pairs of adjacent delta instructions. Effective choices of when to perform
 743     such combinations depend on many factors including the data being processed
 744     and the string matching algorithm in use. For example, if many COPY
 745     instructions have the same data sizes, it may be worth to encode these
 746     instructions more compactly than others.
 747
 748     The Vcdiff data format is designed so that a decoder does not need to be
 749     aware of the choices made in encoding algorithms. This is achieved with the
 750     notion of an "instruction code table" containing 256 entries. Each entry
 751     defines either a single delta instruction or a pair of instructions that
 752     have been combined.  Note that the code table itself only exists in main
 753     memory, not in the delta file (unless using an application-defined code
 754     table, described in Section 7). The encoded data simply includes the index
 755     of each instruction and, since there are only 256 indices, each index
 756     can be represented as a single byte.
 757
 758     Each instruction code entry contains six fields, each of which
 759     is a single byte with unsigned value:
 760
 761             +-----------------------------------------------+
 762             | inst1 | size1 | mode1 | inst2 | size2 | mode2 |
 763             +-----------------------------------------------+
 764
 765 @@@ could be more compact
 766
 767     Each triple (inst,size,mode) defines a delta instruction. The meanings
 768     of these fields are as follows:
 769
 770     inst: An "inst" field can have one of the four values: NOOP (0), ADD (1),
 771         RUN (2) or COPY (3) to indicate the instruction types. NOOP means
 772         that no instruction is specified. In this case, both the corresponding
 773         size and mode fields will be zero.
 774
 775     size: A "size" field is zero or positive. A value zero means that the
 776         size associated with the instruction is encoded separately as
 777         an integer in the "Instructions and sizes section" (Section 6).
 778         A positive value for "size" defines the actual data size.
 779         Note that since the size is restricted to a byte, the maximum
 780         value for any instruction with size implicitly defined in the code
 781         table is 255.
 782
 783     mode: A "mode" field is significant only when the associated delta
 784         instruction is a COPY. It defines the mode used to encode the
 785         associated addresses. For other instructions, this is always zero.
 786
 787
 788 5.5 The Code Table
 789
 790     Following the discussions on address modes and instruction code tables,
 791     we define a "Code Table" to have the data below:
 792
 793         s_near: the size of the near cache,
 794         s_same: the size of the same cache,
 795         i_code: the 256-entry instruction code table.
 796
 797     Vcdiff itself defines a "default code table" in which s_near is 4
 798     and s_same is 3. Thus, there are 9 address modes for a COPY instruction.
 799     The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5
 800     are for addresses coded against the near cache. And, modes 6, 7  and 8
 801     are for addresses coded against the same cache.
 802
 803     The default instruction code table is depicted below, in a compact
 804     representation that we use only for descriptive purposes.  See section 7
 805     for the specification of how an instruction code table is represented
 806     in the Vcdiff encoding format.  In the depiction, a zero value for
 807     size indicates that the size is separately coded. The mode of non-COPY
 808     instructions is represented as 0 even though they are not used.
 809
 810
 811          TYPE      SIZE     MODE    TYPE     SIZE     MODE     INDEX
 812         ---------------------------------------------------------------
 813      1.  RUN         0        0     NOOP       0        0        0
 814      2.  ADD    0, [1,17]     0     NOOP       0        0      [1,18]
 815      3.  COPY   0, [4,18]     0     NOOP       0        0     [19,34]
 816      4.  COPY   0, [4,18]     1     NOOP       0        0     [35,50]
 817      5.  COPY   0, [4,18]     2     NOOP       0        0     [51,66]
 818      6.  COPY   0, [4,18]     3     NOOP       0        0     [67,82]
 819      7.  COPY   0, [4,18]     4     NOOP       0        0     [83,98]
 820      8.  COPY   0, [4,18]     5     NOOP       0        0     [99,114]
 821      9.  COPY   0, [4,18]     6     NOOP       0        0    [115,130]
 822     10.  COPY   0, [4,18]     7     NOOP       0        0    [131,146]
 823     11.  COPY   0, [4,18]     8     NOOP       0        0    [147,162]
 824     12.  ADD       [1,4]      0     COPY     [4,6]      0    [163,174]
 825     13.  ADD       [1,4]      0     COPY     [4,6]      1    [175,186]
 826     14.  ADD       [1,4]      0     COPY     [4,6]      2    [187,198]
 827     15.  ADD       [1,4]      0     COPY     [4,6]      3    [199,210]
 828     16.  ADD       [1,4]      0     COPY     [4,6]      4    [211,222]
 829     17.  ADD       [1,4]      0     COPY     [4,6]      5    [223,234]
 830     18.  ADD       [1,4]      0     COPY       4        6    [235,238]
 831     19.  ADD       [1,4]      0     COPY       4        7    [239,242]
 832     20.  ADD       [1,4]      0     COPY       4        8    [243,246]
 833     21.  COPY        4      [0,8]   ADD        1        0    [247,255]
 834         ---------------------------------------------------------------
 835
 836     In the above depiction, each numbered line represents one or more
 837     entries in the actual instruction code table (recall that an entry in
 838     the instruction code table may represent up to two combined delta
 839     instructions.) The last column ("INDEX") shows which index value or
 840     range of index values of the entries covered by that line. The notation
 841     [i,j] means values from i through j, inclusive. The first 6 columns of
 842     a line in the depiction describe the pairs of instructions used for
 843     the corresponding index value(s).
 844
 845     If a line in the depiction includes a column entry using the [i,j]
 846     notation, this means that the line is instantiated for each value
 847     in the range from i to j, inclusive.  The notation "0, [i,j]" means
 848     that the line is instantiated for the value 0 and for each value
 849     in the range from i to j, inclusive.
 850
 851     If a line in the depiction includes more than one entry using the [i,j]
 852     notation, implying a "nested loop" to convert the line to a range of
 853     table entries, the first such [i,j] range specifies the outer loop,
 854     and the second specifies the inner loop.
 855
 856     The below examples should make clear the above description:
 857
 858     Line 1 shows the single RUN instruction with index 0. As the size field
 859     is 0, this RUN instruction always has its actual size encoded separately.
 860
 861     Line 2 shows the 18 single ADD instructions. The ADD instruction with
 862     size field 0 (i.e., the actual size is coded separately) has index 1.
 863     ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and
 864     their sizes are as given (so they will not be separately encoded.)
 865
 866     Following the single ADD instructions are the single COPY instructions
 867     ordered by their address encoding modes. For example, line 11 shows the
 868     COPY instructions with mode 8, i.e., the last of the same cache.
 869     In this case, the COPY instruction with size field 0 has index 147.
 870     Again, the actual size of this instruction will be coded separately.
 871
 872     Lines 12 to 21 show the pairs of instructions that are combined together.
 873     For example, line 12 depicts the 12 entries in which an ADD instruction
 874     is combined with an immediately following COPY instruction. The entries
 875     with indices 163, 164, 165 represent the pairs in which the ADD
 876     instructions all have size 1 while the COPY instructions has mode
 877     0 (VCD_SELF) and sizes 4, 5 and 6 respectively.
 878
 879     The last line, line 21, shows the eight instruction pairs where the first
 880     instruction is a COPY and the second is an ADD. In this case, all COPY
 881     instructions have size 4 with mode ranging from 0 to 8 and all the ADD
 882     instructions have size 1. Thus, the entry with largest index 255
 883     combines a COPY instruction of size 4 and mode 8 with an ADD instruction
 884     of size 1.
 885
 886     The choice of the minimum size 4 for COPY instructions in the default code
 887     table was made from experiments that showed that excluding small matches
 888     (less then 4 bytes long) improved the compression rates.
 889
 890
 891 6. DECODING A TARGET WINDOW
 892
 893     Section 4.3 discusses that the delta instructions and associated data
 894     are encoded in three arrays of bytes:
 895
 896         Data section for ADDs and RUNs,
 897         Instructions and sizes section, and
 898         Addresses section for COPYs.
 899
 900
 901     Further, these data sections may have been further compressed by some
 902     secondary compressor. Assuming that any such compressed data has been
 903     decompressed so that we now have three arrays:
 904
 905         inst: bytes coding the instructions and sizes.
 906         data: unmatched data associated with ADDs and RUNs.
 907         addr: bytes coding the addresses of COPYs.
 908
 909     These arrays are organized as follows:
 910
 911         inst:
 912             a sequence of (index, [size1], [size2]) tuples, where "index"
 913             is an index into the instruction code table, and size1 and size2
 914             are integers that MAY or MAY NOT be included in the tuple as
 915             follows. The entry with the given "index" in the instruction
 916             code table potentially defines two delta instructions. If the
 917             first delta instruction is not a VCD_NOOP and its size is zero,
 918             then size1 MUST be present. Otherwise, size1 MUST be omitted and
 919             the size of the instruction (if it is not VCD_NOOP) is as defined
 920             in the table. The presence or absence of size2 is defined
 921             similarly with respect to the second delta instruction.
 922
 923         data:
 924             a sequence of data values, encoded as bytes.
 925
 926         addr:
 927             a sequence of address values. Addresses are normally encoded as
 928             integers as described in Section 2 (i.e., base 128).
 929             Since the same cache emits addresses in the range [0,255],
 930             however, same cache addresses are always encoded as a
 931             single byte.
 932
 933     To summarize, each tuple in the "inst" array includes an index to some
 934     entry in the instruction code table that determines:
 935
 936     a. Whether one or two instructions were encoded and their types.
 937
 938     b. If the instructions have their sizes encoded separately, these
 939        sizes will follow, in order, in the tuple.
 940
 941     c. If the instructions have accompanying data, i.e., ADDs or RUNs,
 942        their data will be in the array "data".
 943
 944     d. Similarly, if the instructions are COPYs, the coded addresses are
 945        found in the array "addr".
 946
 947     The decoding procedure simply processes the arrays by reading one code
 948     index at a time, looking up the corresponding instruction code entry,
 949     then consuming the respective sizes, data and addresses following the
 950     directions in this entry. In other words, the decoder maintains an implicit
 951     next-element pointer for each array; "consuming" an instruction tuple,
 952     data, or address value implies incrementing the associated pointer.
 953
 954     For example, if during the processing of the target window, the next
 955     unconsumed tuple in the inst array has index value 19, then the first
 956     instruction is a COPY, whose size is found as the immediately following
 957     integer in the inst array.  Since the mode of this COPY instruction is
 958     VCD_SELF, the corresponding address is found by consuming the next
 959     integer in the addr array.  The data array is left intact. As the second
 960     instruction for code index 19 is a NOOP, this tuple is finished.
 961
 962
 963 7. APPLICATION-DEFINED CODE TABLES
 964
 965     Although the default code table used in Vcdiff is good for general
 966     purpose encoders, there are times when other code tables may perform
 967     better. For example, to code a file with many identical segments of data,
 968     it may be advantageous to have a COPY instruction with the specific size
 969     of these data segments so that the instruction can be encoded in a single
 970     byte. Such a special code table MUST then be encoded in the delta file
 971     so that the decoder can reconstruct it before decoding the data.
 972
 973     Vcdiff allows an application-defined code table to be specified
 974     in a delta file with the following data:
 975
 976         Size of near cache            - byte
 977         Size of same cache            - byte
 978         Compressed code table data
 979
 980     The "compressed code table data" encodes the delta between the default
 981     code table (source) and the new code table (target) in the same manner as
 982     described in Section 4.3 for encoding a target window in terms of a
 983     source window. This delta is computed using the following steps:
 984
 985     a.  Convert the new instruction code table into a string, "code", of
 986         1536 bytes using the below steps in order:
 987
 988         i. Add in order the 256 bytes representing the types of the first
 989            instructions in the instruction pairs.
 990        ii. Add in order the 256 bytes representing the types of the second
 991            instructions in the instruction pairs.
 992       iii. Add in order the 256 bytes representing the sizes of the first
 993            instructions in the instruction pairs.
 994        iv. Add in order the 256 bytes representing the sizes of the second
 995            instructions in the instruction pairs.
 996         v. Add in order the 256 bytes representing the modes of the first
 997            instructions in the instruction pairs.
 998        vi. Add in order the 256 bytes representing the modes of the second
 999            instructions in the instruction pairs.
1000
1001     b.  Similarly, convert the default instruction code table into
1002         a string "dflt".
1003
1004     c.  Treat the string "code" as a target window and "dflt" as the
1005         corresponding source data and apply an encoding algorithm to
1006         compute the delta encoding of "code" in terms of "dflt".
1007         This computation MUST use the default code table for encoding
1008         the delta instructions.
1009
1010     The decoder can then reverse the above steps to decode the compressed
1011     table data using the method of Section 6, employing the default code
1012     table, to generate the new code table.  Note that the decoder does not
1013     need to know anything about the details of the encoding algorithm used
1014     in step (c). The decoder is still able to decode the new code table
1015     because the Vcdiff format is independent from the choice of encoding
1016     algorithm, and because the encoder in step (c) uses the known, default
1017     code table.
1018
1019
1020 8. PERFORMANCE
1021
1022     The encoding format is compact. For compression only, using the LZ-77
1023     string parsing strategy and without any secondary compressors, the typical
1024     compression rate is better than Unix compress and close to gzip.  For
1025     differencing, the data format is better than all known methods in
1026     terms of its stated goal, which is primarily decoding speed and
1027     encoding efficiency.
1028
1029     We compare the performance of compress, gzip and Vcdiff using the
1030     archives of three versions of the Gnu C compiler, gcc-2.95.1.tar,
1031     gcc-2.95.2.tar and gcc-2.95.3.tar. The experiments were done on an
1032     SGI-MIPS3, 400MHZ. Gzip was used at its default compression level.
1033     Vcdiff timings were done using the Vcodex/Vcdiff software (Section 13).
1034     As string and window matching typically dominates the computation during
1035     compression, the Vcdiff compression times were directly due to the
1036     algorithms used in the Vcodex/Vcdiff software. However, the decompression
1037     times should be generic and representative of any good implementation
1038     of the Vcdiff data format. Timing was done by running each program
1039     three times and taking the average of the total cpu+system times.
1040
1041     Below are the different Vcdiff runs:
1042
1043         Vcdiff: vcdiff is used as compressor only.
1044
1045         Vcdiff-d: vcdiff is used as a differencer only. That is, it only
1046                 compares target data against source data.  Since the files
1047                 involved are large, they are broken into windows. In this
1048                 case, each target window starting at some file offset in
1049                 the target file is compared against a source window with
1050                 the same file offset (in the source file). The source
1051                 window is also slightly larger than the target window
1052                 to increase matching opportunities. The -d option also gives
1053                 a hint to the string matching algorithm of Vcdiff that
1054                 the two files are very similar with long stretches of matches.
1055                 The algorithm takes advantage of this to minimize its
1056                 processing of source data and save time.
1057
1058         Vcdiff-dc: This is similar to Vcdiff-d but vcdiff can also compare
1059                 target data against target data as applicable. Thus, vcdiff
1060                 both computes differences and compresses data. The windowing
1061                 algorithm is the same as above. However, the above hint is
1062                 recinded in this case.
1063
1064         Vcdiff-dcs: This is similar to Vcdiff-dc but the windowing algorithm
1065                 uses a content-based heuristic to select source data segments
1066                 that are more likely to match with a given target window.
1067                 Thus, the source data segment selected for a target window
1068                 often will not be aligned with the file offsets of this
1069                 target window.
1070
1071
1072                 gcc-2.95.1    gcc-2.95.2    compression   decompression
1073     raw size      55746560      55797760
1074     compress         -          19939390       13.85s         7.09s
1075     gzip             -          12973443       42.99s         5.35s
1076     Vcdiff           -          15358786       20.04s         4.65s
1077     Vcdiff-d         -            100971       10.93s         1.92s
1078     Vcdiff-dc        -             97246       20.03s         1.84s
1079     Vcdiff-dcs       -            256445       44.81s         1.84s
1080
1081                 TABLE 1. Compressing gcc-2.95.2.tar given gcc-2.95.1
1082
1083
1084     TABLE 1 shows the raw sizes of gcc-2.95.1.tar and gcc-2.95.2.tar and the
1085     sizes of the compressed results. As a pure compressor, the compression
1086     rate for Vcdiff is worse than gzip and better than compress. The last
1087     three rows shows that when two file versions are very similar, differencing
1088     can have dramatically good compression rates. Vcdiff-d and Vcdiff-dc use
1089     the same simple window selection method but Vcdiff-dc also does compression
1090     so its output is slightly smaller. Vcdiff-dcs uses a heuristic based on
1091     data content to search for source data that likely will match a given target
1092     window. Although it does a good job, the heuristic did not always find the
1093     best matches which are given by the simple algorithm of Vcdiff-d.  As a
1094     result, the output size is slightly larger. Note also that there is a large
1095     cost in computing matching windows this way. Finally, the compression times
1096     of Vcdiff-d is nearly half of that of Vcdiff-dc. It is tempting to conclude
1097     that the compression feature causes the additional time in Vcdiff-dc
1098     relative to Vcdiff-d.  However, this is not the case. The hint given to
1099     the Vcdiff string matching algorithm that the two files are likely to
1100     have very long stretches of matches helps the algorithm to minimize
1101     processing of the "source data", thus saving half the time. However, as we
1102     shall see below when this hint is wrong, the result is even longer time.
1103
1104
1105                 gcc-2.95.2    gcc-2.95.3    compression   decompression
1106     raw size      55797760      55787520
1107     compress         -          19939453       13.54s         7.00s
1108     gzip             -          12998097       42.63s         5.62s
1109     Vcdiff           -          15371737       20.09s         4.74s
1110     Vcdiff-d         -          26383849       71.41s         6.41s
1111     Vcdiff-dc        -          14461203       42.48s         4.82s
1112     Vcdiff-dcs       -           1248543       61.18s         1.99s
1113
1114                 TABLE 2. Compressing gcc-2.95.3.tar given gcc-2.95.2
1115
1116
1117     TABLE 2 shows the raw sizes of gcc-2.95.2.tar and gcc-2.95.3.tar and
1118     the sizes of the compressed results. In this case, the tar file of
1119     gcc-2.95.3 is rearranged in a way that makes the straightforward method
1120     of matching file offsets for source and target windows fail. As a
1121     result, Vcdiff-d performs rather dismally both in time and output size.
1122     The large time for Vcdiff-d is directly due to fact that the string
1123     matching algorithm has to work much harder to find matches when the hint
1124     that two files have long matching stretches fails to hold. On the other
1125     hand, Vcdiff-dc does both differencing and compression resulting in good
1126     output size. Finally, the window searching heuristic used in Vcdiff-dcs is
1127     effective in finding the right matching source windows for target windows
1128     resulting a small output size. This shows why the data format needs to
1129     have a way to specify matching windows to gain performance. Finally,
1130     we note that the decoding times are always good regardless of how
1131     the string matching or window searching algorithms perform.
1132
1133
1134 9. FURTHER ISSUES
1135
1136     This document does not address a few issues:
1137
1138     Secondary compressors:
1139         As discussed in Section 4.3, certain sections in the delta encoding
1140         of a window may be further compressed by a secondary compressor.
1141         In our experience, the basic Vcdiff format is adequate for most
1142         purposes so that secondary compressors are seldom needed. In
1143         particular, for normal use of data differencing where the files to
1144         be compared have long stretches of matches, much of the gain in
1145         compression rate is already achieved by normal string matching.
1146         Thus, the use of secondary compressors is seldom needed in this case.
1147         However, for applications beyond differencing of such nearly identical
1148         files, secondary compressors may be needed to achieve maximal
1149         compressed results.
1150
1151         Therefore, we recommend to leave the Vcdiff data format defined
1152         as in this document so that the use of secondary compressors
1153         can be implemented when they become needed in the future.
1154         The formats of the compressed data via such compressors or any
1155         compressors that may be defined in the future are left open to
1156         their implementations.  These could include Huffman encoding,
1157         arithmetic encoding, and splay tree encoding [8,9].
1158
1159     Large file system vs. small file system:
1160         As discussed in Section 4, a target window in a large file may be
1161         compared against some source window in another file or in the same
1162         file (from some earlier part). In that case, the file offset of the
1163         source window is specified as a variable-sized integer in the delta
1164         encoding. There is a possibility that the encoding was computed on
1165         a system supporting much larger files than in a system where
1166         the data may be decoded (e.g., 64-bit file systems vs. 32-bit file
1167         systems). In that case, some target data may not be recoverable.
1168         This problem could afflict any compression format, and ought
1169         to be resolved with a generic negotiation mechanism in the
1170         appropriate protocol(s).
1171
1172
1173 10.  SUMMARY
1174
1175     We have described Vcdiff, a general and portable encoding format for
1176     compression and differencing. The format is good in that it allows
1177     implementing a decoder without knowledge of the encoders. Further,
1178     ignoring the use of secondary compressors not defined within the format,
1179     the decoding algorithms runs in linear time and requires working space
1180     proportional to window sizes.
1181
1182
1183
1184 11. ACKNOWLEDGEMENTS
1185
1186     Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff
1187     who provided much encouragement to publicize Vcdiff. In particular, Jeff
1188     helped clarifying the description of the data format presented here.
1189
1190
1191
1192 12. SECURITY CONSIDERATIONS
1193
1194     Vcdiff only provides a format to encode compressed and differenced data.
1195     It does not address any issues concerning how such data are, in fact,
1196     stored in a given file system or the run-time memory of a computer system.
1197     Therefore, we do not anticipate any security issues with respect to Vcdiff.
1198
1199
1200
1201 13. SOURCE CODE AVAILABILITY
1202
1203     Vcdiff is implemented as a data transforming method in Phong Vo's
1204     Vcodex library. AT&T Corp. has made the source code for Vcodex available
1205     for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11].
1206     The source code and according license is accessible at the below URL:
1207
1208           http://www.research.att.com/sw/tools
1209
1210
1211 14. INTELLECTUAL PROPERTY RIGHTS
1212
1213    The IETF has been notified of intellectual property rights claimed in
1214    regard to some or all of the specification contained in this
1215    document.  For more information consult the online list of claimed
1216    rights, at <http://www.ietf.org/ipr.html>.
1217
1218    The IETF takes no position regarding the validity or scope of any
1219    intellectual property or other rights that might be claimed to
1220    pertain to the implementation or use of the technology described in
1221    this document or the extent to which any license under such rights
1222    might or might not be available; neither does it represent that it
1223    has made any effort to identify any such rights.  Information on the
1224    IETF's procedures with respect to rights in standards-track and
1225    standards-related documentation can be found in BCP-11.  Copies of
1226    claims of rights made available for publication and any assurances of
1227    licenses to be made available, or the result of an attempt made to
1228    obtain a general license or permission for the use of such
1229    proprietary rights by implementors or users of this specification can
1230    be obtained from the IETF Secretariat.
1231
1232
1233
1234 15. IANA CONSIDERATIONS
1235
1236    The Internet Assigned Numbers Authority (IANA) administers the number
1237    space for Secondary Compressor ID values.  Values and their meaning
1238    must be documented in an RFC or other peer-reviewed, permanent, and
1239    readily available reference, in sufficient detail so that
1240    interoperability between independent implementations is possible.
1241    Subject to these constraints, name assignments are First Come, First
1242    Served - see RFC2434 [13].  Legal ID values are in the range 1..255.
1243
1244    This document does not define any values in this number space.
1245
1246
1247 16. REFERENCES
1248
1249     [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
1250         Practical Reusable Unix Software, Editor B. Krishnamurthy,
1251         John Wiley & Sons, Inc., 1995.
1252
1253     [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
1254         Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977.
1255
1256     [3] W. Tichy, The String-to-String Correction Problem with Block Moves,
1257         ACM Transactions on Computer Systems, 2(4):309-321, November 1984.
1258
1259     [4] E.M. McCreight, A Space-Economical Suffix Tree Construction
1260         Algorithm, Journal of the ACM, 23:262-272, 1976.
1261
1262     [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms,
1263         IEEE Software Configuration and Maintenance Workshop, 1996.
1264
1265     [6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis,
1266         ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998.
1267
1268     [7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library,
1269         Proc. of the Summer '91 Usenix Conference, 1991.
1270
1271     [8] D. W. Jones, Application of Splay Trees to Data Compression,
1272         CACM, 31(8):996:1007.
1273
1274     [9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-434-1,
1275         M&T Books, New York, NY, 1995.
1276
1277    [10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy,
1278         Potential benefits of delta encoding and data compression for HTTP,
1279         SIGCOMM '97, Cannes, France, 1997.
1280
1281    [11] J.C. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann,
1282         Y. Goland, and A. Van Hoff, Delta Encoding in HTTP,
1283         IETF, draft-mogul-http-delta-10, 2001.
1284
1285    [12] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels,
1286         RFC 2119, March 1997.
1287
1288    [13] T. Narten, H. Alvestrand, Guidelines for Writing an IANA
1289         Considerations Section in RFCs, RFC2434, October 1998.
1290
1291
1292
1293 17. AUTHOR'S ADDRESS
1294
1295     Kiem-Phong Vo (main contact)
1296     AT&T Labs, Room D223
1297     180 Park Avenue
1298     Florham Park, NJ 07932
1299     Email: kpv@research.att.com
1300     Phone: 1 973 360 8630
1301
1302     David G. Korn
1303     AT&T Labs, Room D237
1304     180 Park Avenue
1305     Florham Park, NJ 07932
1306     Email: dgk@research.att.com
1307     Phone: 1 973 360 8602
1308
1309     Jeffrey C. Mogul
1310     Western Research Laboratory
1311     Compaq Computer Corporation
1312     250 University Avenue
1313     Palo Alto, California, 94305, U.S.A.
1314     Email: JeffMogul@acm.org
1315     Phone: 1 650 617 3304 (email preferred)
1316
1317     Joshua P. MacDonald
1318     Computer Science Division
1319     University of California, Berkeley
1320     345 Soda Hall
1321     Berkeley, CA 94720
1322     Email: jmacd@cs.berkeley.edu