doc/html/libarchive_internals.3.html

   1 <!-- Creator     : groff version 1.22.3 -->
   2 <!-- CreationDate: Tue Feb 11 22:58:46 2020 -->
   3 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   4 "http://www.w3.org/TR/html4/loose.dtd">
   5 <html>
   6 <head>
   7 <meta name="generator" content="groff -Thtml, see www.gnu.org">
   8 <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
   9 <meta name="Content-Style" content="text/css">
  10 <style type="text/css">
  11        p       { margin-top: 0; margin-bottom: 0; vertical-align: top }
  12        pre     { margin-top: 0; margin-bottom: 0; vertical-align: top }
  13        table   { margin-top: 0; margin-bottom: 0; vertical-align: top }
  14        h1      { text-align: center }
  15 </style>
  16 <title></title>
  17 </head>
  18 <body>
  19
  20 <hr>
  21
  22
  23 <p>LIBARCHIVE_INTERNALS(3) BSD Library Functions Manual
  24 LIBARCHIVE_INTERNALS(3)</p>
  25
  26 <p style="margin-top: 1em"><b>NAME</b></p>
  27
  28 <p style="margin-left:6%;"><b>libarchive_internals</b>
  29 &mdash; description of libarchive internal interfaces</p>
  30
  31 <p style="margin-top: 1em"><b>OVERVIEW</b></p>
  32
  33 <p style="margin-left:6%;">The <b>libarchive</b> library
  34 provides a flexible interface for reading and writing
  35 streaming archive files such as tar and cpio. Internally, it
  36 follows a modular layered design that should make it easy to
  37 add new archive and compression formats.</p>
  38
  39 <p style="margin-top: 1em"><b>GENERAL ARCHITECTURE</b></p>
  40
  41 <p style="margin-left:6%;">Externally, libarchive exposes
  42 most operations through an opaque, object-style interface.
  43 The archive_entry(3) objects store information about a
  44 single filesystem object. The rest of the library provides
  45 facilities to write archive_entry(3) objects to archive
  46 files, read them from archive files, and write them to disk.
  47 (There are plans to add a facility to read archive_entry(3)
  48 objects from disk as well.)</p>
  49
  50 <p style="margin-left:6%; margin-top: 1em">The read and
  51 write APIs each have four layers: a public API layer, a
  52 format layer that understands the archive file format, a
  53 compression layer, and an I/O layer. The I/O layer is
  54 completely exposed to clients who can replace it entirely
  55 with their own functions.</p>
  56
  57 <p style="margin-left:6%; margin-top: 1em">In order to
  58 provide as much consistency as possible for clients, some
  59 public functions are virtualized. Eventually, it should be
  60 possible for clients to open an archive or disk writer, and
  61 then use a single set of code to select and write entries,
  62 regardless of the target.</p>
  63
  64 <p style="margin-top: 1em"><b>READ ARCHITECTURE</b></p>
  65
  66 <p style="margin-left:6%;">From the outside, clients use
  67 the archive_read(3) API to manipulate an <b>archive</b>
  68 object to read entries and bodies from an archive stream.
  69 Internally, the <b>archive</b> object is cast to an
  70 <b>archive_read</b> object, which holds all read-specific
  71 data. The API has four layers: The lowest layer is the I/O
  72 layer. This layer can be overridden by clients, but most
  73 clients use the packaged I/O callbacks provided, for
  74 example, by archive_read_open_memory(3), and
  75 archive_read_open_fd(3). The compression layer calls the I/O
  76 layer to read bytes and decompresses them for the format
  77 layer. The format layer unpacks a stream of uncompressed
  78 bytes and creates <b>archive_entry</b> objects from the
  79 incoming data. The API layer tracks overall state (for
  80 example, it prevents clients from reading data before
  81 reading a header) and invokes the format and compression
  82 layer operations through registered function pointers. In
  83 particular, the API layer drives the format-detection
  84 process: When opening the archive, it reads an initial block
  85 of data and offers it to each registered compression
  86 handler. The one with the highest bid is initialized with
  87 the first block. Similarly, the format handlers are polled
  88 to see which handler is the best for each archive. (Prior to
  89 2.4.0, the format bidders were invoked for each entry, but
  90 this design hindered error recovery.)</p>
  91
  92 <p style="margin-left:6%; margin-top: 1em"><b>I/O Layer and
  93 Client Callbacks</b> <br>
  94 The read API goes to some lengths to be nice to clients. As
  95 a result, there are few restrictions on the behavior of the
  96 client callbacks.</p>
  97
  98 <p style="margin-left:6%; margin-top: 1em">The client read
  99 callback is expected to provide a block of data on each
 100 call. A zero-length return does indicate end of file, but
 101 otherwise blocks may be as small as one byte or as large as
 102 the entire file. In particular, blocks may be of different
 103 sizes.</p>
 104
 105 <p style="margin-left:6%; margin-top: 1em">The client skip
 106 callback returns the number of bytes actually skipped, which
 107 may be much smaller than the skip requested. The only
 108 requirement is that the skip not be larger. In particular,
 109 clients are allowed to return zero for any skip that they
 110 don&rsquo;t want to handle. The skip callback must never be
 111 invoked with a negative value.</p>
 112
 113 <p style="margin-left:6%; margin-top: 1em">Keep in mind
 114 that not all clients are reading from disk: clients reading
 115 from networks may provide different-sized blocks on every
 116 request and cannot skip at all; advanced clients may use
 117 mmap(2) to read the entire file into memory at once and
 118 return the entire file to libarchive as a single block;
 119 other clients may begin asynchronous I/O operations for the
 120 next block on each request.</p>
 121
 122
 123 <p style="margin-left:6%; margin-top: 1em"><b>Decompresssion
 124 Layer</b> <br>
 125 The decompression layer not only handles decompression, it
 126 also buffers data so that the format handlers see a much
 127 nicer I/O model. The decompression API is a two stage
 128 peek/consume model. A read_ahead request specifies a minimum
 129 read amount; the decompression layer must provide a pointer
 130 to at least that much data. If more data is immediately
 131 available, it should return more: the format layer handles
 132 bulk data reads by asking for a minimum of one byte and then
 133 copying as much data as is available.</p>
 134
 135 <p style="margin-left:6%; margin-top: 1em">A subsequent
 136 call to the <b>consume</b>() function advances the read
 137 pointer. Note that data returned from a <b>read_ahead</b>()
 138 call is guaranteed to remain in place until the next call to
 139 <b>read_ahead</b>(). Intervening calls to <b>consume</b>()
 140 should not cause the data to move.</p>
 141
 142 <p style="margin-left:6%; margin-top: 1em">Skip requests
 143 must always be handled exactly. Decompression handlers that
 144 cannot seek forward should not register a skip handler; the
 145 API layer fills in a generic skip handler that reads and
 146 discards data.</p>
 147
 148 <p style="margin-left:6%; margin-top: 1em">A decompression
 149 handler has a specific lifecycle:</p>
 150
 151 <p>Registration/Configuration</p>
 152
 153 <p style="margin-left:17%;">When the client invokes the
 154 public support function, the decompression handler invokes
 155 the internal <b>__archive_read_register_compression</b>()
 156 function to provide bid and initialization functions. This
 157 function returns <b>NULL</b> on error or else a pointer to a
 158 <b>struct decompressor_t</b>. This structure contains a
 159 <i>void * config</i> slot that can be used for storing any
 160 customization information.</p>
 161
 162 <p>Bid</p>
 163
 164 <p style="margin-left:17%; margin-top: 1em">The bid
 165 function is invoked with a pointer and size of a block of
 166 data. The decompressor can access its config data through
 167 the <i>decompressor</i> element of the <b>archive_read</b>
 168 object. The bid function is otherwise stateless. In
 169 particular, it must not perform any I/O operations.</p>
 170
 171 <p style="margin-left:17%; margin-top: 1em">The value
 172 returned by the bid function indicates its suitability for
 173 handling this data stream. A bid of zero will ensure that
 174 this decompressor is never invoked. Return zero if magic
 175 number checks fail. Otherwise, your initial implementation
 176 should return the number of bits actually checked. For
 177 example, if you verify two full bytes and three bits of
 178 another byte, bid 19. Note that the initial block may be
 179 very short; be careful to only inspect the data you are
 180 given. (The current decompressors require two bytes for
 181 correct bidding.)</p>
 182
 183 <p>Initialize</p>
 184
 185 <p style="margin-left:17%;">The winning bidder will have
 186 its init function called. This function should initialize
 187 the remaining slots of the <i>struct decompressor_t</i>
 188 object pointed to by the <i>decompressor</i> element of the
 189 <i>archive_read</i> object. In particular, it should
 190 allocate any working data it needs in the <i>data</i> slot
 191 of that structure. The init function is called with the
 192 block of data that was used for tasting. At this point, the
 193 decompressor is responsible for all I/O requests to the
 194 client callbacks. The decompressor is free to read more data
 195 as and when necessary.</p>
 196
 197 <p>Satisfy I/O requests</p>
 198
 199 <p style="margin-left:17%;">The format handler will invoke
 200 the <i>read_ahead</i>, <i>consume</i>, and <i>skip</i>
 201 functions as needed.</p>
 202
 203 <p>Finish</p>
 204
 205 <p style="margin-left:17%; margin-top: 1em">The finish
 206 method is called only once when the archive is closed. It
 207 should release anything stored in the <i>data</i> and
 208 <i>config</i> slots of the <i>decompressor</i> object. It
 209 should not invoke the client close callback.</p>
 210
 211 <p style="margin-left:6%; margin-top: 1em"><b>Format
 212 Layer</b> <br>
 213 The read formats have a similar lifecycle to the
 214 decompression handlers:</p>
 215
 216 <p>Registration</p>
 217
 218 <p style="margin-left:17%;">Allocate your private data and
 219 initialize your pointers.</p>
 220
 221 <p>Bid</p>
 222
 223 <p style="margin-left:17%; margin-top: 1em">Formats bid by
 224 invoking the <b>read_ahead</b>() decompression method but
 225 not calling the <b>consume</b>() method. This allows each
 226 bidder to look ahead in the input stream. Bidders should not
 227 look further ahead than necessary, as long look aheads put
 228 pressure on the decompression layer to buffer lots of data.
 229 Most formats only require a few hundred bytes of look ahead;
 230 look aheads of a few kilobytes are reasonable. (The ISO9660
 231 reader sometimes looks ahead by 48k, which should be
 232 considered an upper limit.)</p>
 233
 234 <p>Read header</p>
 235
 236 <p style="margin-left:17%;">The header read is usually the
 237 most complex part of any format. There are a few strategies
 238 worth mentioning: For formats such as tar or cpio, reading
 239 and parsing the header is straightforward since headers
 240 alternate with data. For formats that store all header data
 241 at the beginning of the file, the first header read request
 242 may have to read all headers into memory and store that
 243 data, sorted by the location of the file data. Subsequent
 244 header read requests will skip forward to the beginning of
 245 the file data and return the corresponding header.</p>
 246
 247 <p>Read Data</p>
 248
 249 <p style="margin-left:17%;">The read data interface
 250 supports sparse files; this requires that each call return a
 251 block of data specifying the file offset and size. This may
 252 require you to carefully track the location so that you can
 253 return accurate file offsets for each read. Remember that
 254 the decompressor will return as much data as it has.
 255 Generally, you will want to request one byte, examine the
 256 return value to see how much data is available, and possibly
 257 trim that to the amount you can use. You should invoke
 258 consume for each block just before you return it.</p>
 259
 260 <p>Skip All Data</p>
 261
 262 <p style="margin-left:17%;">The skip data call should skip
 263 over all file data and trailing padding. This is called
 264 automatically by the API layer just before each header read.
 265 It is also called in response to the client calling the
 266 public <b>data_skip</b>() function.</p>
 267
 268 <p>Cleanup</p>
 269
 270 <p style="margin-left:17%;">On cleanup, the format should
 271 release all of its allocated memory.</p>
 272
 273 <p style="margin-left:6%; margin-top: 1em"><b>API Layer</b>
 274 <br>
 275 XXX to do XXX</p>
 276
 277 <p style="margin-top: 1em"><b>WRITE ARCHITECTURE</b></p>
 278
 279 <p style="margin-left:6%;">The write API has a similar set
 280 of four layers: an API layer, a format layer, a compression
 281 layer, and an I/O layer. The registration here is much
 282 simpler because only one format and one compression can be
 283 registered at a time.</p>
 284
 285 <p style="margin-left:6%; margin-top: 1em"><b>I/O Layer and
 286 Client Callbacks</b> <br>
 287 XXX To be written XXX</p>
 288
 289 <p style="margin-left:6%; margin-top: 1em"><b>Compression
 290 Layer</b> <br>
 291 XXX To be written XXX</p>
 292
 293 <p style="margin-left:6%; margin-top: 1em"><b>Format
 294 Layer</b> <br>
 295 XXX To be written XXX</p>
 296
 297 <p style="margin-left:6%; margin-top: 1em"><b>API Layer</b>
 298 <br>
 299 XXX To be written XXX</p>
 300
 301 <p style="margin-top: 1em"><b>WRITE_DISK
 302 ARCHITECTURE</b></p>
 303
 304 <p style="margin-left:6%;">The write_disk API is intended
 305 to look just like the write API to clients. Since it does
 306 not handle multiple formats or compression, it is not
 307 layered internally.</p>
 308
 309 <p style="margin-top: 1em"><b>GENERAL SERVICES</b></p>
 310
 311 <p style="margin-left:6%;">The <b>archive_read</b>,
 312 <b>archive_write</b>, and <b>archive_write_disk</b> objects
 313 all contain an initial <b>archive</b> object which provides
 314 common support for a set of standard services. (Recall that
 315 ANSI/ISO C90 guarantees that you can cast freely between a
 316 pointer to a structure and a pointer to the first element of
 317 that structure.) The <b>archive</b> object has a magic value
 318 that indicates which API this object is associated with,
 319 slots for storing error information, and function pointers
 320 for virtualized API functions.</p>
 321
 322 <p style="margin-top: 1em"><b>MISCELLANEOUS NOTES</b></p>
 323
 324 <p style="margin-left:6%;">Connecting existing archiving
 325 libraries into libarchive is generally quite difficult. In
 326 particular, many existing libraries strongly assume that you
 327 are reading from a file; they seek forwards and backwards as
 328 necessary to locate various pieces of information. In
 329 contrast, libarchive never seeks backwards in its input,
 330 which sometimes requires very different approaches.</p>
 331
 332 <p style="margin-left:6%; margin-top: 1em">For example,
 333 libarchive&rsquo;s ISO9660 support operates very differently
 334 from most ISO9660 readers. The libarchive support utilizes a
 335 work-queue design that keeps a list of known entries sorted
 336 by their location in the input. Whenever libarchive&rsquo;s
 337 ISO9660 implementation is asked for the next header, checks
 338 this list to find the next item on the disk. Directories are
 339 parsed when they are encountered and new items are added to
 340 the list. This design relies heavily on the ISO9660 image
 341 being optimized so that directories always occur earlier on
 342 the disk than the files they describe.</p>
 343
 344 <p style="margin-left:6%; margin-top: 1em">Depending on the
 345 specific format, such approaches may not be possible. The
 346 ZIP format specification, for example, allows archivers to
 347 store key information only at the end of the file. In
 348 theory, it is possible to create ZIP archives that cannot be
 349 read without seeking. Fortunately, such archives are very
 350 rare, and libarchive can read most ZIP archives, though it
 351 cannot always extract as much information as a dedicated ZIP
 352 program.</p>
 353
 354 <p style="margin-top: 1em"><b>SEE ALSO</b></p>
 355
 356 <p style="margin-left:6%;">archive_entry(3),
 357 archive_read(3), archive_write(3), archive_write_disk(3),
 358 libarchive(3)</p>
 359
 360 <p style="margin-top: 1em"><b>HISTORY</b></p>
 361
 362 <p style="margin-left:6%;">The <b>libarchive</b> library
 363 first appeared in FreeBSD&nbsp;5.3.</p>
 364
 365 <p style="margin-top: 1em"><b>AUTHORS</b></p>
 366
 367 <p style="margin-left:6%;">The <b>libarchive</b> library
 368 was written by Tim Kientzle &lt;kientzle@acm.org&gt;.</p>
 369
 370 <p style="margin-left:6%; margin-top: 1em">BSD
 371 January&nbsp;26, 2011 BSD</p>
 372 <hr>
 373 </body>
 374 </html>