doc/01-introduction.tex

   1 % -*- mode: latex; TeX-master: "Vorbis_I_spec"; -*-
   2 %!TEX root = Vorbis_I_spec.tex
   3 \section{Introduction and Description} \label{vorbis:spec:intro}
   4
   5 \subsection{Overview}
   6
   7 This document provides a high level description of the Vorbis codec's
   8 construction.  A bit-by-bit specification appears beginning in
   9 \xref{vorbis:spec:codec}.
  10 The later sections assume a high-level
  11 understanding of the Vorbis decode process, which is
  12 provided here.
  13
  14 \subsubsection{Application}
  15 Vorbis is a general purpose perceptual audio CODEC intended to allow
  16 maximum encoder flexibility, thus allowing it to scale competitively
  17 over an exceptionally wide range of bitrates.  At the high
  18 quality/bitrate end of the scale (CD or DAT rate stereo, 16/24 bits)
  19 it is in the same league as MPEG-2 and MPC.  Similarly, the 1.0
  20 encoder can encode high-quality CD and DAT rate stereo at below 48kbps
  21 without resampling to a lower rate.  Vorbis is also intended for
  22 lower and higher sample rates (from 8kHz telephony to 192kHz digital
  23 masters) and a range of channel representations (monaural,
  24 polyphonic, stereo, quadraphonic, 5.1, ambisonic, or up to 255
  25 discrete channels).
  26
  27
  28 \subsubsection{Classification}
  29 Vorbis I is a forward-adaptive monolithic transform CODEC based on the
  30 Modified Discrete Cosine Transform.  The codec is structured to allow
  31 addition of a hybrid wavelet filterbank in Vorbis II to offer better
  32 transient response and reproduction using a transform better suited to
  33 localized time events.
  34
  35
  36 \subsubsection{Assumptions}
  37
  38 The Vorbis CODEC design assumes a complex, psychoacoustically-aware
  39 encoder and simple, low-complexity decoder. Vorbis decode is
  40 computationally simpler than mp3, although it does require more
  41 working memory as Vorbis has no static probability model; the vector
  42 codebooks used in the first stage of decoding from the bitstream are
  43 packed in their entirety into the Vorbis bitstream headers. In
  44 packed form, these codebooks occupy only a few kilobytes; the extent
  45 to which they are pre-decoded into a cache is the dominant factor in
  46 decoder memory usage.
  47
  48
  49 Vorbis provides none of its own framing, synchronization or protection
  50 against errors; it is solely a method of accepting input audio,
  51 dividing it into individual frames and compressing these frames into
  52 raw, unformatted 'packets'. The decoder then accepts these raw
  53 packets in sequence, decodes them, synthesizes audio frames from
  54 them, and reassembles the frames into a facsimile of the original
  55 audio stream. Vorbis is a free-form variable bit rate (VBR) codec and packets have no
  56 minimum size, maximum size, or fixed/expected size.  Packets
  57 are designed that they may be truncated (or padded) and remain
  58 decodable; this is not to be considered an error condition and is used
  59 extensively in bitrate management in peeling.  Both the transport
  60 mechanism and decoder must allow that a packet may be any size, or
  61 end before or after packet decode expects.
  62
  63 Vorbis packets are thus intended to be used with a transport mechanism
  64 that provides free-form framing, sync, positioning and error correction
  65 in accordance with these design assumptions, such as Ogg (for file
  66 transport) or RTP (for network multicast).  For purposes of a few
  67 examples in this document, we will assume that Vorbis is to be
  68 embedded in an Ogg stream specifically, although this is by no means a
  69 requirement or fundamental assumption in the Vorbis design.
  70
  71 The specification for embedding Vorbis into
  72 an Ogg transport stream is in \xref{vorbis:over:ogg}.
  73
  74
  75
  76 \subsubsection{Codec Setup and Probability Model}
  77
  78 Vorbis' heritage is as a research CODEC and its current design
  79 reflects a desire to allow multiple decades of continuous encoder
  80 improvement before running out of room within the codec specification.
  81 For these reasons, configurable aspects of codec setup intentionally
  82 lean toward the extreme of forward adaptive.
  83
  84 The single most controversial design decision in Vorbis (and the most
  85 unusual for a Vorbis developer to keep in mind) is that the entire
  86 probability model of the codec, the Huffman and VQ codebooks, is
  87 packed into the bitstream header along with extensive CODEC setup
  88 parameters (often several hundred fields).  This makes it impossible,
  89 as it would be with MPEG audio layers, to embed a simple frame type
  90 flag in each audio packet, or begin decode at any frame in the stream
  91 without having previously fetched the codec setup header.
  92
  93
  94 \begin{note}
  95 Vorbis \emph{can} initiate decode at any arbitrary packet within a
  96 bitstream so long as the codec has been initialized/setup with the
  97 setup headers.
  98 \end{note}
  99
 100 Thus, Vorbis headers are both required for decode to begin and
 101 relatively large as bitstream headers go.  The header size is
 102 unbounded, although for streaming a rule-of-thumb of 4kB or less is
 103 recommended (and Xiph.Org's Vorbis encoder follows this suggestion).
 104
 105 Our own design work indicates the primary liability of the
 106 required header is in mindshare; it is an unusual design and thus
 107 causes some amount of complaint among engineers as this runs against
 108 current design trends (and also points out limitations in some
 109 existing software/interface designs, such as Windows' ACM codec
 110 framework).  However, we find that it does not fundamentally limit
 111 Vorbis' suitable application space.
 112
 113
 114 \subsubsection{Format Specification}
 115 The Vorbis format is well-defined by its decode specification; any
 116 encoder that produces packets that are correctly decoded by the
 117 reference Vorbis decoder described below may be considered a proper
 118 Vorbis encoder.  A decoder must faithfully and completely implement
 119 the specification defined below (except where noted) to be considered
 120 a proper Vorbis decoder.
 121
 122 \subsubsection{Hardware Profile}
 123 Although Vorbis decode is computationally simple, it may still run
 124 into specific limitations of an embedded design.  For this reason,
 125 embedded designs are allowed to deviate in limited ways from the
 126 `full' decode specification yet still be certified compliant.  These
 127 optional omissions are labelled in the spec where relevant.
 128
 129
 130 \subsection{Decoder Configuration}
 131
 132 Decoder setup consists of configuration of multiple, self-contained
 133 component abstractions that perform specific functions in the decode
 134 pipeline.  Each different component instance of a specific type is
 135 semantically interchangeable; decoder configuration consists both of
 136 internal component configuration, as well as arrangement of specific
 137 instances into a decode pipeline.  Componentry arrangement is roughly
 138 as follows:
 139
 140 \begin{center}
 141 \includegraphics[width=\textwidth]{components}
 142 \captionof{figure}{decoder pipeline configuration}
 143 \end{center}
 144
 145 \subsubsection{Global Config}
 146 Global codec configuration consists of a few audio related fields
 147 (sample rate, channels), Vorbis version (always '0' in Vorbis I),
 148 bitrate hints, and the lists of component instances.  All other
 149 configuration is in the context of specific components.
 150
 151 \subsubsection{Mode}
 152
 153 Each Vorbis frame is coded according to a master 'mode'.  A bitstream
 154 may use one or many modes.
 155
 156 The mode mechanism is used to encode a frame according to one of
 157 multiple possible methods with the intention of choosing a method best
 158 suited to that frame.  Different modes are, e.g. how frame size
 159 is changed from frame to frame. The mode number of a frame serves as a
 160 top level configuration switch for all other specific aspects of frame
 161 decode.
 162
 163 A 'mode' configuration consists of a frame size setting, window type
 164 (always 0, the Vorbis window, in Vorbis I), transform type (always
 165 type 0, the MDCT, in Vorbis I) and a mapping number.  The mapping
 166 number specifies which mapping configuration instance to use for
 167 low-level packet decode and synthesis.
 168
 169
 170 \subsubsection{Mapping}
 171
 172 A mapping contains a channel coupling description and a list of
 173 'submaps' that bundle sets of channel vectors together for grouped
 174 encoding and decoding. These submaps are not references to external
 175 components; the submap list is internal and specific to a mapping.
 176
 177 A 'submap' is a configuration/grouping that applies to a subset of
 178 floor and residue vectors within a mapping.  The submap functions as a
 179 last layer of indirection such that specific special floor or residue
 180 settings can be applied not only to all the vectors in a given mode,
 181 but also specific vectors in a specific mode.  Each submap specifies
 182 the proper floor and residue instance number to use for decoding that
 183 submap's spectral floor and spectral residue vectors.
 184
 185 As an example:
 186
 187 Assume a Vorbis stream that contains six channels in the standard 5.1
 188 format.  The sixth channel, as is normal in 5.1, is bass only.
 189 Therefore it would be wasteful to encode a full-spectrum version of it
 190 as with the other channels.  The submapping mechanism can be used to
 191 apply a full range floor and residue encoding to channels 0 through 4,
 192 and a bass-only representation to the bass channel, thus saving space.
 193 In this example, channels 0-4 belong to submap 0 (which indicates use
 194 of a full-range floor) and channel 5 belongs to submap 1, which uses a
 195 bass-only representation.
 196
 197
 198 \subsubsection{Floor}
 199
 200 Vorbis encodes a spectral 'floor' vector for each PCM channel.  This
 201 vector is a low-resolution representation of the audio spectrum for
 202 the given channel in the current frame, generally used akin to a
 203 whitening filter.  It is named a 'floor' because the Xiph.Org
 204 reference encoder has historically used it as a unit-baseline for
 205 spectral resolution.
 206
 207 A floor encoding may be of two types.  Floor 0 uses a packed LSP
 208 representation on a dB amplitude scale and Bark frequency scale.
 209 Floor 1 represents the curve as a piecewise linear interpolated
 210 representation on a dB amplitude scale and linear frequency scale.
 211 The two floors are semantically interchangeable in
 212 encoding/decoding. However, floor type 1 provides more stable
 213 inter-frame behavior, and so is the preferred choice in all
 214 coupled-stereo and high bitrate modes.  Floor 1 is also considerably
 215 less expensive to decode than floor 0.
 216
 217 Floor 0 is not to be considered deprecated, but it is of limited
 218 modern use.  No known Vorbis encoder past Xiph.Org's own beta 4 makes
 219 use of floor 0.
 220
 221 The values coded/decoded by a floor are both compactly formatted and
 222 make use of entropy coding to save space.  For this reason, a floor
 223 configuration generally refers to multiple codebooks in the codebook
 224 component list.  Entropy coding is thus provided as an abstraction,
 225 and each floor instance may choose from any and all available
 226 codebooks when coding/decoding.
 227
 228
 229 \subsubsection{Residue}
 230 The spectral residue is the fine structure of the audio spectrum
 231 once the floor curve has been subtracted out.  In simplest terms, it
 232 is coded in the bitstream using cascaded (multi-pass) vector
 233 quantization according to one of three specific packing/coding
 234 algorithms numbered 0 through 2.  The packing algorithm details are
 235 configured by residue instance.  As with the floor components, the
 236 final VQ/entropy encoding is provided by external codebook instances
 237 and each residue instance may choose from any and all available
 238 codebooks.
 239
 240 \subsubsection{Codebooks}
 241
 242 Codebooks are a self-contained abstraction that perform entropy
 243 decoding and, optionally, use the entropy-decoded integer value as an
 244 offset into an index of output value vectors, returning the indicated
 245 vector of values.
 246
 247 The entropy coding in a Vorbis I codebook is provided by a standard
 248 Huffman binary tree representation.  This tree is tightly packed using
 249 one of several methods, depending on whether codeword lengths are
 250 ordered or unordered, or the tree is sparse.
 251
 252 The codebook vector index is similarly packed according to index
 253 characteristic.  Most commonly, the vector index is encoded as a
 254 single list of values of possible values that are then permuted into
 255 a list of n-dimensional rows (lattice VQ).
 256
 257
 258
 259 \subsection{High-level Decode Process}
 260
 261 \subsubsection{Decode Setup}
 262
 263 Before decoding can begin, a decoder must initialize using the
 264 bitstream headers matching the stream to be decoded.  Vorbis uses
 265 three header packets; all are required, in-order, by this
 266 specification. Once set up, decode may begin at any audio packet
 267 belonging to the Vorbis stream. In Vorbis I, all packets after the
 268 three initial headers are audio packets.
 269
 270 The header packets are, in order, the identification
 271 header, the comments header, and the setup header.
 272
 273 \paragraph{Identification Header}
 274 The identification header identifies the bitstream as Vorbis, Vorbis
 275 version, and the simple audio characteristics of the stream such as
 276 sample rate and number of channels.
 277
 278 \paragraph{Comment Header}
 279 The comment header includes user text comments (``tags'') and a vendor
 280 string for the application/library that produced the bitstream.  The
 281 encoding and proper use of the comment header is described in \xref{vorbis:spec:comment}.
 282
 283 \paragraph{Setup Header}
 284 The setup header includes extensive CODEC setup information as well as
 285 the complete VQ and Huffman codebooks needed for decode.
 286
 287
 288 \subsubsection{Decode Procedure}
 289
 290 The decoding and synthesis procedure for all audio packets is
 291 fundamentally the same.
 292 \begin{enumerate}
 293 \item decode packet type flag
 294 \item decode mode number
 295 \item decode window shape (long windows only)
 296 \item decode floor
 297 \item decode residue into residue vectors
 298 \item inverse channel coupling of residue vectors
 299 \item generate floor curve from decoded floor data
 300 \item compute dot product of floor and residue, producing audio spectrum vector
 301 \item inverse monolithic transform of audio spectrum vector, always an MDCT in Vorbis I
 302 \item overlap/add left-hand output of transform with right-hand output of previous frame
 303 \item store right hand-data from transform of current frame for future lapping
 304 \item if not first frame, return results of overlap/add as audio result of current frame
 305 \end{enumerate}
 306
 307 Note that clever rearrangement of the synthesis arithmetic is
 308 possible; as an example, one can take advantage of symmetries in the
 309 MDCT to store the right-hand transform data of a partial MDCT for a
 310 50\% inter-frame buffer space savings, and then complete the transform
 311 later before overlap/add with the next frame.  This optimization
 312 produces entirely equivalent output and is naturally perfectly legal.
 313 The decoder must be \emph{entirely mathematically equivalent} to the
 314 specification, it need not be a literal semantic implementation.
 315
 316 \paragraph{Packet type decode}
 317
 318 Vorbis I uses four packet types. The first three packet types mark each
 319 of the three Vorbis headers described above. The fourth packet type
 320 marks an audio packet. All other packet types are reserved; packets
 321 marked with a reserved type should be ignored.
 322
 323 Following the three header packets, all packets in a Vorbis I stream
 324 are audio.  The first step of audio packet decode is to read and
 325 verify the packet type; \emph{a non-audio packet when audio is expected
 326 indicates stream corruption or a non-compliant stream. The decoder
 327 must ignore the packet and not attempt decoding it to
 328 audio}.
 329
 330
 331
 332
 333 \paragraph{Mode decode}
 334 Vorbis allows an encoder to set up multiple, numbered packet 'modes',
 335 as described earlier, all of which may be used in a given Vorbis
 336 stream. The mode is encoded as an integer used as a direct offset into
 337 the mode instance index.
 338
 339
 340 \paragraph{Window shape decode (long windows only)} \label{vorbis:spec:window}
 341
 342 Vorbis frames may be one of two PCM sample sizes specified during
 343 codec setup.  In Vorbis I, legal frame sizes are powers of two from 64
 344 to 8192 samples.  Aside from coupling, Vorbis handles channels as
 345 independent vectors and these frame sizes are in samples per channel.
 346
 347 Vorbis uses an overlapping transform, namely the MDCT, to blend one
 348 frame into the next, avoiding most inter-frame block boundary
 349 artifacts.  The MDCT output of one frame is windowed according to MDCT
 350 requirements, overlapped 50\% with the output of the previous frame and
 351 added.  The window shape assures seamless reconstruction.
 352
 353 This is easy to visualize in the case of equal sized-windows:
 354
 355 \begin{center}
 356 \includegraphics[width=\textwidth]{window1}
 357 \captionof{figure}{overlap of two equal-sized windows}
 358 \end{center}
 359
 360 And slightly more complex in the case of overlapping unequal sized
 361 windows:
 362
 363 \begin{center}
 364 \includegraphics[width=\textwidth]{window2}
 365 \captionof{figure}{overlap of a long and a short window}
 366 \end{center}
 367
 368 In the unequal-sized window case, the window shape of the long window
 369 must be modified for seamless lapping as above.  It is possible to
 370 correctly infer window shape to be applied to the current window from
 371 knowing the sizes of the current, previous and next window.  It is
 372 legal for a decoder to use this method. However, in the case of a long
 373 window (short windows require no modification), Vorbis also codes two
 374 flag bits to specify pre- and post- window shape.  Although not
 375 strictly necessary for function, this minor redundancy allows a packet
 376 to be fully decoded to the point of lapping entirely independently of
 377 any other packet, allowing easier abstraction of decode layers as well
 378 as allowing a greater level of easy parallelism in encode and
 379 decode.
 380
 381 A description of valid window functions for use with an inverse MDCT
 382 can be found in \cite{Sporer/Brandenburg/Edler}.  Vorbis windows
 383 all use the slope function
 384 \[ y = \sin(.5*\pi \, \sin^2((x+.5)/n*\pi)) . \]
 385
 386
 387
 388 \paragraph{floor decode}
 389 Each floor is encoded/decoded in channel order, however each floor
 390 belongs to a 'submap' that specifies which floor configuration to
 391 use.  All floors are decoded before residue decode begins.
 392
 393
 394 \paragraph{residue decode}
 395
 396 Although the number of residue vectors equals the number of channels,
 397 channel coupling may mean that the raw residue vectors extracted
 398 during decode do not map directly to specific channels.  When channel
 399 coupling is in use, some vectors will correspond to coupled magnitude
 400 or angle.  The coupling relationships are described in the codec setup
 401 and may differ from frame to frame, due to different mode numbers.
 402
 403 Vorbis codes residue vectors in groups by submap; the coding is done
 404 in submap order from submap 0 through n-1.  This differs from floors
 405 which are coded using a configuration provided by submap number, but
 406 are coded individually in channel order.
 407
 408
 409
 410 \paragraph{inverse channel coupling}
 411
 412 A detailed discussion of stereo in the Vorbis codec can be found in
 413 the document \href{stereo.html}{Stereo Channel Coupling in the
 414 Vorbis CODEC}.  Vorbis is not limited to only stereo coupling, but
 415 the stereo document also gives a good overview of the generic coupling
 416 mechanism.
 417
 418 Vorbis coupling applies to pairs of residue vectors at a time;
 419 decoupling is done in-place a pair at a time in the order and using
 420 the vectors specified in the current mapping configuration.  The
 421 decoupling operation is the same for all pairs, converting square
 422 polar representation (where one vector is magnitude and the second
 423 angle) back to Cartesian representation.
 424
 425 After decoupling, in order, each pair of vectors on the coupling list,
 426 the resulting residue vectors represent the fine spectral detail
 427 of each output channel.
 428
 429
 430
 431 \paragraph{generate floor curve}
 432
 433 The decoder may choose to generate the floor curve at any appropriate
 434 time.  It is reasonable to generate the output curve when the floor
 435 data is decoded from the raw packet, or it can be generated after
 436 inverse coupling and applied to the spectral residue directly,
 437 combining generation and the dot product into one step and eliminating
 438 some working space.
 439
 440 Both floor 0 and floor 1 generate a linear-range, linear-domain output
 441 vector to be multiplied (dot product) by the linear-range,
 442 linear-domain spectral residue.
 443
 444
 445
 446 \paragraph{compute floor/residue dot product}
 447
 448 This step is straightforward; for each output channel, the decoder
 449 multiplies the floor curve and residue vectors element by element,
 450 producing the finished audio spectrum of each channel.
 451
 452 % TODO/FIXME: The following two paragraphs have identical twins
 453 %   in section 4 (under "dot product")
 454 One point is worth mentioning about this dot product; a common mistake
 455 in a fixed point implementation might be to assume that a 32 bit
 456 fixed-point representation for floor and residue and direct
 457 multiplication of the vectors is sufficient for acceptable spectral
 458 depth in all cases because it happens to mostly work with the current
 459 Xiph.Org reference encoder.
 460
 461 However, floor vector values can span \~{}140dB (\~{}24 bits unsigned), and
 462 the audio spectrum vector should represent a minimum of 120dB (\~{}21
 463 bits with sign), even when output is to a 16 bit PCM device.  For the
 464 residue vector to represent full scale if the floor is nailed to
 465 $-140$dB, it must be able to span 0 to $+140$dB.  For the residue vector
 466 to reach full scale if the floor is nailed at 0dB, it must be able to
 467 represent $-140$dB to $+0$dB.  Thus, in order to handle full range
 468 dynamics, a residue vector may span $-140$dB to $+140$dB entirely within
 469 spec.  A 280dB range is approximately 48 bits with sign; thus the
 470 residue vector must be able to represent a 48 bit range and the dot
 471 product must be able to handle an effective 48 bit times 24 bit
 472 multiplication.  This range may be achieved using large (64 bit or
 473 larger) integers, or implementing a movable binary point
 474 representation.
 475
 476
 477
 478 \paragraph{inverse monolithic transform (MDCT)}
 479
 480 The audio spectrum is converted back into time domain PCM audio via an
 481 inverse Modified Discrete Cosine Transform (MDCT).  A detailed
 482 description of the MDCT is available in \cite{Sporer/Brandenburg/Edler}.
 483
 484 Note that the PCM produced directly from the MDCT is not yet finished
 485 audio; it must be lapped with surrounding frames using an appropriate
 486 window (such as the Vorbis window) before the MDCT can be considered
 487 orthogonal.
 488
 489
 490
 491 \paragraph{overlap/add data}
 492 Windowed MDCT output is overlapped and added with the right hand data
 493 of the previous window such that the 3/4 point of the previous window
 494 is aligned with the 1/4 point of the current window (as illustrated in
 495 the window overlap diagram). At this point, the audio data between the
 496 center of the previous frame and the center of the current frame is
 497 now finished and ready to be returned.
 498
 499
 500 \paragraph{cache right hand data}
 501 The decoder must cache the right hand portion of the current frame to
 502 be lapped with the left hand portion of the next frame.
 503
 504
 505
 506 \paragraph{return finished audio data}
 507
 508 The overlapped portion produced from overlapping the previous and
 509 current frame data is finished data to be returned by the decoder.
 510 This data spans from the center of the previous window to the center
 511 of the current window.  In the case of same-sized windows, the amount
 512 of data to return is one-half block consisting of and only of the
 513 overlapped portions. When overlapping a short and long window, much of
 514 the returned range is not actually overlap.  This does not damage
 515 transform orthogonality.  Pay attention however to returning the
 516 correct data range; the amount of data to be returned is:
 517
 518 \begin{Verbatim}[commandchars=\\\{\}]
 519 window\_blocksize(previous\_window)/4+window\_blocksize(current\_window)/4
 520 \end{Verbatim}
 521
 522 from the center of the previous window to the center of the current
 523 window.
 524
 525 Data is not returned from the first frame; it must be used to 'prime'
 526 the decode engine.  The encoder accounts for this priming when
 527 calculating PCM offsets; after the first frame, the proper PCM output
 528 offset is '0' (as no data has been returned yet).