Vorbis I specification

This document provides a high level description of the Vorbis codec's construction. A bit-by-bit specification appears beginning in Section 4, “Codec Setup and Packet Decode”. The later sections assume a high-level understanding of the Vorbis decode process, which is -provided here.

1.1.1. Application

+provided here.

1.1.1. Application

Vorbis is a general purpose perceptual audio CODEC intended to allow maximum encoder flexibility, thus allowing it to scale competitively over an exceptionally wide range of bitrates. At the high @@ -18,13 +18,13 @@ lower and higher sample rates (from 8kHz telephony to 192kHz digital masters) and a range of channel representations (monaural, polyphonic, stereo, quadraphonic, 5.1, ambisonic, or up to 255 discrete channels). -

1.1.2. Classification

Vorbis I is a forward-adaptive monolithic transform CODEC based on the Modified Discrete Cosine Transform. The codec is structured to allow addition of a hybrid wavelet filterbank in Vorbis II to offer better transient response and reproduction using a transform better suited to localized time events. -

1.1.3. Assumptions

The Vorbis CODEC design assumes a complex, psychoacoustically-aware encoder and simple, low-complexity decoder. Vorbis decode is computationally simpler than mp3, although it does require more @@ -56,8 +56,8 @@ examples in this document, we will assume that Vorbis is to be embedded in an Ogg stream specifically, although this is by no means a requirement or fundamental assumption in the Vorbis design.

The specification for embedding Vorbis into -an Ogg transport stream is in Appendix 1, Embedding Vorbis into an Ogg stream. -

1.1.4. Codec Setup and Probability Model

+an Ogg transport stream is in Appendix A, Embedding Vorbis into an Ogg stream. +

1.1.4. Codec Setup and Probability Model

Vorbis' heritage is as a research CODEC and its current design reflects a desire to allow multiple decades of continuous encoder improvement before running out of room within the codec specification. @@ -85,29 +85,29 @@ causes some amount of complaint among engineers as this runs against current design trends (and also points out limitations in some existing software/interface designs, such as Windows' ACM codec framework). However, we find that it does not fundamentally limit -Vorbis' suitable application space.

1.1.5. Format Specification

+Vorbis' suitable application space.

1.1.5. Format Specification

The Vorbis format is well-defined by its decode specification; any encoder that produces packets that are correctly decoded by the reference Vorbis decoder described below may be considered a proper Vorbis encoder. A decoder must faithfully and completely implement the specification defined below (except where noted) to be considered -a proper Vorbis decoder.

1.1.6. Hardware Profile

+a proper Vorbis decoder.

1.1.6. Hardware Profile

Although Vorbis decode is computationally simple, it may still run into specific limitations of an embedded design. For this reason, embedded designs are allowed to deviate in limited ways from the 'full' decode specification yet still be certified compliant. These -optional omissions are labelled in the spec where relevant.

1.2. Decoder Configuration

+optional omissions are labelled in the spec where relevant.

1.2. Decoder Configuration

Decoder setup consists of configuration of multiple, self-contained component abstractions that perform specific functions in the decode pipeline. Each different component instance of a specific type is semantically interchangeable; decoder configuration consists both of internal component configuration, as well as arrangement of specific instances into a decode pipeline. Componentry arrangement is roughly -as follows:

1.2.1. Global Config

+as follows:

1.2.1. Global Config

Global codec configuration consists of a few audio related fields (sample rate, channels), Vorbis version (always '0' in Vorbis I), bitrate hints, and the lists of component instances. All other -configuration is in the context of specific components.

1.2.2. Mode

+configuration is in the context of specific components.

1.2.2. Mode

Each Vorbis frame is coded according to a master 'mode'. A bitstream may use one or many modes.

The mode mechanism is used to encode a frame according to one of @@ -120,7 +120,7 @@ A 'mode' configuration consists of a frame size setting, window type (always 0, the Vorbis window, in Vorbis I), transform type (always type 0, the MDCT, in Vorbis I) and a mapping number. The mapping number specifies which mapping configuration instance to use for -low-level packet decode and synthesis.

1.2.3. Mapping

+low-level packet decode and synthesis.

1.2.3. Mapping

A mapping contains a channel coupling description and a list of 'submaps' that bundle sets of channel vectors together for grouped encoding and decoding. These submaps are not references to external @@ -141,7 +141,7 @@ apply a full range floor and residue encoding to channels 0 through 4, and a bass-only representation to the bass channel, thus saving space. In this example, channels 0-4 belong to submap 0 (which indicates use of a full-range floor) and channel 5 belongs to submap 1, which uses a -bass-only representation.

1.2.4. Floor

+bass-only representation.

1.2.4. Floor

Vorbis encodes a spectral 'floor' vector for each PCM channel. This vector is a low-resolution representation of the audio spectrum for the given channel in the current frame, generally used akin to a @@ -165,7 +165,7 @@ make use of entropy coding to save space. For this reason, a floor configuration generally refers to multiple codebooks in the codebook component list. Entropy coding is thus provided as an abstraction, and each floor instance may choose from any and all available -codebooks when coding/decoding.

1.2.5. Residue

+codebooks when coding/decoding.

1.2.5. Residue

The spectral residue is the fine structure of the audio spectrum once the floor curve has been subtracted out. In simplest terms, it is coded in the bitstream using cascaded (multi-pass) vector @@ -174,7 +174,7 @@ algorithms numbered 0 through 2. The packing algorithm details are configured by residue instance. As with the floor components, the final VQ/entropy encoding is provided by external codebook instances and each residue instance may choose from any and all available -codebooks.

1.2.6. Codebooks

+codebooks.

1.2.6. Codebooks

Codebooks are a self-contained abstraction that perform entropy decoding and, optionally, use the entropy-decoded integer value as an offset into an index of output value vectors, returning the indicated @@ -186,7 +186,7 @@ ordered or unordered, or the tree is sparse.

The codebook vector index is similarly packed according to index characteristic. Most commonly, the vector index is encoded as a single list of values of possible values that are then permuted into -a list of n-dimensional rows (lattice VQ).

1.3. High-level Decode Process

1.3.1. Decode Setup

+a list of n-dimensional rows (lattice VQ).

1.3. High-level Decode Process

1.3.1. Decode Setup

Before decoding can begin, a decoder must initialize using the bitstream headers matching the stream to be decoded. Vorbis uses three header packets; all are required, in-order, by this @@ -194,16 +194,16 @@ specification. Once set up, decode may begin at any audio packet belonging to the Vorbis stream. In Vorbis I, all packets after the three initial headers are audio packets.

The header packets are, in order, the identification -header, the comments header, and the setup header.

1.3.1.1. Identification Header

+header, the comments header, and the setup header.

1.3.1.1. Identification Header

The identification header identifies the bitstream as Vorbis, Vorbis version, and the simple audio characteristics of the stream such as -sample rate and number of channels.

1.3.1.2. Comment Header

+sample rate and number of channels.

1.3.1.2. Comment Header

The comment header includes user text comments ("tags") and a vendor string for the application/library that produced the bitstream. The encoding and proper use of the comment header is described in -Section 5, “comment field and header specification”.

1.3.1.3. Setup Header

+Section 5, “comment field and header specification”.

1.3.1.3. Setup Header

The setup header includes extensive CODEC setup information as well as -the complete VQ and Huffman codebooks needed for decode.

1.3.2. Decode Procedure

+the complete VQ and Huffman codebooks needed for decode.

1.3.2. Decode Procedure

The decoding and synthesis procedure for all audio packets is fundamentally the same.

decode packet type flag
decode mode number
decode window shape (long windows only)
decode floor
decode residue into residue vectors
inverse channel coupling of residue vectors
generate floor curve from decoded floor data
compute dot product of floor and residue, producing audio spectrum vector
inverse monolithic transform of audio spectrum vector, always an MDCT in Vorbis I
overlap/add left-hand output of transform with right-hand output of previous frame
store right hand-data from transform of current frame for future lapping
if not first frame, return results of overlap/add as audio result of current frame

@@ -215,7 +215,7 @@ MDCT to store the right-hand transform data of a partial MDCT for a later before overlap/add with the next frame. This optimization produces entirely equivalent output and is naturally perfectly legal. The decoder must be entirely mathematically equivalent to the -specification, it need not be a literal semantic implementation.

1.3.2.1. Packet type decode

+specification, it need not be a literal semantic implementation.

1.3.2.1. Packet type decode

Vorbis I uses four packet types. The first three packet types mark each of the three Vorbis headers described above. The fourth packet type marks an audio packet. All other packet types are reserved; packets @@ -225,7 +225,7 @@ are audio. The first step of audio packet decode is to read and verify the packet type; a non-audio packet when audio is expected indicates stream corruption or a non-compliant stream. The decoder must ignore the packet and not attempt decoding it to -audio.

1.3.2.2. Mode decode

+audio.

1.3.2.2. Mode decode

Vorbis allows an encoder to set up multiple, numbered packet 'modes', as described earlier, all of which may be used in a given Vorbis stream. The mode is encoded as an integer used as a direct offset into @@ -262,10 +262,10 @@ The use of multirate filter banks for coding of high quality digital audio”, by T. Sporer, K. Brandenburg and B. Edler. Vorbis windows all use the slope function . -

1.3.2.4. floor decode

Each floor is encoded/decoded in channel order, however each floor belongs to a 'submap' that specifies which floor configuration to -use. All floors are decoded before residue decode begins.

1.3.2.5. residue decode

+use. All floors are decoded before residue decode begins.

1.3.2.5. residue decode

Although the number of residue vectors equals the number of channels, channel coupling may mean that the raw residue vectors extracted during decode do not map directly to specific channels. When channel @@ -275,7 +275,7 @@ and may differ from frame to frame, due to different mode numbers.

Vorbis codes residue vectors in groups by submap; the coding is done in submap order from submap 0 through n-1. This differs from floors which are coded using a configuration provided by submap number, but -are coded individually in channel order.

1.3.2.6. inverse channel coupling

+are coded individually in channel order.

1.3.2.6. inverse channel coupling

A detailed discussion of stereo in the Vorbis codec can be found in the document Stereo Channel Coupling in the Vorbis CODEC. Vorbis is not limited to only stereo coupling, but @@ -289,7 +289,7 @@ polar representation (where one vector is magnitude and the second angle) back to Cartesian representation.

After decoupling, in order, each pair of vectors on the coupling list, the resulting residue vectors represent the fine spectral detail -of each output channel.

1.3.2.7. generate floor curve

+of each output channel.

1.3.2.7. generate floor curve

The decoder may choose to generate the floor curve at any appropriate time. It is reasonable to generate the output curve when the floor data is decoded from the raw packet, or it can be generated after @@ -298,7 +298,7 @@ combining generation and the dot product into one step and eliminating some working space.

Both floor 0 and floor 1 generate a linear-range, linear-domain output vector to be multiplied (dot product) by the linear-range, -linear-domain spectral residue.

1.3.2.8. compute floor/residue dot product

+linear-domain spectral residue.

1.3.2.8. compute floor/residue dot product

This step is straightforward; for each output channel, the decoder multiplies the floor curve and residue vectors element by element, producing the finished audio spectrum of each channel.

@@ -321,7 +321,7 @@ residue vector must be able to represent a 48 bit range and the dot product must be able to handle an effective 48 bit times 24 bit multiplication. This range may be achieved using large (64 bit or larger) integers, or implementing a movable binary point -representation.

1.3.2.9. inverse monolithic transform (MDCT)

+representation.

1.3.2.9. inverse monolithic transform (MDCT)

The audio spectrum is converted back into time domain PCM audio via an inverse Modified Discrete Cosine Transform (MDCT). A detailed description of the MDCT is available in the paper “The use of multirate filter banks for coding of high quality digital @@ -329,16 +329,16 @@ audio”, by T. Sporer, K. Brandenburg and B. Edler.

Note that the PCM produced directly from the MDCT is not yet finished audio; it must be lapped with surrounding frames using an appropriate window (such as the Vorbis window) before the MDCT can be considered -orthogonal.

1.3.2.10. overlap/add data

+orthogonal.

1.3.2.10. overlap/add data

Windowed MDCT output is overlapped and added with the right hand data of the previous window such that the 3/4 point of the previous window is aligned with the 1/4 point of the current window (as illustrated in the window overlap diagram). At this point, the audio data between the center of the previous frame and the center of the current frame is -now finished and ready to be returned.

1.3.2.11. cache right hand data

+now finished and ready to be returned.

1.3.2.11. cache right hand data

The decoder must cache the right hand portion of the current frame to be lapped with the left hand portion of the next frame. -

1.3.2.12. return finished audio data

The overlapped portion produced from overlapping the previous and current frame data is finished data to be returned by the decoder. This data spans from the center of the previous window to the center @@ -360,7 +360,7 @@ the decode engine. The encoder accounts for this priming when calculating PCM offsets; after the first frame, the proper PCM output offset is '0' (as no data has been returned yet).

2. Bitpacking Convention

$Id: 02-bitpacking.xml 7186 2004-07-20 07:19:25Z xiphmont $ -

2.1. Overview

The Vorbis codec uses relatively unstructured raw packets containing arbitrary-width binary integer fields. Logically, these packets are a bitstream in which bits are coded one-by-one by the encoder and then @@ -370,7 +370,7 @@ native word size of eight bits (octets), sixteen bits, thirty-two bits or, less commonly other fixed word sizes. The Vorbis bitpacking convention specifies the correct mapping of the logical packet bitstream into an actual representation in fixed-width words. -

2.1.1. octets, bytes and words

In most contemporary architectures, a 'byte' is synonymous with an 'octet', that is, eight bits. This has not always been the case; seven, ten, eleven and sixteen bit 'bytes' have been used. For @@ -386,13 +386,13 @@ octet (eight bits) and a word to be a group of two, four or eight bytes (16, 32 or 64 bits). Note however that the Vorbis bitpacking convention is still well defined for any native byte size; Vorbis uses the native bit-width of a given storage system. This document assumes -that a byte is one octet for purposes of example.

2.1.2. bit order

+that a byte is one octet for purposes of example.

2.1.2. bit order

A byte has a well-defined 'least significant' bit (LSb), which is the only bit set when the byte is storing the two's complement integer value +1. A byte's 'most significant' bit (MSb) is at the opposite end of the byte. Bits in a byte are numbered from zero at the LSb to n (n=7 in an octet) for the -MSb.

2.1.3. byte order

+MSb.

2.1.3. byte order

Words are native groupings of multiple bytes. Several byte orderings are possible in a word; the common ones are 3-2-1-0 ('big endian' or 'most significant byte first' in which the highest-valued byte comes @@ -404,7 +404,7 @@ manipulation at the byte, not word, level, thus host word ordering is of a concern only during optimization when writing high performance code that operates on a word of storage at a time rather than by byte. Logically, bytes are always coded and decoded in order from byte zero -through byte n.

2.1.4. coding bits into byte sequences

+through byte n.

2.1.4. coding bits into byte sequences

The Vorbis codec has need to code arbitrary bit-width integers, from zero to 32 bits wide, into packets. These integer fields are not aligned to the boundaries of the byte representation; the next field @@ -420,13 +420,13 @@ the requested number of bits. When all bits of the destination byte have been filled, encoding continues by zeroing all bits of the next byte and writing the next bit into the bit position 0 of that byte. Decoding follows the same process as encoding, but by reading bits -from the byte stream and reassembling them into integers.

2.1.5. signedness

+from the byte stream and reassembling them into integers.

2.1.5. signedness

The signedness of a specific number resulting from decode is to be interpreted by the decoder given decode context. That is, the three bit binary pattern 'b111' can be taken to represent either 'seven' as an unsigned integer, or '-1' as a signed, two's complement integer. The encoder and decoder are responsible for knowing if fields are to -be treated as signed or unsigned.

2.1.6. coding example

+be treated as signed or unsigned.

2.1.6. coding example

Code the 4 bit integer value '12' [b1100] into an empty bytestream. Bytestream result: @@ -490,7 +490,7 @@ byte 3 [0 0 0 0 0 1 1 0] <- byte n [ ] bytestream length == 4 bytes

2.1.7. decoding example

Reading from the beginning of the bytestream encoded in the above example:

@@ -515,7 +515,7 @@ boundaries maintained in the bitstream.
The second value is the
 two-bit-wide integer 'b11'.  This value may be interpreted either as
 the unsigned value '3', or the signed value '-1'.  Signedness is
 dependent on decode context.

2.1.8. end-of-packet alignment

The typical use of bitpacking is to produce many independent byte-aligned packets which are embedded into a larger byte-aligned container structure, such as an Ogg transport bitstream. Externally, @@ -533,7 +533,7 @@ remaining data to fulfill the desired read size. Vorbis uses truncated packets as a normal mode of operation, and as such, decoders must handle reading past the end of a packet as a typical mode of operation. Any further read operations after an 'end-of-packet' -condition shall also return 'end-of-packet'.

2.1.9. reading zero bits

+condition shall also return 'end-of-packet'.

2.1.9. reading zero bits

Reading a zero-bit-wide integer returns the value '0' and does not increment the stream cursor. Reading to the end of the packet (but not past, such that an 'end-of-packet' condition has not triggered) @@ -542,7 +542,7 @@ not trigger an end-of-packet condition. Reading a zero-bit-wide integer after a previous read sets 'end-of-packet' shall also fail with 'end-of-packet'.

3. Probability Model and Codebooks

$Id: 03-codebook.xml 7186 2004-07-20 07:19:25Z xiphmont $ -

3.1. Overview

Unlike practically every other mainstream audio codec, Vorbis has no statically configured probability model, instead packing all entropy decoding configuration, VQ and Huffman, into the bitstream itself in @@ -551,15 +551,15 @@ consists of multiple 'codebooks', each containing a specific Huffman-equivalent representation for decoding compressed codewords as well as an optional lookup table of output vector values to which a decoded Huffman value is applied as an offset, generating the final -decoded output corresponding to a given compressed codeword.

3.1.1. Bitwise operation

+decoded output corresponding to a given compressed codeword.

3.1.1. Bitwise operation

The codebook mechanism is built on top of the vorbis bitpacker. Both the codebooks themselves and the codewords they decode are unrolled from a packet as a series of arbitrary-width values read from the -stream according to Section 2, “Bitpacking Convention”.

3.2. Packed codebook format

+stream according to Section 2, “Bitpacking Convention”.

3.2. Packed codebook format

For purposes of the examples below, we assume that the storage system's native byte width is eight bits. This is not universally true; see Section 2, “Bitpacking Convention” for discussion -relating to non-eight-bit bytes.

3.2.1. codebook decode

+relating to non-eight-bit bytes.

3.2.1. codebook decode

A codebook begins with a 24 bit sync pattern, 0x564342:

@@ -689,7 +689,7 @@ and indicates a stream that is not decodable by the specification in this
 document.

An 'end of packet' during any read operation in the above steps is -considered an error condition rendering the stream undecodable.

3.2.1.1. Huffman decision tree representation

+considered an error condition rendering the stream undecodable.

3.2.1.1. Huffman decision tree representation

The [codebook_codeword_lengths] array and [codebook_entries] value uniquely define the Huffman decision tree used for entropy decoding.

@@ -747,7 +747,7 @@ undecodable.

Codebook entries marked 'unused' are simply skipped in the assigning process. They have no codeword and do not appear in the decision tree, thus it's impossible for any bit pattern read from the stream to -decode to that entry number.

3.2.1.2. VQ lookup table vector representation

+decode to that entry number.

3.2.1.2. VQ lookup table vector representation

Unpacking the VQ lookup table vectors relies on the following values:

 the [codebook_multiplicands] array
@@ -763,7 +763,7 @@ the [codebook_multiplicands] array
 Decoding (unpacking) a specific vector in the vector lookup table
 proceeds according to [codebook_lookup_type].  The unpacked
 vector values are what a codebook would return during audio packet
-decode in a VQ context.
3.2.1.2.1. Vector value decode: Lookup type 1

+decode in a VQ context.
3.2.1.2.1. Vector value decode: Lookup type 1

 Lookup type one specifies a lattice VQ lookup table built
 algorithmically from a list of scalar values.  Calculate (unpack) the
 final values of a codebook entry vector from the entries in
@@ -790,7 +790,7 @@ is the output vector representing the vector of values for entry number
      }
  
   8) vector calculation completed.
-

3.2.1.2.2. Vector value decode: Lookup type 2

Lookup type two specifies a VQ lookup table in which each scalar in each vector is explicitly set by the [codebook_multiplicands] array in a one-to-one mapping. Calculate [unpack] the @@ -815,7 +815,7 @@ is the output vector representing the vector of values for entry number } 7) vector calculation completed. -

3.3. Use of the codebook abstraction

The decoder uses the codebook abstraction much as it does the bit-unpacking convention; a specific codebook reads a codeword from the bitstream, decoding it into an entry number, and then @@ -847,19 +847,19 @@ desired return value.

When used in a VQ context, the codeword entry number is used as an offset into the VQ lookup table. The value returned to the decoder is the vector of scalars corresponding to this offset.

4. Codec Setup and Packet Decode

- $Id: 04-codec.xml 7186 2004-07-20 07:19:25Z xiphmont $ -

4.1. Overview

+ $Id: 04-codec.xml 10466 2005-11-28 00:34:44Z giles $ +