doc/stereo.html

   1 <HTML><HEAD><TITLE>xiph.org: Ogg Vorbis documentation</TITLE>
   2 <BODY bgcolor="#ffffff" text="#202020" link="#006666" vlink="#000000">
   3 <nobr><img src="white-ogg.png"><img src="vorbisword2.png"></nobr><p>
   4
   5
   6 <h1><font color=#000070>
   7 Stereo Channel Coupling in the Vorbis CODEC
   8 </font></h1>
   9
  10 <em>Last update to this document: June 27, 2001</em><br>
  11
  12 <h2>Abstract</h2> The Vorbis audio CODEC provides a channel coupling
  13 mechanisms designed to reduce effective bitrate by both eliminating
  14 interchannel redundancy and eliminating stereo image information
  15 labelled inaudible or undesireable according to spatial psychoacoustic
  16 models.  This document describes both the mechanical coupling
  17 mechanisms available within the Vorbis specification, as well as the
  18 specific stereo coupling models used by the reference
  19 <tt>libvorbis</tt> CODEC provided by xiph.org.
  20
  21 <h2>Terminology</h2> Terminology as used in this document is based on
  22 common terminology associated with contemporary CODECs such as MPEG I
  23 audio layer 3 (mp3).  However, some differences in terminology are
  24 useful in the context of Vorbis as Vorbis functions somewhat
  25 differently than most current formats.  For clarity, a few terms are
  26 defined beforehand here, and others will be defined where they first
  27 appear in context.<p>
  28
  29 <h3>Subjective and Objective</h3>
  30
  31 <em>Objective</em> fidelity is a measure, based on a computable,
  32 mechanical metric, of how carefully an output matches an input.  For
  33 example, a stereo amplifier may claim to intorduce less that .01%
  34 total harmonic distortion when amplifying an input signal; this claim
  35 is easy to verify given proper equiment, and any number of testers are
  36 likely to arrive at the same, exact results.  One need not listen to
  37 the equipment to make this measurement.<p>
  38
  39 However, given two amplifiers with identical, verifiable objective
  40 specifications, listeners may strongly prefer the sound quality of one
  41 over the other.  This is actually the case in the decades old debate
  42 [some would say jihad] among audiophiles involving vacuum tube versus
  43 solid state amplifiers.  There are people who can tell the difference,
  44 and strongly prefer one over the other despite seemingly identical,
  45 measurable quality.  This preference is <em>subjective</em> and
  46 difficult to measure but nonetheless real.
  47
  48 Individual elements of subjective differences often can be qualified,
  49 but overall subjective quality generally is not measurable.  Different
  50 observers are likely to disagree on the exact results of a subjective
  51 test as each observer's perspective differs.  When measuring
  52 subjective qualities, the best one can hope for is average, empirical
  53 results that show statistical significance across a group.<p>
  54
  55 Perceptual codecs are most concerned with subjective, not objective,
  56 quality.  This is why evaluating a perceptual codec via distortion
  57 measures and sonograms alone is useless; these objective measures may
  58 provide insight into the quality or functioning of a codec, but cannot
  59 answer the much squishier subjective question, "Does it sound
  60 good?". The tube amplifier example is perhaps not the best as very few
  61 people can hear, or care to hear, the minute differences between tubes
  62 and transistors, whereas the subjective differences in perceptual
  63 codecs tend to be quite large even when objective differences are
  64 not.<p>
  65
  66 <h3>Fidelity, Artifacts and Differences</h3> Audio <em>artifacts</em>
  67 and loss of fidelity or more simply put, audio <em>differences</em>
  68 are not the same thing.<p>
  69
  70 A loss of fidelity implies differences between the perceived input and
  71 output signal; it does not necessarily imply that the differences in
  72 output are displeasing or that the output sounds poor (although this
  73 is often the case).  Tube amplifiers are <em>not</em> higher fidelity
  74 than modern solid state and digital systems.  They simply produce a
  75 form of distortion and coloring that is either unnoticable or actually
  76 pleasing to many ears.<p>
  77
  78 As compared to an original signal using hard metrics, all perceptual
  79 codecs [ASPEC, ATRAC, MP3, WMA, AAC, TwinVQ, AC3 and Vorbis included]
  80 lose objective fidelity in order to reduce bitrate.  This is fact. The
  81 idea is to lose fidelity in ways that cannot be perceived.  However,
  82 most current streaming applications demand bitrates lower than what
  83 can be acheived by sacrificing only objective fidelity; this is also
  84 fact, despite whatever various company press releases might claim.
  85 Subjective fidelity eventually must suffer in one way or another.<p>
  86
  87 The goal is to choose the best possible tradeoff such that the
  88 fidelity loss is graceful and not obviously noticable.  Most listeners
  89 of FM radio do not realize how much lower fidelity that medium is as
  90 compared to compact discs or DAT.  However, when compared directly to
  91 source material, the difference is obvious.  A cassette tape is lower
  92 fidelity still, and yet the degredation, relatively speaking, is
  93 graceful and generally easy not to notice.  Compare this graceful loss
  94 of quality to an average 44.1kHz stereo mp3 encoded at 80 or 96kbps.
  95 The mp3 might actually be higher objective fidelity but subjectively
  96 sounds much worse.<p>
  97
  98 Thus, when a CODEC <em>must</em> sacrifice subjective quality in order
  99 to satisfy a user's requirements, the result should be a
 100 <em>difference</em> that is generally either difficult to notice
 101 without comparison, or easy to ignore.  An <em>artifact</em>, on the
 102 other hand, is an element introduced into the output that is
 103 immediately noticable, obviously foreign, and undesired.  The famous
 104 'underwater' or 'twinkling' effect synonymous with low bitrate (or
 105 poorly encoded) mp3 is an example of an <em>artifact</em>.  This
 106 working definition differs slightly from common usage, but the coined
 107 distinction between differences and artifacts is useful for our
 108 discussion.<p>
 109
 110 The goal, when it is absolutely necessary to sacrifice subjective
 111 fidelity, is obviously to strive for differences and not artifacts.
 112 The vast majority of CODECs today fail at this task miserably,
 113 predictably, and regularly in one way or another.  Avoiding such
 114 failures when it is necessary to sacrifice subjective quality is a
 115 fundamental design objective of Vorbis and that objective is reflected
 116 in Vorbis's channel coupling design.<p>
 117
 118 <h2>Mechanisms</h2>
 119
 120 In encoder release beta 4 and earlier, Vorbis supported multiple
 121 channel encoding, but the channels were encoded entirely seperately
 122 with no cross-analysis or redundancy elimination between channels.
 123 This multichannel strategy is very similar to the mp3's <em>dual
 124 stereo</em> mode and Vorbis uses the same name for it's analagous
 125 uncoupled multichannel modes.
 126
 127 However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and
 128 later implement a coupled channel strategy.  Vorbis has two specific
 129 mechanisms that may be used alone or in conjunction to implement
 130 channel coupling.  The first is <em>channel interleaving</em> via
 131 residue backend #2, and the second is <em>square polar mapping</em>.
 132 These two general mechanisms are particularly well suited to coupling
 133 due to the structure of Vorbis encoding, as we'll explore below, and
 134 using both we can implement both totally <em>lossless stereo image
 135 coupling</em>, as well as various lossy models that seek to eliminate
 136 inaudible or unimportant aspects of the stereo image in order to
 137 enhance bitrate. The exact coupling implementation is generalized to
 138 allow the encoder a great deal of flexibility in implementation of a
 139 stereo model without requiring any significant complexity increase
 140 over the combinatorically simpler mid/side joint stereo of mp3 and
 141 other current audio codecs.<p>
 142
 143 Channel interleaving may be applied directly to more than a single
 144 channel and polar mapping is hierarchical such that polar coupling may be
 145 extrapolated to an arbitrary number of channels and is not restricted
 146 to only stereo, quadriphonics, ambisonics or 5.1 surround.  However,
 147 the scope of this document restricts itself to the stereo coupling
 148 case.<p>
 149
 150 <h3>Square Polar Mapping</h3>
 151
 152 <h4>maximal correlation</h4>
 153
 154 Recall that the basic structure of a a Vorbis I stream first generates
 155 from input audio a spectral 'floor' function that serves as an
 156 MDCT-domain whitening filter.  This floor is meant to represent the
 157 rough envelope of the frequency spectrum, using whatever metric the
 158 encoder cares to define.  This floor is subtracted from the log
 159 frequency spectrum, effectively normalizing the spectrum by frequency.
 160 Each input channel is associated with a unique floor function.<p>
 161
 162 The basic idea behind any stereo coupling is that the left and right
 163 channels usually correlate.  This correlation is even stronger if one
 164 first accounts for energy differences in any given frequency band
 165 across left and right; think for example of individual instruments
 166 mixed into different portions of the stereo image, or a stereo
 167 recording with a dominant feature not perfectly in the center.  The
 168 floor functions, each specific to a channel, provide the perfect means
 169 of normaizing left and right energies across the spectrum to maximize
 170 correlation before coupling. This feature of the Vorbis format is not
 171 a convenient accident.<p>
 172
 173 Because we strive to maximally correlate the left and right channels
 174 and generally succeed in doing so, left and right residue is typically
 175 nearly identical.  We could use channel interleaving (discussed below)
 176 alone to efficiently remove the redundancy between the left and right
 177 channels as a side effect of entropy encoding, but a polar
 178 representation gives benefits when left/right correlation is
 179 strong. <p>
 180
 181 <h4>point and diffuse imaging</h4>
 182
 183 The first advantage of a polar representation is that it effectively
 184 seperates the spatial audio information into a 'point image'
 185 (magnitude) at a given frequency and located somewhere in the sound
 186 field, and a 'diffuse image' (angle) that fills a large amount of
 187 space silmultaneously.  Even if we preserve only the magnitude (point)
 188 data, a detailed and carefully chosen floor function in each channel
 189 provides us with a free, fine-grained, frequency relative intensity
 190 stereo*.  Angle information represents diffuse sound fields, such as
 191 reverberation that fills the entre space silmultaneously.<p>
 192
 193 *<em>Because the Vorbis model supports a number of different possible
 194 stereo models and these models may be mixed, we do not use the term
 195 'intensity stereo' talking about Vorbis; instead we use the terms
 196 'point stereo', 'phase stereo' and subcategories of each.</em><p>
 197
 198 The majority of a stereo image is representable by polar magnitude
 199 alone, as strong sounds tend to be produced at near-point sources;
 200 even non-diffuse, fast, sharp echoes track very accurately using
 201 magnitude representation almost alone (for those experimenting with
 202 Vorbis tuning, this strategy works much better with the precise,
 203 piecewise control of floor 1; the continuous approximation of floor 0
 204 results in unstable imaging).  Reverberation and diffuse sounds tend
 205 to contain less energy and be psychoacoustically dominated by the
 206 point sources embedded in them.  Thus, we again tend to concentrate
 207 more represented energy into a predictably smaller number of numbers.
 208 Seperating representation of point and diffuse imaging also allows us
 209 to model and manipulate point and diffuse qualities seperately.<p>
 210
 211 <h4>controlling bit leakage and symbol crosstalk</h4> Because polar
 212 representation concentrates represented energy into fewer large
 213 values, we reduce bit 'leakage' during cascading (multistage VQ
 214 encoding) as a secondary benefit.  A single large, monolithic VQ
 215 codebook is more efficient than a cascaded book due to entropy
 216 'crosstalk' among symbols between different stages of a multistage cascade.
 217 Polar representation is a way of further concentrating entropy into
 218 predictable locations so that codebook design can take steps to
 219 improve multistage codebook efficiency.  It also allows us to cascade
 220 various elements of the stereo image independently.<p>
 221
 222 <h4>eliminating trigonometry and rounding</h4>
 223
 224 Rounding and computational complexity are potential problems with a
 225 polar representation. As our encoding process involves quantization,
 226 mixing a polar representation and quantization makes it potentially
 227 impossible, depending on implementation, to construct a coupled stereo
 228 mechanism that results in bit-identical decompressed output compared
 229 to an uncoupled encoding should the encoder desire it.<p>
 230
 231 Vorbis uses a mapping that preserves the most useful qualities of
 232 polar representation, relies only on addition/subtraction, and makes
 233 it trivial before or after quantization to represent an
 234 angle/magnitude through a one-to-one mapping from possible left/right
 235 value permutations.  We do this by basing our polar representation on
 236 the unit square rather than the unit-circle.<p>
 237
 238 Given a magnitude and angle, we recover left and right using the
 239 following function (note that A/B may be left/right or right/left
 240 depending on the coupling definition used by the encoder):<p>
 241
 242 <pre>
 243       if(magnitude>0)
 244         if(angle>0){
 245           A=magnitude;
 246           B=magnitude-angle;
 247         }else{
 248           B=magnitude;
 249           A=magnitude+angle;
 250         }
 251       else
 252         if(angle>0){
 253           A=magnitude;
 254           B=magnitude+angle;
 255         }else{
 256           B=magnitude;
 257           A=magnitude-angle;
 258         }
 259     }
 260 </pre>
 261
 262 The function is antisymmetric for positive and negative magnitudes in
 263 order to eliminate a redundant value when quantizing.  For example, if
 264 we're quantizing to integer values, we can visualize a magnitude of 5
 265 and an angle of -2 as follows:<p>
 266
 267 <img src="squarepolar.png">
 268
 269 <p>
 270 This representation loses or replicates no values; if the range of A
 271 and B are integral -5 through 5, the number of possible Cartesian
 272 permutations is 121.  Represented in square polar notation, the
 273 possible values are:
 274
 275 <pre>
 276  0, 0
 277
 278 -1,-2  -1,-1  -1, 0  -1, 1
 279
 280  1,-2   1,-1   1, 0   1, 1
 281
 282 -2,-4  -2,-3  -2,-2  -2,-1  -2, 0  -2, 1  -2, 2  -2, 3
 283
 284  2,-4   2,-3   ... following the pattern ...
 285
 286  ...    5, 1   5, 2   5, 3   5, 4   5, 5   5, 6   5, 7   5, 8   5, 9
 287
 288 </pre>
 289
 290 ...for a grand total of 121 possible values, the same number as in
 291 Cartesian representation (note that, for example, <tt>5,-10</tt> is
 292 the same as <tt>-5,10</tt>, so there's no reason to represent
 293 both. 2,10 cannot happen, and there's no reason to account for it.)
 294 It's also obvious that this mapping is exactly reversable.<p>
 295
 296 <h3>Channel interleaving</h3>
 297
 298 We can remap and A/B vector using polar mapping into a magnitude/angle
 299 vector, and it's clear that, in general, this concentrates energy in
 300 the magnitude vector and reduces the amount of information to encode
 301 in the angle vector.  Encoding these vectors independently with
 302 residue backend #0 or residue backend #1 will result in substantial
 303 bitrate savings.  However, there are still implicit correlations
 304 between the magnitude and angle vectors.  The most obvious is that the
 305 amplitude of the angle is bounded by its corresponding magnitude
 306 value.<p>
 307
 308 Entropy coding the results, then, further benefits from the entropy
 309 model being able to compress magnitude and angle silmultaneously.  For
 310 this reason, Vorbis implements residuebackend #2 which preinterleaves
 311 a number of input vectors (in the stereo case, two, A and B) into a
 312 single output vector (with the elements in the order of
 313 A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding.  Thus
 314 each vector to be coded by the vector quantization backend consists of
 315 matching magnitude and angle values.<p>
 316
 317 The astute reader, at this point, will notice that in the theoretical
 318 case in which we can use monolithic codebooks of arbitrarily large
 319 size, we can directly interleave and encode left and right without
 320 polar mapping; in fact, the polar mapping does not appear to lend any
 321 benefit whatsoever to the efficiency of the entropy coding.  In fact,
 322 it is perfectly possible and reasonable to build a Vorbis encoder that
 323 dispenses with polar mapping entirely and merely interleaves the
 324 channel.  Libvorbis based encoders may configure such an encoding and
 325 it will work as intended.<p>
 326
 327 However, when we leave the ideal/theoretical domain, we notice that
 328 polar mapping does give additional practical benefits, as discussed in
 329 the above section on polar mapping and summarised again here:<p>
 330 <ul>
 331 <li>Polar mapping aids in controlling entropy 'leakage' between stages
 332 of a cascaded codebook.  <li>Polar mapping seperates the stereo image
 333 into point and diffuse components which may be analyzed and handled
 334 differently.
 335 </ul>
 336
 337 <h2>Stereo Models</h2>
 338
 339 <h3>Dual Stereo</h3>
 340
 341 Dual stereo referrs to stereo encoding where the channels are entirely
 342 seperate; they are analyzed and encoded as entirely distinct entities.
 343 This terminology is familiar from mp3.<p>
 344
 345 <h3>Lossless Stereo</h3>
 346
 347 Using polar mapping and/or channel interleaving, it's possible to
 348 couple Vorbis channels losslessly, that is, construct a stereo
 349 coupling encoding that both saves space but also decodes
 350 bit-identically to dual stereo.  OggEnc 1.0 and later offers this
 351 mode.<p>
 352
 353 Overall, this stereo mode is overkill; however, it offers a safe
 354 alternative to users concerned about the slightest possible
 355 degredation to the stereo image or archival quality audio.<p>
 356
 357 <h3>Phase Stereo</h3>
 358
 359 Phase stereo is the least aggressive means of gracefully dropping
 360 resolution from the stereo image; it affects only diffuse imaging.<p>
 361
 362 It's often quoted that the human ear is nearly entirely deaf to signal
 363 phase above about 4kHz; this is nearly true and a passable rule of
 364 thumb, but it can be demonstrated that even an average user can tell
 365 the difference between high frequency in-phase and out-of-phase noise.
 366 Obviously then, the statement is not entirely true.  However, it's
 367 also the case that one must resort to nearly such an extreme
 368 demostration before finding the counterexample.<p>
 369
 370 'Phase stereo' is simply a more aggressive quantization of the polar
 371 angle vector; above 4kHz it's generally quite safe to quantize noise
 372 and noisy elements to only a handful of allowed phases.  The phases of
 373 high ampliude pure tones may or may not be preserved more carefully
 374 (they are relatively rare and L/R tend to be in phase, so there is
 375 generally little reason not to spend a few more bits on them) <p>
 376
 377 <h4>eight phase stereo</h4>
 378
 379 Vorbis implements phase stereo coupling by preserving the entirety of the magnitude vector (essential to fine amlitdude and energy resolution overall) and quantizing the angle vector to one of only four possible values. Given that the magnitude vector may be positive or negative, this results in left and right phase having eight possible permutation, thus 'eight phase stereo':<p>
 380
 381 <img src="eightphase.png"><p>
 382
 383 Left and right may be in phase (positive or negative), the most common
 384 case by far, or out of phase by 90 or 180 degrees.<p>
 385
 386 <h4>four phase stereo</h4>
 387
 388 Four phase stereo takes the quantization one step further; it allows
 389 only in-phase and 180 degree out-out-phase signals:<p>
 390
 391 <img src="fourphase.png"><p>
 392
 393 <h3>Point Stereo</h3>
 394
 395 Point stero eliminates the possibility of out-of-phase signal
 396 entirely.  Any diffuse quality to a sound source tends to collapse
 397 inward to a point somewhere within the stereo image.  A practical
 398 example would be balanced reverberations within a large, live space;
 399 normally the sound is diffuse and soft, giving a sonic impression of
 400 volume.  In point-stereo, the reverberations would still exist, but
 401 sound fairly firmly centered within the image (assuming the
 402 reverberation was centered overall; if the reverberation is stronger
 403 to the left, then the point of localization in point stereo would be
 404 to the left).  This effect is most noticable at low and mid
 405 frequencies and using headphones (which grant perfect stereo
 406 seperation). Point stereo is is a graceful but generally easy to
 407 detect degrdation to the sound quality and is thus used in frequency
 408 ranges where it is least noticable.<p>
 409
 410 <h3>Mixed Stereo</h3>
 411
 412 Mixed stereo is the silmultaneous use of more than one of the above
 413 stereo encoding models, generally using more aggressive modes in
 414 higher frequencies, lower amplitudes or 'nearly' in-phase sound.<p>
 415
 416 It is also the case that near-DC frequencies should be encoded using
 417 lossless coupling to avoid frame blocking artifacts.<p>
 418
 419 <h3>Vorbis Stereo Modes</h3>
 420
 421 Vorbis, for the most part, uses lossless stereo and a number of mixed
 422 modes constructed out of the above models.  As of the current pre-1.0
 423 testing version of the encoder, oggenc supports the following modes.
 424 Oggenc's default choice varies by bitrate and each mode is selectable
 425 by the user:<p>
 426
 427 <dl>
 428 <dt>dual stereo
 429 <dd>uncoupled stereo encoding<p>
 430
 431 <dt>lossless stereo
 432 <dd>lossless stereo coupling; produces exactly equivalent output to dual stereo<p>
 433
 434 <dt>eight phase stereo
 435 <dd>a mixed mode combining lossless stereo for frequencies to approximately 4 kHz (and all strong pure tones) and eight phase stereo above<p>
 436
 437 <dt>aggressive eight phase stereo
 438 <dd>a mixed mode combining lossless stereo for frequencies to approximately 2 kHz (and for all strong pure tones) and eight phase stereo above<p>
 439
 440 <dt>eight phase/point stero <dd>A mixed mode combining lossless stereo
 441 for bass, eight phase stereo for noisy content and lossless stereo for
 442 tones to approximately 4kHz and point stereo above 4kHz.<p>
 443
 444 <dt>aggressive eight phase/point stero
 445 <dd>A mixed mode combining lossless stereo
 446 for bass, eight phase stereo to approximately 2kHz and point stereo above 2kHz.<p>
 447
 448 <dt>point stereo
 449 <dd>A mixed mode combining lossless stereo to approximately 4kHz and point stereo above 4kHz.<p>
 450
 451 <dt>aggressive point stereo
 452 <dd>A mixed mode combining lossless stereo to approximately 1-2kHz and point stereo above.<p>
 453
 454 </dl>
 455
 456 <hr>
 457 <a href="http://www.xiph.org/">
 458 <img src="white-xifish.png" align=left border=0>
 459 </a>
 460 <font size=-2 color=#505050>
 461
 462 Ogg is a <a href="http://www.xiph.org">Xiphophorus</a> effort to
 463 protect essential tenets of Internet multimedia from corporate
 464 hostage-taking; Open Source is the net's greatest tool to keep
 465 everyone honest. See <a href="http://www.xiph.org/about.html">About
 466 Xiphophorus</a> for details.
 467 <p>
 468
 469 Ogg Vorbis is the first Ogg audio CODEC.  Anyone may
 470 freely use and distribute the Ogg and Vorbis specification,
 471 whether in a private, public or corporate capacity.  However,
 472 Xiphophorus and the Ogg project (xiph.org) reserve the right to set
 473 the Ogg/Vorbis specification and certify specification compliance.<p>
 474
 475 Xiphophorus's Vorbis software CODEC implementation is distributed
 476 under a BSD-like License.  This does not restrict third parties from
 477 distributing independent implementations of Vorbis software under
 478 other licenses.<p>
 479
 480 OggSquish, Vorbis, Xiphophorus and their logos are trademarks (tm) of
 481 <a href="http://www.xiph.org/">Xiphophorus</a>.  These pages are
 482 copyright (C) 1994-2001 Xiphophorus. All rights reserved.<p>
 483
 484 </body>
 485
 486
 487
 488
 489
 490