doc/stereo.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\r
   2 <html>\r
   3 <head>\r
   4 \r
   5 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15"/>\r
   6 <title>Ogg Vorbis Documentation</title>\r
   7 \r
   8 <style type="text/css">\r
   9 body {\r
  10   margin: 0 18px 0 18px;\r
  11   padding-bottom: 30px;\r
  12   font-family: Verdana, Arial, Helvetica, sans-serif;\r
  13   color: #333333;\r
  14   font-size: .8em;\r
  15 }\r
  16 \r
  17 a {\r
  18   color: #3366cc;\r
  19 }\r
  20 \r
  21 img {\r
  22   border: 0;\r
  23 }\r
  24 \r
  25 #xiphlogo {\r
  26   margin: 30px 0 16px 0;\r
  27 }\r
  28 \r
  29 #content p {\r
  30   line-height: 1.4;\r
  31 }\r
  32 \r
  33 h1, h1 a, h2, h2 a, h3, h3 a, h4, h4 a {\r
  34   font-weight: bold;\r
  35   color: #ff9900;\r
  36   margin: 1.3em 0 8px 0;\r
  37 }\r
  38 \r
  39 h1 {\r
  40   font-size: 1.3em;\r
  41 }\r
  42 \r
  43 h2 {\r
  44   font-size: 1.2em;\r
  45 }\r
  46 \r
  47 h3 {\r
  48   font-size: 1.1em;\r
  49 }\r
  50 \r
  51 li {\r
  52   line-height: 1.4;\r
  53 }\r
  54 \r
  55 #copyright {\r
  56   margin-top: 30px;\r
  57   line-height: 1.5em;\r
  58   text-align: center;\r
  59   font-size: .8em;\r
  60   color: #888888;\r
  61   clear: both;\r
  62 }\r
  63 </style>\r
  64 \r
  65 </head>\r
  66 \r
  67 <body>\r
  68 \r
  69 <div id="xiphlogo">\r
  70   <a href="http://www.xiph.org/"><img src="fish_xiph_org.png" alt="Fish Logo and Xiph.org"/></a>\r
  71 </div>\r
  72 \r
  73 <h1>Ogg Vorbis stereo-specific channel coupling discussion</h1>\r
  74 \r
  75 <h2>Abstract</h2>\r
  76 \r
  77 <p>The Vorbis audio CODEC provides a channel coupling\r
  78 mechanisms designed to reduce effective bitrate by both eliminating\r
  79 interchannel redundancy and eliminating stereo image information\r
  80 labeled inaudible or undesirable according to spatial psychoacoustic\r
  81 models. This document describes both the mechanical coupling\r
  82 mechanisms available within the Vorbis specification, as well as the\r
  83 specific stereo coupling models used by the reference\r
  84 <tt>libvorbis</tt> codec provided by xiph.org.</p>\r
  85 \r
  86 <h2>Mechanisms</h2>\r
  87 \r
  88 <p>In encoder release beta 4 and earlier, Vorbis supported multiple\r
  89 channel encoding, but the channels were encoded entirely separately\r
  90 with no cross-analysis or redundancy elimination between channels.\r
  91 This multichannel strategy is very similar to the mp3's <em>dual\r
  92 stereo</em> mode and Vorbis uses the same name for its analogous\r
  93 uncoupled multichannel modes.</p>\r
  94 \r
  95 <p>However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and\r
  96 later implement a coupled channel strategy. Vorbis has two specific\r
  97 mechanisms that may be used alone or in conjunction to implement\r
  98 channel coupling. The first is <em>channel interleaving</em> via\r
  99 residue backend type 2, and the second is <em>square polar\r
 100 mapping</em>. These two general mechanisms are particularly well\r
 101 suited to coupling due to the structure of Vorbis encoding, as we'll\r
 102 explore below, and using both we can implement both totally\r
 103 <em>lossless stereo image coupling</em> [bit-for-bit decode-identical\r
 104 to uncoupled modes], as well as various lossy models that seek to\r
 105 eliminate inaudible or unimportant aspects of the stereo image in\r
 106 order to enhance bitrate. The exact coupling implementation is\r
 107 generalized to allow the encoder a great deal of flexibility in\r
 108 implementation of a stereo or surround model without requiring any\r
 109 significant complexity increase over the combinatorially simpler\r
 110 mid/side joint stereo of mp3 and other current audio codecs.</p>\r
 111 \r
 112 <p>A particular Vorbis bitstream may apply channel coupling directly to\r
 113 more than a pair of channels; polar mapping is hierarchical such that\r
 114 polar coupling may be extrapolated to an arbitrary number of channels\r
 115 and is not restricted to only stereo, quadraphonics, ambisonics or 5.1\r
 116 surround. However, the scope of this document restricts itself to the\r
 117 stereo coupling case.</p>\r
 118 \r
 119 <h3>Square Polar Mapping</h3>\r
 120 \r
 121 <h4>maximal correlation</h4>\r
 122  \r
 123 <p>Recall that the basic structure of a a Vorbis I stream first generates\r
 124 from input audio a spectral 'floor' function that serves as an\r
 125 MDCT-domain whitening filter. This floor is meant to represent the\r
 126 rough envelope of the frequency spectrum, using whatever metric the\r
 127 encoder cares to define. This floor is subtracted from the log\r
 128 frequency spectrum, effectively normalizing the spectrum by frequency.\r
 129 Each input channel is associated with a unique floor function.</p>\r
 130 \r
 131 <p>The basic idea behind any stereo coupling is that the left and right\r
 132 channels usually correlate. This correlation is even stronger if one\r
 133 first accounts for energy differences in any given frequency band\r
 134 across left and right; think for example of individual instruments\r
 135 mixed into different portions of the stereo image, or a stereo\r
 136 recording with a dominant feature not perfectly in the center. The\r
 137 floor functions, each specific to a channel, provide the perfect means\r
 138 of normalizing left and right energies across the spectrum to maximize\r
 139 correlation before coupling. This feature of the Vorbis format is not\r
 140 a convenient accident.</p>\r
 141 \r
 142 <p>Because we strive to maximally correlate the left and right channels\r
 143 and generally succeed in doing so, left and right residue is typically\r
 144 nearly identical. We could use channel interleaving (discussed below)\r
 145 alone to efficiently remove the redundancy between the left and right\r
 146 channels as a side effect of entropy encoding, but a polar\r
 147 representation gives benefits when left/right correlation is\r
 148 strong.</p>\r
 149 \r
 150 <h4>point and diffuse imaging</h4>\r
 151 \r
 152 <p>The first advantage of a polar representation is that it effectively\r
 153 separates the spatial audio information into a 'point image'\r
 154 (magnitude) at a given frequency and located somewhere in the sound\r
 155 field, and a 'diffuse image' (angle) that fills a large amount of\r
 156 space simultaneously. Even if we preserve only the magnitude (point)\r
 157 data, a detailed and carefully chosen floor function in each channel\r
 158 provides us with a free, fine-grained, frequency relative intensity\r
 159 stereo*. Angle information represents diffuse sound fields, such as\r
 160 reverberation that fills the entire space simultaneously.</p>\r
 161 \r
 162 <p>*<em>Because the Vorbis model supports a number of different possible\r
 163 stereo models and these models may be mixed, we do not use the term\r
 164 'intensity stereo' talking about Vorbis; instead we use the terms\r
 165 'point stereo', 'phase stereo' and subcategories of each.</em></p>\r
 166 \r
 167 <p>The majority of a stereo image is representable by polar magnitude\r
 168 alone, as strong sounds tend to be produced at near-point sources;\r
 169 even non-diffuse, fast, sharp echoes track very accurately using\r
 170 magnitude representation almost alone (for those experimenting with\r
 171 Vorbis tuning, this strategy works much better with the precise,\r
 172 piecewise control of floor 1; the continuous approximation of floor 0\r
 173 results in unstable imaging). Reverberation and diffuse sounds tend\r
 174 to contain less energy and be psychoacoustically dominated by the\r
 175 point sources embedded in them. Thus, we again tend to concentrate\r
 176 more represented energy into a predictably smaller number of numbers.\r
 177 Separating representation of point and diffuse imaging also allows us\r
 178 to model and manipulate point and diffuse qualities separately.</p>\r
 179 \r
 180 <h4>controlling bit leakage and symbol crosstalk</h4>\r
 181 \r
 182 <p>Because polar\r
 183 representation concentrates represented energy into fewer large\r
 184 values, we reduce bit 'leakage' during cascading (multistage VQ\r
 185 encoding) as a secondary benefit. A single large, monolithic VQ\r
 186 codebook is more efficient than a cascaded book due to entropy\r
 187 'crosstalk' among symbols between different stages of a multistage cascade.\r
 188 Polar representation is a way of further concentrating entropy into\r
 189 predictable locations so that codebook design can take steps to\r
 190 improve multistage codebook efficiency. It also allows us to cascade\r
 191 various elements of the stereo image independently.</p>\r
 192 \r
 193 <h4>eliminating trigonometry and rounding</h4>\r
 194 \r
 195 <p>Rounding and computational complexity are potential problems with a\r
 196 polar representation. As our encoding process involves quantization,\r
 197 mixing a polar representation and quantization makes it potentially\r
 198 impossible, depending on implementation, to construct a coupled stereo\r
 199 mechanism that results in bit-identical decompressed output compared\r
 200 to an uncoupled encoding should the encoder desire it.</p>\r
 201 \r
 202 <p>Vorbis uses a mapping that preserves the most useful qualities of\r
 203 polar representation, relies only on addition/subtraction (during\r
 204 decode; high quality encoding still requires some trig), and makes it\r
 205 trivial before or after quantization to represent an angle/magnitude\r
 206 through a one-to-one mapping from possible left/right value\r
 207 permutations. We do this by basing our polar representation on the\r
 208 unit square rather than the unit-circle.</p>\r
 209 \r
 210 <p>Given a magnitude and angle, we recover left and right using the\r
 211 following function (note that A/B may be left/right or right/left\r
 212 depending on the coupling definition used by the encoder):</p>\r
 213 \r
 214 <pre>\r
 215       if(magnitude>0)\r
 216         if(angle>0){\r
 217           A=magnitude;\r
 218           B=magnitude-angle;\r
 219         }else{\r
 220           B=magnitude;\r
 221           A=magnitude+angle;\r
 222         }\r
 223       else\r
 224         if(angle>0){\r
 225           A=magnitude;\r
 226           B=magnitude+angle;\r
 227         }else{\r
 228           B=magnitude;\r
 229           A=magnitude-angle;\r
 230         }\r
 231     }\r
 232 </pre>\r
 233 \r
 234 <p>The function is antisymmetric for positive and negative magnitudes in\r
 235 order to eliminate a redundant value when quantizing. For example, if\r
 236 we're quantizing to integer values, we can visualize a magnitude of 5\r
 237 and an angle of -2 as follows:</p>\r
 238 \r
 239 <p><img src="squarepolar.png" alt="square polar"/></p>\r
 240 \r
 241 <p>This representation loses or replicates no values; if the range of A\r
 242 and B are integral -5 through 5, the number of possible Cartesian\r
 243 permutations is 121. Represented in square polar notation, the\r
 244 possible values are:</p>\r
 245 \r
 246 <pre>\r
 247  0, 0\r
 248 \r
 249 -1,-2  -1,-1  -1, 0  -1, 1\r
 250 \r
 251  1,-2   1,-1   1, 0   1, 1\r
 252 \r
 253 -2,-4  -2,-3  -2,-2  -2,-1  -2, 0  -2, 1  -2, 2  -2, 3  \r
 254 \r
 255  2,-4   2,-3   ... following the pattern ...\r
 256 \r
 257  ...   5, 1   5, 2   5, 3   5, 4   5, 5   5, 6   5, 7   5, 8   5, 9\r
 258 \r
 259 </pre>\r
 260 \r
 261 <p>...for a grand total of 121 possible values, the same number as in\r
 262 Cartesian representation (note that, for example, <tt>5,-10</tt> is\r
 263 the same as <tt>-5,10</tt>, so there's no reason to represent\r
 264 both. 2,10 cannot happen, and there's no reason to account for it.)\r
 265 It's also obvious that this mapping is exactly reversible.</p>\r
 266 \r
 267 <h3>Channel interleaving</h3>\r
 268 \r
 269 <p>We can remap and A/B vector using polar mapping into a magnitude/angle\r
 270 vector, and it's clear that, in general, this concentrates energy in\r
 271 the magnitude vector and reduces the amount of information to encode\r
 272 in the angle vector. Encoding these vectors independently with\r
 273 residue backend #0 or residue backend #1 will result in bitrate\r
 274 savings. However, there are still implicit correlations between the\r
 275 magnitude and angle vectors. The most obvious is that the amplitude\r
 276 of the angle is bounded by its corresponding magnitude value.</p>\r
 277 \r
 278 <p>Entropy coding the results, then, further benefits from the entropy\r
 279 model being able to compress magnitude and angle simultaneously. For\r
 280 this reason, Vorbis implements residue backend #2 which pre-interleaves\r
 281 a number of input vectors (in the stereo case, two, A and B) into a\r
 282 single output vector (with the elements in the order of\r
 283 A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus\r
 284 each vector to be coded by the vector quantization backend consists of\r
 285 matching magnitude and angle values.</p>\r
 286 \r
 287 <p>The astute reader, at this point, will notice that in the theoretical\r
 288 case in which we can use monolithic codebooks of arbitrarily large\r
 289 size, we can directly interleave and encode left and right without\r
 290 polar mapping; in fact, the polar mapping does not appear to lend any\r
 291 benefit whatsoever to the efficiency of the entropy coding. In fact,\r
 292 it is perfectly possible and reasonable to build a Vorbis encoder that\r
 293 dispenses with polar mapping entirely and merely interleaves the\r
 294 channel. Libvorbis based encoders may configure such an encoding and\r
 295 it will work as intended.</p>\r
 296 \r
 297 <p>However, when we leave the ideal/theoretical domain, we notice that\r
 298 polar mapping does give additional practical benefits, as discussed in\r
 299 the above section on polar mapping and summarized again here:</p>\r
 300 \r
 301 <ul>\r
 302 <li>Polar mapping aids in controlling entropy 'leakage' between stages\r
 303 of a cascaded codebook.</li>\r
 304 <li>Polar mapping separates the stereo image\r
 305 into point and diffuse components which may be analyzed and handled\r
 306 differently.</li>\r
 307 </ul>\r
 308 \r
 309 <h2>Stereo Models</h2>\r
 310 \r
 311 <h3>Dual Stereo</h3>\r
 312 \r
 313 <p>Dual stereo refers to stereo encoding where the channels are entirely\r
 314 separate; they are analyzed and encoded as entirely distinct entities.\r
 315 This terminology is familiar from mp3.</p>\r
 316 \r
 317 <h3>Lossless Stereo</h3>\r
 318 \r
 319 <p>Using polar mapping and/or channel interleaving, it's possible to\r
 320 couple Vorbis channels losslessly, that is, construct a stereo\r
 321 coupling encoding that both saves space but also decodes\r
 322 bit-identically to dual stereo. OggEnc 1.0 and later uses this\r
 323 mode in all high-bitrate encoding.</p>\r
 324 \r
 325 <p>Overall, this stereo mode is overkill; however, it offers a safe\r
 326 alternative to users concerned about the slightest possible\r
 327 degradation to the stereo image or archival quality audio.</p>\r
 328 \r
 329 <h3>Phase Stereo</h3>\r
 330 \r
 331 <p>Phase stereo is the least aggressive means of gracefully dropping\r
 332 resolution from the stereo image; it affects only diffuse imaging.</p>\r
 333 \r
 334 <p>It's often quoted that the human ear is deaf to signal phase above\r
 335 about 4kHz; this is nearly true and a passable rule of thumb, but it\r
 336 can be demonstrated that even an average user can tell the difference\r
 337 between high frequency in-phase and out-of-phase noise. Obviously\r
 338 then, the statement is not entirely true. However, it's also the case\r
 339 that one must resort to nearly such an extreme demonstration before\r
 340 finding the counterexample.</p>\r
 341 \r
 342 <p>'Phase stereo' is simply a more aggressive quantization of the polar\r
 343 angle vector; above 4kHz it's generally quite safe to quantize noise\r
 344 and noisy elements to only a handful of allowed phases, or to thin the\r
 345 phase with respect to the magnitude. The phases of high amplitude\r
 346 pure tones may or may not be preserved more carefully (they are\r
 347 relatively rare and L/R tend to be in phase, so there is generally\r
 348 little reason not to spend a few more bits on them)</p>\r
 349 \r
 350 <h4>example: eight phase stereo</h4>\r
 351 \r
 352 <p>Vorbis may implement phase stereo coupling by preserving the entirety\r
 353 of the magnitude vector (essential to fine amplitude and energy\r
 354 resolution overall) and quantizing the angle vector to one of only\r
 355 four possible values. Given that the magnitude vector may be positive\r
 356 or negative, this results in left and right phase having eight\r
 357 possible permutation, thus 'eight phase stereo':</p>\r
 358 \r
 359 <p><img src="eightphase.png" alt="eight phase"/></p>\r
 360 \r
 361 <p>Left and right may be in phase (positive or negative), the most common\r
 362 case by far, or out of phase by 90 or 180 degrees.</p>\r
 363 \r
 364 <h4>example: four phase stereo</h4>\r
 365 \r
 366 <p>Similarly, four phase stereo takes the quantization one step further;\r
 367 it allows only in-phase and 180 degree out-out-phase signals:</p>\r
 368 \r
 369 <p><img src="fourphase.png" alt="four phase"/></p>\r
 370 \r
 371 <h3>example: point stereo</h3>\r
 372 \r
 373 <p>Point stereo eliminates the possibility of out-of-phase signal\r
 374 entirely. Any diffuse quality to a sound source tends to collapse\r
 375 inward to a point somewhere within the stereo image. A practical\r
 376 example would be balanced reverberations within a large, live space;\r
 377 normally the sound is diffuse and soft, giving a sonic impression of\r
 378 volume. In point-stereo, the reverberations would still exist, but\r
 379 sound fairly firmly centered within the image (assuming the\r
 380 reverberation was centered overall; if the reverberation is stronger\r
 381 to the left, then the point of localization in point stereo would be\r
 382 to the left). This effect is most noticeable at low and mid\r
 383 frequencies and using headphones (which grant perfect stereo\r
 384 separation). Point stereo is is a graceful but generally easy to\r
 385 detect degradation to the sound quality and is thus used in frequency\r
 386 ranges where it is least noticeable.</p>\r
 387 \r
 388 <h3>Mixed Stereo</h3>\r
 389 \r
 390 <p>Mixed stereo is the simultaneous use of more than one of the above\r
 391 stereo encoding models, generally using more aggressive modes in\r
 392 higher frequencies, lower amplitudes or 'nearly' in-phase sound.</p>\r
 393 \r
 394 <p>It is also the case that near-DC frequencies should be encoded using\r
 395 lossless coupling to avoid frame blocking artifacts.</p>\r
 396 \r
 397 <h3>Vorbis Stereo Modes</h3>\r
 398 \r
 399 <p>Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes\r
 400 constructed out of lossless and point stereo. Phase stereo was used\r
 401 in the rc2 encoder, but is not currently used for simplicity's sake. It\r
 402 will likely be re-added to the stereo model in the future.</p>\r
 403 \r
 404 <div id="copyright">\r
 405   The Xiph Fish Logo is a\r
 406   trademark (&trade;) of Xiph.Org.<br/>\r
 407 \r
 408   These pages &copy; 1994 - 2005 Xiph.Org. All rights reserved.\r
 409 </div>\r
 410 \r
 411 </body>\r
 412 </html>\r
 413 \r
 414 \r
 415 \r
 416 \r
 417 \r
 418 \r