markdown/design/stereo-multiview-video.md

   1 # Stereoscopic & Multiview Video Handling
   2
   3 There are two cases to handle:
   4
   5  - Encoded video output from a demuxer to parser / decoder or from encoders
   6    into a muxer.
   7
   8  - Raw video buffers
   9
  10 The design below is somewhat based on the proposals from
  11 [bug 611157](https://bugzilla.gnome.org/show_bug.cgi?id=611157)
  12
  13 Multiview is used as a generic term to refer to handling both
  14 stereo content (left and right eye only) as well as extensions for videos
  15 containing multiple independent viewpoints.
  16
  17 ## Encoded Signalling
  18
  19 This is regarding the signalling in caps and buffers from demuxers to
  20 parsers (sometimes) or out from encoders.
  21
  22 For backward compatibility with existing codecs many transports of
  23 stereoscopic 3D content use normal 2D video with 2 views packed spatially
  24 in some way, and put extra new descriptions in the container/mux.
  25
  26 Info in the demuxer seems to apply to stereo encodings only. For all
  27 MVC methods I know, the multiview encoding is in the video bitstream itself
  28 and therefore already available to decoders. Only stereo systems have been retro-fitted
  29 into the demuxer.
  30
  31 Also, sometimes extension descriptions are in the codec (e.g. H.264 SEI FPA packets)
  32 and it would be useful to be able to put the info onto caps and buffers from the
  33 parser without decoding.
  34
  35 To handle both cases, we need to be able to output the required details on
  36 encoded video for decoders to apply onto the raw video buffers they decode.
  37
  38 *If there ever is a need to transport multiview info for encoded data the
  39 same system below for raw video or some variation should work*
  40
  41 ### Encoded Video: Properties that need to be encoded into caps
  42
  43 1. multiview-mode (called "Channel Layout" in bug 611157)
  44     * Whether a stream is mono, for a single eye, stereo, mixed-mono-stereo
  45       (switches between mono and stereo - mp4 can do this)
  46     * Uses a buffer flag to mark individual buffers as mono or "not mono"
  47       (single|stereo|multiview) for mixed scenarios. The alternative (not
  48       proposed) is for the demuxer to switch caps for each mono to not-mono
  49       change, and not used a 'mixed' caps variant at all.
  50     * _single_ refers to a stream of buffers that only contain 1 view.
  51       It is different from mono in that the stream is a marked left or right
  52       eye stream for later combining in a mixer or when displaying.
  53     * _multiple_ marks a stream with multiple independent views encoded.
  54       It is included in this list for completeness. As noted above, there's
  55       currently no scenario that requires marking encoded buffers as MVC.
  56
  57 2. Frame-packing arrangements / view sequence orderings
  58     * Possible frame packings: side-by-side, side-by-side-quincunx,
  59       column-interleaved, row-interleaved, top-bottom, checker-board
  60     * bug 611157 - sreerenj added side-by-side-full and top-bottom-full but
  61       I think that's covered by suitably adjusting pixel-aspect-ratio. If
  62       not, they can be added later.
  63     * _top-bottom_, _side-by-side_, _column-interleaved_, _row-interleaved_ are as the names suggest.
  64     * _checker-board_, samples are left/right pixels in a chess grid +-+-+-/-+-+-+
  65     * _side-by-side-quincunx_. Side By Side packing, but quincunx sampling -
  66       1 pixel offset of each eye needs to be accounted when upscaling or displaying
  67     * there may be other packings (future expansion)
  68     * Possible view sequence orderings: frame-by-frame, frame-primary-secondary-tracks, sequential-row-interleaved
  69     * _frame-by-frame_, each buffer is left, then right view etc
  70     * _frame-primary-secondary-tracks_ - the file has 2 video tracks (primary and secondary), one is left eye, one is right.
  71       Demuxer info indicates which one is which.
  72       Handling this means marking each stream as all-left and all-right views, decoding separately, and combining automatically (inserting a mixer/combiner in playbin)
  73       -> *Leave this for future expansion*
  74     * _sequential-row-interleaved_ Mentioned by sreerenj in bug patches, I can't find a mention of such a thing. Maybe it's in MPEG-2
  75       -> *Leave this for future expansion / deletion*
  76
  77 3. view encoding order
  78     * Describes how to decide which piece of each frame corresponds to left or right eye
  79     * Possible orderings left, right, left-then-right, right-then-left
  80     - Need to figure out how we find the correct frame in the demuxer to start decoding when seeking in frame-sequential streams
  81     - Need a buffer flag for marking the first buffer of a group.
  82
  83 4. "Frame layout flags"
  84     * flags for view specific interpretation
  85     * horizontal-flip-left, horizontal-flip-right, vertical-flip-left, vertical-flip-right
  86       Indicates that one or more views has been encoded in a flipped orientation, usually due to camera with mirror or displays with mirrors.
  87     * This should be an actual flags field. Registered GLib flags types aren't generally well supported in our caps - the type might not be loaded/registered yet when parsing a caps string, so they can't be used in caps templates in the registry.
  88     * It might be better just to use a hex value / integer
  89
  90 ## Buffer representation for raw video
  91
  92  - Transported as normal video buffers with extra metadata
  93  - The caps define the overall buffer width/height, with helper functions to
  94    extract the individual views for packed formats
  95  - pixel-aspect-ratio adjusted if needed to double the overall width/height
  96  - video sinks that don't know about multiview extensions yet will show the
  97    packed view as-is. For frame-sequence outputs, things might look weird, but
  98    just adding multiview-mode to the sink caps can disallow those transports.
  99  - _row-interleaved_ packing is actually just side-by-side memory layout with
 100    half frame width, twice the height, so can be handled by adjusting the
 101    overall caps and strides
 102  - Other exotic layouts need new pixel formats defined (checker-board,
 103    column-interleaved, side-by-side-quincunx)
 104  - _Frame-by-frame_ - one view per buffer, but with alternating metas marking
 105    which buffer is which left/right/other view and using a new buffer flag as
 106    described above to mark the start of a group of corresponding frames.
 107  - New video caps addition as for encoded buffers
 108
 109 ### Proposed Caps fields
 110
 111 Combining the requirements above and collapsing the combinations into mnemonics:
 112
 113 * multiview-mode =
 114    mono | left | right | sbs | sbs-quin | col | row | topbot | checkers |
 115    frame-by-frame | mixed-sbs | mixed-sbs-quin | mixed-col | mixed-row |
 116    mixed-topbot | mixed-checkers | mixed-frame-by-frame | multiview-frames mixed-multiview-frames
 117
 118 * multiview-flags =
 119     + 0x0000 none
 120     + 0x0001 right-view-first
 121     + 0x0002 left-h-flipped
 122     + 0x0004 left-v-flipped
 123     + 0x0008 right-h-flipped
 124     + 0x0010 right-v-flipped
 125
 126 ### Proposed new buffer flags
 127
 128 Add two new `GST_VIDEO_BUFFER_*` flags in video-frame.h and make it clear that
 129 those flags can apply to encoded video buffers too. wtay says that's currently
 130 the case anyway, but the documentation should say it.
 131
 132  - **`GST_VIDEO_BUFFER_FLAG_MULTIPLE_VIEW`** - Marks a buffer as representing
 133    non-mono content, although it may be a single (left or right) eye view.
 134
 135  - **`GST_VIDEO_BUFFER_FLAG_FIRST_IN_BUNDLE`** - for frame-sequential methods of
 136    transport, mark the "first" of a left/right/other group of frames
 137
 138 ### A new GstMultiviewMeta
 139
 140 This provides a place to describe all provided views in a buffer / stream,
 141 and through Meta negotiation to inform decoders about which views to decode if
 142 not all are wanted.
 143
 144 * Logical labels/names and mapping to `GstVideoMeta` numbers
 145 * Standard view labels LEFT/RIGHT, and non-standard ones (strings)
 146
 147 ```c
 148         GST_VIDEO_MULTIVIEW_VIEW_LEFT = 1
 149         GST_VIDEO_MULTIVIEW_VIEW_RIGHT = 2
 150
 151         struct GstVideoMultiviewViewInfo {
 152             guint view_label;
 153             guint meta_id; // id of the GstVideoMeta for this view
 154
 155             padding;
 156         }
 157
 158         struct GstVideoMultiviewMeta {
 159             guint n_views;
 160             GstVideoMultiviewViewInfo *view_info;
 161         }
 162 ```
 163
 164 The meta is optional, and probably only useful later for MVC
 165
 166
 167 ## Outputting stereo content
 168
 169 The initial implementation for output will be stereo content in glimagesink
 170
 171 ### Output Considerations with OpenGL
 172
 173  - If we have support for stereo GL buffer formats, we can output separate
 174    left/right eye images and let the hardware take care of display.
 175
 176  - Otherwise, glimagesink needs to render one window with left/right in a
 177    suitable frame packing and that will only show correctly in fullscreen on a
 178    device set for the right 3D packing -> requires app intervention to set the
 179    video mode.
 180
 181  - Which could be done manually on the TV, or with HDMI 1.4 by setting the
 182    right video mode for the screen to inform the TV or third option, we support
 183    rendering to two separate overlay areas on the screen - one for left eye,
 184    one for right which can be supported using the 'splitter' element and two
 185    output sinks or, better, add a 2nd window overlay for split stereo output
 186
 187  - Intel hardware doesn't do stereo GL buffers - only nvidia and AMD, so
 188    initial implementation won't include that
 189
 190 ## Other elements for handling multiview content
 191
 192  - videooverlay interface extensions
 193    - __Q__: Should this be a new interface?
 194    - Element message to communicate the presence of stereoscopic information to the app
 195    - App needs to be able to override the input interpretation - ie, set multiview-mode and multiview-flags
 196      - Most videos I've seen are side-by-side or top-bottom with no frame-packing metadata
 197    - New API for the app to set rendering options for stereo/multiview content
 198    - This might be best implemented as a **multiview GstContext**, so that
 199      the pipeline can share app preferences for content interpretation and downmixing
 200      to mono for output, or in the sink and have those down as far upstream/downstream as possible.
 201
 202  - Converter element
 203    - convert different view layouts
 204    - Render to anaglyphs of different types (magenta/green, red/blue, etc) and output as mono
 205
 206  - Mixer element
 207    - take 2 video streams and output as stereo
 208    - later take n video streams
 209    - share code with the converter, it just takes input from n pads instead of one.
 210
 211  - Splitter element
 212   - Output one pad per view
 213
 214 ### Implementing MVC handling in decoders / parsers (and encoders)
 215
 216 Things to do to implement MVC handling
 217
 218 1. Parsing SEI in h264parse and setting caps (patches available in
 219    bugzilla for parsing, see below)
 220 2. Integrate gstreamer-vaapi MVC support with this proposal
 221 3. Help with [libav MVC implementation](https://wiki.libav.org/Blueprint/MVC)
 222 4. generating SEI in H.264 encoder
 223 5. Support for MPEG2 MVC extensions
 224
 225 ## Relevant bugs
 226
 227  - [bug 685215](https://bugzilla.gnome.org/show_bug.cgi?id=685215) - codecparser h264: Add initial MVC parser
 228  - [bug 696135](https://bugzilla.gnome.org/show_bug.cgi?id=696135) - h264parse: Add mvc stream parsing support
 229  - [bug 732267](https://bugzilla.gnome.org/show_bug.cgi?id=732267) - h264parse: extract base stream from MVC or SVC encoded streams
 230
 231 ## Other Information
 232
 233 [Matroska 3D support notes](http://www.matroska.org/technical/specs/notes.html#3D)
 234
 235 ## Open Questions
 236
 237 ### Background
 238
 239 ### Representation for GstGL
 240
 241 When uploading raw video frames to GL textures, the goal is to implement:
 242
 243 Split packed frames into separate GL textures when uploading, and
 244 attach multiple `GstGLMemory` to the `GstBuffer`. The multiview-mode and
 245 multiview-flags fields in the caps should change to reflect the conversion
 246 from one incoming `GstMemory` to multiple `GstGLMemory`, and change the
 247 width/height in the output info as needed.
 248
 249 This is (currently) targetted as 2 render passes - upload as normal
 250 to a single stereo-packed RGBA texture, and then unpack into 2
 251 smaller textures, output with `GST_VIDEO_MULTIVIEW_MODE_SEPARATED`, as
 252 2 `GstGLMemory` attached to one buffer. We can optimise the upload later
 253 to go directly to 2 textures for common input formats.
 254
 255 Separat output textures have a few advantages:
 256
 257  - Filter elements can more easily apply filters in several passes to each
 258    texture without fundamental changes to our filters to avoid mixing pixels
 259    from separate views.
 260
 261  - Centralises the sampling of input video frame packings in the upload code,
 262    which makes adding new packings in the future easier.
 263
 264  - Sampling multiple textures to generate various output frame-packings
 265    for display is conceptually simpler than converting from any input packing
 266    to any output packing.
 267
 268  - In implementations that support quad buffers, having separate textures
 269    makes it trivial to do `GL_LEFT`/`GL_RIGHT` output
 270
 271 For either option, we'll need new glsink output API to pass more
 272 information to applications about multiple views for the draw signal/callback.
 273
 274 I don't know if it's desirable to support *both* methods of representing
 275 views. If so, that should be signalled in the caps too. That could be a
 276 new multiview-mode for passing views in separate `GstMemory` objects
 277 attached to a `GstBuffer`, which would not be GL specific.
 278
 279 ### Overriding frame packing interpretation
 280
 281 Most sample videos available are frame packed, with no metadata
 282 to say so. How should we override that interpretation?
 283
 284  - Simple answer: Use capssetter + new properties on playbin to
 285    override the multiview fields. *Basically implemented in playbin, using*
 286    *a pad probe. Needs more work for completeness*
 287
 288 ### Adding extra GstVideoMeta to buffers
 289
 290 There should be one `GstVideoMeta` for the entire video frame in packed
 291 layouts, and one `GstVideoMeta` per `GstGLMemory` when views are attached
 292 to a `GstBuffer` separately. This should be done by the buffer pool,
 293 which knows from the caps.
 294
 295 ### videooverlay interface extensions
 296
 297 GstVideoOverlay needs:
 298
 299 - A way to announce the presence of multiview content when it is
 300   detected/signalled in a stream.
 301 - A way to tell applications which output methods are supported/available
 302 - A way to tell the sink which output method it should use
 303 - Possibly a way to tell the sink to override the input frame
 304   interpretation / caps - depends on the answer to the question
 305   above about how to model overriding input interpretation.
 306
 307 ### What's implemented
 308
 309 - Caps handling
 310 - gst-plugins-base libsgstvideo pieces
 311 - playbin caps overriding
 312 - conversion elements - glstereomix, gl3dconvert (needs a rename),
 313   glstereosplit.
 314
 315 ### Possible future enhancements
 316
 317 - Make GLupload split to separate textures at upload time?
 318   - Needs new API to extract multiple textures from the upload. Currently only outputs 1 result RGBA texture.
 319 - Make GLdownload able to take 2 input textures, pack them and colorconvert / download as needed.
 320   - current done by packing then downloading which isn't OK overhead for RGBA download
 321 - Think about how we integrate GLstereo - do we need to do anything special,
 322   or can the app just render to stereo/quad buffers if they're available?