docs/isl/tiling.rst

   1 Tiling
   2 ======
   3
   4 The naive view of an image in memory is that the pixels are stored one after
   5 another in memory usually in an X-major order.  An image that is arranged in
   6 this way is called "linear".  Linear images, while easy to reason about, can
   7 have very bad cache locality.  Graphics operations tend to act on pixels that
   8 are close together in 2-D euclidean space.  If you move one pixel to the right
   9 or left in a linear image, you only move a few bytes to one side or the other
  10 in memory.  However, if you move one pixel up or down you can end up kilobytes
  11 or even megabytes away.
  12
  13 Tiling (sometimes referred to as swizzling) is a method of re-arranging the
  14 pixels of a surface so that pixels which are close in 2-D euclidean space are
  15 likely to be close in memory.
  16
  17 Basics
  18 ------
  19
  20 The basic idea of a tiled image is that the image is first divided into
  21 two-dimensional blocks or tiles.  Each tile takes up a chunk of contiguous
  22 memory and the tiles are arranged like pixels in linear surface.  This is best
  23 demonstrated with a specific example. Suppose we have a RGBA8888 X-tiled
  24 surface on Intel graphics.  Then the surface is divided into 128x8 pixel tiles
  25 each of which is 4KB of memory.  Within each tile, the pixels are laid out like
  26 a 128x8 linear image.  The tiles themselves are laid out row-major in memory
  27 like giant pixels.  This means that, as long as you don't leave your 128x8
  28 tile, you can move in both dimensions without leaving the same 4K page in
  29 memory.
  30
  31 .. image:: tiling-basic.svg
  32    :alt: Example of an X-tiled image
  33
  34 You can, however do even better than this.  Suppose that same image is,
  35 instead, Y-tiled.  Then the surface is divided into 32x32 pixel tiles each of
  36 which is 4KB of memory.  Within a tile, each 64B cache line corresponds to 4x4
  37 pixel region of the image (you can think of it as a tile within a tile).  This
  38 means that very small deviations don't even leave the cache line.  This added
  39 bit of pixel shuffling is known to have a substantial performance impact in
  40 most real-world applications.
  41
  42 Intel graphics has several different tiling formats that we'll discuss in
  43 detail in later sections.  The most commonly used as of the writing of this
  44 chapter is Y-tiling.  In all tiling formats the basic principal is the same:
  45 The image is divided into tiles of a particular size and, within those tiles,
  46 the data is re-arranged (or swizzled) based on a particular pattern.  A tile
  47 size will always be specified in bytes by rows and the actual X-dimension of
  48 the tile in elements depends on the size of the element in bytes.
  49
  50 Bit-6 Swizzling
  51 ^^^^^^^^^^^^^^^
  52
  53 On some older hardware, there is an additional address swizzle that is applied
  54 on top of the tiling format.  This has been removed starting with Broadwell
  55 because, as it says in the Broadwell PRM Vol 5 "Tiling Algorithm" (p. 17):
  56
  57    Address Swizzling for Tiled-Surfaces is no longer used because the main
  58    memory controller has a more effective address swizzling algorithm.
  59
  60 Whether or not swizzling is enabled depends on the memory configuration of the
  61 system.  Generally, systems with dual-channel RAM have swizzling enabled and
  62 single-channel do not.  Supposedly, this swizzling allows for better balancing
  63 between the two memory channels and increases performance. Because it depends
  64 on the memory configuration which may change from one boot to the next, it
  65 requires a run-time check.
  66
  67 The best documentation for bit-6 swizzling can be found in the Haswell PRM Vol.
  68 5 "Memory Views" in the section entitled "Address Swizzling for Tiled-Y
  69 Surfaces".  It exists on older platforms but the docs get progressively worse
  70 the further you go back.
  71
  72 ISL Representation
  73 ------------------
  74
  75 The structure of any given tiling format is represented by ISL using the
  76 :cpp:enum:`isl_tiling` enum and the :cpp:struct:`isl_tile_info` structure:
  77
  78 .. doxygenenum:: isl_tiling
  79
  80 .. doxygenfunction:: isl_tiling_get_info
  81
  82 .. doxygenstruct:: isl_tile_info
  83    :members:
  84
  85 The `isl_tile_info` structure has two different sizes for a tile: a logical
  86 size in surface elements and a physical size in bytes.  In order to determine
  87 the proper logical size, the bits-per-block of the underlying format has to be
  88 passed into `isl_tiling_get_info`. The proper way to compute the size of an
  89 image in bytes given a width and height in elements is as follows:
  90
  91 .. code-block:: c
  92
  93    uint32_t width_tl = DIV_ROUND_UP(width_el * (format_bpb / tile_info.format_bpb),
  94                                     tile_info.logical_extent_el.w);
  95    uint32_t height_tl = DIV_ROUND_UP(height_el, tile_info.logical_extent_el.h);
  96    uint32_t row_pitch = width_tl * tile_info.phys_extent_el.w;
  97    uint32_t size = height_tl * tile_info.phys_extent_el.h * row_pitch;
  98
  99 It is very important to note that there is no direct conversion between
 100 :cpp:member:`isl_tile_info::logical_extent_el` and
 101 :cpp:member:`isl_tile_info::phys_extent_B`.  It is tempting to assume that the
 102 logical and physical heights are the same and simply divide the width of
 103 :cpp:member:`isl_tile_info::phys_extent_B` by the size of the format (which is
 104 what the PRM does) to get :cpp:member:`isl_tile_info::logical_extent_el` but
 105 this is not at all correct. Some tiling formats have logical and physical
 106 heights that differ and so no such calculation will work in general.  The
 107 easiest case study for this is W-tiling. From the Sky Lake PRM Vol. 2d,
 108 "RENDER_SURFACE_STATE" (p. 427):
 109
 110    If the surface is a stencil buffer (and thus has Tile Mode set to
 111    TILEMODE_WMAJOR), the pitch must be set to 2x the value computed based on
 112    width, as the stencil buffer is stored with two rows interleaved.
 113
 114 What does this mean?  Why are we multiplying the pitch by two?  What does it
 115 mean that "the stencil buffer is stored with two rows interleaved"?  The
 116 explanation for all these questions is that a W-tile (which is only used for
 117 stencil) has a logical size of 64el x 64el but a physical size of 128B
 118 x 32rows.  In memory, a W-tile has the same footprint as a Y-tile (128B
 119 x 32rows) but every pair of rows in the stencil buffer is interleaved into
 120 a single row of bytes yielding a two-dimensional area of 64el x 64el.  You can
 121 consider this as its own tiling format or as a modification of Y-tiling.  The
 122 interpretation in the PRMs vary by hardware generation; on Sandy Bridge they
 123 simply said it was Y-tiled but by Sky Lake there is almost no mention of
 124 Y-tiling in connection with stencil buffers and they are always W-tiled. This
 125 mismatch between logical and physical tile sizes are also relevant for
 126 hierarchical depth buffers as well as single-channel MCS and CCS buffers.
 127
 128 X-tiling
 129 --------
 130
 131 The simplest tiling format available on Intel graphics (which has been
 132 available since gen4) is X-tiling.  An X-tile is 512B x 8rows and, within the
 133 tile, the data is arranged in an X-major linear fashion.  You can also look at
 134 X-tiling as being an 8x8 cache line grid where the cache lines are arranged
 135 X-major as follows:
 136
 137 ===== ===== ===== ===== ===== ===== ===== =====
 138 ===== ===== ===== ===== ===== ===== ===== =====
 139 0x000 0x040 0x080 0x0c0 0x100 0x140 0x180 0x1c0
 140 0x200 0x240 0x280 0x2c0 0x300 0x340 0x380 0x3c0
 141 0x400 0x440 0x480 0x4c0 0x500 0x540 0x580 0x5c0
 142 0x600 0x640 0x680 0x6c0 0x700 0x740 0x780 0x7c0
 143 0x800 0x840 0x880 0x8c0 0x900 0x940 0x980 0x9c0
 144 0xa00 0xa40 0xa80 0xac0 0xb00 0xb40 0xb80 0xbc0
 145 0xc00 0xc40 0xc80 0xcc0 0xd00 0xd40 0xd80 0xdc0
 146 0xe00 0xe40 0xe80 0xec0 0xf00 0xf40 0xf80 0xfc0
 147 ===== ===== ===== ===== ===== ===== ===== =====
 148
 149 Each cache line represents a piece of a single row of pixels within the image.
 150 The memory locations of two vertically adjacent pixels within the same X-tile
 151 always differs by 512B or 8 cache lines.
 152
 153 As mentioned above, X-tiling is slower than Y-tiling (though still faster than
 154 linear).  However, until Sky Lake, the display scan-out hardware could only do
 155 X-tiling so we have historically used X-tiling for all window-system buffers
 156 (because X or a Wayland compositor may want to put it in a plane).
 157
 158 Bit-6 Swizzling
 159 ^^^^^^^^^^^^^^^
 160
 161 When bit-6 swizzling is enabled, bits 9 and 10 are XORed in with bit 6 of the
 162 tiled address:
 163
 164 .. code-block:: c
 165
 166    addr[6] ^= addr[9] ^ addr[10];
 167
 168 Y-tiling
 169 --------
 170
 171 The Y-tiling format, also available since gen4, is substantially different from
 172 X-tiling and performs much better in practice.  Each Y-tile is an 8x8 grid of cache lines arranged Y-major as follows:
 173
 174 ===== ===== ===== ===== ===== ===== ===== =====
 175 ===== ===== ===== ===== ===== ===== ===== =====
 176 0x000 0x200 0x400 0x600 0x800 0xa00 0xc00 0xe00
 177 0x040 0x240 0x440 0x640 0x840 0xa40 0xc40 0xe40
 178 0x080 0x280 0x480 0x680 0x880 0xa80 0xc80 0xe80
 179 0x0c0 0x2c0 0x4c0 0x6c0 0x8c0 0xac0 0xcc0 0xec0
 180 0x100 0x300 0x500 0x700 0x900 0xb00 0xd00 0xf00
 181 0x140 0x340 0x540 0x740 0x940 0xb40 0xd40 0xf40
 182 0x180 0x380 0x580 0x780 0x980 0xb80 0xd80 0xf80
 183 0x1c0 0x3c0 0x5c0 0x7c0 0x9c0 0xbc0 0xdc0 0xfc0
 184 ===== ===== ===== ===== ===== ===== ===== =====
 185
 186 Each 64B cache line within the tile is laid out as 4 rows of 16B each:
 187
 188 ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
 189 ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
 190 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0a 0x0b 0x0c 0x0d 0x0e 0x0f
 191 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f
 192 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f
 193 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 0x38 0x39 0x3a 0x3b 0x3c 0x3d 0x3e 0x3f
 194 ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
 195
 196 Y-tiling is widely regarded as being substantially faster than X-tiling so it
 197 is generally preferred.  However, prior to Sky Lake, Y-tiling was not available
 198 for scanout so X tiling was used for any sort of window-system buffers.
 199 Starting with Sky Lake, we can scan out from Y-tiled buffers.
 200
 201 Bit-6 Swizzling
 202 ^^^^^^^^^^^^^^^
 203
 204 When bit-6 swizzling is enabled, bit 9 is XORed in with bit 6 of the tiled
 205 address:
 206
 207 .. code-block:: c
 208
 209    addr[6] ^= addr[9];
 210
 211 W-tiling
 212 --------
 213
 214 W-tiling is a new tiling format added on Sandy Bridge for use in stencil
 215 buffers.  W-tiling is similar to Y-tiling in that it's arranged as an 8x8
 216 Y-major grid of cache lines.  The bytes within each cache line are arranged as
 217 follows:
 218
 219 ==== ==== ==== ==== ==== ==== ==== ====
 220 ==== ==== ==== ==== ==== ==== ==== ====
 221 0x00 0x01 0x04 0x05 0x10 0x11 0x14 0x15
 222 0x02 0x03 0x06 0x07 0x12 0x13 0x16 0x17
 223 0x08 0x09 0x0c 0x0d 0x18 0x19 0x1c 0x1d
 224 0x0a 0x0b 0x0e 0x0f 0x1a 0x1b 0x1e 0x1f
 225 0x20 0x21 0x24 0x25 0x30 0x31 0x34 0x35
 226 0x22 0x23 0x26 0x27 0x32 0x33 0x36 0x37
 227 0x28 0x29 0x2c 0x2d 0x38 0x39 0x3c 0x3d
 228 0x2a 0x2b 0x2e 0x2f 0x3a 0x3b 0x3e 0x3f
 229 ==== ==== ==== ==== ==== ==== ==== ====
 230
 231 While W-tiling has been required for stencil all the way back to Sandy Bridge,
 232 the docs are somewhat confused as to whether stencil buffers are W or Y-tiled.
 233 This seems to stem from the fact that the hardware seems to implement W-tiling
 234 as a sort of modified Y-tiling.  One example of this is the somewhat odd
 235 requirement that W-tiled buffers have their pitch multiplied by 2.  From the
 236 Sky Lake PRM Vol. 2d, "RENDER_SURFACE_STATE" (p. 427):
 237
 238    If the surface is a stencil buffer (and thus has Tile Mode set to
 239    TILEMODE_WMAJOR), the pitch must be set to 2x the value computed based on
 240    width, as the stencil buffer is stored with two rows interleaved.
 241
 242 The last phrase holds the key here: "the stencil buffer is stored with two rows
 243 interleaved".  More accurately, a W-tiled buffer can be viewed as a Y-tiled
 244 buffer with each set of 4 W-tiled lines interleaved to form 2 Y-tiled lines. In
 245 ISL, we represent a W-tile as a tiling with a logical dimension of 64el x 64el
 246 but a physical size of 128B x 32rows.  This cleanly takes care of the pitch
 247 issue above and seems to nicely model the hardware.
 248
 249 Tile4
 250 -----
 251
 252 The tile4 format, introduced on Xe-HP, is somewhat similar to Y but with more
 253 internal shuffling.  Each tile4 tile is an 8x8 grid of cache lines arranged
 254 as follows:
 255
 256 ===== ===== ===== ===== ===== ===== ===== =====
 257 ===== ===== ===== ===== ===== ===== ===== =====
 258 0x000 0x040 0x080 0x0a0 0x200 0x240 0x280 0x2a0
 259 0x100 0x140 0x180 0x1a0 0x300 0x340 0x380 0x3a0
 260 0x400 0x440 0x480 0x4a0 0x600 0x640 0x680 0x6a0
 261 0x500 0x540 0x580 0x5a0 0x700 0x740 0x780 0x7a0
 262 0x800 0x840 0x880 0x8a0 0xa00 0xa40 0xa80 0xaa0
 263 0x900 0x940 0x980 0x9a0 0xb00 0xb40 0xb80 0xba0
 264 0xc00 0xc40 0xc80 0xca0 0xe00 0xe40 0xe80 0xea0
 265 0xd00 0xd40 0xd80 0xda0 0xf00 0xf40 0xf80 0xfa0
 266 ===== ===== ===== ===== ===== ===== ===== =====
 267
 268 Each 64B cache line within the tile is laid out the same way as for a Y-tile,
 269 as 4 rows of 16B each:
 270
 271 ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
 272 ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
 273 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0a 0x0b 0x0c 0x0d 0x0e 0x0f
 274 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f
 275 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2a 0x2b 0x2c 0x2d 0x2e 0x2f
 276 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 0x38 0x39 0x3a 0x3b 0x3c 0x3d 0x3e 0x3f
 277 ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
 278
 279 Tiling as a bit pattern
 280 -----------------------
 281
 282 There is one more important angle on tiling that should be discussed before we
 283 finish.  Every tiling can be described by three things:
 284
 285  1. A logical width and height in elements
 286  2. A physical width in bytes and height in rows
 287  3. A mapping from logical elements to physical bytes within the tile
 288
 289 We have spent a good deal of time on the first two because this is what you
 290 really need for doing surface layout calculations.  However, there are cases in
 291 which the map from logical to physical elements is critical.  One example is
 292 W-tiling where we have code to do W-tiled encoding and decoding in the shader
 293 for doing stencil blits because the hardware does not allow us to render to
 294 W-tiled surfaces.
 295
 296 There are many ways to mathematically describe the mapping from logical
 297 elements to physical bytes.  In the PRMs they give a very complicated set of
 298 formulas involving lots of multiplication, modulus, and sums that show you how
 299 to compute the mapping.  With a little creativity, you can easily reduce those
 300 to a set of bit shifts and ORs.  By far the simplest formulation, however, is
 301 as a mapping from the bits of the texture coordinates to bits in the address.
 302 Suppose that :math:`(u, v)` is location of a 1-byte element within a tile.  If
 303 you represent :math:`u` as :math:`u_n u_{n-1} \cdots u_2 u_1 u_0` where
 304 :math:`u_0` is the LSB and :math:`u_n` is the MSB of :math:`u` and similarly
 305 :math:`v = v_m v_{m-1} \cdots v_2 v_1 v_0`, then the bits of the address within
 306 the tile are given by the table below:
 307
 308 =========================================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== ===========
 309  Tiling                                          11          10          9           8           7           6           5           4           3           2           1           0
 310 =========================================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== ===========
 311 :cpp:enumerator:`isl_tiling::ISL_TILING_X`  :math:`v_2` :math:`v_1` :math:`v_0` :math:`u_8` :math:`u_7` :math:`u_6` :math:`u_5` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0`
 312 :cpp:enumerator:`isl_tiling::ISL_TILING_Y0` :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_4` :math:`v_3` :math:`v_2` :math:`v_1` :math:`v_0` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0`
 313 :cpp:enumerator:`isl_tiling::ISL_TILING_W`  :math:`u_5` :math:`u_4` :math:`u_3` :math:`v_5` :math:`v_4` :math:`v_3` :math:`v_2` :math:`u_2` :math:`v_1` :math:`u_1` :math:`v_0` :math:`u_0`
 314 :cpp:enumerator:`isl_tiling::ISL_TILING_4`  :math:`v_4` :math:`v_3` :math:`u_6` :math:`v_2` :math:`u_5` :math:`u_4` :math:`v_1` :math:`v_0` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0`
 315 =========================================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== ===========
 316
 317 Constructing the mapping this way makes a lot of sense when you think about
 318 hardware.  It may seem complex on paper but "simple" things such as addition
 319 are relatively expensive in hardware while interleaving bits in a well-defined
 320 pattern is practically free. For a format that has more than one byte per
 321 element, you simply chop bits off the bottom of the pattern, hard-code them to
 322 0, and adjust bit indices as needed.  For a 128-bit format, for instance, the
 323 Y-tiled pattern becomes :math:`u_2 u_1 u_0 v_4 v_3 v_2 v_1 v_0`.  The Sky Lake
 324 PRM Vol. 5 in the section "2D Surfaces" contains an expanded version of the
 325 above table (which we will not repeat here) that also includes the bit patterns
 326 for the Ys and Yf tiling formats.