inference-engine/thirdparty/mkl-dnn/doc/design/understanding_memory_formats.md

   1 Understanding Memory Formats {#understanding_memory_formats}
   2 ============================================================
   3
   4 ## Introduction
   5
   6 Most of the computations are about data: analyzing data, adjusting
   7 data, reading and storing data, generating data... DNN domain is no
   8 exception. Images, weights/filters, sound, and text require
   9 efficient representation in computer memory to perform operations fast and in
  10 the most convenient way.
  11
  12 This article is devoted to **data format** -- one form of data
  13 representation that describes how multidimensional arrays (nD) are stored in
  14 linear (1D) memory address space and why this is important for
  15 [Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)](https://github.com/intel/mkl-dnn/).
  16
  17 > Note: for the purpose of this article data *format* and *layout*
  18 >       are used interchangeably.
  19
  20 ### Nomenclature used
  21
  22 - Channels mean the same as feature maps
  23 - Upper-case letters denote the dimensions (e.g. `N`)
  24 - Lower-case letters denote the index (e.g. `n`, where `0 <= n < N`)
  25 - The notation for the activations:
  26
  27   *batch*         **N**,
  28   *channels*      **C**,
  29   *depth*         **D**,
  30   *height*        **H**,
  31   *width*         **W**
  32
  33 - The notation for the weights:
  34
  35   *groups*            **G**,
  36   *output channels*   **O**,
  37   *input channels*    **I**,
  38   *depth*             **D**,
  39   *height*            **H**,
  40   *width*             **W**
  41
  42 ## Data formats
  43
  44 Let's first focus on data formats for activations (images).
  45
  46 Activations consist of channels (aka feature maps) and a spatial domain,
  47 either 2D or 3D. Spatial domain together with channels form an image.
  48 During the training phase images are typically grouped together in batches.
  49 Even if there is only one image, we would still assume there is a batch
  50 with batch size equal to 1.
  51 Hence, the overall dimensionality of activations is 4D (**N**, **C**, **H**,
  52 and **W**) or 5D (**N**, **C**, **D**, **H**, and **W**).
  53
  54 In this article for the sake of simplicity we will use 2D spatial only.
  55
  56 ### Plain data formats
  57
  58 It would be simpler to start with an example.
  59
  60 Consider 4D activations with batch equals 2, 16 channels, and 5 x 4 spatial
  61 domain.
  62 Logical representation is given on the picture below.
  63 ![Activations](mem_fmt_img1.png)
  64
  65 The value at the position (n, c, h, w) is generated with the following formula:
  66 ~~~
  67     value(n, c, h, w) = n * CHW + c * HW + h * W + w
  68 ~~~
  69
  70 In order to define how data in this 4D-tensor is laid out in memory we need to
  71 define how to map it to a 1D tensor via an **offset** function that takes
  72 logical index (n, c, h, w) as an input and returns an address displacement to
  73 where the value is located:
  74 ~~~
  75     offset : (int, int, int, int) --> int
  76 ~~~
  77
  78 #### NCHW
  79
  80 Let's describe the order in which the tensor values are laid out in memory
  81 for one the very popular format **NCHW**.
  82 The `[a:?]` marks refer to the jumps shown in the picture below that
  83 shows the 1D representation of an NCHW tensor in memory.
  84
  85 -
  86   `[a:0]`
  87   First within a line, from left to right
  88
  89 -
  90   `[a:1]`
  91   Then line by line from top to bottom
  92
  93 -
  94   `[a:2]`
  95   Then go from one plane to another (in depth)
  96
  97 -
  98   `[a:3]`
  99   And finally switch from one image in a batch (**n** = 0)
 100   to another (**n** = 1)
 101
 102 Then the offset function is:
 103 ~~~
 104     offset_nchw(n, c, h, w) = n * CHW + c * HW + h * W + w
 105 ~~~
 106
 107 We use `nchw` here to denote that `w` is the inner-most dimension, meaning that
 108 two elements adjacent in memory would share the same indices of `n`, `c`,
 109 and `h`, and their index of `w` would be different by `1`. This is of course
 110 true only for non-border elements. On the contrary `n` is the outermost dimension
 111 here, meaning that if you need to take the same pixel `(c, h, w)` but on the
 112 next image you have to jump over the whole image size `C*H*W`.
 113
 114 This data format is called **NCHW** and used by default in BVLC\* Caffe.
 115 TensorFlow\* also supports this data format.
 116
 117 > Note: It is just a coincidence that `offset_nchw()` is the same as `value()`
 118 >       in this example.
 119
 120 One can create memory with **NCHW** data layout using
 121 #mkldnn_nchw of the enum type #mkldnn_memory_format_t defined in
 122 [mkldnn_types.h](https://github.com/intel/mkl-dnn/blob/master/include/mkldnn_types.h)
 123 for C API and mkldnn::memory::nchw defined in
 124 [mkldnn.hpp](https://github.com/intel/mkl-dnn/blob/master/include/mkldnn.hpp)
 125 for C++ API.
 126
 127
 128 #### NHWC
 129
 130 Another quite popular data format is **NHWC**
 131 and it uses the following offset function:
 132 ~~~
 133     offset_nhwc(n, c, h, w) = n * HWC + h * WC + w * C + c
 134 ~~~
 135
 136 In this case the inner-most dimension is channels (`[b:0]`) that is followed
 137 by width (`[b:1]`), height (`[b:2]`), and finally batch (`[b:3]`).
 138
 139 For a single image (**N** = 1), this format is very similar to how
 140 [BMP-file format](https://en.wikipedia.org/wiki/BMP_file_format) works,
 141 where the image is kept pixel by pixel and every pixel contains all
 142 required information about colors (for instance 3 channels for 24bit BMP).
 143
 144 NHWC data format is the default one for
 145 [TensorFlow](https://www.tensorflow.org/performance/performance_guide#data_formats).
 146
 147 This layout corresponds to #mkldnn_nhwc or mkldnn::memory::nhwc.
 148
 149
 150 #### CHWN
 151
 152 The last example here for the plain data layout is **CHWN** which is used by
 153 [Neon](https://neon.nervanasys.com/index.html/design.html#data-layout).
 154 This layout might be very interesting from a vectorization perspective if
 155 an appropriate batch size is used, but on the other hand users cannot always
 156 have *good* batch size
 157 (e.g. in case of real-time inference batch is typically 1).
 158
 159 The dimensions order is (from inner-most to outer-most):
 160 batch (`[c:0]`), width (`[c:1]`), height (`[c:2]`), channels (`[c:3]`).
 161
 162 The offset function for **CHWN** format is defined as:
 163 ~~~
 164     offset_chwn(n, c, h, w) = c * HWN + h * WN + w * N + n
 165 ~~~
 166
 167 This layout corresponds to #mkldnn_chwn or mkldnn::memory::chwn.
 168
 169 ![Different plain layouts](mem_fmt_img2.png)
 170
 171 #### Relevant reading
 172
 173 [TensorFlow Doc. Shapes and Layout](https://www.tensorflow.org/performance/xla/shapes)
 174
 175 ### Generalization of the plain data layout
 176
 177 #### Strides
 178
 179 In the previous examples the data was kept packed or in dense form, meaning
 180 pixels follow one another.
 181 Sometimes it might be necessary to not keep data contiguous in memory.
 182 For instance some might need to work with sub-tensor within a bigger tensor.
 183 Sometimes it might be beneficial to artificially make the data disjoint, like
 184 in case of GEMM with non-trivial leading dimension to get better performance
 185 ([see Tips 6](https://software.intel.com/en-us/articles/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function)).
 186
 187 The following picture shows simplified case for 2D matrix of size
 188 `rows x columns` kept in row-major format where rows have some non-trivial
 189 (i.e. not equal to the number of columns) stride.
 190
 191 ![Strides](strides.png)
 192
 193 In this case the general offset function looks like:
 194 ~~~
 195     offset(n, c, h, w) = n * stride_n
 196                        + c * stride_c
 197                        + h * stride_h
 198                        + w * stride_w
 199 ~~~
 200
 201 Note, then **NCHW**, **NHWC**, and **CHWN** formats are just special
 202 cases of the format with strides. For example for **NCHW** we have:
 203 ~~~
 204     stride_n = CHW, stride_c = HW, stride_h = W, stride_w = 1
 205 ~~~
 206
 207
 208
 209 Intel MKL-DNN supports strides via blocking structure. The pseudo code is:
 210 ~~~cpp
 211     memory_desc_t md; // memory descriptor object
 212
 213     // logical description, layout independent
 214     md.ndims = 4;           // # dimensions
 215     md.dims = {N, C, H, W}; // dimensions themselves
 216
 217     // physical description
 218     md.memory_format = mkldnn_blocked; // generic blocked format
 219     md.layout_desc.blocking.strides[0] = {
 220         stride_n, stride_c, stride_h, stride_w
 221     };
 222 ~~~
 223 In particular whenever a user creates memory with #mkldnn_nchw format MKL-DNN
 224 computes the strides and fills the structure on behalf of the user. That
 225 can be done manually though.
 226
 227
 228 ## Blocked layout
 229
 230 Plain layouts give great flexibility and are very convenient for use. That's
 231 why most of the frameworks and applications use either **NCHW** or **NHWC**
 232 layouts. However depending on the operation that is performed on data it might
 233 turn out that those layouts are sub-optimal from performance perspective.
 234
 235 In order to achieve better vectorization and cache re-usage Intel MKL-DNN
 236 introduces blocked layout that splits one or several dimensions into the
 237 blocks of fixed size. The most popular Intel MKL-DNN data format is
 238 **nChw16c** on AVX512+ systems and **nChw8c** on SSE4.2+ systems. As one
 239 might guess from the name the only dimension that is blocked is channels and
 240 the block size is either 16 in the former case or 8 in the later case.
 241
 242 Precisely, the offset function for **nChw8c** is:
 243 ~~~
 244     offset_nChw8c(n, c, h, w) = n * CHW
 245                               + (c / 8) * HW*8
 246                               + h * W*8
 247                               + w * 8
 248                               + (c % 8)
 249 ~~~
 250
 251 Note that blocks of 8 channels are kept contiguously in memory. Pixel
 252 by pixel the spatial domain is covered. Then next slice covers subsequent
 253 8 channels (i.e. moving from `c=0..7` to `c=8..15`).
 254 Once all channel blocks are covered the next image in the batch appears.
 255
 256 ![nChw8c format](mem_fmt_blk.png)
 257
 258 > Note: we use lower- and uppercase letters in the formats to distinguish
 259 >       between the blocks (e.g. 8c) and the remaining co-dimension
 260 >       (**C** = channels / 8).
 261
 262 The reason behind the format choice can be found in
 263 [this paper](https://arxiv.org/pdf/1602.06709v1.pdf).
 264
 265 Intel MKL-DNN describes this type of memory via blocking structure
 266 as well. The pseudo code is:
 267 ~~~cpp
 268     memory_desc_t md;
 269     // logical description, layout independent
 270     md.ndims = 4;           // # dimensions
 271     md.dims = {N, C, H, W}; // dimensions themselves
 272
 273     // physical description
 274     md.memory_format = mkldnn_nChw8c; // blocked layout with
 275                                       // channels blocked by 8
 276     md.layout_desc.blocking.block_dims = {
 277         1, // no blocking by n, hence 1
 278         8, //    blocking by c, hence 8
 279         1, // no blocking by h, hence 1
 280         1, // no blocking by w, hence 1
 281     };
 282
 283     ptrdiff_t stride_n = C*H*W;
 284     ptrdiff_t stride_C = H*W*8;
 285     ptrdiff_t stride_h =   W*8;
 286     ptrdiff_t stride_w =     8;
 287     ptrdiff_t stride_8c =    1;
 288
 289     md.layout_desc.blocking.strides[0] = { // strides between blocks
 290         stride_n, stride_C, stride_h, stride_w
 291     };
 292     md.layout_desc.blocking.strides[1] = { // strides within blocks
 293         1,         // ignored since no blocking by n
 294         stride_8c, // blocks of channels are contiguous
 295         1,         // ignored since no blocking by h
 296         1,         // ignored since no blocking by w
 297     };
 298 ~~~
 299
 300 ### What if channels are not multiple of 8 (or 16)?
 301
 302 Blocking data layout gives significant performance improvement for the
 303 convolutions, but what to do when the number of channels is not multiple
 304 of the block size, say 17 channels for **nChw8c** format?
 305
 306 Well one of the possible ways to handle that would be to use blocked layout for
 307 as many channels as possible by rounding them down to a number that is
 308 a multiple of the block size (in this case `16 = 17 / 8 * 8`) and process
 309 the tail somehow. However that would lead to introduction of very special tail
 310 processing code into many Intel MKL-DNN kernels.
 311
 312 So we came up with another solution using zero-padding. The idea is to round
 313 the channels up to make them multiples of the block size and pad created tail
 314 with zeros (in the example above `24 = div_up(17, 8) * 8`). Then primitives
 315 like convolutions might work with rounded-up number of channels instead of
 316 the original ones and compute the correct result (adding zeros doesn't change
 317 the result).
 318
 319 That allows supporting arbitrary number of channels with almost no changes
 320 to the kernels. The price would be some extra computations on those zeros,
 321 but this is either negligible or the performance with overheads is still
 322 higher than the performance with plain data layout.
 323
 324 The picture below depicts the idea. Note that some extra computations
 325 happen while computing `d0`, but that does not affect the result.
 326
 327 ![Padded format](mem_fmt_padded_blk.png)
 328
 329 The feature is supported starting with Intel MKL-DNN
 330 [v0.15](https://github.com/intel/mkl-dnn/releases/tag/v0.15). So far the
 331 support is limited by `f32` data type and on the AVX512+ systems. We plan
 332 to extend the implementations for other cases as well.
 333
 334 Some pitfalls of the given approach:
 335
 336 - To keep *padded data are zeros* invariant mkldnn_memory_set_data_handle()
 337   and mkldnn::memory::set_data_handle() physically put zeros
 338   whenever user attaches a pointer to a memory that uses zero padding. That
 339   might affect performance if too many unnecessary calls to these functions
 340   are made. We might consider extending our API in future to allow attaching
 341   pointers without subsequent initialization with zeros, if user can guarantee
 342   the padding is already filled correctly
 343
 344 - Memory size required to keep the data cannot be computed by the formula
 345   `sizeof(data_type) * N * C * H * W` anymore. Actual size should always be
 346   queried via mkldnn_memory_primitive_desc_get_size() in C and
 347   mkldnn::memory::primitive_desc::get_size() in C++
 348
 349 - Element-wise operations that are implemented in user's code and directly
 350   operate on Intel MKL-DNN blocked layout like that:
 351   ~~~
 352       for (int e = 0; e < phys_size; ++e)
 353           x[e] = eltwise_op(x[e])
 354   ~~~
 355   are not safe if the data is padded with zeros and the `eltwise_op(0) != 0`
 356
 357 Relevant Intel MKL-DNN code:
 358 ~~~cpp
 359     const int C = 17;
 360     const int C_padded = div_up(17, 8) * 8; // 24
 361
 362     // logical description, layout independent
 363     const int ndims    = 4;            // # of dimensions
 364     mkldnn_dims_t dims = {N, C, H, W}; // dimensions themselves
 365
 366     memory_desc_t md;
 367     // initialize memory descriptor
 368     mkldnn_memory_desc_init(&md, ndims,
 369                                  dims,
 370                                  mkldnn_f32,   // single precision data type
 371                                  mkldnn_nChw8c // blocked layout
 372                                  );
 373
 374     ptrdiff_t expect_stride_n = C_padded*H*W;   // note C_padded here, not C
 375     ptrdiff_t expect_stride_C =          H*W*8;
 376     ptrdiff_t expect_stride_h =            W*8;
 377     ptrdiff_t expect_stride_w =              8;
 378     ptrdiff_t expect_stride_8c =             1;
 379
 380     bool expect_true = true
 381         && true // logical dims stay as is
 382         && md.dims[0] == N
 383         && md.dims[1] == C
 384         && md.dims[2] == H
 385         && md.dims[3] == W
 386         && true // padded dims are rounded accordingly
 387         && md.layout_desc.blocking.padding_dims[0] == N
 388         && md.layout_desc.blocking.padding_dims[1] == C_padded
 389         && md.layout_desc.blocking.padding_dims[2] == H
 390         && md.layout_desc.blocking.padding_dims[3] == W
 391         && true // strides correspond to the physcal layout
 392         && md.layout_desc.blocking.strides[0][0] == expect_stride_n
 393         && md.layout_desc.blocking.strides[0][1] == expect_stride_C
 394         && md.layout_desc.blocking.strides[0][2] == expect_stride_h
 395         && md.layout_desc.blocking.strides[0][3] == expect_stride_w
 396         && md.layout_desc.blocking.strides[1][1] == expect_stride_8c;
 397
 398     assert(expect_true);
 399 ~~~
 400
 401 ---
 402
 403 [Legal information](@ref legal_information)