1 Understanding Memory Formats {#understanding_memory_formats}
2 ============================================================
6 Most of the computations are about data: analyzing data, adjusting
7 data, reading and storing data, generating data... DNN domain is no
8 exception. Images, weights/filters, sound, and text require
9 efficient representation in computer memory to perform operations fast and in
10 the most convenient way.
12 This article is devoted to **data format** -- one form of data
13 representation that describes how multidimensional arrays (nD) are stored in
14 linear (1D) memory address space and why this is important for
15 [Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)](https://github.com/intel/mkl-dnn/).
17 > Note: for the purpose of this article data *format* and *layout*
18 > are used interchangeably.
22 - Channels mean the same as feature maps
23 - Upper-case letters denote the dimensions (e.g. `N`)
24 - Lower-case letters denote the index (e.g. `n`, where `0 <= n < N`)
25 - The notation for the activations:
33 - The notation for the weights:
36 *output channels* **O**,
37 *input channels* **I**,
44 Let's first focus on data formats for activations (images).
46 Activations consist of channels (aka feature maps) and a spatial domain,
47 either 2D or 3D. Spatial domain together with channels form an image.
48 During the training phase images are typically grouped together in batches.
49 Even if there is only one image, we would still assume there is a batch
50 with batch size equal to 1.
51 Hence, the overall dimensionality of activations is 4D (**N**, **C**, **H**,
52 and **W**) or 5D (**N**, **C**, **D**, **H**, and **W**).
54 In this article for the sake of simplicity we will use 2D spatial only.
56 ### Plain data formats
58 It would be simpler to start with an example.
60 Consider 4D activations with batch equals 2, 16 channels, and 5 x 4 spatial
62 Logical representation is given on the picture below.
63 ![Activations](mem_fmt_img1.png)
65 The value at the position (n, c, h, w) is generated with the following formula:
67 value(n, c, h, w) = n * CHW + c * HW + h * W + w
70 In order to define how data in this 4D-tensor is laid out in memory we need to
71 define how to map it to a 1D tensor via an **offset** function that takes
72 logical index (n, c, h, w) as an input and returns an address displacement to
73 where the value is located:
75 offset : (int, int, int, int) --> int
80 Let's describe the order in which the tensor values are laid out in memory
81 for one the very popular format **NCHW**.
82 The `[a:?]` marks refer to the jumps shown in the picture below that
83 shows the 1D representation of an NCHW tensor in memory.
87 First within a line, from left to right
91 Then line by line from top to bottom
95 Then go from one plane to another (in depth)
99 And finally switch from one image in a batch (**n** = 0)
100 to another (**n** = 1)
102 Then the offset function is:
104 offset_nchw(n, c, h, w) = n * CHW + c * HW + h * W + w
107 We use `nchw` here to denote that `w` is the inner-most dimension, meaning that
108 two elements adjacent in memory would share the same indices of `n`, `c`,
109 and `h`, and their index of `w` would be different by `1`. This is of course
110 true only for non-border elements. On the contrary `n` is the outermost dimension
111 here, meaning that if you need to take the same pixel `(c, h, w)` but on the
112 next image you have to jump over the whole image size `C*H*W`.
114 This data format is called **NCHW** and used by default in BVLC\* Caffe.
115 TensorFlow\* also supports this data format.
117 > Note: It is just a coincidence that `offset_nchw()` is the same as `value()`
120 One can create memory with **NCHW** data layout using
121 #mkldnn_nchw of the enum type #mkldnn_memory_format_t defined in
122 [mkldnn_types.h](https://github.com/intel/mkl-dnn/blob/master/include/mkldnn_types.h)
123 for C API and mkldnn::memory::nchw defined in
124 [mkldnn.hpp](https://github.com/intel/mkl-dnn/blob/master/include/mkldnn.hpp)
130 Another quite popular data format is **NHWC**
131 and it uses the following offset function:
133 offset_nhwc(n, c, h, w) = n * HWC + h * WC + w * C + c
136 In this case the inner-most dimension is channels (`[b:0]`) that is followed
137 by width (`[b:1]`), height (`[b:2]`), and finally batch (`[b:3]`).
139 For a single image (**N** = 1), this format is very similar to how
140 [BMP-file format](https://en.wikipedia.org/wiki/BMP_file_format) works,
141 where the image is kept pixel by pixel and every pixel contains all
142 required information about colors (for instance 3 channels for 24bit BMP).
144 NHWC data format is the default one for
145 [TensorFlow](https://www.tensorflow.org/performance/performance_guide#data_formats).
147 This layout corresponds to #mkldnn_nhwc or mkldnn::memory::nhwc.
152 The last example here for the plain data layout is **CHWN** which is used by
153 [Neon](https://neon.nervanasys.com/index.html/design.html#data-layout).
154 This layout might be very interesting from a vectorization perspective if
155 an appropriate batch size is used, but on the other hand users cannot always
156 have *good* batch size
157 (e.g. in case of real-time inference batch is typically 1).
159 The dimensions order is (from inner-most to outer-most):
160 batch (`[c:0]`), width (`[c:1]`), height (`[c:2]`), channels (`[c:3]`).
162 The offset function for **CHWN** format is defined as:
164 offset_chwn(n, c, h, w) = c * HWN + h * WN + w * N + n
167 This layout corresponds to #mkldnn_chwn or mkldnn::memory::chwn.
169 ![Different plain layouts](mem_fmt_img2.png)
171 #### Relevant reading
173 [TensorFlow Doc. Shapes and Layout](https://www.tensorflow.org/performance/xla/shapes)
175 ### Generalization of the plain data layout
179 In the previous examples the data was kept packed or in dense form, meaning
180 pixels follow one another.
181 Sometimes it might be necessary to not keep data contiguous in memory.
182 For instance some might need to work with sub-tensor within a bigger tensor.
183 Sometimes it might be beneficial to artificially make the data disjoint, like
184 in case of GEMM with non-trivial leading dimension to get better performance
185 ([see Tips 6](https://software.intel.com/en-us/articles/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function)).
187 The following picture shows simplified case for 2D matrix of size
188 `rows x columns` kept in row-major format where rows have some non-trivial
189 (i.e. not equal to the number of columns) stride.
191 ![Strides](strides.png)
193 In this case the general offset function looks like:
195 offset(n, c, h, w) = n * stride_n
201 Note, then **NCHW**, **NHWC**, and **CHWN** formats are just special
202 cases of the format with strides. For example for **NCHW** we have:
204 stride_n = CHW, stride_c = HW, stride_h = W, stride_w = 1
209 Intel MKL-DNN supports strides via blocking structure. The pseudo code is:
211 memory_desc_t md; // memory descriptor object
213 // logical description, layout independent
214 md.ndims = 4; // # dimensions
215 md.dims = {N, C, H, W}; // dimensions themselves
217 // physical description
218 md.memory_format = mkldnn_blocked; // generic blocked format
219 md.layout_desc.blocking.strides[0] = {
220 stride_n, stride_c, stride_h, stride_w
223 In particular whenever a user creates memory with #mkldnn_nchw format MKL-DNN
224 computes the strides and fills the structure on behalf of the user. That
225 can be done manually though.
230 Plain layouts give great flexibility and are very convenient for use. That's
231 why most of the frameworks and applications use either **NCHW** or **NHWC**
232 layouts. However depending on the operation that is performed on data it might
233 turn out that those layouts are sub-optimal from performance perspective.
235 In order to achieve better vectorization and cache re-usage Intel MKL-DNN
236 introduces blocked layout that splits one or several dimensions into the
237 blocks of fixed size. The most popular Intel MKL-DNN data format is
238 **nChw16c** on AVX512+ systems and **nChw8c** on SSE4.2+ systems. As one
239 might guess from the name the only dimension that is blocked is channels and
240 the block size is either 16 in the former case or 8 in the later case.
242 Precisely, the offset function for **nChw8c** is:
244 offset_nChw8c(n, c, h, w) = n * CHW
251 Note that blocks of 8 channels are kept contiguously in memory. Pixel
252 by pixel the spatial domain is covered. Then next slice covers subsequent
253 8 channels (i.e. moving from `c=0..7` to `c=8..15`).
254 Once all channel blocks are covered the next image in the batch appears.
256 ![nChw8c format](mem_fmt_blk.png)
258 > Note: we use lower- and uppercase letters in the formats to distinguish
259 > between the blocks (e.g. 8c) and the remaining co-dimension
260 > (**C** = channels / 8).
262 The reason behind the format choice can be found in
263 [this paper](https://arxiv.org/pdf/1602.06709v1.pdf).
265 Intel MKL-DNN describes this type of memory via blocking structure
266 as well. The pseudo code is:
269 // logical description, layout independent
270 md.ndims = 4; // # dimensions
271 md.dims = {N, C, H, W}; // dimensions themselves
273 // physical description
274 md.memory_format = mkldnn_nChw8c; // blocked layout with
275 // channels blocked by 8
276 md.layout_desc.blocking.block_dims = {
277 1, // no blocking by n, hence 1
278 8, // blocking by c, hence 8
279 1, // no blocking by h, hence 1
280 1, // no blocking by w, hence 1
283 ptrdiff_t stride_n = C*H*W;
284 ptrdiff_t stride_C = H*W*8;
285 ptrdiff_t stride_h = W*8;
286 ptrdiff_t stride_w = 8;
287 ptrdiff_t stride_8c = 1;
289 md.layout_desc.blocking.strides[0] = { // strides between blocks
290 stride_n, stride_C, stride_h, stride_w
292 md.layout_desc.blocking.strides[1] = { // strides within blocks
293 1, // ignored since no blocking by n
294 stride_8c, // blocks of channels are contiguous
295 1, // ignored since no blocking by h
296 1, // ignored since no blocking by w
300 ### What if channels are not multiple of 8 (or 16)?
302 Blocking data layout gives significant performance improvement for the
303 convolutions, but what to do when the number of channels is not multiple
304 of the block size, say 17 channels for **nChw8c** format?
306 Well one of the possible ways to handle that would be to use blocked layout for
307 as many channels as possible by rounding them down to a number that is
308 a multiple of the block size (in this case `16 = 17 / 8 * 8`) and process
309 the tail somehow. However that would lead to introduction of very special tail
310 processing code into many Intel MKL-DNN kernels.
312 So we came up with another solution using zero-padding. The idea is to round
313 the channels up to make them multiples of the block size and pad created tail
314 with zeros (in the example above `24 = div_up(17, 8) * 8`). Then primitives
315 like convolutions might work with rounded-up number of channels instead of
316 the original ones and compute the correct result (adding zeros doesn't change
319 That allows supporting arbitrary number of channels with almost no changes
320 to the kernels. The price would be some extra computations on those zeros,
321 but this is either negligible or the performance with overheads is still
322 higher than the performance with plain data layout.
324 The picture below depicts the idea. Note that some extra computations
325 happen while computing `d0`, but that does not affect the result.
327 ![Padded format](mem_fmt_padded_blk.png)
329 The feature is supported starting with Intel MKL-DNN
330 [v0.15](https://github.com/intel/mkl-dnn/releases/tag/v0.15). So far the
331 support is limited by `f32` data type and on the AVX512+ systems. We plan
332 to extend the implementations for other cases as well.
334 Some pitfalls of the given approach:
336 - To keep *padded data are zeros* invariant mkldnn_memory_set_data_handle()
337 and mkldnn::memory::set_data_handle() physically put zeros
338 whenever user attaches a pointer to a memory that uses zero padding. That
339 might affect performance if too many unnecessary calls to these functions
340 are made. We might consider extending our API in future to allow attaching
341 pointers without subsequent initialization with zeros, if user can guarantee
342 the padding is already filled correctly
344 - Memory size required to keep the data cannot be computed by the formula
345 `sizeof(data_type) * N * C * H * W` anymore. Actual size should always be
346 queried via mkldnn_memory_primitive_desc_get_size() in C and
347 mkldnn::memory::primitive_desc::get_size() in C++
349 - Element-wise operations that are implemented in user's code and directly
350 operate on Intel MKL-DNN blocked layout like that:
352 for (int e = 0; e < phys_size; ++e)
353 x[e] = eltwise_op(x[e])
355 are not safe if the data is padded with zeros and the `eltwise_op(0) != 0`
357 Relevant Intel MKL-DNN code:
360 const int C_padded = div_up(17, 8) * 8; // 24
362 // logical description, layout independent
363 const int ndims = 4; // # of dimensions
364 mkldnn_dims_t dims = {N, C, H, W}; // dimensions themselves
367 // initialize memory descriptor
368 mkldnn_memory_desc_init(&md, ndims,
370 mkldnn_f32, // single precision data type
371 mkldnn_nChw8c // blocked layout
374 ptrdiff_t expect_stride_n = C_padded*H*W; // note C_padded here, not C
375 ptrdiff_t expect_stride_C = H*W*8;
376 ptrdiff_t expect_stride_h = W*8;
377 ptrdiff_t expect_stride_w = 8;
378 ptrdiff_t expect_stride_8c = 1;
380 bool expect_true = true
381 && true // logical dims stay as is
386 && true // padded dims are rounded accordingly
387 && md.layout_desc.blocking.padding_dims[0] == N
388 && md.layout_desc.blocking.padding_dims[1] == C_padded
389 && md.layout_desc.blocking.padding_dims[2] == H
390 && md.layout_desc.blocking.padding_dims[3] == W
391 && true // strides correspond to the physcal layout
392 && md.layout_desc.blocking.strides[0][0] == expect_stride_n
393 && md.layout_desc.blocking.strides[0][1] == expect_stride_C
394 && md.layout_desc.blocking.strides[0][2] == expect_stride_h
395 && md.layout_desc.blocking.strides[0][3] == expect_stride_w
396 && md.layout_desc.blocking.strides[1][1] == expect_stride_8c;
403 [Legal information](@ref legal_information)