inference-engine/thirdparty/mkl-dnn/tests/benchdnn/README.md

   1 # benchdnn
   2
   3 **benchdnn** is a standalone correctness and performance benchmark for
   4 [Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)](/intel/mkl-dnn) library.
   5 The purpose of the benchmark is extended and robust correctness verification of
   6 the primitives provided by MKL-DNN. So far **benchdnn** supports convolutions
   7 and inner products of different data types. It also implicitly tests reorders.
   8
   9
  10 ## License
  11 **benchdnn** is licensed under
  12 [Apache License Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).
  13
  14
  15 ## Usage (main driver)
  16
  17 **benchdnn** itself is a driver for different implementation specific
  18 harnesses. So far it has harness for Intel MKL-DNN convolution, inner product,
  19 reorder, batch normalization, and harness for testing itself.
  20 The usage:
  21 ```
  22     $ ./benchdnn: [--HARNESS] [--mode=MODE] [-vN|--verbose=N] HARNESS-OPTS
  23 ```
  24 where:
  25
  26  - `HARNESS` is either `conv` [default], `ip`, `reorder`, `bnorm`, `rnn` or `self`
  27
  28  - `MODE` -- string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
  29
  30  - `N` -- verbose level (integer from 0 [default] to ...)
  31
  32  - `HARNESS-OPTS` are passed to the chosen harness
  33
  34 Returns `0` on success (all tests passed), and non-zero in case of any error
  35 happened.
  36
  37
  38 ## Usage (convolution harness)
  39
  40 The usage:
  41 ```
  42     [harness-knobs] [conv-desc] ...
  43 ```
  44
  45 where *harness-knobs* are:
  46
  47  - `--cfg={f32, u8s8u8s32, ...}` configuration (see below), default `f32`
  48  - `--dir={FWD_D (forward data), FWD_B (forward data + bias), BWD_D (backward data), BWD_W (backward weights), BWD_WB (backward weights + bias)}` direction, default `FWD_B`
  49  - `--alg={DIRECT, WINO}` convolution algorithm, default DIRECT
  50  - `--merge={NONE, RELU}` merged primitive, default NONE (nothing merged)
  51  - `--attr="attr_str"` convolution attributes (see in the section below), default `""` (no attributes set)
  52  - `--mb=N` override minibatch that is specified in convolution description, default `0` (use mb specified in conv desc)
  53  - `--match=regex` check only convolutions that match with regex, default is `".*"`. Notice: Windows may only interpret string arguments surrounded by double quotation marks.
  54  - `--skip-impl="str1[:str2]..."` skip implementation (see mkldnn_query_impl_info_str), default `""`
  55  - `--allow-unimpl=true|false` do not treat unimplemented configuration as an error, default `false`
  56  - `--perf-template=template-str` set template for performance report (see section *Performance measurements*)
  57  - `--reset` reset all the parameters set before to default one
  58  - `-vN|--verbose=N` verbose level, default `0`
  59  - `--batch=file` use options from the given file (see in subdirectory)
  60
  61 and *conv-desc* is convolution description. The canonical form is:
  62 ```
  63     gXmbXicXihXiwXocXohXowXkhXkwXshXswXphXpwXdhXdwXnS
  64 ```
  65 Here X is a number and S is string (n stands for name). Some of the parameters
  66 might be omitted if there is either default one (e.g. if g is not specified
  67 **benchdnn** uses 1) or if the can be computed automatically (e.g. output shape
  68 can be derived from the input one and kernel). Also if either width or height
  69 is not specified than it is assumed height == width. Special symbol `_` is
  70 ignored, hence maybe used as delimiter. See `str2desc()` in conv/conv_aux.cpp
  71 for more details and implicit rules :^)
  72
  73 The attribute string *attr_str* is defined as (new lines for readability):
  74 ```
  75     [irmode={nearest,down};]
  76     [oscale={none,common,per_oc}[:scale];]
  77     [post_ops='[{relu,sum[:sum_scale]};]...';]
  78 ```
  79
  80 Here `irmode` defines the rounding mode for integer output (default is nearest).
  81
  82 Next, `oscale` stands for output_scales. The first parameter is the policy that
  83 is defined below. The second optional parameter is a scale that specifies
  84 either the one common output scale (for `none` and `common` polices) or a
  85 starting point for `per_oc` policy, which uses many scales. The default scale
  86 is 1.0. Known policies are:
  87
  88   - `none` (default) means no output scales set (i.e. scale = 1.)
  89   - `common` corresponds to `mask=0` with common scale factor
  90   - `per_oc` corresponds to `mask=1<<1` (i.e. output channels) with different scale factors
  91
  92 Next, `post_ops` stands for post operation sequence. Currently supported post
  93 ops are:
  94
  95   - `relu` with no parameters (i.e. corresponding scale is 1., alg = eltwise_relu, alpha = beta = 0.)
  96   - `sum` with optional parameter scale (default 1.)
  97
  98 ### convolution configurations (aka precision specification)
  99
 100 `--cfg` option specifies what convolution would be used in terms of data type.
 101 Also it defines all the magic with data filling inside. For integer type
 102 saturation is implicitly implied.
 103
 104 Finally configuration defines threshold for computation errors (ideally we
 105 want keep it 0 and it seems to work for now).
 106
 107 The table below shows cases supported by Intel MKL-DNN and corresponding
 108 configurations for **benchdnn**:
 109
 110 |src type | wei type | dst type | acc type | cfg          | notes
 111 |:---     |:---      |:---      |:---      |:---          |:---
 112 | f32     | f32      | f32      | f32      | f32          | inference optimized for sse4.2+, training avx2+
 113 | s16     | s16      | s32      | s32      | s16s16s32s32 | optimized for processors with support of 4vnni, forward pass only (aka FWD_D, FWD_B)
 114 | s32     | s16      | s16      | s32      | s32s16s16s32 | optimized for processors with support of 4vnni, backward wrt data only (aka BWD_D)
 115 | s16     | s32      | s16      | s32      | s16s32s16s32 | optimized for processors with support of 4vnni, backward wrt weights (aka BWD_W, BWD_WB)
 116 | u8      | s8       | f32      | s32      | u8s8f32s32   | optimized for processors with support of avx512vl, forward pass only (aka FWD_D, FWD_B)
 117 | u8      | s8       | s32      | s32      | u8s8s32s32   | same notes as for u8s8s32s32
 118 | u8      | s8       | s8       | s32      | u8s8s8s32    | same notes as for u8s8s32s32
 119 | u8      | s8       | u8       | s32      | u8s8u8s32    | same notes as for u8s8s32s32
 120
 121
 122 ## Performance measurements
 123
 124 **benchdnn** supports custom performance report. Template is passed via
 125 command line and consists of terminal and nonterminal symbols. Nonterminal
 126 symbols are printed as is. Description of terminal symbols is given below.
 127 There is also a notion of modifiers (marked as @) that change meaning of
 128 terminal symbols, e.g. sign '-' means minimum of (in terms of time). See
 129 table of modifiers below.
 130
 131 > **caution:** threads have to be pinned in order to get consistent frequency
 132
 133 | abbreviation  | description
 134 |:------------  |:-----------
 135 | %d            | problem descriptor
 136 | %D            | expanded problem descriptor (conv parameters in csv format)
 137 | %n            | problem name
 138 | %z            | direction
 139 | %@F           | effective cpu frequency computed as clocks[@] / time[@]
 140 | %O            | number of ops required (padding is not taken into account)
 141 | %@t           | time in ms
 142 | %@c           | time in clocks
 143 | %@p           | ops per second
 144
 145 | modifier  | description
 146 |:--------  |:-----------
 147 |           | default
 148 | -         | min (time) -- default
 149 | 0         | avg (time)
 150 | +         | max (time)
 151 |           |
 152 | K         | Kilo (1e3)
 153 | M         | Mega (1e6)
 154 | G         | Giga (1e9)
 155
 156 The definition of expanded problem descriptor is:
 157 `g,mb,ic,ih,iw,oc,oh,ow,kh,kw,sh,sw,ph,pw`.
 158
 159 The default template can be found in conv/bench_conv.cpp that is defined as
 160 `perf,%n,%d,%GO,%GF,%-t,%-Gp,%0t,%0Gp`. That will produce the following output
 161 in CSV format:
 162 ```
 163 string: perf
 164 convolution name
 165 full conv-desc
 166 number of giga ops calculated
 167 effective cpu frequency in GHz (amb clocks[min] / time[min])
 168 minimum time spent in ms
 169 best gigaops (since it corresponds to mimimum time)
 170 average time spent in ms
 171 average gigaops (since it corresponds to average time)
 172 ```
 173
 174 ## Examples
 175
 176 Run the set of f32 forward convolutions from inputs/conv_all file w/ bias and default minibatch:
 177 ```
 178     $ ./benchdnn --conv \
 179         --cfg=f32 --dir=FWD_B --batch=inputs/conv_all
 180 ```
 181
 182 Run the same but with merged ReLU:
 183 ```
 184     $ ./benchdnn --conv \
 185         --cfg=f32 --dir=FWD_B --merge=RELU --batch=inputs/conv_all
 186 ```
 187
 188 Run the same as previous but also measure performance:
 189 ```
 190     $ ./benchdnn --conv --mode=CORRnPERF \
 191         --cfg=f32 --dir=FWD_B --merge=RELU --batch=inputs/conv_all
 192 ```
 193
 194 > **note**: instead of `CORRnPERF` one can use `CP`, `PC`, `cp`, or `pc`
 195
 196 Run a set of f32 backward convolutions wrt weights with kh=3 and
 197 verbose level set to 2:
 198 ```
 199     $ ./benchdnn --conv -v2 \
 200         --cfg=f32 --dir=BWD_W --match='.*kh3[^0-9].*' --batch=inputs/conv_all
 201 ```
 202
 203 Run a set of u8s8u8s32 backward convolutions wrt data but skip all
 204 the convolutions that will use reference or gemm-based implementation:
 205 ```
 206     $ ./benchdnn --conv \
 207         --cfg=u8s8u8s32 --dir=BWD_B --skip-impl='ref:gemm' --batch=inputs/conv_all
 208 ```
 209
 210 Run explicitly specified 1st forward convolution (including bias) from Alexnet
 211 with the minibatch set to 4, verbose level set to 1 for two given
 212 configurations (`u8s8u8s32` and `f32`):
 213 ```
 214     $ ./benchdnn --conv -v1 \
 215         --mb=4 --dir=FWD_B \
 216         --cfg=u8s8u8s32 ic3ih227iw227_oc96oh55ow55_kh11kw11_sh4sw4ph0pw0_n"alexnet:conv1" \
 217         --cfg=f32 ic3ih227iw227_oc96oh55ow55_kh11kw11_sh4sw4ph0pw0_n"alexnet:conv1"
 218 ```
 219
 220 Run batch file for different algorithms (assuming the file only specifies
 221 convolutions and does not include harness options that would override ones
 222 passed in the command line). Also ignore mkldnn_unimplemented errors in case of
 223 Winograd:
 224 ```
 225     $ ./benchdnn --conv \
 226         --alg=DIRECT --batch=convs.in \
 227         --allow-unimpl=true \
 228         --alg=WINO   --batch=convs.in
 229 ```
 230
 231 Run a set of u8s8u8s32 forward convolutions w/o bias, skipping
 232 reference implementations and not triggering unimplemented as an error, with
 233 one common output scale set to 0.5 with rounding mode set to down
 234 (via attributes):
 235 ```
 236     $ ./benchdnn --conv \
 237         --cfg=u8s8u8s32 --dir=FWD_D --skip-impl="ref" --allow-unimpl=true \
 238         --attr="irmode=down;oscale=common:.5" --batch=inputs/conv_all
 239 ```
 240
 241 Almost the same as above (with minor changes), but also add post operation
 242 sequence **(relu, then sum with scale .3, then relu)** using
 243 attributes/mkldnn_post_ops_t:
 244 ```
 245     $ ./benchdnn --conv \
 246         --cfg=u8s8s32s32 --dir=FWD_B \
 247         --attr="oscale=common:.5;post_ops='relu;sum:.3;relu'" --batch=inputs/conv_all
 248 ```
 249
 250
 251 ## Notations / Glossary / Abbreviations
 252
 253 |Abbreviation   | Description
 254 |:---           |:---
 255 | src           | Source image (input image for forward convolution)
 256 | wei           | Weights (aka filter)
 257 | bia           | Bias
 258 | dst           | Destination image (output image for forward convolution)
 259 | acc           | Accumulation (typically in terms of data type)
 260 | ic, oc        | Input/Output channels (aka feature maps)
 261 | ih, iw        | Input height and width
 262 | oh, ow        | Output height and width
 263 | kh, kw        | Kernel (filter, weights) height and width
 264 | sh, sw        | Convolution stride over height and width
 265 | ph, pw        | Convolution top and left padding
 266 | mb            | Minibatch (amount of images processed at once)
 267 | g             | Groups (a way to reduce the amount of computations, see Alexnet topology)
 268 | FWD_{D,B}     | forward w/o and w/ bias
 269 | BWD_{D,W,WB}  | backward wrt data, weights, and weights and bias
 270 | DIRECT, WINO  | convolution algorithm: direct or Winograd based
 271 | NONE, RELU    | merged primitives: nothing or ReLU
 272
 273
 274 ## Usage (batch normalization harness)
 275
 276 The usage:
 277 ```
 278     ./benchdnn --bnorm [harness-knobs] bnorm-desc ...
 279 ```
 280
 281 where *harness-knobs* are:
 282
 283  - `--mb=N` override minibatch that is specified in batch normalization description, default `0` (use mb specified in bnorm-desc)
 284  - `--dir={FWD_D (forward data /training), FWD_I (forward data /inference), BWD_D (backward data), BWD_DW (backward data + weights)}` direction, default `FWD_D`
 285  - `--dt={f32, s32, ...}` base data type, default `f32`
 286  - `--fmt={nchw, nChw16c, ...}` data layout, default `nchw`
 287  - `--flags=[|G|S|R]` batch normalization flags, default `none` (G -- global stats, S -- use scale shift, R -- fuse with ReLU)
 288  - `--attr="attr_str"` attributes (see in the convolution section above), default `""` (no attributes set)
 289  - `--match=regex` check only convolutions that match with regex, default is `".*"`. Notice: Windows may only interpret string arguments surrounded by double quotation marks.
 290  - `--skip-impl="str1[:str2]..."` skip implementation (see mkldnn_query_impl_info_str), default `""`
 291  - `--perf-template=template-str` set template for performance report (very similar to the convolution one)
 292  - `--reset` reset all the parameters set before to default one
 293  - `-vN|--verbose=N` verbose level, default `0`
 294  - `--batch=file` use options from the given file (see in subdirectory)
 295
 296 and *bnorm-desc* is a batch normalization description. The canonical form is:
 297 ```
 298     mbXicXihXiwXepsYnS
 299 ```
 300 Here X is an integer number, Y is a real number, and S is string (n stands for
 301 name). Special symbol `_` is ignored, hence maybe used as delimiter. There are
 302 some implicit rules:
 303  - if mb is omitted set mb to 2
 304
 305  - if iw is omitted set iw to ih (and vice versa)
 306
 307  - if eps is omitted set eps to 1./16
 308
 309
 310 ## Installation
 311
 312 **benchdnn** is automatically built with Intel MKL-DNN. For the convenience one
 313 may build **benchdnn** using cmake or make.
 314
 315
 316 ## Essence of convolution testing
 317
 318 Intel MKL-DNN supports different data types, such as single precision floating
 319 point (`mkldnn_f32`), signed/unsigned integer of different length
 320 (`mkldnn_{s,u}{8,16,32}`). We need to cover all those cases by tests. It is
 321 essential to test real convolution sizes, since Intel MKL-DNN provides
 322 different optimizations depending on convolution parameters, so there is no
 323 one unified approach inside, which means it would not be enough to test only
 324 few convolutions (aka unit tests).
 325
 326 But even for given convolution the correctness convolution test is not as
 327 simple as it might seem to be at first sight. One of the biggest problem we
 328 encountered is numerical instability. For every output point a lot of
 329 operations may happen. For instance on backward propagation with respect to
 330 filter each filter point requires `mb * oh * ow` operations (see *Notation*
 331 section below). That big amount of compute operations may lead to either
 332 integer overflow or accuracy loss if initial data was chosen inadequately.
 333
 334 These two main things complicate testing. **benchdnn** tries to address these
 335 issues by using integers for initialization with uniform distribution in a
 336 range `[cfg->f_min .. cfg->f_max]`, with the step `cfg->f_step`
 337 (see `struct dt_conf_t` in conv/conv.hpp). `f_min` and `f_max` are chosen so
 338 that most of the result would belong `[cfg->min .. cfg->max]` range. Also
 339 for floating point all integers in both ranges have exact representation (i.e.
 340 the absolute numbers are less than `2^size_of_mantissa`). Uniform distribution
 341 leads to have result uniformly distributed and quite small `f_min/f_max` keep
 342 the result in a reasonable range. Yet another trick: not all the points are
 343 initialized with non-zero values: see `fill_{src,wei,bia,dst}` in
 344 conv/conv.cpp.
 345
 346
 347 ## Further plans
 348
 349 Please see TODO.md in **benchdnn** root directory for development plans.
 350
 351
 352 ## Issues and contributions
 353
 354 We welcome community contributions to **benchdnn** as well as Intel MKL-DNN.
 355 If you have any ideas or issues please submit an issue or pull request. For
 356 clarity please include ''benchdnn: '' in the title.
 357
 358
 359 ## Inspiration
 360
 361 bench{yet another 3 letters where the first one equals second)...