inference-engine/thirdparty/mkl-dnn/tests/benchdnn/README.md

   1 # benchdnn
   2
   3 **benchdnn** is a standalone correctness and performance benchmark for
   4 [Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)](/intel/mkl-dnn).
   5 The purpose of the benchmark is extended and robust correctness verification of
   6 the primitives provided by Intel MKL-DNN. Currently, **benchdnn** supports convolutions
   7 , inner products, reorder, batch normalization, deconvolution, recurrent neural network, and shuffle of different data types.
   8
   9
  10 ## License
  11 **benchdnn** is licensed under
  12 [Apache License Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).
  13
  14
  15 ## Usage (main driver)
  16
  17 **benchdnn** itself is a driver for different implementation-specific
  18 harnesses. So far it uses a harness for Intel MKL-DNN [convolution](/tests/benchdnn/README.md#usage-convolution-harness), [inner product](/tests/benchdnn/README.md#usage-ip-harness),
  19 [reorder](/tests/benchdnn/README.md#usage-reorder-harness), [batch normalization](/tests/benchdnn/README.md#usage-batch-normalization-harness), [deconvolution](/tests/benchdnn/README.md#usage-deconvolution-harness), [shuffle](/tests/benchdnn/README.md#usage-shuffle-harness), and [recurrent neural network](/tests/benchdnn/README.md#usage-rnn-harness) as well as a
  20 harness for testing [itself](/tests/benchdnn/README.md#usage-self-harness).
  21
  22 Usage:
  23 ```
  24     $ ./benchdnn: [--HARNESS] [--mode=MODE] [--max-ms-per-prb=MAX-MS-PER-PRB] [-vN|--verbose=N] HARNESS-OPTS
  25 ```
  26 where:
  27
  28  - `HARNESS` is either `conv` [default], `ip`, `shuffle`, `reorder`, `bnorm`, `rnn`, or `self`
  29
  30  - `MODE` -- string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
  31
  32  - `MAX-MS-PER-PRB`  is passed to assign the maximum time spent per problem in milliseconds, by default `3e3`
  33  - `-vN|--verbose=N` -- verbose level, default `0`
  34
  35  - `HARNESS-OPTS`  are passed to the chosen harness
  36
  37 Returns `0` on success (all tests passed) or non-zero in case of any error.
  38
  39 ## Notations / Glossary / Abbreviations
  40
  41 |Abbreviation   | Description
  42 |:---           |:---
  43 | src           | Source image (input image for forward convolution)
  44 | wei           | Weights (aka filter)
  45 | bia           | Bias
  46 | dst           | Destination image (output image for forward convolution)
  47 | acc           | Accumulation (typically in terms of data type)
  48 | ic, oc        | Input/Output channels (aka feature maps)
  49 | ih, iw        | Input height and width
  50 | oh, ow        | Output height and width
  51 | kh, kw        | Kernel (filter, weights) height and width
  52 | sh, sw        | Convolution stride over height and width
  53 | ph, pw        | Convolution top and left padding
  54 | mb            | Minibatch (amount of images processed at once)
  55 | g             | Groups (a way to reduce the amount of computations, see Alexnet topology)
  56 | FWD_{D,B}     | forward w/o and w/ bias
  57 | BWD_{D,W,WB}  | backward wrt data, weights, and weights and bias
  58 | DIRECT, WINO  | convolution algorithm: direct or Winograd based
  59 | AUTO          | convolution algorithm is chosen by MKL-DNN for best performance
  60
  61
  62 ## Usage (convolution harness)
  63
  64 ```
  65     [harness-knobs] [conv-desc] ...
  66 ```
  67
  68 where *harness-knobs* are:
  69
  70  - `--cfg={f32, u8s8u8s32, ...}` configuration (see below [convolution configuration](/tests/benchdnn/README.md#convolution-configurations-also-known-as-precision-specification)), default `f32`
  71  - `--dir={FWD_D (forward data), FWD_B (forward data + bias),FWD_I (forward data inference), BWD_D (backward data), BWD_W (backward weights), BWD_WB (backward weights + bias)}` direction, default `FWD_B`
  72  - `--alg={DIRECT, WINO, AUTO}` convolution algorithm, default DIRECT
  73  - `--attr="attr_str"` convolution attributes (see in the section below), default `""` (no attributes set)
  74  - `--mb=N` override minibatch that is specified in convolution description, default `0` (use mb specified in conv desc)
  75  - `--match=regex` check only convolutions that match with regex, default is `".*"`. Notice: Windows may only interpret string arguments surrounded by double quotation marks.
  76  - `--skip-impl="str1[:str2]..."` skip implementation (see mkldnn_query_impl_info_str), default `""`
  77  - `--allow-unimpl=true|false` do not treat unimplemented configuration as an error, default `false`
  78  - `--perf-template=template-str` set template for performance report (see section *Performance measurements*)
  79  - `--reset` reset all the parameters set before to default one
  80  - `-vN|--verbose=N` verbose level, default `0`
  81  - `--batch=file` use options from the given file (see in subdirectory)
  82  - `--mode=` string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
  83
  84 and *conv-desc* is the convolution description. The canonical form is:
  85 ```
  86     gXmbXicXihXiwXocXohXowXkhXkwXshXswXphXpwXdhXdwXnS
  87 ```
  88 Here X is a number and S is a string (n stands for name). Some of the parameters
  89 may be omitted if a default exists (for example, if g is not specified
  90 **benchdnn** uses 1) or if it can be computed automatically (for example, the output shape
  91 can be derived from the input one and the kernel). Also, if either width or height
  92 is not specified, it is assumed that height == width. The special symbol `_` is
  93 ignored, so it may be used as a delimiter. See `str2desc()` in conv/conv_aux.cpp
  94 for more details and implicit rules.
  95
  96 The attribute string *attr_str* is defined as follows (line breaks are for readability):
  97 ```
  98     [irmode={nearest,down};]
  99     [oscale={none,common,per_oc}[:scale];]
 100     [post_ops='[{relu,sum[:sum_scale]};]...';]
 101 ```
 102
 103 Here `irmode` defines the rounding mode for integer output (default is nearest).
 104
 105 Next, `oscale` stands for output_scales. The first parameter is the policy that
 106 is defined below. The second optional parameter is a scale that specifies
 107 either the one common output scale (for the `none` and `common` polices) or a
 108 starting point for the `per_oc` policy, which uses many scales. The default scale
 109 is 1.0. Known policies are:
 110
 111   - `none` (default) means no output scales set (i.e. scale = 1.)
 112   - `common` corresponds to `mask=0` with common scale factor
 113   - `per_oc` corresponds to `mask=1<<1` (i.e. output channels) with different scale factors
 114
 115 Next, `post_ops` stands for post operation sequence. Currently supported post
 116 operations are:
 117
 118   - `relu` with no parameters (i.e. corresponding scale is 1., alg = eltwise_relu, alpha = beta = 0.)
 119   - `sum` with optional parameter scale (default 1.)
 120
 121 ### Convolution configurations (also known as precision specification)
 122
 123 `--cfg` option specifies what convolution would be used in terms of data type.
 124 Also it defines all the magic with data filling inside. For the integer type,
 125 saturation is implicitly implied.
 126
 127 Finally configuration defines the threshold for computation errors (ideally we
 128 want to keep it at 0, and it seems to work for now).
 129
 130 The table below shows cases supported by Intel MKL-DNN and corresponding
 131 configurations for **benchdnn**:
 132
 133 |src type | wei type | dst type | acc type | cfg          | notes
 134 |:---     |:---      |:---      |:---      |:---          |:---
 135 | f32     | f32      | f32      | f32      | f32          | inference optimized for sse4.2+, training avx2+
 136 | s16     | s16      | s32      | s32      | s16s16s32s32 | optimized for processors with support of 4vnni, forward pass only (aka FWD_D, FWD_B)
 137 | s32     | s16      | s16      | s32      | s32s16s16s32 | optimized for processors with support of 4vnni, backward wrt data only (aka BWD_D)
 138 | s16     | s32      | s16      | s32      | s16s32s16s32 | optimized for processors with support of 4vnni, backward wrt weights (aka BWD_W, BWD_WB)
 139 | u8      | s8       | f32      | s32      | u8s8f32s32   | optimized for processors with support of avx512vl, forward pass only (aka FWD_D, FWD_B)
 140 | u8      | s8       | s32      | s32      | u8s8s32s32   | same notes as for u8s8f32s32
 141 | u8      | s8       | s8       | s32      | u8s8s8s32    | same notes as for u8s8f32s32
 142 | u8      | s8       | u8       | s32      | u8s8u8s32    | same notes as for u8s8f32s32
 143 | s8      | s8       | f32      | s32      | s8s8f32s32   | same notes as for u8s8f32s32
 144 | s8      | s8       | s32      | s32      | s8s8s32s32   | same notes as for u8s8f32s32
 145 | s8      | s8       | s8       | s32      | s8s8s8s32    | same notes as for u8s8f32s32
 146 | s8      | s8       | u8       | s32      | s8s8u8s32    | same notes as for u8s8f32s32
 147
 148
 149 ### Performance measurements (convolution harness)
 150
 151 **benchdnn** supports a custom performance report. A template is passed via the
 152 command line and consists of terminal and nonterminal symbols. Nonterminal
 153 symbols are printed as-is. A description of terminal symbols is given below.
 154 There is also a notion of modifiers (marked with @) that change the meaning of
 155 terminal symbols; for example, the sign '-' means minimum of (in terms of time).
 156 See the table of modifiers below.
 157
 158 > **Caution:** Threads must be pinned in order to get consistent frequency.
 159
 160 | Abbreviation  | Description
 161 |:------------  |:-----------
 162 | %d            | problem descriptor
 163 | %D            | expanded problem descriptor (conv parameters in csv format)
 164 | %n            | problem name
 165 | %z            | direction
 166 | %@F           | effective cpu frequency computed as clocks[@] / time[@]
 167 | %O            | number of ops required (padding is not taken into account)
 168 | %@t           | time in ms
 169 | %@c           | time in clocks
 170 | %@p           | ops per second
 171
 172 | Modifier  | Description
 173 |:--------  |:-----------
 174 |           | default
 175 | -         | min (time) -- default
 176 | 0         | avg (time)
 177 | +         | max (time)
 178 |           |
 179 | K         | Kilo (1e3)
 180 | M         | Mega (1e6)
 181 | G         | Giga (1e9)
 182
 183 The definition of expanded problem descriptor is:
 184 `g,mb,ic,ih,iw,oc,oh,ow,kh,kw,sh,sw,ph,pw`.
 185
 186 The default template can be found in conv/bench_conv.cpp and is defined as
 187 `perf,%n,%d,%GO,%GF,%-t,%-Gp,%0t,%0Gp`. That will produce the following output
 188 in CSV format:
 189 ```
 190 string: perf
 191 convolution name
 192 full conv-desc
 193 number of giga ops calculated
 194 effective cpu frequency in GHz (amb clocks[min] / time[min])
 195 minimum time spent in ms
 196 best gigaops (since it corresponds to mimimum time)
 197 average time spent in ms
 198 average gigaops (since it corresponds to average time)
 199 ```
 200 Here is an example of the performance output:
 201 ```
 202  perf,"yolov2:conv1",mb16ic3ih610oc32oh608kh3n"yolov2:conv1",10.2205,0,43.9827,232.375,58.0146,176.171
 203 ```
 204 full convolution descriptor is `mb16ic3ih610oc32oh608kh3n"yolov2:conv1"` in the above example.
 205
 206 ### Examples (convolution harness)
 207
 208 Run the set of f32 forward convolutions from inputs/conv_all file w/ bias and default minibatch:
 209 ```
 210     $ ./benchdnn --conv \
 211         --cfg=f32 --dir=FWD_B --batch=inputs/conv_all
 212 ```
 213
 214 Run the same but with post_ops ReLU:
 215 ```
 216     $ ./benchdnn --conv \
 217         --cfg=f32 --dir=FWD_B --attr="post_ops='relu'" --batch=inputs/conv_all
 218 ```
 219
 220 Run the same as previous but also measure performance:
 221 ```
 222     $ ./benchdnn --conv  --mode=CORRnPERF \
 223         --cfg=f32 --dir=FWD_B --attr="post_ops='relu'" --batch=inputs/conv_all
 224 ```
 225
 226 > **Note**: Instead of `CORRnPERF`, one can use `CP`, `PC`, `cp`, or `pc`
 227
 228 Run a set of f32 backward convolutions wrt weights with kh=3 and
 229 verbose level set to 2:
 230 ```
 231     $ ./benchdnn --conv -v2 \
 232         --cfg=f32 --dir=BWD_W --match='.*kh3[^0-9].*' --batch=inputs/conv_all
 233 ```
 234
 235 Run a set of u8s8u8s32 backward convolutions wrt data but skip all
 236 the convolutions that will use reference or gemm-based implementation:
 237 ```
 238     $ ./benchdnn --conv \
 239         --cfg=u8s8u8s32 --dir=BWD_B --skip-impl='ref:gemm' --batch=inputs/conv_all
 240 ```
 241
 242 Run explicitly specified 1st forward convolution (including bias) from Alexnet
 243 with the minibatch set to 4, verbose level set to 1 for two given
 244 configurations (`u8s8u8s32` and `f32`):
 245 ```
 246     $ ./benchdnn --conv -v1 \
 247         --mb=4 --dir=FWD_B \
 248         --cfg=u8s8u8s32 ic3ih227iw227_oc96oh55ow55_kh11kw11_sh4sw4ph0pw0_n"alexnet:conv1" \
 249         --cfg=f32 ic3ih227iw227_oc96oh55ow55_kh11kw11_sh4sw4ph0pw0_n"alexnet:conv1"
 250 ```
 251
 252 Run batch file for different algorithms (assuming the file specifies only
 253 convolutions and does not include harness options that would override any
 254 passed on the command line). Also ignore mkldnn_unimplemented errors in case of
 255 Winograd:
 256 ```
 257     $ ./benchdnn --conv \
 258         --alg=DIRECT --batch=convs.in \
 259         --allow-unimpl=true \
 260         --alg=WINO   --batch=convs.in \
 261         --alg=AUTO   --batch=convs.in
 262 ```
 263
 264 Run a set of u8s8u8s32 forward convolutions without bias, skipping
 265 reference implementations and not triggering unimplemented as an error, with
 266 one common output scale set to 0.5 with rounding mode set to down
 267 (via attributes):
 268 ```
 269     $ ./benchdnn --conv \
 270         --cfg=u8s8u8s32 --dir=FWD_D --skip-impl="ref" --allow-unimpl=true \
 271         --attr="irmode=down;oscale=common:.5" --batch=inputs/conv_all
 272 ```
 273
 274
 275
 276 ## Usage (batch normalization harness)
 277
 278 ```
 279     ./benchdnn --bnorm [harness-knobs] bnorm-desc ...
 280 ```
 281
 282 where *harness-knobs* are:
 283
 284  - `--mb=N` override minibatch that is specified in batch normalization description, default `0` (use mb specified in bnorm-desc)
 285  - `--dir={FWD_D (forward data /training), FWD_I (forward data /inference), BWD_D (backward data), BWD_DW (backward data + weights)}` direction, default `FWD_D`
 286  - `--dt={f32, s32, ...}` base data type, default `f32`
 287  - `--fmt={nchw, nChw16c, ...}` data layout, default `nchw`
 288  - `--flags=[|G|S|R]` batch normalization flags, default `none` (G -- global stats, S -- use scale shift, R -- fuse with ReLU)
 289  - `--attr="attr_str"` attributes (see in the convolution section above), default `""` (no attributes set)
 290  - `--match=regex` check only bnorm that match with regex, default is `".*"`. Notice: Windows may only interpret string arguments surrounded by double quotation marks.
 291  - `--skip-impl="str1[:str2]..."` skip implementation (see mkldnn_query_impl_info_str), default `""`
 292  - `--perf-template=template-str` set template for performance report (very similar to the convolution one)
 293  - `--reset` reset all the parameters set before to default one
 294  - `-vN|--verbose=N` verbose level, default `0`
 295  - `--batch=file` use options from the given file (see in subdirectory)
 296
 297 and *bnorm-desc* is a batch normalization description. The canonical form is:
 298 ```
 299     mbXicXidXihXiwXepsYnS
 300 ```
 301 Here X is an integer number, Y is a real number, and S is a string (n stands for
 302 name). The special symbol `_` is ignored, so it may be used as delimiter. There are
 303 some implicit rules:
 304  - if mb is omitted set mb to 2
 305
 306  - if iw is omitted set iw to ih (and vice versa)
 307
 308  - if eps is omitted set eps to 1./16
 309
 310 ### Performance measurements (batch normalization harness)
 311
 312 **benchdnn** supports a custom performance report. A template is passed via the
 313 command line and consists of terminal and nonterminal symbols. Nonterminal
 314 symbols are printed as-is. A description of terminal symbols is given below.
 315 There is also a notion of modifiers (marked with @) that change the meaning of
 316 terminal symbols; for example, the sign '-' means minimum of (in terms of time). See the
 317 table of modifiers below.
 318
 319 > **Caution:** Threads must be pinned in order to get consistent frequency.
 320
 321 | abbreviation  | description
 322 |:------------  |:-----------
 323 | %d            | problem descriptor
 324 | %D            | expanded problem descriptor (parameters in csv format)
 325 | %n            | problem name
 326 | %z            | direction
 327 | %f            | flags
 328 | %q            | data type (precision)
 329 | %f            | data format (layout)
 330 | %@t           | time in ms
 331
 332 The definition of expanded problem descriptor is: `mb,ic,id,ih,iw,eps`.
 333
 334 The default template can be found in bnorm/bench_bnorm.cpp and is defined as
 335 `perf,%n,%z,%f,%q,%f,%D,%-t,%0t`. That will produce the following output
 336 in CSV format:
 337 ```
 338 string: perf
 339 bnorm name
 340 direction
 341 batch normalization flags
 342 base data type
 343 batch normalization flags
 344 expanded bnorm problem descriptor
 345 minimum time spent in ms
 346 average time spent in ms
 347 ```
 348 Here is an example of performance output:
 349 ```
 350 perf,"resnet_50:bn_conv1",FWD_D,,f32,,50,64,1,112,112,0.0625,10.7729,77.1917
 351 ```
 352 expanded bnorm problem descriptor is `50,64,1,112,112,0.0625` in the above example.
 353
 354 ### Examples (batch normalization harness)
 355
 356 Run the set of bnorms from inputs/bnorm/bnorm_resnet_50 file with default minibatch:
 357 ```
 358     $ ./benchdnn --bnorm \
 359          --batch=inputs/bnorm/bnorm_resnet_50
 360 ```
 361
 362 Run the same as previous but also measure performance:
 363 ```
 364     $ ./benchdnn --bnorm --mode=CORRnPERF \
 365          --batch=inputs/bnorm/bnorm_resnet_50
 366 ```
 367
 368
 369 ## Usage (rnn harness)
 370
 371 ```
 372     ./benchdnn --rnn [harness-knobs] [rnn-desc] ...
 373 ```
 374
 375 where *harness-knobs* are:
 376
 377  - `--prop={FWD_D (forward data), BWD_DW (backward data + weights)}` direction, default `FWD_D``
 378  - `--alg={VANILLA_RNN, VANILLA_LSTM, VANILLA_GRU, LBR_GRU}` algorithm, default `VANILLA_RNN``
 379  - `--direction={left2right, right2left, concat, sum}`  direction, default `left2right``
 380  - `--activation={RELU, LOGISTIC, TANH}` activation, default `RELU``
 381  - `--reset` reset all the parameters set before to default one
 382  - `--batch=file` use options from the given file (see in subdirectory)
 383
 384 and *rnn-desc* is rnn description. The canonical form is:
 385 ```
 386  lXtXmbXsicXslcXdicXdlc
 387 ```
 388 Here X is a number and S is a string. Some implicit rules:
 389  - default values: l = 1, t = 1, mb = 2, S="wip"
 390
 391  - if slc/dlc/dic is undefined => slc/dlc/dic = sic
 392
 393 See `str2desc()` in rnn/rnn_aux.cpp
 394 for more details and implicit rules :^)
 395
 396 ### Performance measurements (rnn harness)
 397
 398
 399 Runing rnn with performance measurememt mode will produce the following output
 400 in CSV format:
 401 ```
 402 string: perf
 403 algorithm
 404 activation function
 405 direction
 406 expanded rnn problem descriptor
 407 name
 408 time spent in ms
 409 minimum time spent in ms
 410 maximum time spent in ms
 411 average time spent in ms
 412 ```
 413 Here is an example of performance output:
 414 ```
 415 perf,VANILLA_RNN,RELU,left2right,l1t1mb128sic512slc512dic512dlc512n""GNMT_enc-training"",time(ms):min=68.0007,max=176.006,avg=91.2686
 416 ```
 417 expanded rnn problem descriptor is `l1t1mb128sic512slc512dic512dlc512n` in the above example.
 418
 419 ### Examples (rnn harness)
 420
 421 Run the set of rnn training from inputs/rnn/rnn_training file with default minibatch:
 422 ```
 423     $ ./benchdnn --rnn \
 424          --batch=inputs/rnn/rnn_training
 425 ```
 426
 427 Run the same as previous but also measure performance:
 428 ```
 429     $ ./benchdnn --rnn --mode=CORRnPERF \
 430          --batch=inputs/rnn/rnn_training
 431 ```
 432
 433
 434 ## Usage (deconvolution harness)
 435
 436 ```
 437     ./benchdnn --deconv [harness-knobs] [deconv-desc] ...
 438 ```
 439
 440 where *harness-knobs* are:
 441
 442  - `--cfg={f32, u8s8u8s32, ...}` configuration (ref conv session above  [convolution configuration](/tests/benchdnn/README.md#convolution-configurations-also-known-as-precision-specification)), default `f32`
 443  - `--match=regex` check only deconvolutions that match with regex, default is `".*"`. Notice: Windows may only interpret string arguments surrounded by double quotation marks.
 444  - `--mb=N` override minibatch that is specified in deconvolution description, default `0` (use mb specified in deconv desc)
 445  - `--dir={FWD_D (forward data), FWD_B (forward data + bias),FWD_I (forward data inference), BWD_D (backward data), BWD_W (backward weights), BWD_WB (backward weights + bias)}` direction, default `FWD_B`
 446  - `--alg={DIRECT, WINO, AUTO}` deconvolution algorithm, default DIRECT
 447  - `--attr="attr_str"` deconvolution attributes (see in the convolution section above), default `""` (no attributes set)
 448  - `--skip-impl="str1[:str2]..."` skip implementation (see mkldnn_query_impl_info_str), default `""`
 449  - `--allow-unimpl=true|false` do not treat unimplemented configuration as an error, default `false`
 450  - `--perf-template=template-str` set template for performance report (see section *Performance measurements*)
 451  - `--mode=` string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
 452  - `--reset` reset all the parameters set before to default one
 453  - `-vN|--verbose=N` verbose level, default `0`
 454  - `--batch=file` use options from the given file (see in subdirectory)
 455
 456 and *deconv-desc* is deconvolution description. The canonical form is:
 457 ```
 458     gXmbXicXihXiwXocXohXowXkhXkwXshXswXphXpwXdhXdwXnS
 459 ```
 460 Here X is a number and S is string (n stands for name). Some of the parameters
 461 might be omitted if a default exists (e.g. if g is not specified
 462 **benchdnn** uses 1) or if the can be computed automatically (e.g. output shape
 463 can be derived from the input one and kernel). Also if either width or height
 464 is not specified than it is assumed height == width. Special symbol `_` is
 465 ignored, hence maybe used as delimiter. See `str2desc()` in conv/conv_aux.cpp
 466 for more details and implicit rules :^)
 467
 468
 469 ### Performance measurements (deconvolution harness)
 470
 471 **benchdnn** supports a custom performance report. please refer above Performance measurements convolution harness session for detail, [convolution harness](/tests/benchdnn/README.md#performance-measurements-convolution-harness).
 472
 473 The default template can be found in conv/bench_deconv.cpp and is defined as
 474 `perf,%n,%d,%GO,%GF,%-t,%-Gp,%0t,%0Gp`. That will produce the following output
 475 in CSV format:
 476 ```
 477 string: perf
 478 deconvolution name
 479 full deconv-desc
 480 number of giga ops calculated
 481 effective cpu frequency in GHz (amb clocks[min] / time[min])
 482 minimum time spent in ms
 483 best gigaops (since it corresponds to mimimum time)
 484 average time spent in ms
 485 average gigaops (since it corresponds to average time)
 486 ```
 487 Here is an example of performance output:
 488 ```
 489  perf,"alexnet:deconv1",mb256ic96ih55oc3oh227kh11sh4n"alexnet:deconv1",2.9733,0,249.474,11.9183,307.702,9.66291
 490 ```
 491 full deconvolution descriptor is `mb256ic96ih55oc3oh227kh11sh4n"alexnet:deconv1"` in the above example.
 492
 493 ### Examples (deconvolution harness)
 494
 495 Run the set of f32 forward deconvolutions from inputs/deconv_all file w/ bias and default minibatch:
 496 ```
 497     $ ./benchdnn --deconv \
 498         --cfg=f32 --dir=FWD_B --batch=inputs/deconv_all
 499 ```
 500
 501 Run the same as previous but also measure performance:
 502 ```
 503     $ ./benchdnn --deconv  --mode=CORRnPERF \
 504         --cfg=f32 --dir=FWD_B  --batch=inputs/deconv_all
 505 ```
 506
 507 ## Usage (ip harness)
 508
 509 ```
 510     ./benchdnn --ip [harness-knobs] [ip-desc] ...
 511 ```
 512
 513 where *harness-knobs* are:
 514
 515  - `--cfg={f32, u8s8u8s32, ...}` configuration (ref conv session above  [convolution configuration](/tests/benchdnn/README.md#convolution-configurations-also-known-as-precision-specification)), default `f32``
 516  - `--mb=N` override minibatch that is specified in ip description, default `0` (use mb specified in ip desc)
 517  - `--dir={FWD_D (forward data), FWD_B (forward data + bias),FWD_I (forward data inference), BWD_D (backward data), BWD_W (backward weights), BWD_WB (backward weights + bias)}` direction, default `FWD_B`
 518  - `--attr="attr_str"` ip attributes (see in the convolution section above), default `""` (no attributes set)
 519  - `--allow-unimpl=true|false` do not treat unimplemented configuration as an error, default `false`
 520  - `--perf-template=template-str` set template for performance report (see section *Performance measurements*)
 521  - `--mode=` string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
 522  - `--reset`  reset all the parameters set before to default one
 523  - `-vN|--verbose=N` verbose level, default `0`
 524  - `--batch=file` use options from the given file (see in subdirectory)
 525
 526 and *ip-desc* is ip description. The canonical form is:
 527 ```
 528     mbXicXidXihXiwXSocXnS
 529 ```
 530 Here X is a number and S is a string (n stands for name).
 531 The special symbol `_` is ignored, so it may be used as a delimiter.
 532 Some implicit rules:
 533  - default values:  mb = 2, id = 1, S="wip"
 534
 535  - if H is undefined => H = W
 536
 537  - if W is undefined => W = H
 538
 539 See `str2desc()` in ip/ip_aux.cpp
 540 for more details and implicit rules :^)
 541
 542 ### Performance measurements (ip harness)
 543
 544 **benchdnn** supports a custom performance report. A template is passed via the
 545 command line and consists of terminal and nonterminal symbols. Nonterminal
 546 symbols are printed as-is. A description of terminal symbols is given below.
 547 There is also a notion of modifiers (marked with @) that change the meaning of
 548 terminal symbols; for example, the sign '-' means minimum of (in terms of time). See the
 549 table of modifiers below.
 550
 551 > **Caution:** Threads must be pinned in order to get consistent frequency.
 552
 553 | abbreviation  | description
 554 |:------------  |:-----------
 555 | %d            | problem descriptor
 556 | %D            | expanded problem descriptor (parameters in csv format)
 557 | %n            | problem name
 558 | %z            | direction
 559 | %f            | flags
 560 | %q            | data type (precision)
 561 | %f            | data format (layout)
 562 | %@t           | time in ms
 563
 564 The definition of expanded problem descriptor is: `mb,oc,ic,id,ih,iw`.
 565
 566 The default template can be found in bnorm/bench_ip.cpp and is defined as
 567 `perf,%D,%n,%z,%q,%-t,%-Gp,%0t,%0Gp`. That will produce the following output
 568 in CSV format:
 569 ```
 570 string: perf
 571 expanded ip problem descriptor
 572 name
 573 direction
 574 data type
 575 minimum time spent in ms
 576 best gigaops (since it corresponds to mimimum time)
 577 average time spent in ms
 578 average gigaops (since it corresponds to average time)
 579 ```
 580
 581 Here is an example of performance output:
 582 ```
 583 perf,112,1000,2048,1,1,1,"resnet:ip1",FWD_B,f32,3.99976,114.695,19.0323,24.1039
 584 ```
 585 expanded ip problem descriptor is `112,1000,2048,1,1,1` in the above example.
 586
 587 ### Examples (ip harness)
 588
 589 Run the set of ip from inputs/ip/ip_all file with default minibatch:
 590 ```
 591     $ ./benchdnn --ip \
 592          --batch=inputs/ip/ip_all
 593 ```
 594
 595 Run the same as previous but also measure performance:
 596 ```
 597     $ ./benchdnn --ip --mode=CORRnPERF \
 598          --batch=inputs/ip/ip_all
 599 ```
 600
 601 ## Usage (shuffle harness)
 602
 603 ```
 604     ./benchdnn --shuffle [harness-knobs]  [dim]...
 605 ```
 606
 607 where *harness-knobs* are:
 608
 609  - `--match==regex` check only shuffle that match with regex, default is `".*"`. Notice: Windows may only interpret string arguments surrounded by double quotation marks.
 610  - `--dir={FWD_D (forward data), FWD_B (forward data + bias),FWD_I (forward data inference), BWD_D (backward data), BWD_W (backward weights), BWD_WB (backward weights + bias)}` direction, default `FWD_B`
 611  - `--dt={f32, s32, ...}` base data type, default `f32`
 612  - `--fmt={nchw, nChw16c, ...}` data layout, default `nchw`
 613  - `--axis=` default `1`
 614  - `--group=` default `1`
 615  - `--mode=` string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
 616  - `-vN|--verbose=N` verbose level, default `0`
 617  - `--batch=file` use options from the given file (see in subdirectory)
 618
 619 and *dim* is ip description. The canonical form is:
 620 ```
 621     dxdxdxdxd
 622 ```
 623 Here d is a number.
 624
 625 See `str2dims()` in shuffle/shuffle_aux.cpp for more details.
 626
 627 ### Performance measurements (shuffle harness)
 628
 629 **benchdnn** supports a custom performance report. A template is passed via the
 630 command line and consists of terminal and nonterminal symbols. Nonterminal
 631 symbols are printed as-is. A description of terminal symbols is given below.
 632 There is also a notion of modifiers (marked with @) that change the meaning of
 633 terminal symbols; for example, the sign '-' means minimum of (in terms of time). See the
 634 table of modifiers below.
 635
 636 > **Caution:** Threads must be pinned in order to get consistent frequency.
 637
 638 | Abbreviation  | Description
 639 |:------------  |:-----------
 640 | %d            | problem descriptor
 641 | %D            | expanded problem descriptor (parameters in csv format)
 642 | %z            | direction
 643 | %q            | data type (precision)
 644 | %f            | data format (layout)
 645 | %a            | axis
 646 | %g            | group size
 647 | %@t           | time in ms
 648
 649 The definition of expanded problem descriptor is: `dxdxdxdxd`.
 650
 651 The default template can be found in shuffle/bench_shuffle.cpp and is defined as
 652 `perf,%z,%q,%f,%D,%a,%g,%-t,%0t`. That will produce the following output
 653 in CSV format:
 654 ```
 655 string: perf
 656 direction
 657 data type
 658 data format
 659 expanded shuffle problem descriptor
 660 axis
 661 group size
 662 minimum time spent in ms
 663 average time spent in ms
 664 ```
 665 Here is an example of performance output.
 666 ```
 667 perf,FWD_D,u8,nCdhw16c,1x272x2x56x56,4,4,11.6177,16.509
 668 ```
 669 expanded shuffle problem descriptor is `1x272x2x56x56` in the above example.
 670
 671 ### Examples (shuffle harness)
 672
 673 Run the set of shuffle from inputs/shuffle/test_shuffle_axis file with default minibatch:
 674 ```
 675     $ ./benchdnn --shuffle \
 676          --batch=inputs/shuffle/test_shuffle_axis
 677 ```
 678
 679 Run the same as previous but also measure performance:
 680 ```
 681     $ ./benchdnn --shuffle --mode=CORRnPERF \
 682          --batch=inputs/shuffle/test_shuffle_axis
 683 ```
 684
 685 ## Usage (reorder harness)
 686
 687 ```
 688     ./benchdnn --reorder [harness-knobs]  ...
 689 ```
 690
 691 where *harness-knobs* are:
 692
 693  - `--idt={f32, s32, ...}` base input data type, default `f32`
 694  - `--odt={f32, s32, ...}` base output data type, default `f32`
 695  - `--dt={f32, s32, ...}` base data type, default `f32`
 696  - `--ifmt={nchw, nChw16c, ...}` input data layout, default `nchw`
 697  - `--ofmt={nchw, nChw16c, ...}` output data layout, default `nchw`
 698  - `--fmt={nchw, nChw16c, ...}` data layout, default `nchw`
 699  - `--def-scales={,,}` input defined scales. separate number by ',' ex : 0.125, 0.25, 0.5, 1, 2, 4, 8
 700  - `--attr="attr_str"` ip attributes (see in the section below), default `""` (no attributes set)
 701  - `--both-dir-dt=true|false` , default `false`
 702  - `--both-dir-fmt=true|false` , default `false`
 703  - `--allow-unimpl=true|false` do not treat unimplemented configuration as an error, default `false`
 704  - `--run` run reorder bench
 705  - `--perf-template=template-str` set template for performance report (see section *Performance measurements*)
 706  - `--reset` reset all the parameters set before to default one
 707  - `--mode=` string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
 708  - `-vN|--verbose=N` verbose level, default `0`
 709  - `--batch=file` use options from the given file (see in subdirectory)
 710
 711 ### Performance measurements (reorder harness)
 712
 713 **benchdnn** supports a custom performance report. A template is passed via the
 714 command line and consists of terminal and nonterminal symbols. Nonterminal
 715 symbols are printed as-is. A description of terminal symbols is given below.
 716 There is also a notion of modifiers (marked with @) that change the meaning of
 717 terminal symbols; for example, the sign '-' means minimum of (in terms of time). See the
 718 table of modifiers below.
 719
 720 > **Caution:** Threads must be pinned in order to get consistent frequency.
 721
 722 | abbreviation  | description
 723 |:------------  |:-----------
 724 | %d            | problem descriptor
 725 | %D            | expanded problem descriptor (reorder parameters in csv format)
 726 | %n            | dimensionality of the problem
 727 | %@O           | number of elements being reordered
 728 | %@t           | time in ms
 729 | %@p           | elements per second
 730
 731 | modifier  | description
 732 |:--------  |:-----------
 733 |           | default
 734 | -         | min (time) -- default
 735 | 0         | avg (time)
 736 | +         | max (time)
 737 |           |
 738 | K         | Kilo (1e3)
 739 | M         | Mega (1e6)
 740 | G         | Giga (1e9)
 741
 742 The definition of expanded problem descriptor is:
 743 `idt,odt,ifmt,ofmt,attrs,dims`.
 744
 745 The default template can be found in reorder/bench_reorder.cpp and is defined as
 746 `perf,%n,%D,%O,%-t,%-Gp,%0t,%0Gp`. That will produce the following output
 747 in CSV format:
 748 ```
 749 string: perf
 750 dimensionality of the problem
 751 expanded reorder problem descriptor
 752 number of elements being reordered
 753 minimum time spent in ms
 754 best gigaops (since it corresponds to mimimum time)
 755 average time spent in ms
 756 average gigaops (since it corresponds to average time)
 757 ```
 758 Here is an example of performance output:
 759 ```
 760  perf,4,f32,f32,nchw,nchw,irmode=nearest;oscale=per_oc:0.125;post_ops='',2x64x3x3,1152,4.00244,0.000287824,24.0279,4.79442e-05
 761 ```
 762 expanded reorder problem descriptor is `f32,f32,nchw,nchw,irmode=nearest;oscale=per_oc:0.125;post_ops='',2x64x3x3` in the above example.
 763
 764 ### Examples (reorder harness)
 765
 766 Run the set of reorder from reorder/test_default file with default minibatch:
 767 ```
 768     $ ./benchdnn --reorder \
 769         --batch=inputs/reorder/test_default
 770 ```
 771
 772 Run the same as previous but also measure performance:
 773 ```
 774     $ ./benchdnn --reorder  --mode=CORRnPERF \
 775         --batch=inputs/reorder/test_default
 776 ```
 777
 778 ## Usage (self harness)
 779
 780 ```
 781     ./benchdnn --self ...
 782 ```
 783
 784 Check enumlation type, attributes, flags, and descriptions.
 785
 786
 787
 788 ## Installation
 789
 790 **benchdnn** is automatically built with Intel MKL-DNN. For convenience, you can
 791 build **benchdnn** using cmake or make.
 792
 793
 794 ## Essence of convolution testing
 795
 796 Intel MKL-DNN supports different data types, such as single-precision floating
 797 point (`mkldnn_f32`) and signed/unsigned integer of different length
 798 (`mkldnn_{s,u}{8,16,32}`). We need to cover all those cases with tests. It is
 799 essential to test real convolution sizes, because Intel MKL-DNN provides
 800 different optimizations depending on convolution parameters. There is no
 801 single unified approach inside, so it would not be enough to test only a few
 802 convolutions (also known as unit tests).
 803
 804 But even for a given convolution, the correctness convolution test is not as
 805 simple as it might seem at first sight. One of the biggest problems we
 806 encountered is numerical instability. For every output point, a lot of
 807 operations may occur. For instance, on backward propagation with respect to
 808 filter, each filter point requires `mb * oh * ow` operations (see the *Notation*
 809 section below). That large amount of compute operations may lead to either
 810 integer overflow or accuracy loss if initial data was chosen inadequately.
 811
 812 These two main issues complicate testing. **benchdnn** tries to address these
 813 by using integers for initialization with uniform distribution in a
 814 range `[cfg->f_min .. cfg->f_max]`, with the step `cfg->f_step`
 815 (see `struct dt_conf_t` in conv/conv.hpp). `f_min` and `f_max` are chosen so
 816 that most of the results would belong in the `[cfg->min .. cfg->max]` range. Also
 817 for floating point all integers in both ranges have exact representation (that is,
 818 the absolute numbers are less than `2^size_of_mantissa`). Uniform distribution
 819 leads to results that are uniformly distributed and quite small. `f_min/f_max` keep
 820 the result in a reasonable range. Yet another trick: not all the points are
 821 initialized with non-zero values: see `fill_{src,wei,bia,dst}` in
 822 conv/conv.cpp.
 823
 824
 825 ## Further plans
 826
 827 Please see TODO.md in the **benchdnn** root directory for development plans.
 828
 829
 830 ## Issues and contributions
 831
 832 We welcome community contributions to **benchdnn** as well as to Intel MKL-DNN.
 833 If you have any ideas or issues please submit an issue or pull request. For
 834 clarity, please include ''benchdnn: '' in the title.
 835
 836
 837 ## Inspiration
 838
 839 bench{yet another 3 letters where the first one equals second)...