3 **benchdnn** is a standalone correctness and performance benchmark for
4 [Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)](/intel/mkl-dnn).
5 The purpose of the benchmark is extended and robust correctness verification of
6 the primitives provided by Intel MKL-DNN. Currently, **benchdnn** supports convolutions
7 , inner products, reorder, batch normalization, deconvolution, recurrent neural network, and shuffle of different data types.
11 **benchdnn** is licensed under
12 [Apache License Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).
15 ## Usage (main driver)
17 **benchdnn** itself is a driver for different implementation-specific
18 harnesses. So far it uses a harness for Intel MKL-DNN [convolution](/tests/benchdnn/README.md#usage-convolution-harness), [inner product](/tests/benchdnn/README.md#usage-ip-harness),
19 [reorder](/tests/benchdnn/README.md#usage-reorder-harness), [batch normalization](/tests/benchdnn/README.md#usage-batch-normalization-harness), [deconvolution](/tests/benchdnn/README.md#usage-deconvolution-harness), [shuffle](/tests/benchdnn/README.md#usage-shuffle-harness), and [recurrent neural network](/tests/benchdnn/README.md#usage-rnn-harness) as well as a
20 harness for testing [itself](/tests/benchdnn/README.md#usage-self-harness).
24 $ ./benchdnn: [--HARNESS] [--mode=MODE] [--max-ms-per-prb=MAX-MS-PER-PRB] [-vN|--verbose=N] HARNESS-OPTS
28 - `HARNESS` is either `conv` [default], `ip`, `shuffle`, `reorder`, `bnorm`, `rnn`, or `self`
30 - `MODE` -- string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
32 - `MAX-MS-PER-PRB` is passed to assign the maximum time spent per problem in milliseconds, by default `3e3`
33 - `-vN|--verbose=N` -- verbose level, default `0`
35 - `HARNESS-OPTS` are passed to the chosen harness
37 Returns `0` on success (all tests passed) or non-zero in case of any error.
39 ## Notations / Glossary / Abbreviations
41 |Abbreviation | Description
43 | src | Source image (input image for forward convolution)
44 | wei | Weights (aka filter)
46 | dst | Destination image (output image for forward convolution)
47 | acc | Accumulation (typically in terms of data type)
48 | ic, oc | Input/Output channels (aka feature maps)
49 | ih, iw | Input height and width
50 | oh, ow | Output height and width
51 | kh, kw | Kernel (filter, weights) height and width
52 | sh, sw | Convolution stride over height and width
53 | ph, pw | Convolution top and left padding
54 | mb | Minibatch (amount of images processed at once)
55 | g | Groups (a way to reduce the amount of computations, see Alexnet topology)
56 | FWD_{D,B} | forward w/o and w/ bias
57 | BWD_{D,W,WB} | backward wrt data, weights, and weights and bias
58 | DIRECT, WINO | convolution algorithm: direct or Winograd based
59 | AUTO | convolution algorithm is chosen by MKL-DNN for best performance
62 ## Usage (convolution harness)
65 [harness-knobs] [conv-desc] ...
68 where *harness-knobs* are:
70 - `--cfg={f32, u8s8u8s32, ...}` configuration (see below [convolution configuration](/tests/benchdnn/README.md#convolution-configurations-also-known-as-precision-specification)), default `f32`
71 - `--dir={FWD_D (forward data), FWD_B (forward data + bias),FWD_I (forward data inference), BWD_D (backward data), BWD_W (backward weights), BWD_WB (backward weights + bias)}` direction, default `FWD_B`
72 - `--alg={DIRECT, WINO, AUTO}` convolution algorithm, default DIRECT
73 - `--attr="attr_str"` convolution attributes (see in the section below), default `""` (no attributes set)
74 - `--mb=N` override minibatch that is specified in convolution description, default `0` (use mb specified in conv desc)
75 - `--match=regex` check only convolutions that match with regex, default is `".*"`. Notice: Windows may only interpret string arguments surrounded by double quotation marks.
76 - `--skip-impl="str1[:str2]..."` skip implementation (see mkldnn_query_impl_info_str), default `""`
77 - `--allow-unimpl=true|false` do not treat unimplemented configuration as an error, default `false`
78 - `--perf-template=template-str` set template for performance report (see section *Performance measurements*)
79 - `--reset` reset all the parameters set before to default one
80 - `-vN|--verbose=N` verbose level, default `0`
81 - `--batch=file` use options from the given file (see in subdirectory)
82 - `--mode=` string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
84 and *conv-desc* is the convolution description. The canonical form is:
86 gXmbXicXihXiwXocXohXowXkhXkwXshXswXphXpwXdhXdwXnS
88 Here X is a number and S is a string (n stands for name). Some of the parameters
89 may be omitted if a default exists (for example, if g is not specified
90 **benchdnn** uses 1) or if it can be computed automatically (for example, the output shape
91 can be derived from the input one and the kernel). Also, if either width or height
92 is not specified, it is assumed that height == width. The special symbol `_` is
93 ignored, so it may be used as a delimiter. See `str2desc()` in conv/conv_aux.cpp
94 for more details and implicit rules.
96 The attribute string *attr_str* is defined as follows (line breaks are for readability):
98 [irmode={nearest,down};]
99 [oscale={none,common,per_oc}[:scale];]
100 [post_ops='[{relu,sum[:sum_scale]};]...';]
103 Here `irmode` defines the rounding mode for integer output (default is nearest).
105 Next, `oscale` stands for output_scales. The first parameter is the policy that
106 is defined below. The second optional parameter is a scale that specifies
107 either the one common output scale (for the `none` and `common` polices) or a
108 starting point for the `per_oc` policy, which uses many scales. The default scale
109 is 1.0. Known policies are:
111 - `none` (default) means no output scales set (i.e. scale = 1.)
112 - `common` corresponds to `mask=0` with common scale factor
113 - `per_oc` corresponds to `mask=1<<1` (i.e. output channels) with different scale factors
115 Next, `post_ops` stands for post operation sequence. Currently supported post
118 - `relu` with no parameters (i.e. corresponding scale is 1., alg = eltwise_relu, alpha = beta = 0.)
119 - `sum` with optional parameter scale (default 1.)
121 ### Convolution configurations (also known as precision specification)
123 `--cfg` option specifies what convolution would be used in terms of data type.
124 Also it defines all the magic with data filling inside. For the integer type,
125 saturation is implicitly implied.
127 Finally configuration defines the threshold for computation errors (ideally we
128 want to keep it at 0, and it seems to work for now).
130 The table below shows cases supported by Intel MKL-DNN and corresponding
131 configurations for **benchdnn**:
133 |src type | wei type | dst type | acc type | cfg | notes
134 |:--- |:--- |:--- |:--- |:--- |:---
135 | f32 | f32 | f32 | f32 | f32 | inference optimized for sse4.2+, training avx2+
136 | s16 | s16 | s32 | s32 | s16s16s32s32 | optimized for processors with support of 4vnni, forward pass only (aka FWD_D, FWD_B)
137 | s32 | s16 | s16 | s32 | s32s16s16s32 | optimized for processors with support of 4vnni, backward wrt data only (aka BWD_D)
138 | s16 | s32 | s16 | s32 | s16s32s16s32 | optimized for processors with support of 4vnni, backward wrt weights (aka BWD_W, BWD_WB)
139 | u8 | s8 | f32 | s32 | u8s8f32s32 | optimized for processors with support of avx512vl, forward pass only (aka FWD_D, FWD_B)
140 | u8 | s8 | s32 | s32 | u8s8s32s32 | same notes as for u8s8f32s32
141 | u8 | s8 | s8 | s32 | u8s8s8s32 | same notes as for u8s8f32s32
142 | u8 | s8 | u8 | s32 | u8s8u8s32 | same notes as for u8s8f32s32
143 | s8 | s8 | f32 | s32 | s8s8f32s32 | same notes as for u8s8f32s32
144 | s8 | s8 | s32 | s32 | s8s8s32s32 | same notes as for u8s8f32s32
145 | s8 | s8 | s8 | s32 | s8s8s8s32 | same notes as for u8s8f32s32
146 | s8 | s8 | u8 | s32 | s8s8u8s32 | same notes as for u8s8f32s32
149 ### Performance measurements (convolution harness)
151 **benchdnn** supports a custom performance report. A template is passed via the
152 command line and consists of terminal and nonterminal symbols. Nonterminal
153 symbols are printed as-is. A description of terminal symbols is given below.
154 There is also a notion of modifiers (marked with @) that change the meaning of
155 terminal symbols; for example, the sign '-' means minimum of (in terms of time).
156 See the table of modifiers below.
158 > **Caution:** Threads must be pinned in order to get consistent frequency.
160 | Abbreviation | Description
161 |:------------ |:-----------
162 | %d | problem descriptor
163 | %D | expanded problem descriptor (conv parameters in csv format)
166 | %@F | effective cpu frequency computed as clocks[@] / time[@]
167 | %O | number of ops required (padding is not taken into account)
169 | %@c | time in clocks
170 | %@p | ops per second
172 | Modifier | Description
173 |:-------- |:-----------
175 | - | min (time) -- default
183 The definition of expanded problem descriptor is:
184 `g,mb,ic,ih,iw,oc,oh,ow,kh,kw,sh,sw,ph,pw`.
186 The default template can be found in conv/bench_conv.cpp and is defined as
187 `perf,%n,%d,%GO,%GF,%-t,%-Gp,%0t,%0Gp`. That will produce the following output
193 number of giga ops calculated
194 effective cpu frequency in GHz (amb clocks[min] / time[min])
195 minimum time spent in ms
196 best gigaops (since it corresponds to mimimum time)
197 average time spent in ms
198 average gigaops (since it corresponds to average time)
200 Here is an example of the performance output:
202 perf,"yolov2:conv1",mb16ic3ih610oc32oh608kh3n"yolov2:conv1",10.2205,0,43.9827,232.375,58.0146,176.171
204 full convolution descriptor is `mb16ic3ih610oc32oh608kh3n"yolov2:conv1"` in the above example.
206 ### Examples (convolution harness)
208 Run the set of f32 forward convolutions from inputs/conv_all file w/ bias and default minibatch:
210 $ ./benchdnn --conv \
211 --cfg=f32 --dir=FWD_B --batch=inputs/conv_all
214 Run the same but with post_ops ReLU:
216 $ ./benchdnn --conv \
217 --cfg=f32 --dir=FWD_B --attr="post_ops='relu'" --batch=inputs/conv_all
220 Run the same as previous but also measure performance:
222 $ ./benchdnn --conv --mode=CORRnPERF \
223 --cfg=f32 --dir=FWD_B --attr="post_ops='relu'" --batch=inputs/conv_all
226 > **Note**: Instead of `CORRnPERF`, one can use `CP`, `PC`, `cp`, or `pc`
228 Run a set of f32 backward convolutions wrt weights with kh=3 and
229 verbose level set to 2:
231 $ ./benchdnn --conv -v2 \
232 --cfg=f32 --dir=BWD_W --match='.*kh3[^0-9].*' --batch=inputs/conv_all
235 Run a set of u8s8u8s32 backward convolutions wrt data but skip all
236 the convolutions that will use reference or gemm-based implementation:
238 $ ./benchdnn --conv \
239 --cfg=u8s8u8s32 --dir=BWD_B --skip-impl='ref:gemm' --batch=inputs/conv_all
242 Run explicitly specified 1st forward convolution (including bias) from Alexnet
243 with the minibatch set to 4, verbose level set to 1 for two given
244 configurations (`u8s8u8s32` and `f32`):
246 $ ./benchdnn --conv -v1 \
248 --cfg=u8s8u8s32 ic3ih227iw227_oc96oh55ow55_kh11kw11_sh4sw4ph0pw0_n"alexnet:conv1" \
249 --cfg=f32 ic3ih227iw227_oc96oh55ow55_kh11kw11_sh4sw4ph0pw0_n"alexnet:conv1"
252 Run batch file for different algorithms (assuming the file specifies only
253 convolutions and does not include harness options that would override any
254 passed on the command line). Also ignore mkldnn_unimplemented errors in case of
257 $ ./benchdnn --conv \
258 --alg=DIRECT --batch=convs.in \
259 --allow-unimpl=true \
260 --alg=WINO --batch=convs.in \
261 --alg=AUTO --batch=convs.in
264 Run a set of u8s8u8s32 forward convolutions without bias, skipping
265 reference implementations and not triggering unimplemented as an error, with
266 one common output scale set to 0.5 with rounding mode set to down
269 $ ./benchdnn --conv \
270 --cfg=u8s8u8s32 --dir=FWD_D --skip-impl="ref" --allow-unimpl=true \
271 --attr="irmode=down;oscale=common:.5" --batch=inputs/conv_all
276 ## Usage (batch normalization harness)
279 ./benchdnn --bnorm [harness-knobs] bnorm-desc ...
282 where *harness-knobs* are:
284 - `--mb=N` override minibatch that is specified in batch normalization description, default `0` (use mb specified in bnorm-desc)
285 - `--dir={FWD_D (forward data /training), FWD_I (forward data /inference), BWD_D (backward data), BWD_DW (backward data + weights)}` direction, default `FWD_D`
286 - `--dt={f32, s32, ...}` base data type, default `f32`
287 - `--fmt={nchw, nChw16c, ...}` data layout, default `nchw`
288 - `--flags=[|G|S|R]` batch normalization flags, default `none` (G -- global stats, S -- use scale shift, R -- fuse with ReLU)
289 - `--attr="attr_str"` attributes (see in the convolution section above), default `""` (no attributes set)
290 - `--match=regex` check only bnorm that match with regex, default is `".*"`. Notice: Windows may only interpret string arguments surrounded by double quotation marks.
291 - `--skip-impl="str1[:str2]..."` skip implementation (see mkldnn_query_impl_info_str), default `""`
292 - `--perf-template=template-str` set template for performance report (very similar to the convolution one)
293 - `--reset` reset all the parameters set before to default one
294 - `-vN|--verbose=N` verbose level, default `0`
295 - `--batch=file` use options from the given file (see in subdirectory)
297 and *bnorm-desc* is a batch normalization description. The canonical form is:
299 mbXicXidXihXiwXepsYnS
301 Here X is an integer number, Y is a real number, and S is a string (n stands for
302 name). The special symbol `_` is ignored, so it may be used as delimiter. There are
304 - if mb is omitted set mb to 2
306 - if iw is omitted set iw to ih (and vice versa)
308 - if eps is omitted set eps to 1./16
310 ### Performance measurements (batch normalization harness)
312 **benchdnn** supports a custom performance report. A template is passed via the
313 command line and consists of terminal and nonterminal symbols. Nonterminal
314 symbols are printed as-is. A description of terminal symbols is given below.
315 There is also a notion of modifiers (marked with @) that change the meaning of
316 terminal symbols; for example, the sign '-' means minimum of (in terms of time). See the
317 table of modifiers below.
319 > **Caution:** Threads must be pinned in order to get consistent frequency.
321 | abbreviation | description
322 |:------------ |:-----------
323 | %d | problem descriptor
324 | %D | expanded problem descriptor (parameters in csv format)
328 | %q | data type (precision)
329 | %f | data format (layout)
332 The definition of expanded problem descriptor is: `mb,ic,id,ih,iw,eps`.
334 The default template can be found in bnorm/bench_bnorm.cpp and is defined as
335 `perf,%n,%z,%f,%q,%f,%D,%-t,%0t`. That will produce the following output
341 batch normalization flags
343 batch normalization flags
344 expanded bnorm problem descriptor
345 minimum time spent in ms
346 average time spent in ms
348 Here is an example of performance output:
350 perf,"resnet_50:bn_conv1",FWD_D,,f32,,50,64,1,112,112,0.0625,10.7729,77.1917
352 expanded bnorm problem descriptor is `50,64,1,112,112,0.0625` in the above example.
354 ### Examples (batch normalization harness)
356 Run the set of bnorms from inputs/bnorm/bnorm_resnet_50 file with default minibatch:
358 $ ./benchdnn --bnorm \
359 --batch=inputs/bnorm/bnorm_resnet_50
362 Run the same as previous but also measure performance:
364 $ ./benchdnn --bnorm --mode=CORRnPERF \
365 --batch=inputs/bnorm/bnorm_resnet_50
369 ## Usage (rnn harness)
372 ./benchdnn --rnn [harness-knobs] [rnn-desc] ...
375 where *harness-knobs* are:
377 - `--prop={FWD_D (forward data), BWD_DW (backward data + weights)}` direction, default `FWD_D``
378 - `--alg={VANILLA_RNN, VANILLA_LSTM, VANILLA_GRU, LBR_GRU}` algorithm, default `VANILLA_RNN``
379 - `--direction={left2right, right2left, concat, sum}` direction, default `left2right``
380 - `--activation={RELU, LOGISTIC, TANH}` activation, default `RELU``
381 - `--reset` reset all the parameters set before to default one
382 - `--batch=file` use options from the given file (see in subdirectory)
384 and *rnn-desc* is rnn description. The canonical form is:
386 lXtXmbXsicXslcXdicXdlc
388 Here X is a number and S is a string. Some implicit rules:
389 - default values: l = 1, t = 1, mb = 2, S="wip"
391 - if slc/dlc/dic is undefined => slc/dlc/dic = sic
393 See `str2desc()` in rnn/rnn_aux.cpp
394 for more details and implicit rules :^)
396 ### Performance measurements (rnn harness)
399 Runing rnn with performance measurememt mode will produce the following output
406 expanded rnn problem descriptor
409 minimum time spent in ms
410 maximum time spent in ms
411 average time spent in ms
413 Here is an example of performance output:
415 perf,VANILLA_RNN,RELU,left2right,l1t1mb128sic512slc512dic512dlc512n""GNMT_enc-training"",time(ms):min=68.0007,max=176.006,avg=91.2686
417 expanded rnn problem descriptor is `l1t1mb128sic512slc512dic512dlc512n` in the above example.
419 ### Examples (rnn harness)
421 Run the set of rnn training from inputs/rnn/rnn_training file with default minibatch:
424 --batch=inputs/rnn/rnn_training
427 Run the same as previous but also measure performance:
429 $ ./benchdnn --rnn --mode=CORRnPERF \
430 --batch=inputs/rnn/rnn_training
434 ## Usage (deconvolution harness)
437 ./benchdnn --deconv [harness-knobs] [deconv-desc] ...
440 where *harness-knobs* are:
442 - `--cfg={f32, u8s8u8s32, ...}` configuration (ref conv session above [convolution configuration](/tests/benchdnn/README.md#convolution-configurations-also-known-as-precision-specification)), default `f32`
443 - `--match=regex` check only deconvolutions that match with regex, default is `".*"`. Notice: Windows may only interpret string arguments surrounded by double quotation marks.
444 - `--mb=N` override minibatch that is specified in deconvolution description, default `0` (use mb specified in deconv desc)
445 - `--dir={FWD_D (forward data), FWD_B (forward data + bias),FWD_I (forward data inference), BWD_D (backward data), BWD_W (backward weights), BWD_WB (backward weights + bias)}` direction, default `FWD_B`
446 - `--alg={DIRECT, WINO, AUTO}` deconvolution algorithm, default DIRECT
447 - `--attr="attr_str"` deconvolution attributes (see in the convolution section above), default `""` (no attributes set)
448 - `--skip-impl="str1[:str2]..."` skip implementation (see mkldnn_query_impl_info_str), default `""`
449 - `--allow-unimpl=true|false` do not treat unimplemented configuration as an error, default `false`
450 - `--perf-template=template-str` set template for performance report (see section *Performance measurements*)
451 - `--mode=` string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
452 - `--reset` reset all the parameters set before to default one
453 - `-vN|--verbose=N` verbose level, default `0`
454 - `--batch=file` use options from the given file (see in subdirectory)
456 and *deconv-desc* is deconvolution description. The canonical form is:
458 gXmbXicXihXiwXocXohXowXkhXkwXshXswXphXpwXdhXdwXnS
460 Here X is a number and S is string (n stands for name). Some of the parameters
461 might be omitted if a default exists (e.g. if g is not specified
462 **benchdnn** uses 1) or if the can be computed automatically (e.g. output shape
463 can be derived from the input one and kernel). Also if either width or height
464 is not specified than it is assumed height == width. Special symbol `_` is
465 ignored, hence maybe used as delimiter. See `str2desc()` in conv/conv_aux.cpp
466 for more details and implicit rules :^)
469 ### Performance measurements (deconvolution harness)
471 **benchdnn** supports a custom performance report. please refer above Performance measurements convolution harness session for detail, [convolution harness](/tests/benchdnn/README.md#performance-measurements-convolution-harness).
473 The default template can be found in conv/bench_deconv.cpp and is defined as
474 `perf,%n,%d,%GO,%GF,%-t,%-Gp,%0t,%0Gp`. That will produce the following output
480 number of giga ops calculated
481 effective cpu frequency in GHz (amb clocks[min] / time[min])
482 minimum time spent in ms
483 best gigaops (since it corresponds to mimimum time)
484 average time spent in ms
485 average gigaops (since it corresponds to average time)
487 Here is an example of performance output:
489 perf,"alexnet:deconv1",mb256ic96ih55oc3oh227kh11sh4n"alexnet:deconv1",2.9733,0,249.474,11.9183,307.702,9.66291
491 full deconvolution descriptor is `mb256ic96ih55oc3oh227kh11sh4n"alexnet:deconv1"` in the above example.
493 ### Examples (deconvolution harness)
495 Run the set of f32 forward deconvolutions from inputs/deconv_all file w/ bias and default minibatch:
497 $ ./benchdnn --deconv \
498 --cfg=f32 --dir=FWD_B --batch=inputs/deconv_all
501 Run the same as previous but also measure performance:
503 $ ./benchdnn --deconv --mode=CORRnPERF \
504 --cfg=f32 --dir=FWD_B --batch=inputs/deconv_all
507 ## Usage (ip harness)
510 ./benchdnn --ip [harness-knobs] [ip-desc] ...
513 where *harness-knobs* are:
515 - `--cfg={f32, u8s8u8s32, ...}` configuration (ref conv session above [convolution configuration](/tests/benchdnn/README.md#convolution-configurations-also-known-as-precision-specification)), default `f32``
516 - `--mb=N` override minibatch that is specified in ip description, default `0` (use mb specified in ip desc)
517 - `--dir={FWD_D (forward data), FWD_B (forward data + bias),FWD_I (forward data inference), BWD_D (backward data), BWD_W (backward weights), BWD_WB (backward weights + bias)}` direction, default `FWD_B`
518 - `--attr="attr_str"` ip attributes (see in the convolution section above), default `""` (no attributes set)
519 - `--allow-unimpl=true|false` do not treat unimplemented configuration as an error, default `false`
520 - `--perf-template=template-str` set template for performance report (see section *Performance measurements*)
521 - `--mode=` string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
522 - `--reset` reset all the parameters set before to default one
523 - `-vN|--verbose=N` verbose level, default `0`
524 - `--batch=file` use options from the given file (see in subdirectory)
526 and *ip-desc* is ip description. The canonical form is:
528 mbXicXidXihXiwXSocXnS
530 Here X is a number and S is a string (n stands for name).
531 The special symbol `_` is ignored, so it may be used as a delimiter.
533 - default values: mb = 2, id = 1, S="wip"
535 - if H is undefined => H = W
537 - if W is undefined => W = H
539 See `str2desc()` in ip/ip_aux.cpp
540 for more details and implicit rules :^)
542 ### Performance measurements (ip harness)
544 **benchdnn** supports a custom performance report. A template is passed via the
545 command line and consists of terminal and nonterminal symbols. Nonterminal
546 symbols are printed as-is. A description of terminal symbols is given below.
547 There is also a notion of modifiers (marked with @) that change the meaning of
548 terminal symbols; for example, the sign '-' means minimum of (in terms of time). See the
549 table of modifiers below.
551 > **Caution:** Threads must be pinned in order to get consistent frequency.
553 | abbreviation | description
554 |:------------ |:-----------
555 | %d | problem descriptor
556 | %D | expanded problem descriptor (parameters in csv format)
560 | %q | data type (precision)
561 | %f | data format (layout)
564 The definition of expanded problem descriptor is: `mb,oc,ic,id,ih,iw`.
566 The default template can be found in bnorm/bench_ip.cpp and is defined as
567 `perf,%D,%n,%z,%q,%-t,%-Gp,%0t,%0Gp`. That will produce the following output
571 expanded ip problem descriptor
575 minimum time spent in ms
576 best gigaops (since it corresponds to mimimum time)
577 average time spent in ms
578 average gigaops (since it corresponds to average time)
581 Here is an example of performance output:
583 perf,112,1000,2048,1,1,1,"resnet:ip1",FWD_B,f32,3.99976,114.695,19.0323,24.1039
585 expanded ip problem descriptor is `112,1000,2048,1,1,1` in the above example.
587 ### Examples (ip harness)
589 Run the set of ip from inputs/ip/ip_all file with default minibatch:
592 --batch=inputs/ip/ip_all
595 Run the same as previous but also measure performance:
597 $ ./benchdnn --ip --mode=CORRnPERF \
598 --batch=inputs/ip/ip_all
601 ## Usage (shuffle harness)
604 ./benchdnn --shuffle [harness-knobs] [dim]...
607 where *harness-knobs* are:
609 - `--match==regex` check only shuffle that match with regex, default is `".*"`. Notice: Windows may only interpret string arguments surrounded by double quotation marks.
610 - `--dir={FWD_D (forward data), FWD_B (forward data + bias),FWD_I (forward data inference), BWD_D (backward data), BWD_W (backward weights), BWD_WB (backward weights + bias)}` direction, default `FWD_B`
611 - `--dt={f32, s32, ...}` base data type, default `f32`
612 - `--fmt={nchw, nChw16c, ...}` data layout, default `nchw`
613 - `--axis=` default `1`
614 - `--group=` default `1`
615 - `--mode=` string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
616 - `-vN|--verbose=N` verbose level, default `0`
617 - `--batch=file` use options from the given file (see in subdirectory)
619 and *dim* is ip description. The canonical form is:
625 See `str2dims()` in shuffle/shuffle_aux.cpp for more details.
627 ### Performance measurements (shuffle harness)
629 **benchdnn** supports a custom performance report. A template is passed via the
630 command line and consists of terminal and nonterminal symbols. Nonterminal
631 symbols are printed as-is. A description of terminal symbols is given below.
632 There is also a notion of modifiers (marked with @) that change the meaning of
633 terminal symbols; for example, the sign '-' means minimum of (in terms of time). See the
634 table of modifiers below.
636 > **Caution:** Threads must be pinned in order to get consistent frequency.
638 | Abbreviation | Description
639 |:------------ |:-----------
640 | %d | problem descriptor
641 | %D | expanded problem descriptor (parameters in csv format)
643 | %q | data type (precision)
644 | %f | data format (layout)
649 The definition of expanded problem descriptor is: `dxdxdxdxd`.
651 The default template can be found in shuffle/bench_shuffle.cpp and is defined as
652 `perf,%z,%q,%f,%D,%a,%g,%-t,%0t`. That will produce the following output
659 expanded shuffle problem descriptor
662 minimum time spent in ms
663 average time spent in ms
665 Here is an example of performance output.
667 perf,FWD_D,u8,nCdhw16c,1x272x2x56x56,4,4,11.6177,16.509
669 expanded shuffle problem descriptor is `1x272x2x56x56` in the above example.
671 ### Examples (shuffle harness)
673 Run the set of shuffle from inputs/shuffle/test_shuffle_axis file with default minibatch:
675 $ ./benchdnn --shuffle \
676 --batch=inputs/shuffle/test_shuffle_axis
679 Run the same as previous but also measure performance:
681 $ ./benchdnn --shuffle --mode=CORRnPERF \
682 --batch=inputs/shuffle/test_shuffle_axis
685 ## Usage (reorder harness)
688 ./benchdnn --reorder [harness-knobs] ...
691 where *harness-knobs* are:
693 - `--idt={f32, s32, ...}` base input data type, default `f32`
694 - `--odt={f32, s32, ...}` base output data type, default `f32`
695 - `--dt={f32, s32, ...}` base data type, default `f32`
696 - `--ifmt={nchw, nChw16c, ...}` input data layout, default `nchw`
697 - `--ofmt={nchw, nChw16c, ...}` output data layout, default `nchw`
698 - `--fmt={nchw, nChw16c, ...}` data layout, default `nchw`
699 - `--def-scales={,,}` input defined scales. separate number by ',' ex : 0.125, 0.25, 0.5, 1, 2, 4, 8
700 - `--attr="attr_str"` ip attributes (see in the section below), default `""` (no attributes set)
701 - `--both-dir-dt=true|false` , default `false`
702 - `--both-dir-fmt=true|false` , default `false`
703 - `--allow-unimpl=true|false` do not treat unimplemented configuration as an error, default `false`
704 - `--run` run reorder bench
705 - `--perf-template=template-str` set template for performance report (see section *Performance measurements*)
706 - `--reset` reset all the parameters set before to default one
707 - `--mode=` string that contains flags for benchmark mode. Use `C` or `c` for correctness (used by default), and `P` or `p` for performance
708 - `-vN|--verbose=N` verbose level, default `0`
709 - `--batch=file` use options from the given file (see in subdirectory)
711 ### Performance measurements (reorder harness)
713 **benchdnn** supports a custom performance report. A template is passed via the
714 command line and consists of terminal and nonterminal symbols. Nonterminal
715 symbols are printed as-is. A description of terminal symbols is given below.
716 There is also a notion of modifiers (marked with @) that change the meaning of
717 terminal symbols; for example, the sign '-' means minimum of (in terms of time). See the
718 table of modifiers below.
720 > **Caution:** Threads must be pinned in order to get consistent frequency.
722 | abbreviation | description
723 |:------------ |:-----------
724 | %d | problem descriptor
725 | %D | expanded problem descriptor (reorder parameters in csv format)
726 | %n | dimensionality of the problem
727 | %@O | number of elements being reordered
729 | %@p | elements per second
731 | modifier | description
732 |:-------- |:-----------
734 | - | min (time) -- default
742 The definition of expanded problem descriptor is:
743 `idt,odt,ifmt,ofmt,attrs,dims`.
745 The default template can be found in reorder/bench_reorder.cpp and is defined as
746 `perf,%n,%D,%O,%-t,%-Gp,%0t,%0Gp`. That will produce the following output
750 dimensionality of the problem
751 expanded reorder problem descriptor
752 number of elements being reordered
753 minimum time spent in ms
754 best gigaops (since it corresponds to mimimum time)
755 average time spent in ms
756 average gigaops (since it corresponds to average time)
758 Here is an example of performance output:
760 perf,4,f32,f32,nchw,nchw,irmode=nearest;oscale=per_oc:0.125;post_ops='',2x64x3x3,1152,4.00244,0.000287824,24.0279,4.79442e-05
762 expanded reorder problem descriptor is `f32,f32,nchw,nchw,irmode=nearest;oscale=per_oc:0.125;post_ops='',2x64x3x3` in the above example.
764 ### Examples (reorder harness)
766 Run the set of reorder from reorder/test_default file with default minibatch:
768 $ ./benchdnn --reorder \
769 --batch=inputs/reorder/test_default
772 Run the same as previous but also measure performance:
774 $ ./benchdnn --reorder --mode=CORRnPERF \
775 --batch=inputs/reorder/test_default
778 ## Usage (self harness)
781 ./benchdnn --self ...
784 Check enumlation type, attributes, flags, and descriptions.
790 **benchdnn** is automatically built with Intel MKL-DNN. For convenience, you can
791 build **benchdnn** using cmake or make.
794 ## Essence of convolution testing
796 Intel MKL-DNN supports different data types, such as single-precision floating
797 point (`mkldnn_f32`) and signed/unsigned integer of different length
798 (`mkldnn_{s,u}{8,16,32}`). We need to cover all those cases with tests. It is
799 essential to test real convolution sizes, because Intel MKL-DNN provides
800 different optimizations depending on convolution parameters. There is no
801 single unified approach inside, so it would not be enough to test only a few
802 convolutions (also known as unit tests).
804 But even for a given convolution, the correctness convolution test is not as
805 simple as it might seem at first sight. One of the biggest problems we
806 encountered is numerical instability. For every output point, a lot of
807 operations may occur. For instance, on backward propagation with respect to
808 filter, each filter point requires `mb * oh * ow` operations (see the *Notation*
809 section below). That large amount of compute operations may lead to either
810 integer overflow or accuracy loss if initial data was chosen inadequately.
812 These two main issues complicate testing. **benchdnn** tries to address these
813 by using integers for initialization with uniform distribution in a
814 range `[cfg->f_min .. cfg->f_max]`, with the step `cfg->f_step`
815 (see `struct dt_conf_t` in conv/conv.hpp). `f_min` and `f_max` are chosen so
816 that most of the results would belong in the `[cfg->min .. cfg->max]` range. Also
817 for floating point all integers in both ranges have exact representation (that is,
818 the absolute numbers are less than `2^size_of_mantissa`). Uniform distribution
819 leads to results that are uniformly distributed and quite small. `f_min/f_max` keep
820 the result in a reasonable range. Yet another trick: not all the points are
821 initialized with non-zero values: see `fill_{src,wei,bia,dst}` in
827 Please see TODO.md in the **benchdnn** root directory for development plans.
830 ## Issues and contributions
832 We welcome community contributions to **benchdnn** as well as to Intel MKL-DNN.
833 If you have any ideas or issues please submit an issue or pull request. For
834 clarity, please include ''benchdnn: '' in the title.
839 bench{yet another 3 letters where the first one equals second)...