1 Performance profiling {#perf_profile}
2 =====================================
4 It is often useful to collect information about how much of an application run
5 time is spent executing Intel(R) MKL-DNN primitives and which of those take
6 the most time. One of the popular methods to do this is to use profilers like
7 Linux\* perf or Intel(R) VTune(tm) Amplifier. Currently, Intel MKL-DNN has very
8 limited support for these tools since it does not annotate code generated at
9 run-time and thus the profiles cannot properly attribute it. However, Intel
10 MKL-DNN implements another feature called _verbose mode_ that allows tracing
11 execution of Intel MKL-DNN primitives and collection of basic statistics like
12 execution time and primitive parameters.
16 To enable Intel MKL-DNN verbose mode, set `MKLDNN_VERBOSE` environment variable
17 to `1` (to dump only execution time) or `2` (to dump both execution and
18 creation time). For example:
21 $ export MKLDNN_VERBOSE=1
22 $ ./benchdnn --conv ic16ih7oc16oh7kh5ph2n"wip"
25 This will produce the following output (the line break was added to fit into
29 mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nchw out:f32_nChw8c,num:1,2x16x7x7,0.484863
30 mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_goihw out:f32_gOIhw8i8o,num:1,1x16x16x5x5,0.494141
31 mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nchw out:f32_nChw8c,num:1,2x16x7x7,0.478027
32 mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_x out:f32_x,num:1,16,0.219971
33 mkldnn_verbose,exec,convolution,jit:avx2,forward_inference,fsrc:nChw8c fwei:gOIhw8i8o fbia:x \
34 fdst:nChw8c,alg:convolution_direct,mb2_g1ic16oc16_ih7oh7kh5sh1dh0ph2_iw7ow7kw5sw1dw0pw2,0.0170898
35 mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nChw8c out:f32_nchw,num:1,2x16x7x7,0.488037
36 mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nChw8c out:f32_nchw,num:1,2x16x7x7,0.00512695
37 0:PASSED __REPRO: ic16ih7oc16oh7kh5ph2nwip
38 tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 failed:0
41 Each line with verbose information is formatted as a comma-separated list
44 - `stage`, e.g. `create` or `exec`
45 - `primitive-kind`, e.g. `convolution`, `reorder`, `sum`, ...
46 - primitive implementation name
47 - propagation-kind, e.g. `forward_training`
48 - input/output data info, e.g. data type and data format
49 - auxiliary information, e.g. algorithm or number of input
51 - for convolution the problem description is dumped in benchdnn friendly format
52 - for reorder, sum, and concat problem description is simply logical dims
53 - for other primitives the problem description is similar to convolution one
54 - execution time in milliseconds
56 To get more information about verbose report format please refer to the
57 `verbose_templ()` function in the
58 [src/common/verbose.hpp](https://github.com/intel/mkl-dnn/blob/master/src/common/verbose.hpp)
63 The format is subject to change
70 Verbose mode has non-negligible performance impact especially if the output
75 ## Intel(R) VTune(TM) profiling
77 To collect performance data of JIT-kernels set `VTUNEROOT` environment variable
78 to path to VTune before building of Intel MKL-DNN. For example:
81 $ mkdir -p build && cd build && cmake -DVTUNEROOT=/path/to/vtune .. && make
85 To dump JIT-kernels set MKLDNN_JIT_DUMP environment variable to `1`. For example:
88 $ export MKLDNN_JIT_DUMP=1
92 This will produce the following output files:
94 mkldnn_dump_jit_uni_reorder_kernel_f32.0.bin
95 mkldnn_dump_jit_avx2_conv_fwd_kernel_f32.1.bin
96 mkldnn_dump_jit_uni_relu_kernel_f32.2.bin
97 mkldnn_dump_jit_uni_lrn_fwd_kernel_f32.3.bin
98 mkldnn_dump_jit_uni_lrn_fwd_kernel_f32.4.bin
99 mkldnn_dump_jit_uni_lrn_fwd_kernel_f32.5.bin
100 mkldnn_dump_jit_uni_reorder_kernel_f32.6.bin
101 mkldnn_dump_jit_uni_pool_kernel_f32.7.bin
103 To open these files any disassembler can be used. For example:
106 $ xed -64 -ir mkldnn_dump_jit_avx2_conv_fwd_kernel_f32.1.bin
107 XDIS 0: PUSH BASE 53 push rbx
108 XDIS 1: PUSH BASE 55 push rbp
109 XDIS 2: PUSH BASE 4154 push r12
110 XDIS 4: PUSH BASE 4155 push r13
111 XDIS 6: PUSH BASE 4156 push r14
112 XDIS 8: PUSH BASE 4157 push r15
113 XDIS a: DATAXFER BASE 488B07 mov rax, qword ptr [rdi]
114 XDIS d: DATAXFER BASE 488B7708 mov rsi, qword ptr [rdi+0x8]
115 XDIS 11: DATAXFER BASE 488B5710 mov rdx, qword ptr [rdi+0x10]
116 XDIS 15: DATAXFER BASE 488B5F18 mov rbx, qword ptr [rdi+0x18]
117 XDIS 19: DATAXFER BASE 488B8F98000000 mov rcx, qword ptr [rdi+0x98]
118 XDIS 20: DATAXFER BASE 448BAF00010000 mov r13d, dword ptr [rdi+0x100]
119 XDIS 27: DATAXFER BASE 4C8BB7D0000000 mov r14, qword ptr [rdi+0xd0]
120 XDIS 2e: BINARY BASE 4983FE04 cmp r14, 0x4
121 XDIS 32: COND_BR BASE 0F85EF030000 jnz 0x427
122 XDIS 38: LOGICAL BASE 4D31DB xor r11, r11
123 XDIS 3b: LOGICAL BASE 41F7C510000000 test r13d, 0x10
124 XDIS 42: COND_BR BASE 0F8558000000 jnz 0xa0
125 XDIS 48: DATAXFER AVX C5FC1006 vmovups ymm0, ymmword ptr [rsi]
126 XDIS 4c: DATAXFER AVX C5FC104E20 vmovups ymm1, ymmword ptr [rsi+0x20]
127 XDIS 51: DATAXFER AVX C5FC105640 vmovups ymm2, ymmword ptr [rsi+0x40]
128 XDIS 56: DATAXFER AVX C5FC109E207A0100 vmovups ymm3, ymmword ptr [rsi+0x17a20]
132 [Legal information](@ref legal_information)