Add a fast path for batch-norm CPU inference. (#19152)
authorXiaoqiang Zheng <zhengxq@fb.com>
Wed, 17 Apr 2019 02:22:13 +0000 (19:22 -0700)
committerFacebook Github Bot <facebook-github-bot@users.noreply.github.com>
Wed, 17 Apr 2019 02:27:54 +0000 (19:27 -0700)
commit5627940e9c4ea18ba6c15a2f46f57d8905937c43
tree39f02374f6fc87e1f6ca4e6a89ca5434feebba90
parentff0a7ae43f9b814801d7fb7144f4d390c586838b
Add a fast path for batch-norm CPU inference. (#19152)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19152

Adding a fast path for batch-norm CPU inference when all tensors are contiguous.
* Leverage vectorization through smiple loops.
* Folding linear terms before computation.
* For resnext-101, this version gets 18.95 times faster.
* Add a microbenchmark:
* (buck build mode/opt -c python.package_style=inplace --show-output //caffe2/benchmarks/operator_benchmark:batchnorm_benchmark) && \
(OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/batchnorm_benchmark#binary.par)
* batch_norm: data shape: [1, 256, 3136], bandwidth: 22.26 GB/s
* batch_norm: data shape: [1, 65536, 1], bandwidth: 5.57 GB/s
* batch_norm: data shape: [128, 2048, 1], bandwidth: 18.21 GB/s

Reviewed By: soumith, BIT-silence

Differential Revision: D14889728

fbshipit-source-id: 20c9e567e38ff7dbb9097873b85160eca2b0a795
aten/src/ATen/native/Normalization.cpp
benchmarks/operator_benchmark/batchnorm_benchmark.py [new file with mode: 0644]