Speed-up adaptive average pooling for the common case of size=1 output (#17011)
authorngimel <ngimelshein@nvidia.com>
Fri, 15 Feb 2019 05:11:30 +0000 (21:11 -0800)
committerFacebook Github Bot <facebook-github-bot@users.noreply.github.com>
Fri, 15 Feb 2019 05:15:16 +0000 (21:15 -0800)
commit91c50aeec6eccb9e23b8c08b161dbae63de9a0b0
tree960392698734b84189667f491e0cc53a7f506b0a
parent7cff803d0a09e36622f4e72e2ca2820cf9b97c52
Speed-up adaptive average pooling for the common case of size=1 output (#17011)

Summary:
When adaptive pooling has to produce a single pixel feature map, it is faster to do so by calling .mean(). Backward calls a pretty inefficient cuda kernel with atomics, which becomes ridiculously slow for halfs. For half this PR provides approx 30x speed-up for adaptive average pooling, which results in 30% end-to-end speed-up on senet. Improvements are smaller for float, but still significant (approx 5x).
Also this PR unifies handling of 3d (no batch dimension) and 4d tensors, using negative dimension indices.
cc ezyang for review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17011

Reviewed By: ailzhang

Differential Revision: D14078747

Pulled By: soumith

fbshipit-source-id: 0eb9255da2351190a6bcaf68c30e2ae2402a2dd9
aten/src/ATen/native/AdaptiveAveragePooling.cpp
aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
aten/src/ATen/native/native_functions.yaml
test/common_nn.py
tools/autograd/derivatives.yaml
torch/csrc/jit/symbolic_script.cpp