Speed-up adaptive average pooling for the common case of size=1 output (#17011)
Summary:
When adaptive pooling has to produce a single pixel feature map, it is faster to do so by calling .mean(). Backward calls a pretty inefficient cuda kernel with atomics, which becomes ridiculously slow for halfs. For half this PR provides approx 30x speed-up for adaptive average pooling, which results in 30% end-to-end speed-up on senet. Improvements are smaller for float, but still significant (approx 5x).
Also this PR unifies handling of 3d (no batch dimension) and 4d tensors, using negative dimension indices.
cc ezyang for review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17011
Reviewed By: ailzhang
Differential Revision:
D14078747
Pulled By: soumith
fbshipit-source-id:
0eb9255da2351190a6bcaf68c30e2ae2402a2dd9