(#14580)
authorJie <jiej@nvidia.com>
Thu, 6 Dec 2018 16:57:39 +0000 (08:57 -0800)
committerFacebook Github Bot <facebook-github-bot@users.noreply.github.com>
Thu, 6 Dec 2018 17:03:46 +0000 (09:03 -0800)
commitd2fdc33411a3ffeb0575604d60014869869c5653
tree42f3279d5f7a81125f0cafe3a09db8c62a8902fe
parenteb3cabffd69e37162a3fe0bb1bbfa3de83404f3a
(#14580)

Summary:
Removes cast of half to float in torch.sum, with float16 input tensor and
float32 output tensor, instead we cast data when loading input in kernel.

This supposingly would save a kernel launch as well as a full global memory load
on promoted data type (float).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14580

Differential Revision: D13356203

Pulled By: ezyang

fbshipit-source-id: 85e91225b880a65fe3ceb493371b9b36407fdf48
aten/src/ATen/native/ReduceOps.cpp
aten/src/ATen/native/TensorIterator.cpp
aten/src/ATen/native/TensorIterator.h
aten/src/ATen/native/cuda/Reduce.cuh
aten/src/ATen/native/cuda/ReduceOpsKernel.cu
test/test_cuda.py