[TensorIterator fixing mean to output correct result for half precisi… (#14878)
Summary:
…on](#12115)
mean is calculated in two step sum()/numel(). For half precision, data gets
casted back to half after sum().
We fused the division into the reduction kernel by adding pre_op/post_op.
This allows us to do torch.ones(65536).cuda().half().mean() to return correct
result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14878
Differential Revision:
D13491159
Pulled By: soumith
fbshipit-source-id:
e83802e1628b6d2615c45e18d7acf991d143a09e