[XLA:GPU] Implement trivial (one-replica) cross-replica-sum on XLA:GPU.
authorJustin Lebar <jlebar@google.com>
Tue, 22 May 2018 03:41:26 +0000 (20:41 -0700)
committerTensorFlower Gardener <gardener@tensorflow.org>
Tue, 22 May 2018 03:43:56 +0000 (20:43 -0700)
commiteab53f2cea0506d869b14713c6c532e0bbfd9c52
tree60ef8fa706ad613261484666e66411df38d3969f
parentc0bf28ecc311759ac80e12515ad931b077aae635
[XLA:GPU] Implement trivial (one-replica) cross-replica-sum on XLA:GPU.

Also fix the CPU implementation to work in the case when there are
multiple operands to the cross-replica-sum op.

PiperOrigin-RevId: 197506311
tensorflow/compiler/xla/service/cpu/ir_emitter.cc
tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc
tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
tensorflow/compiler/xla/tests/BUILD
tensorflow/compiler/xla/tests/cross_replica_sum_test.cc [new file with mode: 0644]
tensorflow/compiler/xla/tests/hlo_test_base.cc
tensorflow/compiler/xla/tests/hlo_test_base.h