[XLA] Fix BF16 normalizer for CrossReplicaSum.
1. It may produce incorrect result when mixed precision is not supported and
BF16 is not support only for a particular operand. Then the pass may introduce
new mixed precision for an all-BF16 CRS. This is unlikely in practical
settings, but removing this constraint can enable auto-generating corner case
tests using this pass.
2. A cycle can be introduced in the tuple-shaped output output. This wasn't
caught by the test because the DFS happened to succeed. Now add verifier
explicitly.
PiperOrigin-RevId:
187908099