Fix NCCL/Gloo process groups and DDP stream sync bug (#18465)
authorShen Li <shenli@fb.com>
Thu, 28 Mar 2019 22:05:53 +0000 (15:05 -0700)
committerFacebook Github Bot <facebook-github-bot@users.noreply.github.com>
Thu, 28 Mar 2019 22:12:40 +0000 (15:12 -0700)
commitaea8ee1f6831a7704044d85df7c24e041e2363f5
tree87e97549839d5660ffc23e5d632d131f30681b5b
parent9eb0f435d9218b2fe1d5b2d32a2e471ce5b4d7df
Fix NCCL/Gloo process groups and DDP stream sync bug (#18465)

Summary:
DDP with NCCL backend uses a [worker stream](https://github.com/pytorch/pytorch/blob/d3eb941ed96774efb8d89a0b20c9e49807ea85a7/torch/csrc/distributed/c10d/ddp.cpp#L142) to flatten grand batch
tensors, and passes the flattened tensor to [another stream](https://github.com/pytorch/pytorch/blob/d3eb941ed96774efb8d89a0b20c9e49807ea85a7/torch/lib/c10d/ProcessGroupNCCL.cpp#L379) to
conduct ncclAllReduce. The flattened tensor has to record the
ncclAllReduce stream, otherwise multiple streams might access the
same memory space.

cc ppwwyyxx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18465

Differential Revision: D14613449

Pulled By: mrshenli

fbshipit-source-id: b62773732552d12cc87b7adeb6897e9e11753ea9
torch/csrc/distributed/c10d/ddp.cpp
torch/lib/c10d/ProcessGroupGloo.cpp
torch/lib/c10d/ProcessGroupNCCL.cpp