Make NCCL backend support barrier op (#14142)
Summary:
This is a feature request from: https://github.com/pytorch/pytorch/issues/13573
As the title says, this PR makes NCCL backend support barrier op.
There are a couple scenarios that need to be addressed:
(1) When there is already a NCCL op happened, we need to record what GPU device(s) the previous op happened and queue the allreduce barrier op on the same GPU device
(2) When there is no NCCL op yet, we will try to use a single GPU and separate each process from a single GPU as the best effort.
As for the async work, during wait, we would like not just wait on the NCCL kernel to be completed, but also block the thread until the current stream and nccl stream return.
`test_distributed` should cover the test. I also manually tested both scenarios.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14142
Differential Revision:
D13113391
Pulled By: teng-li
fbshipit-source-id:
96c33d4d129e2977e6892d85d0fc449424c35499