Fixed new_group won't work for two or more different rank groups (#14529)
authorTeng Li <tengli@fb.com>
Fri, 30 Nov 2018 03:55:34 +0000 (19:55 -0800)
committerFacebook Github Bot <facebook-github-bot@users.noreply.github.com>
Fri, 30 Nov 2018 03:57:47 +0000 (19:57 -0800)
commit9127ab386602d5b6509246332e71c63e6b741a25
treef74c7ca87d92a394a2823150d28819b5a3b7ee0b
parente227aa9e2e7e5bc010e82071e30422a2c5eae297
Fixed new_group won't work for two or more different rank groups (#14529)

Summary:
This fixed two things:

(1) NCCL group doesn't support 2 or more groups, this is because, we need a group name in ProcessGroupNCCL class to keep track of the ProcessGroup ID within that group name, and also the NCCL unique ID within that group name and process group ID.  Otherwise, different processes will create different NCCL PG in different orders and can clash on these names.  This will fix the NCCL problem.

(2)  When using new_group, each rank should enter this function and update its global group name counter to ensure that every rank always operates on the same group name.

With both fixes: repro code in: https://github.com/pytorch/pytorch/issues/14528 should work with both NCCL and Gloo backends.

```
tengli@learnfair096:~$ python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=30000 ~/github_issues/nccl_group.py
rank: 0 - val: 6.0
rank: 2 - val: 6.0
rank: 3 - val: 6.0
rank: 1 - val: 6.0
rank: 4 - val: 22.0
rank: 6 - val: 22.0
rank: 5 - val: 22.0
rank: 7 - val: 22.0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14529

Differential Revision: D13253434

Pulled By: teng-li

fbshipit-source-id: 8eb45882b996b06d951fc9a306d5de86a42e8b84
torch/csrc/distributed/c10d/init.cpp
torch/distributed/distributed_c10d.py
torch/lib/c10d/ProcessGroupNCCL.cpp
torch/lib/c10d/ProcessGroupNCCL.hpp