[c10d] Provide failure reason from ProcessGroup when aborting NCCL comm (#64241)
authorRohan Varma <rvarm1@fb.com>
Wed, 8 Sep 2021 16:17:49 +0000 (09:17 -0700)
committerFacebook GitHub Bot <facebook-github-bot@users.noreply.github.com>
Wed, 8 Sep 2021 16:19:24 +0000 (09:19 -0700)
commite0e832c2baf181284bab924e687a9668ed6beef5
tree2250ca8f48da2f922b096f1b53c9eafc7eaa8cec
parent7205ca02107059443bddd14322d2b9ed8562c60b
[c10d] Provide failure reason from ProcessGroup when aborting NCCL comm (#64241)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64241

When things go wrong PG NCCL aborts nccl communicators via `ncclCommAbort`, but one issues is that often the error can be set to `ncclSystemError` (see  https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L176) when that might not be the true cause of the issue and the actual issue is that some prior work timed out, communicator was aborted on other rank, etc.

This results in a lot of confusion when debugging jobs with a large no. of processes as the current message for ncclSystemError is not very informative: https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L22

The fix here is to pass in a string exception message from PG NCCL down to `NCCLUtils` which will aim to raise that as the actual issue and not the confusing `ncclSystemError` message.

Test Plan: CI

Reviewed By: pallab-zz, cbalioglu

Differential Revision: D30658855

fbshipit-source-id: 17661dbe0a1bb8cc5b87b637c47634b1f52f54e1
test/distributed/test_c10d_nccl.py
torch/csrc/distributed/c10d/NCCLUtils.cpp
torch/csrc/distributed/c10d/NCCLUtils.hpp
torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp