review.tizen.org Git - platform/upstream/pytorch.git/commit

[c10d] Provide failure reason from ProcessGroup when aborting NCCL comm (#64241)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64241

When things go wrong PG NCCL aborts nccl communicators via `ncclCommAbort`, but one issues is that often the error can be set to `ncclSystemError` (see https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L176) when that might not be the true cause of the issue and the actual issue is that some prior work timed out, communicator was aborted on other rank, etc.

This results in a lot of confusion when debugging jobs with a large no. of processes as the current message for ncclSystemError is not very informative: https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L22

The fix here is to pass in a string exception message from PG NCCL down to `NCCLUtils` which will aim to raise that as the actual issue and not the confusing `ncclSystemError` message.

Test Plan: CI

Reviewed By: pallab-zz, cbalioglu

Differential Revision: D30658855

fbshipit-source-id: 17661dbe0a1bb8cc5b87b637c47634b1f52f54e1

author	Rohan Varma <rvarm1@fb.com>
	Wed, 8 Sep 2021 16:17:49 +0000 (09:17 -0700)
committer	Facebook GitHub Bot <facebook-github-bot@users.noreply.github.com>
	Wed, 8 Sep 2021 16:19:24 +0000 (09:19 -0700)
commit	e0e832c2baf181284bab924e687a9668ed6beef5
tree	2250ca8f48da2f922b096f1b53c9eafc7eaa8cec	tree \| snapshot
parent	7205ca02107059443bddd14322d2b9ed8562c60b	commit \| diff

test/distributed/test_c10d_nccl.py		diff \| blob \| history
torch/csrc/distributed/c10d/NCCLUtils.cpp		diff \| blob \| history
torch/csrc/distributed/c10d/NCCLUtils.hpp		diff \| blob \| history
torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp		diff \| blob \| history