review.tizen.org Git - platform/upstream/pytorch.git/commit

TCP init method race condition fix (#15684)

Summary:
This PR fixes a race condition for TCP init method, when master rank can exit earlier than slave ranks and thus the TCP daemon thread gets shutdown before other slaves are able to access it.

This will let every rank (process) write a special key to the store to mark that they are completed (and thus about to exit). The master rank (who is the server) will always wait until all the ranks to complete before complete itself.

This should fix: https://github.com/pytorch/pytorch/issues/15638

Tested using the repro of https://github.com/pytorch/pytorch/issues/15638 and works fine. Also test_distributed and test_c10d should have already had this coverage.

I had to make rendezvous test in c10d the world size of 1, since it is a single process code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15684

Differential Revision: D13570904

Pulled By: teng-li

fbshipit-source-id: 34f3bc471204bbd29320df359347ad5561c6b589

author	Teng Li <tengli@fb.com>
	Fri, 18 Jan 2019 10:23:51 +0000 (02:23 -0800)
committer	Facebook Github Bot <facebook-github-bot@users.noreply.github.com>
	Fri, 18 Jan 2019 10:29:38 +0000 (02:29 -0800)
commit	b4bc55beefda3a0724b0fb83c04b6bbd8dd46c77
tree	4a76722cec30fbaed189a530a7fe3b8a05fbd6b4	tree \| snapshot
parent	aaff2fecda78ca8064e313944c05a6df720ba87e	commit \| diff

test/test_c10d.py		diff \| blob \| history
torch/csrc/distributed/c10d/init.cpp		diff \| blob \| history
torch/distributed/rendezvous.py		diff \| blob \| history
torch/lib/c10d/TCPStore.cpp		diff \| blob \| history
torch/lib/c10d/TCPStore.hpp		diff \| blob \| history
torch/lib/c10d/test/TCPStoreTest.cpp		diff \| blob \| history