Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65149
Fixes #64789
There is a race condition between when the free port is acquired to when it is used to create the store in which it may have been used. Since this test only tests that timeout is triggered for tcpstore, we can bind to any port on tcpstore creation.
This only affects the test on the server (since that is where the port is used), but I changed both tests for clarity
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang cbalioglu gcramer23
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision:
D30993166
Pulled By: H-Huang
fbshipit-source-id:
eac4f28d641ac87c4ebee89df83f90955144f2f1
def test_create_store_timeout_on_server(self):
with self.assertRaises(TimeoutError):
- port = get_free_port()
+ # use any available port (port 0) since timeout is expected
create_c10d_store(
is_server=True,
server_addr=socket.gethostname(),
- server_port=port,
+ server_port=0,
world_size=2,
timeout=1,
)
def test_create_store_timeout_on_worker(self):
with self.assertRaises(TimeoutError):
- port = get_free_port()
+ # use any available port (port 0) since timeout is expected
create_c10d_store(
is_server=False,
server_addr=socket.gethostname(),
- server_port=port,
+ server_port=0,
world_size=2,
timeout=1,
)