Fix port allocation race condition for elastic test (#65149)
authorHoward Huang <howardhuang@fb.com>
Fri, 17 Sep 2021 14:55:01 +0000 (07:55 -0700)
committerFacebook GitHub Bot <facebook-github-bot@users.noreply.github.com>
Fri, 17 Sep 2021 15:32:47 +0000 (08:32 -0700)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65149

Fixes #64789

There is a race condition between when the free port is acquired to when it is used to create the store in which it may have been used. Since this test only tests that timeout is triggered for tcpstore, we can bind to any port on tcpstore creation.

This only affects the test on the server (since that is where the port is used), but I changed both tests for clarity

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30993166

Pulled By: H-Huang

fbshipit-source-id: eac4f28d641ac87c4ebee89df83f90955144f2f1

test/distributed/elastic/utils/distributed_test.py

index 5a31ee0..e3c1de3 100644 (file)
@@ -84,22 +84,22 @@ class DistributedUtilTest(unittest.TestCase):
 
     def test_create_store_timeout_on_server(self):
         with self.assertRaises(TimeoutError):
-            port = get_free_port()
+            # use any available port (port 0) since timeout is expected
             create_c10d_store(
                 is_server=True,
                 server_addr=socket.gethostname(),
-                server_port=port,
+                server_port=0,
                 world_size=2,
                 timeout=1,
             )
 
     def test_create_store_timeout_on_worker(self):
         with self.assertRaises(TimeoutError):
-            port = get_free_port()
+            # use any available port (port 0) since timeout is expected
             create_c10d_store(
                 is_server=False,
                 server_addr=socket.gethostname(),
-                server_port=port,
+                server_port=0,
                 world_size=2,
                 timeout=1,
             )