change dropout lowering in symbolic_script (#18375)
authorNatalia Gimelshein <ngimelshein@nvidia.com>
Tue, 26 Mar 2019 02:57:06 +0000 (19:57 -0700)
committerFacebook Github Bot <facebook-github-bot@users.noreply.github.com>
Tue, 26 Mar 2019 03:05:11 +0000 (20:05 -0700)
Summary:
Dropout is now eligible for fusion, and generated fused kernels are just as fast as dropout in ATen. Change its lowering in symbolic script so that it can actually be fused. Still special-cased for cuda, because without fusion this lowering is less efficient than current (bernoulli_ * input). Testing is covered by the test case that ailzhang added (test_dropout_cuda).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18375

Differential Revision: D14611938

Pulled By: soumith

fbshipit-source-id: 11b18f4784e6c9265e382a8f8deca7add8df3b37

test/test_jit.py
torch/csrc/jit/symbolic_script.cpp

index 1c59929..c2318e1 100644 (file)
@@ -1362,6 +1362,8 @@ class TestJit(JitTestCase):
             self.assertEqual(outputs, m(*inputs))
 
     @unittest.skipIf(not RUN_CUDA, "test_dropout_cuda require CUDA")
+    @unittest.skipIf(IS_WINDOWS, "NYI: fuser support for Windows")
+    @skipIfRocm
     def test_dropout_cuda(self):
         # Dropout AD is dispatched to _fused_dropout in CUDA case,
         # which is not included in TestJitGeneratedFunctional
index cce3552..6974936 100644 (file)
@@ -725,20 +725,20 @@ const std::vector<std::string> functions = {
                                       mask,
                                       p1m: float):
             p1r = 1. / p1m
-            if grad.requires_grad:
-                grad_input = grad * (mask.type_as(grad) * p1r)
-            else:
-                grad_input = torch._masked_scale(grad, mask, p1r)
+            grad_input = grad * (mask.type_as(grad) * p1r)
             return grad_input
 
         def dropout(input,
                     p: float,
                     train: bool):
             use_cuda = input.is_cuda
-            # CUDA has a fused dropout implementation
+            # lowering is specialized for cuda because cuda fuser can efficiently fuse those operations
+            # for cpu backend, where fusions are disabled, a different lowering that is more efficient
+            # in the absence of fusion is used
             p1m = 1. - p
             if use_cuda:
-                res, mask = torch._fused_dropout(input, p1m)
+                mask = torch.rand_like(input) < p1m
+                res = mask.type_as(input) * input * (1./p1m)
             else:
                 mask = torch.empty_like(input)
                 mask.bernoulli_(p1m)