Jaliya Ekanayake [Thu, 29 Nov 2018 15:04:52 +0000 (07:04 -0800)]
Jaliyae/samplers (#13870)
Summary:
Make Samplers optionally accept new size in their reset() method. This helps dataloader or dataset to reset the sampler for an epoch or a chunk of data with different sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13870
Differential Revision:
D13240120
Pulled By: soumith
fbshipit-source-id:
19c53f8be13c0fdcf504f0637b0d3e6009a8e599
David Riazati [Thu, 29 Nov 2018 07:28:59 +0000 (23:28 -0800)]
Use nn module tests in test_jit (#14238)
Summary:
This PR adds weak modules for all activation modules and uses `test_nn` module tests to test weak modules that have been annotated with `weak_module` and therefore are in `torch._jit_internal._weak_types`
Also depends on #14379
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14238
Differential Revision:
D13252887
Pulled By: driazati
fbshipit-source-id:
e9638cf74089884a32b8f0f38396cf432c02c988
svcscm [Thu, 29 Nov 2018 05:37:40 +0000 (21:37 -0800)]
Updating submodules
Reviewed By: yns88
fbshipit-source-id:
f957056bb48c583738c5defaf3d1f01cd7df3915
svcscm [Thu, 29 Nov 2018 05:07:02 +0000 (21:07 -0800)]
Updating submodules
Reviewed By: yns88
fbshipit-source-id:
9800251baaa09d9f7988eff340ef36e0ab11f579
Peter Goldsborough [Thu, 29 Nov 2018 04:25:21 +0000 (20:25 -0800)]
Fix version.groups() (#14505)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/14502
fmassa soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14505
Differential Revision:
D13242386
Pulled By: goldsborough
fbshipit-source-id:
faebae8795e1efd9c0ebc2294fe9648193d16624
Elias Ellison [Thu, 29 Nov 2018 03:14:16 +0000 (19:14 -0800)]
Support Embedding + EmbeddingBag in Script + (Ignore flakey test) (#14509)
Summary:
Resubmitting PR #14415
The tests added for Embedding + EmbeddingBag had random numbers as input, which affected the random number generator & caused the flakey test to break.
Everything but the last two commits have already been accepted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14509
Differential Revision:
D13247917
Pulled By: eellison
fbshipit-source-id:
ea6963c47f666c07687787e2fa82020cddc6aa15
Elias Ellison [Thu, 29 Nov 2018 02:12:22 +0000 (18:12 -0800)]
pointwise_loss (#14134)
Summary:
Adding pointwise loss ops to weak_script
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14134
Differential Revision:
D13209455
Pulled By: eellison
fbshipit-source-id:
87fc0222121f34a2f4edb24c2da2a11124b097d8
James Sun [Thu, 29 Nov 2018 02:05:10 +0000 (18:05 -0800)]
Merge Caffe2 and PyTorch thread pool definitions (#14114)
Summary:
(1) Move Caffe2 thread pool to aten
(2) Use the same thread pool definition for PyTorch interpreter
(3) Make ivalue::Future thread-safe
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14114
Reviewed By: ilia-cher
Differential Revision:
D13110451
Pulled By: highker
fbshipit-source-id:
a83acb6a4bafb7f674e3fe3d58f7a74c68064fac
Sam Gross [Thu, 29 Nov 2018 01:51:01 +0000 (17:51 -0800)]
Ensure that indices are on the same device as self
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14504
Reviewed By: wat3rBro
Differential Revision:
D13242200
Pulled By: colesbury
fbshipit-source-id:
82731cee808681ec612d406342070640eb26e519
Dmytro Dzhulgakov [Wed, 28 Nov 2018 23:43:22 +0000 (15:43 -0800)]
Remove Context dependency from Tensor class (#14269)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14269
Removes reference to Context proper and instead adds a bool argument for async copy (the same as `copy_`)
For CopyFrom - I haven't tweaked all callsites yet. Instead I rely on a terrible hack that pointer to context is implicitly converted to bool when passed, haha :) It's not a good code and I propose to fix it in a follow up diff (maybe using clangr tooling).
Reviewed By: ezyang
Differential Revision:
D13117981
fbshipit-source-id:
7cb1dc2ba6a4c50ac26614f45ab8318ea96e3138
Dmytro Dzhulgakov [Wed, 28 Nov 2018 23:43:22 +0000 (15:43 -0800)]
Change Tensor::CopyFrom to a simple double dispatch (#14268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14268
Removes the need for Context in Tensor by doing simple dispatch for CopyBytes. It'd eventually be subsumed by Roy Li's changes of proper copy_ op, but before that is done, let's get a clear logic of how copies are implemented and clean up some craft in CopyFrom implementation.
Note, that with these changes, one can probably can get rid of Context::CopyFromCPU/CopyToCPU, but it's a matter for follow up diffs.
This diff doesn't change the API of Tensor yet, but relies on the fact that passing `Context` to CopyFrom makes copy async if the device is CUDA and doesn't have any effect otherwise (that's how Context methods are implemented).
This doesn't change semantics of copy async implementation - as before it blindly calls cudaMemcpyAsync which probably means that it can be misused if invoked separately outside of operator body. I'll leave it for the follow up copy_ unification.
For Extend() we always do async copy - it makes sense as it's an in-place device-device operation and only any further op would be observable.
Note: there are now three ways of invoking copy in C2 code - templated CopyBytes, virtual CopyFromCPU/etc, and double-dispatch free method here. Hopefully we can get rid of the second one.
Also, please advise whether it's c10-worthy :)
Reviewed By: ezyang
Differential Revision:
D13117987
fbshipit-source-id:
a6772d6dcf3effaf06717da3a656fc9873b310b5
albanD [Wed, 28 Nov 2018 23:25:09 +0000 (15:25 -0800)]
Update Tensor doc (#14339)
Summary:
Add to the Tensor doc info about `.device`, `.is_cuda`, `.requires_grad`, `.is_leaf` and `.grad`.
Update the `register_backward_hook` doc with a warning stating that it does not work in all cases.
Add support in the `_add_docstr` function to add docstring to attributes.
There is an explicit cast here but I am not sure how to handle it properly. The thing is that the doc field for getsetdescr is written as being a const char * (as all other doc fields in descriptors objects) in cpython online documentation. But in the code, it is the only one that is not const.
I assumed here that it is a bug in the code because it does not follow the doc and the convention of the others descriptors and so I cast out the const.
EDIT: the online doc I was looking at is for 3.7 and in that version both the code and the doc are const. For older versions, both are non const.
Please let me know if this should not be done. And if it should be done if there is a cleaner way to do it !
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14339
Differential Revision:
D13243266
Pulled By: ezyang
fbshipit-source-id:
75b7838f7cd6c8dc72b0c61950e7a971baefaeeb
andersj [Wed, 28 Nov 2018 22:40:50 +0000 (14:40 -0800)]
nccl fixes (#14195)
Summary:
This has 4 changes
1) propagate USE_SYSTEM_NCCL. Previously it was ignored and cmake always did a FindPackage
2) respect SCCACHE_DISABLE in our caffe2 sccache wrapper for circleci
3) use SCCACHE_DISABLE when building nccl, because it triggers the same bug as when using CCACHE (already tracked in https://github.com/pytorch/pytorch/issues/13362). This was hidden because we weren't respecting USE_SYSTEM_NCCL, and were never building nccl ourselves in CI
4) In one particular CI configuration (caffe2, cuda 8, cudnn 7), force USE_SYSTEM_NCCL=1. Building the bundled nccl triggers a bug in nvlink. I've done some investigation, but this looks like a tricky, preexisting bug, so rather than hold up this diff I'm tracking it separately in https://github.com/pytorch/pytorch/issues/14486
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14195
Differential Revision:
D13237502
Pulled By: anderspapitto
fbshipit-source-id:
1100ac1269c7cd39e2e0b3ba12a56a3ce8977c55
Edward Yang [Wed, 28 Nov 2018 21:59:59 +0000 (13:59 -0800)]
Clean up house on CUDAStream (#14247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14247
Just a bunch of clean up to get the code in a good state before we
enshrine it in c10.
Billing of changes:
- Inline all "pointer" API functions into their real implementations,
so we don't have a bunch of dead pointer functions hanging around.
- Replace all occurrences of int64_t with DeviceIndex, as appropriate
- Rename device field to device_index
- Add documentation for everything in CUDAStream.h
- Bring CUDAStream to API parity with Stream (e.g., support equality)
- Delete uncheckedSetCurrentCUDAStream, it didn't work anyway because
StreamId to internal pointer conversion has a bunch of ways it can
fail. Just hope for the best!
Reviewed By: dzhulgakov
Differential Revision:
D13141949
fbshipit-source-id:
a02f34921e3d8294bd77c262bd05da07d1740a71
Edward Yang [Wed, 28 Nov 2018 21:52:44 +0000 (13:52 -0800)]
Make clang-tidy shut up about Python C API macros.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14480
Reviewed By: goldsborough
Differential Revision:
D13235001
fbshipit-source-id:
cd7f00b12ed3d9ef0fb0d7bd6c428e21561ec1b6
Sebastian Messmer [Wed, 28 Nov 2018 21:37:31 +0000 (13:37 -0800)]
Make TensorImpl/StorageImpl safer (#14429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14429
- forbid copying
- make final what ought to be
Reviewed By: dzhulgakov
Differential Revision:
D13223125
fbshipit-source-id:
e6176cc916d4cd8370c835f243ca90d5c3124c4a
Sebastian Messmer [Wed, 28 Nov 2018 21:37:31 +0000 (13:37 -0800)]
Handle copying intrusive_ptr_target correctly (#14428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14428
See in-code comment
Reviewed By: ezyang
Differential Revision:
D13223126
fbshipit-source-id:
1e87e6112bbcca6377ca04ef2ba25ef937931061
Edward Yang [Wed, 28 Nov 2018 21:36:40 +0000 (13:36 -0800)]
Revert
D13219647: [pytorch][PR] Support Embedding + EmbeddingBag in Script
Differential Revision:
D13219647
Original commit changeset:
c90706aa6fbd
fbshipit-source-id:
d189e717ba0773de43d633876bc3a688830a9303
Sebastian Messmer [Wed, 28 Nov 2018 21:30:36 +0000 (13:30 -0800)]
Remove StorageImpl::type() (#14139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14139
This seems neither be used nor implemented. Also, it is a c10->aten dependency which we don't want.
Reviewed By: ezyang
Differential Revision:
D13112298
fbshipit-source-id:
0407c4c3ac9b02bbd6fca478336cb6a6ae334930
Jerry Zhang [Wed, 28 Nov 2018 21:24:30 +0000 (13:24 -0800)]
Add XBlobGetMutableTensor that returns Tensor (#14424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14424
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14136
Since now Tensor is a shared_ptr, it doesn't make sense to have Tensor* around anymore,
so we want to change Tensor* to Tensor in the interface.
We added functions that work with `Tensor` instead of `Tensor*` in this diff.
To remove Tensor*, we'll do following
```
auto* Y = Ouptut(0);
Y->mutable_data...
```
-->
```
auto Y = Output(0);
Y.mutable_data...
```
But to run clangr codemod, we'll keep both APIs in different names, e.g. `Output` and `XOutput`, and do the refactor and then delete the old method and rename the new method into the old one.
For example for `Output`, we'll first codemod the callsites from `Output` to `XOutput`, then delete the old `Output` and rename `XOutput` to `Output` in the end.
Reviewed By: smessmer
Differential Revision:
D12934074
fbshipit-source-id:
d0e85f6ef8d13ed4e7a7505faa5db292a507d54c
Pieter Noordhuis [Wed, 28 Nov 2018 19:32:47 +0000 (11:32 -0800)]
Add timeout kwarg to init_process_group (#14435)
Summary:
This applies to the gloo backend only. Timeout support for the NCCL and
MPI backends is tracked in issues #14371 and #14372 respectively.
When creating a new process group (either the global one or any subgroup
created through `new_group`) you can specify a timeout keyword
argument (of type datetime.timedelta). This timeout applies to all
collective operations executed against that process group, such that any
operation taking longer than the timeout will throw a runtime error.
Using a different, better catchable error type is tracked in #14433.
This fixes #14376.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14435
Differential Revision:
D13234317
Pulled By: pietern
fbshipit-source-id:
973993b67994dc64861c0977cbb6f051ec9d87f6
Edward Yang [Wed, 28 Nov 2018 19:05:36 +0000 (11:05 -0800)]
Add support for HIP to DispatchStub. (#14413)
Summary:
I feel a bit bad writing this patch, because there isn't really
any reason not to use the normal dispatch mechanism for CUDA
and HIP here (so we have *yet another dispatcher*), but I don't
really want to sign up to rewrite DispatchStub to deduplicate the
dispatcher right now.
Need to natively add support for HIP here, as I don't want to
have to HIPify files which are not in a CUDA directory.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14413
Differential Revision:
D13220358
Pulled By: ezyang
fbshipit-source-id:
cc61218322589a1dc2ab8eb9d5ddd3c616f6b712
Elias Ellison [Wed, 28 Nov 2018 18:50:26 +0000 (10:50 -0800)]
Support Embedding + EmbeddingBag in Script (#14415)
Summary:
Add support for Embedding and EmbeddingBag in script. Both functions require with torch.no_grad(), which we don't have any plans to support in the near future. To work around this, I added a embedding_renorm function without derivatives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14415
Reviewed By: wanchaol
Differential Revision:
D13219647
Pulled By: eellison
fbshipit-source-id:
c90706aa6fbd48686eb10f3efdb65844be7b8717
Jongsoo Park [Wed, 28 Nov 2018 18:39:46 +0000 (10:39 -0800)]
fix build error from
D13188595 (#14481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14481
Fix build error in mode/opt
Reviewed By: dskhudia
Differential Revision:
D13234688
fbshipit-source-id:
6c8515c45f75e7b88713a303f22990ad85d68beb
Raghavendra Thodime [Wed, 28 Nov 2018 18:39:31 +0000 (10:39 -0800)]
Revert
D13144472: [fix] condition blob in while_op test changes data type
Differential Revision:
D13144472
Original commit changeset:
af4d920a3148
fbshipit-source-id:
74d9f69fc66964b5e68b4b2cd2fd2be1f63e9d69
Jiong Gong [Wed, 28 Nov 2018 18:35:28 +0000 (10:35 -0800)]
Fix the build issue in setup.py due to cmake version type x.x.x.x vio… (#14331)
Summary:
See https://github.com/pytorch/pytorch/issues/13226
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14331
Differential Revision:
D13234639
Pulled By: orionr
fbshipit-source-id:
87880057e84242e4af5ad6bf87e08831aa2c5459
JerryShih [Wed, 28 Nov 2018 17:26:25 +0000 (09:26 -0800)]
Update OpenMP cmake setting for xcode 9 compiler(AppleClang 9.0) (#14473)
Summary:
Original PR: https://github.com/pytorch/pytorch/pull/11563
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14473
Differential Revision:
D13234208
Pulled By: ezyang
fbshipit-source-id:
7d874c63659e93728af239ecdfb85547613e52ad
Edward Yang [Wed, 28 Nov 2018 15:38:04 +0000 (07:38 -0800)]
Revert
D13166626: [pytorch][PR] ignore generated caffe2 docs and virtualenvs
Differential Revision:
D13166626
Original commit changeset:
4f11228d8b5d
fbshipit-source-id:
ff301f1791ca8a390767ae43cde8637dcd044d0c
Brennan Vincent [Wed, 28 Nov 2018 14:50:49 +0000 (06:50 -0800)]
Make `mean` function work across multiple dimensions. (#14252)
Summary:
Multi-dimensional `sum` is already implemented, and it's trivial to implement `mean` in terms of `sum`, so just do it.
Bonus: Fix incomplete language in the `torch.sum` documentation which doesn't take into account multiple dimensions when describing `unsqueeze` (at the same time as introducing similar language in `torch.mean`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14252
Differential Revision:
D13161157
Pulled By: umanwizard
fbshipit-source-id:
c45da692ba83c0ec80815200c5543302128da75c
Francisco Massa [Wed, 28 Nov 2018 14:11:08 +0000 (06:11 -0800)]
Fix half tensor printing plus speedup large tensor printing (#14418)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/14344 and https://github.com/pytorch/pytorch/issues/6863
The slowdown was due to the fact that we were only summarizing the tensor (for computing the number of digits to print) if its first dimension was larger than the threshold. It now goes over all the dimensions.
Some quick runtime analysis:
Before this PR:
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50)
In [2]: %timeit str(a)
13.6 s ± 84.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
After this PR
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50)
In [2]: %timeit str(a)
2.08 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [3]: b = a.cuda()
In [4]: %timeit str(b)
8.39 ms ± 45.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14418
Reviewed By: weiyangfb
Differential Revision:
D13226950
Pulled By: soumith
fbshipit-source-id:
19eb4b855db4c8f891d0925a9c56ae8a2824bb23
Wei Yang [Wed, 28 Nov 2018 10:16:56 +0000 (02:16 -0800)]
torch.sparse.sum() (#12430)
Summary:
- to fix #12241
- add `_sparse_sum()` to ATen, and expose as `torch.sparse.sum()`, not support `SparseTensor.sum()` currently
- this PR depends on #11253, and will need to be updated upon it lands
- [x] implement forward
- [x] implement backward
- performance [benchmark script](https://gist.github.com/weiyangfb/
f4c55c88b6092ef8f7e348f6b9ad8946#file-sparse_sum_benchmark-py):
- sum all dims is fastest for sparse tensor
- when input is sparse enough nnz = 0.1%, sum of sparse tensor is faster than dense in CPU, but not necessary in CUDA
- CUDA backward is comparable (<2x) between `sum several dims` vs `sum all dims` in sparse
- CPU backward uses binary search is still slow in sparse, takes `5x` time in `sum [0, 2, 3] dims` vs `sum all dims`
- optimize CUDA backward for now
- using thrust for sort and binary search, but runtime not improved
- both of CPU and CUDA forward are slow in sparse (`sum several dims` vs `sum all dims`), at most `20x` slower in CPU, and `10x` in CUDA
- improve CPU and CUDA forward kernels
(nnz, sizes, sum_dims, keepdim, sum all or dims, bk=backward) | CPU (sparse vs dense) | CUDA(sparse vs dense)
-- | -- | --
(1000, [1000, 1000, 2, 2], [0, 1], False, sumAll) | 8.77 µs vs 72.9 µs | 42.5 µs vs 108 µs
(1000, [1000, 1000, 2, 2], [0, 1], False, sumD) | 112 µs vs 4.47 ms | 484 µs vs 407 µs
(1000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk) | 141 µs vs 148 µs | 647 µs vs 231 µs
(1000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk) | 235 µs vs 1.23 ms | 781 µs vs 213 µs
(1000, [1000, 1000, 2, 2], [2, 3], False, sumD) | 48.5 µs vs 360 µs | 160 µs vs 2.03 ms
(1000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk) | 258 µs vs 1.22 ms | 798 µs vs 224 µs
(1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) | 204 µs vs 882 µs | 443 µs vs 133 µs
(1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk) | 709 µs vs 1.15 ms | 893 µs vs 202 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumAll) | 39.8 µs vs 81 µs | 42.4 µs vs 113 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumD) | 747 µs vs 4.7 ms | 2.4 ms vs 414 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk) | 1.04 ms vs 126 µs | 5.03 ms vs 231 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk) | 1.12 ms vs 1.24 ms | 5.99 ms vs 213 µs
(10000, [1000, 1000, 2, 2], [2, 3], False, sumD) | 133 µs vs 366 µs | 463 µs vs 2.03 ms
(10000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk) | 1.56 ms vs 1.22 ms | 6.11 ms vs 229 µs
(10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) | 1.53 ms vs 799 µs | 824 µs vs 134 µs
(10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk) | 5.15 ms vs 1.09 ms | 7.02 ms vs 205 µs
- after improving CPU and CUDA forward kernels
- in `(1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD)` forward, CPU takes ~~`171 µs`~~, in which `130 µs` is spent on `coalesce()`, for CUDA, total time is ~~`331 µs`~~, in which `141 µs` is spent on `coalesce()`, we need to reduce time at other places outside `coalesce()`.
- after a few simple tweaks, now in the forward, it is at most `10x` slower in CPU, and `7x` in CUDA. And time takes in `sum dense dims only [2, 3]` is `~2x` of `sum all dims`. Speed of `sum all sparse dims [0, 1]` is on bar with `sum all dims`
(nnz, sizes, sum_dims, keepdim, sum all or dims, bk=backward) | CPU (sparse vs dense) | CUDA(sparse vs dense)
-- | -- | --
(1000, [1000, 1000, 2, 2], [0, 1], False, sumAll) | 7 µs vs 69.5 µs | 31.5 µs vs 61.6 µs
(1000, [1000, 1000, 2, 2], [0, 1], False, sumD) | 11.3 µs vs 4.72 ms | 35.2 µs vs 285 µs
(1000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk) | 197 µs vs 124 µs | 857 µs vs 134 µs
(1000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk) | 124 µs vs 833 µs | 796 µs vs 106 µs
(1000, [1000, 1000, 2, 2], [2, 3], False, sumD) | 20.5 µs vs 213 µs | 39.4 µs vs 1.24 ms
(1000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk) | 131 µs vs 830 µs | 881 µs vs 132 µs
(1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) | 95.8 µs vs 409 µs | 246 µs vs 87.2 µs
(1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk) | 624 µs vs 820 µs | 953 µs vs 124 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumAll) | 45.3 µs vs 72.9 µs | 33.9 µs vs 57.2 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumD) | 81.4 µs vs 4.49 ms | 39.7 µs vs 280 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk) | 984 µs vs 111 µs | 6.41 ms vs 121 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk) | 1.45 ms vs 828 µs | 6.77 ms vs 113 µs
(10000, [1000, 1000, 2, 2], [2, 3], False, sumD) | 74.9 µs vs 209 µs | 37.7 µs vs 1.23 ms
(10000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk) | 1.48 ms vs 845 µs | 6.96 ms vs 132 µs
(10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) | 1.14 ms vs 411 µs | 252 µs vs 87.8 µs
(10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk) | 4.53 ms vs 851 µs | 7.12 ms vs 128 µs
- time takes in CUDA backward of sparse is super long with large variance (in case of nnz=10000, it normally takes 6-7ms). To improve backward of sparse ops, we will need to debug at places other than CUDA kernels. here is a benchmark of `torch.copy_()`:
```
>>> d = [1000, 1000, 2, 2]
>>> nnz = 10000
>>> I = torch.cat([torch.randint(0, d[0], size=(nnz,)),
torch.randint(0, d[1], size=(nnz,))], 0).reshape(2, nnz)
>>> V = torch.randn(nnz, d[2], d[3])
>>> size = torch.Size(d)
>>> S = torch.sparse_coo_tensor(I, V, size).coalesce().cuda()
>>> S2 = torch.sparse_coo_tensor(I, V, size).coalesce().cuda().requires_grad_()
>>> data = S2.clone()
>>> S.copy_(S2)
>>> y = S * 2
>>> torch.cuda.synchronize()
>>> %timeit y.backward(data, retain_graph=True); torch.cuda.synchronize()
7.07 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12430
Differential Revision:
D12878313
Pulled By: weiyangfb
fbshipit-source-id:
e16dc7681ba41fdabf4838cf05e491ca9108c6fe
Jiyan Yang [Wed, 28 Nov 2018 10:13:21 +0000 (02:13 -0800)]
Ensure FP16 rowwise Adagrad can be run
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12317
Reviewed By: hyuen
Differential Revision:
D10190778
fbshipit-source-id:
720a9aaa4e6b1736023d8c6326a613e4ea592b31
Jongsoo Park [Wed, 28 Nov 2018 09:11:19 +0000 (01:11 -0800)]
use fbgemm's im2col fusion and thread partitioning (#14350)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14350
acc32 for now. Will have a separate diff for acc16 but that will need another out processing that does sparse convolution without im2col.
Reviewed By: dskhudia
Differential Revision:
D13188595
fbshipit-source-id:
e8faee46c7ea43e4a600aecb8b8e93e6c860a8c8
Teng Li [Wed, 28 Nov 2018 08:31:34 +0000 (00:31 -0800)]
PT1 Stable Release Distributed Documentation (#14444)
Summary:
The doc covers pretty much all we have had on distributed for PT1 stable release, tracked in https://github.com/pytorch/pytorch/issues/14080
Tested by previewing the sphinx generated webpages. All look good.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14444
Differential Revision:
D13227675
Pulled By: teng-li
fbshipit-source-id:
752f00df096af38dd36e4a337ea2120ffea79f86
David Riazati [Wed, 28 Nov 2018 08:21:01 +0000 (00:21 -0800)]
Revert
D13192230: [pytorch][PR] [jit] Use nn module tests in test_jit
Differential Revision:
D13192230
Original commit changeset:
36488960b6c9
fbshipit-source-id:
63b68bd909b9ef0548f52c986c84f549aecb8909
Teng Li [Wed, 28 Nov 2018 05:56:25 +0000 (21:56 -0800)]
Fixed SyncParam/QueueReduction/SyncReduction test for 2+ GPUs (#14452)
Summary:
Fixed: https://github.com/pytorch/pytorch/issues/14445
Also bumped up timeout to 30 seconds, since on 8-GPU machines, DDP test will take more than 15 seconds sometimes.
Tested on 8 GPU machines:
```
tengli@learnfair062:~/pytorch/test$ python test_c10d.py --verbose
test_dist_broadcast_coalesced_gloo (__main__.DistributedDataParallelTest) ... ok
test_dist_broadcast_coalesced_nccl (__main__.DistributedDataParallelTest) ... skipped 'Test skipped due to known issues'
test_fp16 (__main__.DistributedDataParallelTest) ... ok
test_gloo_backend (__main__.DistributedDataParallelTest) ... ok
test_nccl_backend (__main__.DistributedDataParallelTest) ... ok
test_queue_reduction (__main__.DistributedDataParallelTest) ... ok
test_sync_params_no_buffers (__main__.DistributedDataParallelTest) ... ok
test_sync_params_with_buffers (__main__.DistributedDataParallelTest) ... ok
test_sync_reduction (__main__.DistributedDataParallelTest) ... ok
test_set_get (__main__.FileStoreTest) ... ok
test_set_get (__main__.PrefixFileStoreTest) ... ok
test_set_get (__main__.PrefixTCPStoreTest) ... ok
test_allgather_basics (__main__.ProcessGroupGlooTest) ... ok
test_allgather_checks (__main__.ProcessGroupGlooTest) ... ok
test_allreduce_basics (__main__.ProcessGroupGlooTest) ... ok
test_allreduce_basics_cuda (__main__.ProcessGroupGlooTest) ... ok
test_allreduce_checks (__main__.ProcessGroupGlooTest) ... ok
test_allreduce_stress (__main__.ProcessGroupGlooTest) ... ok
test_allreduce_stress_cuda (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_basics (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_basics_cuda (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_checks (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_stress (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_stress_cuda (__main__.ProcessGroupGlooTest) ... ok
test_gather_basics (__main__.ProcessGroupGlooTest) ... ok
test_gather_checks (__main__.ProcessGroupGlooTest) ... ok
test_reduce_basics (__main__.ProcessGroupGlooTest) ... ok
test_reduce_checks (__main__.ProcessGroupGlooTest) ... ok
test_scatter_basics (__main__.ProcessGroupGlooTest) ... ok
test_scatter_checks (__main__.ProcessGroupGlooTest) ... ok
test_send_recv_all_to_all (__main__.ProcessGroupGlooTest) ... ok
test_timeout_kwarg (__main__.ProcessGroupGlooTest) ... ok
test_allgather_ops (__main__.ProcessGroupNCCLTest) ... ok
test_allreduce_ops (__main__.ProcessGroupNCCLTest) ... ok
test_barrier (__main__.ProcessGroupNCCLTest) ... ok
test_broadcast_ops (__main__.ProcessGroupNCCLTest) ... ok
test_reduce_ops (__main__.ProcessGroupNCCLTest) ... ok
test_common_errors (__main__.RendezvousEnvTest) ... ok
test_nominal (__main__.RendezvousEnvTest) ... ok
test_common_errors (__main__.RendezvousFileTest) ... ok
test_nominal (__main__.RendezvousFileTest) ... ok
test_common_errors (__main__.RendezvousTCPTest) ... ok
test_nominal (__main__.RendezvousTCPTest) ... ok
test_unknown_handler (__main__.RendezvousTest) ... ok
test_address_already_in_use (__main__.TCPStoreTest) ... ok
test_set_get (__main__.TCPStoreTest) ... ok
----------------------------------------------------------------------
Ran 46 tests in 162.980s
OK (skipped=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14452
Differential Revision:
D13230652
Pulled By: teng-li
fbshipit-source-id:
88580fe55b3a4fbc7a499ca3b591958f11623bf8
David Riazati [Wed, 28 Nov 2018 05:17:51 +0000 (21:17 -0800)]
Use nn module tests in test_jit (#14238)
Summary:
This PR adds weak modules for all activation modules and uses `test_nn` module tests to test weak modules that have been annotated with `weak_module` and therefore are in `torch._jit_internal._weak_types`
Also depends on #14379
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14238
Differential Revision:
D13192230
Pulled By: driazati
fbshipit-source-id:
36488960b6c91448b38c0fa65422539a93af8c5e
Brian Vaughan [Wed, 28 Nov 2018 04:31:18 +0000 (20:31 -0800)]
check for invalid ranges in torch.arange
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13915
Differential Revision:
D13222110
Pulled By: nairbv
fbshipit-source-id:
fcff1ad058fbf792d0fdf4aa75d77f22e3b7483b
Brian Vaughan [Wed, 28 Nov 2018 04:28:11 +0000 (20:28 -0800)]
roll along multiple dimensions
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13874
Differential Revision:
D13223669
Pulled By: nairbv
fbshipit-source-id:
1678d52529c326fa4a0614d0994b1820ad12bc04
David Riazati [Wed, 28 Nov 2018 03:37:20 +0000 (19:37 -0800)]
Add poisson_nll_loss to script
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14420
Differential Revision:
D13220726
Pulled By: driazati
fbshipit-source-id:
6c08a0050075beafcc8ba413c9603b273870c70c
David Riazati [Wed, 28 Nov 2018 03:33:47 +0000 (19:33 -0800)]
Add boolean dispatch for function overloading (#14425)
Summary:
This PR allows to overload functions based on the value of a parameter (so long as it is a constant). See max_pool1d for an example usage.
This is the first step in enabling the use of max_pool functions for the standard library that can return `Tensor` or `Tuple[Tensor, Tensor]` based on the `return_indices` flag. This will give the JIT identical results to the Python versions of the functions.
Fixes #14081
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14425
Differential Revision:
D13222104
Pulled By: driazati
fbshipit-source-id:
8cb676b8b13ebcec3262234698edf4a7d7dcbbe1
Zachary DeVito [Wed, 28 Nov 2018 03:11:47 +0000 (19:11 -0800)]
fix enable_cpu_fuser
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14440
Differential Revision:
D13226354
Pulled By: zdevito
fbshipit-source-id:
e4ed023eece8b5b670a4a27d24a8688907b36b90
Elias Ellison [Wed, 28 Nov 2018 02:36:05 +0000 (18:36 -0800)]
Move Affine grid to C++ (#14392)
Summary:
Port AffineGrid to C++, because script does not support compiling Function classes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14392
Differential Revision:
D13219698
Pulled By: eellison
fbshipit-source-id:
3ddad8a84c72010b5a6c6f7f9712be614202faa6
Peter Goldsborough [Wed, 28 Nov 2018 01:33:54 +0000 (17:33 -0800)]
Allow building libraries with setuptools that dont have abi suffix (#14130)
Summary:
When using `setuptools` to build a Python extension, setuptools will automatically add an ABI suffix like `cpython-37m-x86_64-linux-gnu` to the shared library name when using Python 3. This is required for extensions meant to be imported as Python modules. When we use setuptools to build shared libraries not meant as Python modules, for example libraries that define and register TorchScript custom ops, having your library called `my_ops.cpython-37m-x86_64-linux-gnu.so` is a bit annoying compared to just `my_ops.so`, especially since you have to reference the library name when loading it with `torch.ops.load_library` in Python.
This PR fixes this by adding a `with_options` class method to the `torch.utils.cpp_extension.BuildExtension` which allows configuring the `BuildExtension`. In this case, the first option we add is `no_python_abi_suffix`, which we then use in `get_ext_filename` (override from `setuptools.build_ext`) to throw away the ABI suffix.
I've added a test `setup.py` in a `no_python_abi_suffix_test` folder.
Fixes https://github.com/pytorch/pytorch/issues/14188
t-vi fmassa soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14130
Differential Revision:
D13216575
Pulled By: goldsborough
fbshipit-source-id:
67dc345c1278a1a4ee4ca907d848bc1fb4956cfa
Wanchao Liang [Wed, 28 Nov 2018 01:28:55 +0000 (17:28 -0800)]
Fix clang tidy errors
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14427
Differential Revision:
D13222381
Pulled By: wanchaol
fbshipit-source-id:
d90d210a810e95bf0eb404f9c1c304f4e6a3f61e
Zachary DeVito [Wed, 28 Nov 2018 01:08:09 +0000 (17:08 -0800)]
Handling of pretty-printing methods (#14378)
Summary:
Stacked on #14176, review only the last commit.
* Print parameters to methods as self.weight rather than as extra inputs.
* Print entire set of methods out as a single string
* Update test code to test the module-at-a-time export/import
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14378
Differential Revision:
D13198463
Pulled By: zdevito
fbshipit-source-id:
3fab02e8239cfd6f40d6ab6399047bd02cf0a8c8
Edward Yang [Wed, 28 Nov 2018 00:36:09 +0000 (16:36 -0800)]
Eliminate necessity of HIPify on AccumulateType.h (#14412)
Summary:
I'd like to NOT HIPify files that are not in a cuda/
directory, so hand-HIPify AccumulateType.h
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14412
Differential Revision:
D13221801
Pulled By: ezyang
fbshipit-source-id:
d1927cfc956e50a6a5e67168ac0e1ce56ecd1e0b
andersj [Tue, 27 Nov 2018 23:51:17 +0000 (15:51 -0800)]
when BUILD_CAFFE2_OPS is OFF, torch-python needs a direct dep on nccl (#14430)
Summary:
https://github.com/pytorch/pytorch/issues/14431 tracks supporting this with CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14430
Differential Revision:
D13224079
Pulled By: anderspapitto
fbshipit-source-id:
47d7900d25910ed61585b93f9003acd1b2630a9f
Sam Gross [Tue, 27 Nov 2018 23:18:39 +0000 (15:18 -0800)]
Speed-up "advanced" indexing operations (#13420)
Summary:
This speeds-up "advanced" indexing (indexing a tensor by a tensor)
on CPU and GPU. There's still a bunch of work to do, including
speeding up indexing by a byte (boolean) mask and speeding up the derivative
calculation for advanced indexing.
Here's some speed comparisons to indexing on master using a little [benchmark script](https://gist.github.com/colesbury/
c369db72aad594e5e032c8fda557d909) with 16 OpenMP threads and on a P100. The test cases are listed as (input shape -> output shape).
| Test case | CPU (old vs. new) | CUDA (old vs. new) |
|-----------------------|---------------------|------------------------|
| 1024x1024 -> 512x1024 | 225 us vs. **57 us** | 297 us vs. **47 us** |
| 1024x1024 -> 1024x512 | 208 us vs. **153 us** | 335 us vs. **54 us** |
| 50x50 -> 20000x50 | 617 us vs. **77 us** | 239 us vs. **54 us** |
| 50x50 -> 50x20000 | 575 us vs. **236 us** | 262 us vs. **58 us** |
| 2x5x10 -> 10 | 65 us vs. **18 us** | 612 us vs. **93 us** |
See #11647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13420
Reviewed By: soumith
Differential Revision:
D13088936
Pulled By: colesbury
fbshipit-source-id:
0a5c2ee9aa54e15f96d06692d1694c3b24b924e2
Jiyan Yang [Tue, 27 Nov 2018 22:49:28 +0000 (14:49 -0800)]
Resubmit: Set the correct engine name for position weighted pooling when fp16 is used for training
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13768
Reviewed By: xianjiec
Differential Revision:
D12996103
fbshipit-source-id:
5ca4cda4210f68ece2b5d6eced8cf52ee91fb36f
Will Feng [Tue, 27 Nov 2018 22:13:48 +0000 (14:13 -0800)]
Windows local build: restore original working dir after activating VC environment (#14416)
Summary:
`call "C:\\Program Files (x86)\\Microsoft Visual Studio\\2017\\Community\\VC\\Auxiliary\\Build\\vcvarsall.bat" x64` seems to change the working dir to `C:\Users\Administrator\source`, and we need to cd back to the PyTorch directory before running `git submodule update --init --recursive`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14416
Differential Revision:
D13222269
Pulled By: yf225
fbshipit-source-id:
a0eb3311fb11713b1bb8f52cd13e2c21d5ca9c7b
Jerry Zhang [Tue, 27 Nov 2018 22:10:41 +0000 (14:10 -0800)]
condition blob in while_op test changes data type (#14279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14279
att
Reviewed By: smessmer
Differential Revision:
D13144472
fbshipit-source-id:
af4d920a3148c648d1a428a5bcd56da19ea8c38c
zrphercule [Tue, 27 Nov 2018 21:49:21 +0000 (13:49 -0800)]
Add test of ONNX_ATEN (#14259)
Summary:
In #14239 we fixed ONNX_ATEN.
In order to make sure its correctness in the future, we should add related test case.
We use torch.fmod() to test ONNX_ATEN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14259
Differential Revision:
D13204610
Pulled By: zrphercule
fbshipit-source-id:
e4660c346e5edd201f1458b7d74d7dfac49b94c7
Hassan Eslami [Tue, 27 Nov 2018 21:31:59 +0000 (13:31 -0800)]
Allowing TaskGroups to carry remote nets (#14342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14342
Sometimes, when we are creating a TaskGroup, we are in fact creating a TaskGroup for a distributed job. In some cases, we may want to register a few nets as "remote" to a TaskGroup. The remote net should have sufficient attributes on where they should be executed later on.
This diff adds the remote net attribute to the TaskGroup class. It exposes two minimal functionalities: adding a remote net, and getting all remote nets added to a TaskGroup.
Reviewed By: d4l3k
Differential Revision:
D13188320
fbshipit-source-id:
efe947aec30817e9512a5e18be985713b9356bdc
Edward Yang [Tue, 27 Nov 2018 21:14:12 +0000 (13:14 -0800)]
Add scaffolding for HIP backend in ATen/core. (#14285)
Summary:
This code doesn't actually do anything, but it will be the
groundwork necessary to change PyTorch's HIPIFY pass from reusing
CUDA identifiers directly, to actually switching to using HIP
identifiers (moving us closer to a world where we can compile
both HIP and CUDA PyTorch side-by-side.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14285
Differential Revision:
D13158851
Pulled By: ezyang
fbshipit-source-id:
df2462daa5d0d4112455b67bd3067d60ba55cda5
Edward Yang [Tue, 27 Nov 2018 21:12:25 +0000 (13:12 -0800)]
Document device_guard in native_functions.yaml (#14235)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14235
Differential Revision:
D13145780
Pulled By: ezyang
fbshipit-source-id:
0e93bf009ad492551bcdcada0357f2fef529e67d
David Riazati [Tue, 27 Nov 2018 21:12:14 +0000 (13:12 -0800)]
Revert
D13192228: [pytorch][PR] [jit] Add boolean dispatch for function overloading
Differential Revision:
D13192228
Original commit changeset:
fce33c400c1f
fbshipit-source-id:
75c9991dc7097f9513c6c89d16eff2de6e287c3b
Sebastian Messmer [Tue, 27 Nov 2018 20:43:24 +0000 (12:43 -0800)]
Remove fake dependencies from TensorImpl to caffe2 (#14141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14141
These includes weren't actually used, let's remove them.
Reviewed By: ezyang
Differential Revision:
D13113129
fbshipit-source-id:
816995e280b81bf99002772ea8aea458bdfcd2c7
Sebastian Messmer [Tue, 27 Nov 2018 20:43:24 +0000 (12:43 -0800)]
Fix include paths for TensorTypeId.h and TensorTypeIdRegistration.h
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14070
Reviewed By: ezyang
Differential Revision:
D13081610
fbshipit-source-id:
685994a15a2cd15e9e5447cf77671343de5dd278
Sebastian Messmer [Tue, 27 Nov 2018 20:43:24 +0000 (12:43 -0800)]
Move TensorTypeId to c10/core
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14327
Reviewed By: ezyang
Differential Revision:
D13131338
fbshipit-source-id:
c4682cb6ed6fe4cd1636e09d918eef6e90c836f1
Sebastian Messmer [Tue, 27 Nov 2018 20:43:24 +0000 (12:43 -0800)]
Fix include paths for Storage.h and StorageImpl.h
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14062
Reviewed By: ezyang
Differential Revision:
D13081603
fbshipit-source-id:
c272b715ef2f513d21d1c3f34fbf79eec6946441
Sebastian Messmer [Tue, 27 Nov 2018 20:43:24 +0000 (12:43 -0800)]
Move Storage and StorageImpl to c10
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14061
Reviewed By: ezyang
Differential Revision:
D13081608
fbshipit-source-id:
1ea2d32e9ec9293b6ffa4b9e76c674cca55d5a1c
Sebastian Messmer [Tue, 27 Nov 2018 20:43:24 +0000 (12:43 -0800)]
Fix include paths for Allocator.h
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14060
Reviewed By: ezyang
Differential Revision:
D13081605
fbshipit-source-id:
02f23af174c0f0c38fb0163c2dfef3873ff5635d
Sebastian Messmer [Tue, 27 Nov 2018 20:43:24 +0000 (12:43 -0800)]
Move Allocator.h to c10
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14059
Reviewed By: ezyang
Differential Revision:
D13081606
fbshipit-source-id:
d6ad59ad4e3d363268cd4307b6c999a168681246
Sebastian Messmer [Tue, 27 Nov 2018 20:43:22 +0000 (12:43 -0800)]
Move UniqueVoidPtr to c10
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14058
Reviewed By: dzhulgakov
Differential Revision:
D13081602
fbshipit-source-id:
e91ccf9fba9a7a02f99ed90b7a3a0fe7afd56832
Sebastian Messmer [Tue, 27 Nov 2018 20:43:22 +0000 (12:43 -0800)]
Move ScalarTypeUtils.h to c10
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14024
Reviewed By: ezyang
Differential Revision:
D13081604
fbshipit-source-id:
d7a09610f64eb2e9dd831bbb3c85f20691251594
Sebastian Messmer [Tue, 27 Nov 2018 20:43:22 +0000 (12:43 -0800)]
Fix include paths for Scalar.h and ScalarType.h
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14023
Reviewed By: ezyang
Differential Revision:
D13081609
fbshipit-source-id:
c27eeafa381b39e043f0261ea7f6f634ee8bc238
Sebastian Messmer [Tue, 27 Nov 2018 20:43:22 +0000 (12:43 -0800)]
Move Scalar and ScalarType to c10/core
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14022
Reviewed By: ezyang
Differential Revision:
D13015236
fbshipit-source-id:
92aac4e342d85f75a31837b2943fa5b80f0c35c9
Michael Suo [Tue, 27 Nov 2018 20:38:28 +0000 (12:38 -0800)]
Trace in-place ops (#14254)
Summary:
This PR adds a `try_outplace` option to the tracer. When `try_outplace` is true, the tracer will attempt to out-of-place ops (similar to how things are done today). When it's false, the correct in-place op is emitted.
I made `try_outplace` false by default, but flipped it to true for ONNX export utils. zdevito jamesr66a, anywhere else I should preserve the existing behavior?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14254
Reviewed By: eellison
Differential Revision:
D13166691
Pulled By: suo
fbshipit-source-id:
ce39fdf73ac39811c55100e567466d53108e856b
Teng Li [Tue, 27 Nov 2018 20:32:56 +0000 (12:32 -0800)]
Fixed torch.multiprocessing.spawn for not being able to spawn like dataloader workers (#14391)
Summary:
Should fix: https://github.com/pytorch/pytorch/issues/14390
Now imagenet example works fine with multiprocessing and more than 1 dataloader worker
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14391
Reviewed By: calebho
Differential Revision:
D13209800
Pulled By: teng-li
fbshipit-source-id:
e8abc0fb38d4436cf3474dcbba0e28f4290e4d29
Jerry Zhang [Tue, 27 Nov 2018 20:31:17 +0000 (12:31 -0800)]
Tensor construction: combine Resize+mutable_data - 4/4 (#13856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13856
Codemod generated with clangr shard mode, 25 files per diff,
motivation: https://github.com/pytorch/pytorch/pull/12407
Reviewed By: smessmer
Differential Revision:
D13007310
fbshipit-source-id:
941f064ef8934bb17fbfb706e6ed3db173b5d268
Zachary DeVito [Tue, 27 Nov 2018 19:46:17 +0000 (11:46 -0800)]
Print default values and introduce ir view classes (#14176)
Summary:
[Stacked commit, only review the last commit]
This PR adds support for printing default values in python printing as well as the logic
for parsing default values back in using the parser. For simplicity, this PR simply
creates a subgraph of the constant expressions and then runs that graph to generate the defaults.
A more lightweight approach should be possible later, but would require more machinery.
To make reading code in the printer easier, this also add ir_views.h.
Similar to tree_views.h these classes can provide views of some commonly used IR nodes
that have complicated structure and common operations on that structure.
Currently it has only read-only views for prim::If and prim::Loop,
but we should eventually add helpers to manipulate If/Loop nodes as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14176
Differential Revision:
D13198455
Pulled By: zdevito
fbshipit-source-id:
dc99ab9692804ccaedb60a55040c0b89ac7a6a6d
Thomas Viehmann [Tue, 27 Nov 2018 19:30:41 +0000 (11:30 -0800)]
Add Type support to the fuser, fuse more (#14336)
Summary:
This adds scalar type support to the fuser, both internally (instead of auto / assuming float) and for the inputs/outputs.
We can now fuse things with input / output of arbitrary scalar type, in particular comparisons and where work well. So it fixes #13384 by returning the right type tensor (and adds a test where byte and double tensors are returned).
The type inference is done by re-calling PropagateTensorShapeOnNode in the compilation, I would venture that it isn't prohibitively expensive compared to the actual compilation. (Propagation was fixed for where to return the second argument's type and amended to handle FusedConcat.)
I'm not sure how to add a check for the code generated by the fuser, but I am not sure we absolutely need to (we'd see if it is invalid / produces wrong results).
Thanks in particular to apaszke, fmassa, mruberry for advice and encouragement! All the errors are my own.
I have discussed order of PRs briefly with mruberry, if this goes in before he submits the PR, he graciously agreed to rebasing his, but I'd happily rebase, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14336
Differential Revision:
D13202620
Pulled By: soumith
fbshipit-source-id:
855159e261fa15f21aca3053bfc05fb3f720a8ef
svcscm [Tue, 27 Nov 2018 19:20:46 +0000 (11:20 -0800)]
Updating submodules
Reviewed By: yns88
fbshipit-source-id:
e63160e97550942931bacaa860d91d591d2e1712
David Riazati [Tue, 27 Nov 2018 18:49:14 +0000 (10:49 -0800)]
Add boolean dispatch for function overloading (#14081)
Summary:
This PR allows to overload functions based on the value of a parameter (so long as it is a constant). See `max_pool1d` for an example usage.
This is the first step in enabling the use of `max_pool` functions for the standard library that can return `Tensor` or `Tuple[Tensor, Tensor]` based on the `return_indices` flag. This will give the JIT identical results to the Python versions of the functions.
Depends on #14232 for `Optional[BroadcastingList[T]]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14081
Differential Revision:
D13192228
Pulled By: driazati
fbshipit-source-id:
fce33c400c1fd06e59747d98507c5fdcd8d4c113
Pieter Noordhuis [Tue, 27 Nov 2018 18:41:06 +0000 (10:41 -0800)]
Barrier synchronizes with prior work before completing (#14386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14386
See #13573, #14142, and #14271 for discussion.
This change updates ProcessGroupGloo to ensure that all prior
operations have completed before executing the barrier.
Reviewed By: manojkris
Differential Revision:
D13205022
fbshipit-source-id:
673e7e6ca357dc843874d6dd8da590832e1de7fa
Pieter Noordhuis [Tue, 27 Nov 2018 18:41:06 +0000 (10:41 -0800)]
Make ProcessGroup::Work::wait() throw (#14298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14298
This is a breaking API change for users of the C++ c10d API. The work
object defined wait() to return a boolean. If the work completed
successfully it would return true, if it didn't it would return false.
It was then up to the user to call the exception() function to figure
out what went wrong. This has proven suboptimal as it allows users to
forget about failure handling and errors may be ignored.
The work class is semantically very similar to std::future, where a
call to get() may throw if the underlying std::promise has set an
exception. This commit changes the semantic of the work class to be
similar to this and turns wait() into a void function that throws if
the work completes with an exception.
The exception() function can still be used to retrieve the exception
if isSuccess() returns false, but now returns an std::exception_ptr
instead of a reference to a std::exception.
Reviewed By: manojkris
Differential Revision:
D13158475
fbshipit-source-id:
9cd8569b9e7cbddc867a5f34c6fd0b7be85581b8
Pieter Noordhuis [Tue, 27 Nov 2018 18:41:04 +0000 (10:41 -0800)]
Add option structs and timeout field (#14297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14297
Adds option structs for allgather and barrier such that we have one
for every collective. Add timeout member field to every one of these
such that we can support per operation timeouts.
Use default constructed options struct for every collective process
group function exposed to Python.
Reviewed By: manojkris
Differential Revision:
D13158474
fbshipit-source-id:
3d28977de2f2bd6fc2f42ba3108b63a429338906
Pieter Noordhuis [Tue, 27 Nov 2018 18:41:04 +0000 (10:41 -0800)]
Refer to all work with ProcessGroup prefix (#14296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14296
There was mixed usage of "ProcessGroup::Work" and just "Work".
Adding prefix for readability/consistency.
Reviewed By: manojkris
Differential Revision:
D13128977
fbshipit-source-id:
a54a8784fa91cd6023c723cb83e9f626fb896a30
Pieter Noordhuis [Tue, 27 Nov 2018 18:41:04 +0000 (10:41 -0800)]
Remove algorithm caching in ProcessGroupGloo (#14295)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14295
This is no longer used after moving to Gloo new style algorithms.
Closes #11912.
Reviewed By: manojkris
Differential Revision:
D13111781
fbshipit-source-id:
53e347080e29d847cd9da36f2d93af047930690c
Pieter Noordhuis [Tue, 27 Nov 2018 18:41:04 +0000 (10:41 -0800)]
Use new style barrier support in c10d/gloo (#14294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14294
This is the final collective to be ported to the new style where there
is no longer a need to keep a cached algorithm instance around. There
is a follow up change incoming to remove the algorithm caching
functionality in ProcessGroupGloo.
Reviewed By: manojkris
Differential Revision:
D13111509
fbshipit-source-id:
f3ea0d955a62029fc4e7cfc09055e4957e0943ac
Wei Yang [Tue, 27 Nov 2018 18:22:24 +0000 (10:22 -0800)]
fix doc for sparse.addmm (#14403)
Summary:
- fixing the doc issue in sparse.addmm
================ before change ==================
![image](https://user-images.githubusercontent.com/
38509346/
49063994-
2f10fe80-f1ce-11e8-9ccc-
54241bc45f0b.png)
![image](https://user-images.githubusercontent.com/
38509346/
49064064-
641d5100-f1ce-11e8-865a-
7227be7156ef.png)
================ post change ==================
![image](https://user-images.githubusercontent.com/
38509346/
49064078-
76978a80-f1ce-11e8-8f38-
f1f8ac9ce63b.png)
![image](https://user-images.githubusercontent.com/
38509346/
49064085-
7bf4d500-f1ce-11e8-8a0d-
bf9e5460d21f.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14403
Differential Revision:
D13216582
Pulled By: weiyangfb
fbshipit-source-id:
52e0a20c6b341c37cfb31f281be3afe2a52ca532
Jongsoo Park [Tue, 27 Nov 2018 18:05:28 +0000 (10:05 -0800)]
per-group and per-channel quantization (#14340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14340
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/25
Per-group and per-channel quantization in fbgemm
This diff also cleans up explicit template instantiation using macro expansion
This diff also changes randFill interface which was easy to make mistakes of generating integer random numbers for floating point vectors.
Using this in DNNLOWP operators will be done in a separate diff.
Reviewed By: dskhudia
Differential Revision:
D13176386
fbshipit-source-id:
e46c53e31e21520bded71b8ed86e8b19e010e2dd
Peter Goldsborough [Tue, 27 Nov 2018 18:04:57 +0000 (10:04 -0800)]
Add variable_factories.h to cppdocs (#14381)
Summary:
This will document `torch::from_blob` and such.
soumith ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14381
Differential Revision:
D13216560
Pulled By: goldsborough
fbshipit-source-id:
112f60e45e4d38a8a9983fa71e9cc56bc1a73465
Jan Schlüter [Tue, 27 Nov 2018 17:36:11 +0000 (09:36 -0800)]
Use integer math to compute output size of pooling operations (#14405)
Summary:
As reported in #13386, the pooling operations can return wrong results for large inputs. The root of the problem is that while the output shape is initially being computed with integer operations, it is converted to float32 for division by the stride and applying either a `ceil` or a `floor` depending on the `ceil_mode`. Since even moderately large integers (the smallest being 16,777,217) cannot be expressed exactly in float32, this leads to wrong result shapes.
This PR relies purely on integer operations to perform the shape computation, including the ceil/floor distinction. Since I could not stand all that duplicated code, I pulled it out into a `pooling_shape.h` header, similar to the existing `linear_upsampling.h` header. I hope this is acceptable, let me know if you'd like to see it solved differently. I've also added tests to `test_nn.py` that fail without my changes and pass with my changes. They cover `{max,avg}_pool{1,2,3}d()` for CPU and GPU.
Fixes #13386.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14405
Differential Revision:
D13215260
Pulled By: soumith
fbshipit-source-id:
802588ce6cba8db6c346448c3b3c0dac14d12b2d
Edward Yang [Tue, 27 Nov 2018 16:23:34 +0000 (08:23 -0800)]
Delete legacy THCStream (long live THCStream). (#14246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14246
This commit systematically eliminates THCStream entirely from THC, replacing it
with at::cuda::CUDAStream. In places where the previous pointer type showed up
in a public API signature, those functions are now only available to C++
clients. (It would not be too difficult to make a C-compatible version of
CUDAStream, as it's really just a simple struct, but we leave this for
future work.)
All functions in THC that referred to THCStream were expunged in favor of their
modern counterparts.
One annoyance was that I didn't feel like redoing how the torch.cuda.Stream
binding code worked, but I really wanted to get rid of the stored THCStream*
pointer. So I repurposed the bit-packing code I implemented for Stream hashing,
and used that to (reversibly) store streams in a uint64_t cdata field. A perhaps
more future proof solution would be to get rid of cdata entirely, and store the
device and stream ID directly.
Billing of changes:
- All CUDAStream_ pointer API functions are now hidden and anonymously
namespaced (instead of being in the impl namespace). All use sites
rewritten to use the modern C++ API. Since CUDAStreamInternals is no
longer part of the public API, the CUDAStreamInternals constructor and
internals() method have been removed, and replaced with anonymous
functions in the C++ file.
- device_index() returns DeviceIndex rather than int64_t now
- Stream and CUDAStream now have pack/unpack methods. (CUDAStream checks
that the unpacked bit-pattern is for a CUDA device.)
- THCStream.h header is removed entirely
- Most THCStream handling functions in THC API are removed
Reviewed By: gchanan
Differential Revision:
D13121531
fbshipit-source-id:
48873262cc0a37c3eec75a7ba1c93c800da40222
Edward Yang [Tue, 27 Nov 2018 16:23:34 +0000 (08:23 -0800)]
Add hash functions for Stream, CUDAStream; fix Device hash function (#14191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14191
Previously, Device's hash function only worked for CPU and CUDA. Now
it works for everything.
Implementing the bit concatenation was a bit tricky, and I got it wrong the
first time. See Note [Hazard when concatenating signed integers]
Reviewed By: smessmer
Differential Revision:
D13119624
fbshipit-source-id:
36bfa139cfc739bb0624f52aaf466438c2428207
Owen Anderson [Tue, 27 Nov 2018 06:41:56 +0000 (22:41 -0800)]
Implement NaN-propagating max/min on Vec256.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13399
Differential Revision:
D13199957
Pulled By: resistor
fbshipit-source-id:
1565e079b13c5d4f42f2033830a7c997b7d824bc
svcscm [Tue, 27 Nov 2018 03:35:44 +0000 (19:35 -0800)]
Updating submodules
Reviewed By: yns88
fbshipit-source-id:
210f7eec65bea5e31817fb56dec27b0ab8af797a
Ilia Cherniavskii [Tue, 27 Nov 2018 03:07:07 +0000 (19:07 -0800)]
Remove unused executors, part 3 (#14199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14199
Remove legacy code for dag, async_dag
Reviewed By: salexspb
Differential Revision:
D13019102
fbshipit-source-id:
ff07e45304d9af4be0375215f4b642c4b0edb12d
Ilia Cherniavskii [Tue, 27 Nov 2018 03:07:07 +0000 (19:07 -0800)]
Remove unused executors, part 2 (#14115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14115
Remove legacy implementation of prof_dag
Reviewed By: salexspb
Differential Revision:
D13019096
fbshipit-source-id:
4f2bf676444d84eaa2cc1effcc3ebdc764e0a016
Ilia Cherniavskii [Tue, 27 Nov 2018 03:07:06 +0000 (19:07 -0800)]
Remove unused executors, part 1 (#14117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14117
Removing unused legacy executors (htrace)
Reviewed By: salexspb
Differential Revision:
D13019078
fbshipit-source-id:
19d0ed1b47a22cc17c27fdd15d748ced54806132
Edward Yang [Tue, 27 Nov 2018 03:06:06 +0000 (19:06 -0800)]
Delete OPENMP_STUB translation. (#14286)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14286
Differential Revision:
D13205356
Pulled By: ezyang
fbshipit-source-id:
08e9821e4b32f8d7f3c41906e481f280ee6cf2e3
Wei Yang [Tue, 27 Nov 2018 01:43:21 +0000 (17:43 -0800)]
backward for sparse.addmm(D, S, D, alpha, beta) -> D (#13345)
Summary:
- introduce `sparse.addmm()` with backward for sparse matrix input for https://github.com/pytorch/pytorch/issues/12308
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13345
Differential Revision:
D13094070
Pulled By: weiyangfb
fbshipit-source-id:
136c08c3ca9bafb20577b60dd43d31c3e5cd5461
Marat Dukhan [Tue, 27 Nov 2018 01:41:13 +0000 (17:41 -0800)]
Switch Int8ChannelShuffle operator to QNNPACK (#14362)
Summary:
1.8-2.2X better performance on ARM devices
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14362
Reviewed By: jerryzh168
Differential Revision:
D13192312
Pulled By: Maratyszcza
fbshipit-source-id:
0d3dff067e300c7d741c42615b61246cbf09a829
Teng Li [Tue, 27 Nov 2018 01:05:17 +0000 (17:05 -0800)]
Fixed file init_method write/read race (#14388)
Summary:
This should fix the race among multiple processes: https://github.com/pytorch/pytorch/issues/13750
Essentially, the reader is trying to open the file, and will error out if it doesn't exist, we here factor in the timeout option of FileStore to apply a timeout for creating a file (should always be created anyway unless something is wrong), and more importantly, waiting for the file to be created.
Tested on both NFS and local drive, the race disappears when 8 concurrent processes do distributed training.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14388
Differential Revision:
D13207178
Pulled By: teng-li
fbshipit-source-id:
d3d5d62c4c8f01c0522bf1653c8986155c54ff80
Peter Goldsborough [Tue, 27 Nov 2018 01:04:51 +0000 (17:04 -0800)]
Fix dataloader iterator test (#14045)
Summary:
I noticed the test `DataLoaderTest.CanDereferenceIteratorMultipleTimes` doesn't test proper progression of the iterator. I also added a test for using `std::copy`.
Fixes https://github.com/pytorch/pytorch/issues/14276
ebetica ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14045
Differential Revision:
D13092187
Pulled By: goldsborough
fbshipit-source-id:
57698ec00fa7b914b159677a4ab38b6b25c2860b
Teng Li [Tue, 27 Nov 2018 00:44:11 +0000 (16:44 -0800)]
Fixed c10d test (#14389)
Summary:
Most likely a typo.
Tested on 8-GPU machine
```
tengli@learnfair062:~/pytorch/test$ python test_c10d.py ProcessGroupNCCLTest.test_barrier
.
----------------------------------------------------------------------
Ran 1 test in 29.341s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14389
Differential Revision:
D13207207
Pulled By: teng-li
fbshipit-source-id:
aaffe14237076fe19d94e2fa4d9c093397f07bb9
Brennan Vincent [Tue, 27 Nov 2018 00:34:47 +0000 (16:34 -0800)]
fix typo in `torch.sum` documentation (#14250)
Summary:
Notice that an extra colon was added to `:attr:`, so in https://pytorch.org/docs/stable/torch.html#torch.sum , `dim` shows up as ":attr::_dim_". This patch fixes the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14250
Reviewed By: soumith
Differential Revision:
D13146363
Pulled By: umanwizard
fbshipit-source-id:
f7d03dcb0973aae248b56ab407ba8489f2b1fe36
Wanchao Liang [Tue, 27 Nov 2018 00:21:08 +0000 (16:21 -0800)]
More JIT type hierarchy refinement (#14127)
Summary:
JIT type system hierarchy refinement and refactors:
1. Make NumberType be the base type of IntType FloatType
2. Make single type container like OptionalType and FutureType share SingleElementType base type
3. Some refactors to make it more robust, e.g. adding python_str() for some types so that we have proper python_print serialization format
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14127
Differential Revision:
D13112657
Pulled By: wanchaol
fbshipit-source-id:
335c5b25977be2e0a462c7e4a6649c1b653ccb4f