Victor Quach [Thu, 12 Aug 2021 19:36:38 +0000 (12:36 -0700)]
Forbid inplace modification of a saved tensor's pack_hook input (#62717)
Summary:
When using saved tensors hooks (especially default hooks),
if the user defines a `pack_hook` that modifies its input,
it can cause some surprising behavior.
The goal of this PR is to prevent future user headache by catching
inplace modifications of the input of `pack_hook` and raising an error if
applicable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62717
Reviewed By: albanD
Differential Revision:
D30255243
Pulled By: Varal7
fbshipit-source-id:
8d73f1e1b50b697a59a2849b5e21cf0aa7493b76
Howard Huang [Thu, 12 Aug 2021 19:22:06 +0000 (12:22 -0700)]
Update CONTRIBUTING.md to remove ProcessGroupAgent (#63160)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63160
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision:
D30284439
Pulled By: H-Huang
fbshipit-source-id:
53c31b6917ef5e2125e146fb0ed73ae3d76a8cf9
Edward Wang (EcoF) [Thu, 12 Aug 2021 19:10:50 +0000 (12:10 -0700)]
add use_strict_trace to tensorboard add_graph method (#63120)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63120
FAIM returns dictionaries as the model output, which throws an error when trying to trace using add_graph. Pass in `strict` to the tracer to make this user configurable.
User post: https://fb.workplace.com/groups/pytorchLightning/permalink/
1510194972650369/?comment_id=
1510252919311241&reply_comment_id=
1510281112641755
Test Plan: unit test
Reviewed By: Reubend
Differential Revision:
D30265890
fbshipit-source-id:
58b25d9500b875a29a664aa9ef4c1e7f13631fa1
Shen Li [Thu, 12 Aug 2021 18:39:31 +0000 (11:39 -0700)]
Revert
D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: revert-hammer
Differential Revision:
D30279364 (https://github.com/pytorch/pytorch/commit/
b0043072529b81276a69df29e00555333117646c)
Original commit changeset:
c1ed77dfe43a
fbshipit-source-id:
eab50857675c51e0088391af06ec0ecb14e2347e
jiej [Thu, 12 Aug 2021 18:03:32 +0000 (11:03 -0700)]
LayerNorm Support in autodiff: (#50467)
Summary:
1. extend autodiff by adding entry for layer_norm in symbolic script, we now use native_layer_norm_backward
2. added backward function `layernorm_double_backward` for `native_layer_norm_backward`, preserves double backward support for LayerNorm in autodiff/ScriptModule
3. added python test to verify autodiff on layer_norm with various configuration of optional tensors; (verify the fix in https://github.com/pytorch/pytorch/issues/49430)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50467
Reviewed By: eellison
Differential Revision:
D30232864
Pulled By: jansel
fbshipit-source-id:
b9c33075386aff96afff7415df9f94388bfb474a
Co-authored-by: Ryan Spring <rspring@nvidia.com>
Co-authored-by: Jie <jiej@nvidia.com>
Zsolt Dollenstein [Thu, 12 Aug 2021 17:56:55 +0000 (10:56 -0700)]
[codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: manual inspection & sandcastle
Reviewed By: zertosh
Differential Revision:
D30279364
fbshipit-source-id:
c1ed77dfe43a3bde358f92737cd5535ae5d13c9a
Kushashwa Ravi Shrimali [Thu, 12 Aug 2021 16:45:17 +0000 (09:45 -0700)]
[reland] OpInfo: `adaptive_avg_pool2d` (#62935)
Summary:
This PR is an attempt to reland https://github.com/pytorch/pytorch/pull/62704.
**What has changed?**
The op has non-deterministic behavior, hence an appropriate `gradcheck` wrapper had to be added.
cc: mruberry zou3519 heitorschueroff kshitij12345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62935
Reviewed By: anjali411
Differential Revision:
D30225095
Pulled By: zou3519
fbshipit-source-id:
644873cc21d44b19c8b68f9edff691913778de0e
Rong Rong (AI Infra) [Thu, 12 Aug 2021 15:13:01 +0000 (08:13 -0700)]
[BE] shorten CI name part2 (#63030)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62357
there's no need to specify cudnn version since they are recommended from cuda version already.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63030
Reviewed By: zhouzhuojie, driazati
Differential Revision:
D30226354
Pulled By: walterddr
fbshipit-source-id:
7e2dc577810e0ce80ee27569c25a814566250ab1
Rohan Varma [Thu, 12 Aug 2021 07:37:30 +0000 (00:37 -0700)]
Skip zero test on windows (#63087)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63087
Test failed on windows unexpectedly see
https://github.com/pytorch/pytorch/issues/63086. Skip for now while we
investigate
ghstack-source-id:
135631811
Test Plan: CI
Reviewed By: ngimel
Differential Revision:
D30251300
fbshipit-source-id:
8acb1ea8863c654c171fe989ac24446c321c085d
Peter Bell [Thu, 12 Aug 2021 06:46:12 +0000 (23:46 -0700)]
BatchNorm: Use resize_output and empty, instead of empty_like (#63084)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62967
This lets each of the three implementations choose which memory format
to use for the output, meaning channels_last can be used in more cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63084
Reviewed By: saketh-are
Differential Revision:
D30255740
Pulled By: ngimel
fbshipit-source-id:
48d42850952ec910b29521a1c4e530eb2b29df5e
Supriya Rao [Thu, 12 Aug 2021 05:05:30 +0000 (22:05 -0700)]
[quant] Make version 1 the default for get_default_qat_qconfig (#63043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63043
In version 1 we use the fused module/operator during QAT. Making this the default for all QAT runs going forward.
Older models saved after prepare_qat_fx can still load their state_dict into a model prepared using version 1.
The state_dict will still have the same attribute for the observer/fake_quant modules.
There may be some numerics difference between the old observer code in observer.py and the new fused module that was
re-written in C++/CUDA to perform observe + fake_quantize.
This PR also updates the test to check for the new module instead of the default FakeQuantize module.
Note: there are also some changes to make the operator work for multi-dim per-channel quantization + updated the test for that.
Test Plan:
python test/test_quantization.py TestSerialization.test_default_qat_qconfig
Imported from OSS
Reviewed By: raghuramank100
Differential Revision:
D30232222
fbshipit-source-id:
f3553a1926ab7c663bbeed6d574e30a7e90dfb5b
Pritam Damania [Thu, 12 Aug 2021 04:41:31 +0000 (21:41 -0700)]
Fix sharded tensor tests. (#63054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63054
1) Ensure these tests are skipped in environments without any GPUs.
2) Add the test to run_test.py
ghstack-source-id:
135595698
Test Plan: waitforbuildbot
Reviewed By: wanchaol
Differential Revision:
D30239159
fbshipit-source-id:
21b543ba72e8d10182bc77e7ae1fd34fd4096509
Meghan Lele [Thu, 12 Aug 2021 04:01:28 +0000 (21:01 -0700)]
Port `log_softmax_backward_data` to structured kernel (#62372)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62372
Test Plan: Imported from OSS
Reviewed By: saketh-are
Differential Revision:
D30240242
Pulled By: SplitInfinity
fbshipit-source-id:
67d5e4b1543c2e43675e905ce18ca49c11e33748
Meghan Lele [Thu, 12 Aug 2021 04:01:28 +0000 (21:01 -0700)]
Port `log_softmax` to structured kernel (#57374)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57374
Test Plan: Imported from OSS
Reviewed By: saketh-are
Differential Revision:
D30240243
Pulled By: SplitInfinity
fbshipit-source-id:
de6617c75d16e26d607a884c25b8752b7b561737
zhouzhuojie [Thu, 12 Aug 2021 00:09:02 +0000 (17:09 -0700)]
Add ciflow_ruleset.json generator along with gha ci (#63097)
Summary:
- Add `.github/generated-ciflow-ruleset.json` for ciflow-bot (so that we can generate better comments)
- The lint job also checks git dirty to make sure that the file is always in sync with ciflow configs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63097
Reviewed By: saketh-are
Differential Revision:
D30263278
Pulled By: zhouzhuojie
fbshipit-source-id:
bad68105a228e892ba071b29ecfdf433e1038054
Jiewen Tan [Wed, 11 Aug 2021 23:42:34 +0000 (16:42 -0700)]
Improve IMethod::getArgumentNames to deal with empty argument names list (#62947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62947
This diff improved IMethod::getArgumentNames to deal with empty argument names list.
Test Plan:
buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- PyTorchDeployPredictor.GetEmptyArgumentNamesValidationMode
buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- PyTorchDeployPredictor.GetEmptyArgumentNamesRealMode
Reviewed By: wconstab
Differential Revision:
D30179974
fbshipit-source-id:
c7aec35c360a73318867c5b77ebfec3affee47e3
Amy He [Wed, 11 Aug 2021 21:24:06 +0000 (14:24 -0700)]
Fix Nnapi backend execute's dangling pointer (#63092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63092
Bug discovered while testing NNAPI Delegate on SparkAR.
Using
```
c10::IntArrayRef order = {0, 2, 3, 1};
fixed_inputs.push_back(tensorInp.get(i).permute(order).contiguous());
```
results in a garbage value for order in `permute()`.
Moving order inside the call to `permute()` fixes this issue. Problem is seemingly related to https://github.com/pytorch/pytorch/issues/44409, but luckily the solution in this case is simple.
Bug wasn't caught earlier, since regular unit tests weren't affected by the dangling pointer, and address sanitizer NNAPI tests are turned off due to there being a different failure (T95764916).
ghstack-source-id:
135526129
Test Plan:
Run Unit tests: `python test/test_jit.py`
Build and run SparkAR on an Android phone at the top of this diff stack (
D30173959): `buck build --show-output arstudioplayer_arm64_debug -c pt.enable_nnapi=1`
Reviewed By: raziel, iseeyuan
Differential Revision:
D30237504
fbshipit-source-id:
c946d81feefc453b43d9295d8d6f509cafdcec03
Nikita Shulga [Wed, 11 Aug 2021 21:05:55 +0000 (14:05 -0700)]
Fix warnings (#62930)
Summary:
Add `-Wno-writable-strings`(which is clang's flavor of `-Wwrite-strings`) to list of warnings ignored while compiling torch_python.
Avoid unnecessary copies in range loop
Fix number of signed-unsigned comparisons
Found while building locally on M1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62930
Reviewed By: albanD
Differential Revision:
D30171981
Pulled By: malfet
fbshipit-source-id:
25bd43dab5675f927ca707e32737ed178b04651e
Tao Xu [Wed, 11 Aug 2021 20:28:09 +0000 (13:28 -0700)]
[iOS][GPU] Consolidate array and non-array kernel for upsampling_nearest2d (#63061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63061
Cleanup the redundant shader code for the upsampling nearest kernel.
ghstack-source-id:
135524349
Test Plan:
- `buck test pp-macos`
- Op tests in PyTorchPlayground app
Reviewed By: husthyc
Differential Revision:
D30236905
fbshipit-source-id:
e1e001b446452b077e6db719b0519c9070f3300b
Richard Barnes [Wed, 11 Aug 2021 20:12:16 +0000 (13:12 -0700)]
irange-ify 13b (#62476)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62476
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision:
D30001445
fbshipit-source-id:
6f4525338c80e9f929695f47f36ca9c72d96a75d
CaoE [Wed, 11 Aug 2021 19:51:28 +0000 (12:51 -0700)]
Add BFloat16 support for unique and unique_consecutive on CPU (#62559)
Summary:
Add BFloat16 support for unique and unique_consecutive on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62559
Reviewed By: saketh-are
Differential Revision:
D30250675
Pulled By: ngimel
fbshipit-source-id:
26e48f971d87f3b86db237e8ad3a4b74eb3c1def
Alexander Grund [Wed, 11 Aug 2021 19:42:32 +0000 (12:42 -0700)]
Add Github action to upload full source releases (#63022)
Summary:
Those release tarballs include the submodules.
The action is run on every tag, master-branch push but will not upload anything.
This makes sure nothing is broken when an actual release happens.
On created releases the action runs and uploads the tarball
Fixes https://github.com/pytorch/pytorch/issues/62708
As I don't have access rights here and testing is obviously hard (as a new release needs to be published), I set up a test at https://github.com/Flamefire/pytorch/releases/tag/testtag
See also the run(s) at https://github.com/Flamefire/pytorch/actions/workflows/create_release.yml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63022
Reviewed By: saketh-are
Differential Revision:
D30256253
Pulled By: seemethere
fbshipit-source-id:
ab5fe131452de14ae3768b91c221e68c536cb3aa
Xiang Gao [Wed, 11 Aug 2021 19:34:58 +0000 (12:34 -0700)]
Embedding thrust->cub: unique (#63042)
Summary:
Followup of https://github.com/pytorch/pytorch/pull/62495
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63042
Reviewed By: saketh-are
Differential Revision:
D30231084
Pulled By: ngimel
fbshipit-source-id:
03b0a88107e8a2aee3570881d81bf2b676f525cd
Howard Cheng [Wed, 11 Aug 2021 19:32:10 +0000 (12:32 -0700)]
[PyTorch] Add flop count for addmm (#61895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61895
* Add FLOP count for addmm, should be `2*m*n*k`.
Share the same code path for `addmm` and `mm`.
Test Plan:
Imported from OSS
`python test/test_profiler.py`
Run a sample profile and check that FLOPS for `aten::addmm` is correct.
`[chowar@devbig053.frc2 ~/local/pytorch/build] ninja bin/test_jit`
`[chowar@devbig053.frc2 ~/local/pytorch/build] ./bin/test_jit --gtest_filter='ComputeFlopsTest*'`
Reviewed By: dskhudia
Differential Revision:
D29785671
fbshipit-source-id:
d1512036202d7234a981bda897af1f75808ccbfe
Salil Desai [Wed, 11 Aug 2021 18:51:58 +0000 (11:51 -0700)]
XNNPack Input Pointer Caching Comment (#62818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62818
Added a comment to explain why we no longer need to manually cache pointers/parameters for convolution, as removed in
D29777605 (https://github.com/pytorch/pytorch/commit/
f5c6c3947e4618d30ebd68a414f1cfcda27bdcd4)
Test Plan: Sandcastle tests (no code changed)
Reviewed By: kimishpatel
Differential Revision:
D30113489
fbshipit-source-id:
d697f05816acbd367d59a4aced1925303c683d40
rusty1s [Wed, 11 Aug 2021 18:35:53 +0000 (11:35 -0700)]
`_convert_coo_to_csr` CPP and CUDA functionality (#61838)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57381 and improves https://github.com/pytorch/pytorch/pull/61340 via dedicated `coo_to_csr` functionalities.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61838
Reviewed By: ezyang
Differential Revision:
D30132736
Pulled By: cpuhrsch
fbshipit-source-id:
a1fd074c0d70366a524d219a620b94f8bed71d7c
Pritam Damania [Wed, 11 Aug 2021 18:22:48 +0000 (11:22 -0700)]
Add a _RemoteDevice structure for ShardedTensor/ShardingSpec. (#62927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62927
As part of the ShardedTensor work, we realized we do need some sort of
_RemoteDevice structure that deals with our format of "workername/device" so
that users don't have to worry about parsing this string directly.
Right now this structure is just the bare minimum and is mostly a container for
describing a remote device. It is currently only used in ShardedTensor,
ShardingSpec and RemoteModule.
Once we actually have a consolidated remote device proposal, this class can be
extended appropriately if needed.
ghstack-source-id:
135534086
Test Plan:
1) unit tests
2) waitforbuildbot
Reviewed By: SciPioneer
Differential Revision:
D30170689
fbshipit-source-id:
1ac2e81c7a597dc40bf3fbf2c1168c382c66649f
Jacob Szwejbka [Wed, 11 Aug 2021 18:14:25 +0000 (11:14 -0700)]
[Pytorch Edge] Move RuntimeCompatibilityInfo Factory Method (#63005)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63005
Realized I forgot to move the Runtime half of these functions be within the struct.
Test Plan: ci
Reviewed By: pavithranrao
Differential Revision:
D30205521
fbshipit-source-id:
ccd87d7d78450dd0dd23ba493bbb9d87be4640a5
Stephen Macke [Wed, 11 Aug 2021 18:09:02 +0000 (11:09 -0700)]
[easy] add an `inplace` argument to MutableNetProto.to_net() and core.Net() constructor (#63068)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63068
The caffe2 core.Net constructor can accept a caffe2_pb2.NetDef proto, but it always creates a copy. This is wasteful when we can prove that the proto being passed to it will not be used anywhere else. So we add an "inplace" argument to the `core.Net` constructor that allows clients to give away ownership of the passed proto without copying. We default this argument to `False`, ensuring that behavior does not change unless explicitly requested.
Test Plan: Let CI run.
Differential Revision:
D29976510
fbshipit-source-id:
26e13ca76f3431b8ef0de51f08bbf263491d323e
zhouzhuojie [Wed, 11 Aug 2021 16:42:15 +0000 (09:42 -0700)]
Fix gha render-test-result mixed failure passthrough (#63056)
Summary:
To fix something like https://github.com/pytorch/pytorch/actions/runs/
1114555082
![image](https://user-images.githubusercontent.com/658840/
128956528-
86997457-5e18-4ae1-83cc-
aa7d0ca03c0e.png)
Not sure why `needs.test.result` doesn't capture the `failure` case before, so changed it to `needs.test.result != 'skipped' || failure()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63056
Reviewed By: walterddr, tktrungna
Differential Revision:
D30240112
Pulled By: zhouzhuojie
fbshipit-source-id:
d159cc3f79ed5d604ae12583736b37ac28e8d87c
Yida Wang [Wed, 11 Aug 2021 16:36:49 +0000 (09:36 -0700)]
Fix issues with printing certain torch modules (#62447)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54420
When I tested on master, with the testing code, there were multiple objects on the garbage collector that cannot be printed.
Testing code:
```
import torch
import gc
import os
import sys
print(torch.__version__)
a = torch.rand(10)
print(a)
objects = gc.get_objects()
for i in range(len(objects)):
print(objects[i])
```
### 1
```
print(torch.classes)
```
Like SplitInfinity has mentioned in the GitHub issue, the solution here is to set `__file__` for `torch.classes` to something. Similar to [_ops.py](https://github.com/pytorch/pytorch/blob/master/torch/_ops.py#L69), where `__file__` is set to `_ops.py`, we could set `__file__` for torch.classes to `_classes.py`.
### 2
```
print(torch._ops.ops.quantized)
print(torch._ops.ops.atan)
```
When we try to print these two modules, it will call `_OpNamespace::__getattr__`, but the `op_name` is `__file__`. This becomes a problem when `torch._C._jit_get_operation(qualified_op_name)` [(link)](https://github.com/pytorch/pytorch/blob/master/torch/_ops.py#L60) tries to look for an actual op on the native C++ side.
Only when we get the attribute for an actual op, e.g. `print(torch._ops.ops.quantized.elu)`, the `op_name` becomes proper (e.g. `elu`).
My current solution is to return a hardcoded string (i.e. “torch.ops”) if `op_name` is `"__file__"`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62447
Reviewed By: saketh-are
Differential Revision:
D30234654
Pulled By: yidawang-oss
fbshipit-source-id:
de43a8f599739c749fb3307eea015cc61f1da60e
Peter Bell [Wed, 11 Aug 2021 15:44:08 +0000 (08:44 -0700)]
Shard python_functions.cpp (#62186)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62186
This file takes 6 minutes on its own to compile and is the limiting factor for
building `libtorch_python` on a 32-core threadripper. This splits the file into
5 shards which take around 50 seconds each to compile.
Test Plan: Imported from OSS
Reviewed By: bdhirsh
Differential Revision:
D29962046
Pulled By: albanD
fbshipit-source-id:
df13cfaebd54296f10609f67ae74a850c329bd37
Sze Wai Celeste Yuen [Wed, 11 Aug 2021 15:38:13 +0000 (08:38 -0700)]
Fix inconsisteny between Python and JIT power operation (#62842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62842
Test Plan:
Wrote unit test TestAtenPow to test behavior of aten::pow when:
1. base is int, exponent is int
2. base is int, exponent is float
3. base is float, exponent is int
4. base is float, exponent is float
Specifically, we test that when base is zero and exponent is negative, we raise error. In all other cases, we expect behavior to be the same as the result returned by Python.
It is because the cpp code relies on overloading, we need to make sure all combinations of types give us the expected result.
Reviewed By: zhxchen17
Differential Revision:
D30146115
Pulled By: szewaiyuen7
fbshipit-source-id:
dc661897ad38da286ee454120fbe41314b7f2995
Dmytro Dzhulgakov [Wed, 11 Aug 2021 08:08:45 +0000 (01:08 -0700)]
Fix CUDA_KERNEL_ASSERT ambiguous symbol in NDEBUG mode (#62527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62527
If NDEBUG is applied inconsistently in compilation we might get 'ambiguous declaration' error. Let's make sure that the forward declaration matches glibc including all specifiers.
Test Plan: sandcastle
Reviewed By: mdschatz
Differential Revision:
D30030051
fbshipit-source-id:
9f4d5f1d4e74f0a4eaeeaaaad76b93ee485d8bcd
Pritam Damania [Wed, 11 Aug 2021 05:37:14 +0000 (22:37 -0700)]
[4/N] Enable opt-asan for distributed unit tests. (#62051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62051
The goal here is to enable opt-asan for "spawn" based unit tests since
this works for "spawn" unlike "dev-asan". As a result, we can run ASAN for
"spawn" unit tests as well.
This means we can completely remove fork unit tests from the code base since
the only purpose for these tests was to run ASAN.
ghstack-source-id:
135523770
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision:
D29854514
fbshipit-source-id:
02a5bfcfae2afc21badecff77082c7a6ad83636b
Lu Fang [Wed, 11 Aug 2021 04:56:41 +0000 (21:56 -0700)]
Back out "[fx] store Tracer class on Graph and GraphModule for package deserialization" (#63053)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63053
Original commit changeset:
eca09424ad30
The original diff -
D30019214 (https://github.com/pytorch/pytorch/commit/
6286d338785c48a3e7a9b969e2bc3bd4d502851d) breaks the publish flow in model saving.
Test Plan: ci
Differential Revision:
D30236517
fbshipit-source-id:
3e05db02fc1cbbc2ed262c83bf56d555277abb34
Rishi Puri [Wed, 11 Aug 2021 03:02:07 +0000 (20:02 -0700)]
rebase for autocast updates to include device_type and dtype flags (#61002)
Summary:
Fixes #{55374}
https://github.com/pytorch/pytorch/issues/55374
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61002
Reviewed By: malfet, mruberry
Differential Revision:
D30016812
Pulled By: ngimel
fbshipit-source-id:
6e09a29f539d28e9aea5cd9489b1e633cc588033
Wei-Sheng Chin [Wed, 11 Aug 2021 02:46:46 +0000 (19:46 -0700)]
Fix missing element types and shapes when autograd.Function has multiple tensor outputs (#57966)
Summary:
When generating IR for autograd.Function, if the function has multiple outputs, a TupleUnpack may be inserted after the original function node, and Pytorch only assigns proper information (tensor element type and shape) to the TupleUnpack and forgets the original function node. In contrast, if autograd.Function only produces one output, the original function node may have tensor
element type and shape in its output schema.
Before this PR:
- (simplified) IR for autograd.Function with one output: input (tensor, dtype=float32, shape=[2, 3]) -> PythonOp -> output (tensor, dtype=float32, shape=[4, 5])
- (simplified) IR for autograd.Function with one output: input (tensor, dtype=float32, shape=[2, 3]) -> PythonOp -> output_0 **(tensor)**, output_1 **(tensor)** -> TupleUnpack output_2 (tensor, dtype=float32, shape=[4, 5]), output_3 (tensor, dtype=float32, shape=[6, 7])
After this PR:
- (simplified) IR for autograd.Function with one output: input (tensor, dtype=float32, shape=[2, 3]) -> PythonOp -> output (tensor, dtype=float32, shape=[4, 5])
- (simplified) IR for autograd.Function with one output: input (tensor, dtype=float32, shape=[2, 3]) -> PythonOp ->output_0 **(tensor, dtype=float32, shape=[4, 5])**, output_1 **(tensor, dtype=float32, shape=[6, 7])** -> TupleUnpack output_2 (tensor, dtype=float32, shape=[4, 5]), output_3 (tensor, dtype=float32, shape=[6, 7])
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57966
Reviewed By: zhxchen17
Differential Revision:
D30208207
Pulled By: gmagogsfm
fbshipit-source-id:
42a3d1f9c0932133112a85df0c49cf4ea0afa175
Natalia Gimelshein [Wed, 11 Aug 2021 01:39:45 +0000 (18:39 -0700)]
remove dead code (#63031)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63031
Reviewed By: mruberry
Differential Revision:
D30225094
Pulled By: ngimel
fbshipit-source-id:
3666a0fa120bea85225cd3ee04f89d64952d2862
Natalia Gimelshein [Wed, 11 Aug 2021 01:23:00 +0000 (18:23 -0700)]
Revert
D30199482: [pytorch][PR] Add BFloat16 support for unique and unique_consecutive on CPU
Test Plan: revert-hammer
Differential Revision:
D30199482 (https://github.com/pytorch/pytorch/commit/
fc0b8e60337ae46b90ed5d2f6d1f623f0f8d6581)
Original commit changeset:
6f2d9cc1a528
fbshipit-source-id:
39e9f202bcbd978525f792173d4f97b5b329b5b1
Richard Barnes [Wed, 11 Aug 2021 00:57:22 +0000 (17:57 -0700)]
Use `const auto` with irange (#62990)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62990
Test Plan: Sandcastle
Reviewed By: zhouzhuojie
Differential Revision:
D30199748
fbshipit-source-id:
284b208ffa3c6c4749e5ac9b1fccb28914590f2c
Eddie Yan [Wed, 11 Aug 2021 00:44:40 +0000 (17:44 -0700)]
change nccl version reporting (#62916)
Summary:
https://github.com/pytorch/pytorch/issues/62295
Previously the packing and unpacking of the NCCL version "integer" was done to have parity with the upstream NCCL version encoding. However, there doesn't seem to be any place where this integer is directly compared with a version integer sourced from upstream NCCL, and syncing the encoding seems to be error-prone (e.g., a recent change where a special case was added for minor versions >= 10 https://github.com/NVIDIA/nccl/blob/
7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/nccl.h.in#L22).
This patch changes the reporting to return a tuple of version numbers instead (to preserve ease-of-use for comparisons) and tweaks the passing between C/Python to avoid the digit overflow problem.
CC ngimel mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62916
Reviewed By: anjali411
Differential Revision:
D30201069
Pulled By: mrshenli
fbshipit-source-id:
2e4e7c69f001c3f22bd04aa6df6a992e538bea45
tktrungna [Tue, 10 Aug 2021 23:24:57 +0000 (16:24 -0700)]
Update test_torch_deploy (#62838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62838
Fixes #62380
* update test functions to call wheel install folder {sitepackages}/torch instead of build/ folder
* add symbolic link for shared libraries which are called by the tests (this is a bit hacky and should be fixed the rpath before compiling -- similar to https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/test.sh#L204-L208).
### Test plan
check if all ci workflows pass
Test Plan: Imported from OSS
Reviewed By: walterddr
Differential Revision:
D30193141
Pulled By: tktrungna
fbshipit-source-id:
72c2bd3a740fca0f72e4803df505240193692c44
tktrungna [Tue, 10 Aug 2021 23:24:57 +0000 (16:24 -0700)]
update test_libtorch (#62797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62797
Fixes #62380
* update test functions to call wheel install folder {sitepackages}/torch instead of build/ folder
* add symbolic link for shared libraries which are called by the tests (this is a bit hacky and should be fixed the rpath before compiling -- similar to https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/test.sh#L204-L208).
### Test plan
check if all ci workflows pass
Test Plan: Imported from OSS
Reviewed By: walterddr
Differential Revision:
D30193140
Pulled By: tktrungna
fbshipit-source-id:
d8e54c403f42abbbbe4556abf40c22a7955df737
tktrungna [Tue, 10 Aug 2021 23:24:57 +0000 (16:24 -0700)]
update test distributed (#62796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62796
Fixes #62380
* update test functions to call wheel install folder {sitepackages}/torch instead of build/ folder
* add symbolic link for shared libraries which are called by the tests (this is a bit hacky and should be fixed the rpath before compiling -- similar to https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/test.sh#L204-L208).
### Test plan
check if all ci workflows pass
Test Plan: Imported from OSS
Reviewed By: driazati
Differential Revision:
D30193142
Pulled By: tktrungna
fbshipit-source-id:
1247f9eda1c11c763c31c7383c77545b1ead1a60
tktrungna [Tue, 10 Aug 2021 23:24:57 +0000 (16:24 -0700)]
update test_vulkan (#62795)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62795
Test Plan: Imported from OSS
Reviewed By: driazati
Differential Revision:
D30124421
Pulled By: tktrungna
fbshipit-source-id:
235ba166b02f7334e89cb2493024067851bf5b9b
tktrungna [Tue, 10 Aug 2021 23:24:57 +0000 (16:24 -0700)]
update test_rpc (#62781)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62781
Test Plan: Imported from OSS
Reviewed By: walterddr, zhouzhuojie
Differential Revision:
D30124391
Pulled By: tktrungna
fbshipit-source-id:
99c275d6c9f23b4f274fd0ca19a16879ed27afd5
Matej Sladek [Tue, 10 Aug 2021 23:19:39 +0000 (16:19 -0700)]
[ONNX] add support for prim::Unitialized in lower_tuples pass (#56912)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56911
Code from issue generates this Torchscript:
```
graph(%self : __torch__.MyModule,
%t.1 : Tensor):
%12 : None = prim::Constant()
%7 : str = prim::Constant[value="Negative input"]() # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:11:28
%3 : int = prim::Constant[value=0]() # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:10:15
%9 : int = prim::Constant[value=5]() # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:13:31
%33 : (Tensor, Tensor) = prim::Uninitialized()
%4 : Tensor = aten::lt(%t.1, %3) # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:10:11
%6 : bool = aten::Bool(%4) # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:10:11
%34 : (Tensor, Tensor) = prim::If(%6) # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:10:8
block0():
= prim::RaiseException(%7) # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:11:12
-> (%33)
block1():
%11 : int[] = prim::ListConstruct(%9)
%16 : Tensor = aten::zeros(%11, %12, %12, %12, %12) # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:13:19
%18 : int[] = prim::ListConstruct(%9)
%23 : Tensor = aten::zeros(%18, %12, %12, %12, %12) # /mnt/nvdl/usr/msladek/notes/python_code/unitialized.py:13:35
%24 : (Tensor, Tensor) = prim::TupleConstruct(%16, %23)
-> (%24)
return (%34)
```
Problem is that onnx exporter during lower_tuples pass doesn't support forwarding of tuples in prim::Unitialized.
Solution is:
1. add prim::Unitialized to supported_op in lower_tuples pass
1. As prim::Unitialized has now multiple outputs, we should call giveFreshAlias for every output
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56912
Reviewed By: nikithamalgifb
Differential Revision:
D29837200
Pulled By: SplitInfinity
fbshipit-source-id:
321fae6fe52b1523df5653dbb9ea73b998ef1cda
Howard Huang [Tue, 10 Aug 2021 22:56:18 +0000 (15:56 -0700)]
Remove process_group_agent and faulty_process_group_agent files (#62985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62985
Remove the process_group_agent and faulty_process_group_agent code now that PROCESS_GROUP backend has been deprecated for RPC (https://github.com/pytorch/pytorch/issues/55615). Discussed with xush6528 that it was okay to remove ProcessGroupAgentTest and ProcessGroupAgentBench which depended on process_group_agent.
Test Plan: CI tests
Reviewed By: pritamdamania87
Differential Revision:
D30195576
fbshipit-source-id:
8b4381cffadb868b19d481198015d0a67b205811
Natalia Gimelshein [Tue, 10 Aug 2021 22:44:09 +0000 (15:44 -0700)]
fix sort and topk with discontiguous out (#63029)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62645 and https://github.com/pytorch/pytorch/issues/62940. The root cause of those bugs is in the bad interaction between `collapseDims` and setting the size of sorting/topK dimension to 1. If all other dimensions happen to be 1, `collapseDims` thinks that that `1` dimension is collapsible (even though it was specifically marked to be preserved) and loses its stride information. If dimension was really of size 1, the stride information would be unimportant, but since in reality that dimension is not 1 and was set to 1 for convenience, the loss of stride information results in incorrect outputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63029
Reviewed By: heitorschueroff
Differential Revision:
D30224925
Pulled By: ngimel
fbshipit-source-id:
269dd375c5cd57c6007fe91f729f8c60a2e7a264
Hanton Yang [Tue, 10 Aug 2021 22:15:23 +0000 (15:15 -0700)]
[iOS] enable Metal in the nightly build (#62855)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62855
Test Plan: Test on Private Pod with the [HelloWorld](https://fburl.com/3hiwkkhm) demo
Reviewed By: xta0
Differential Revision:
D30174151
Pulled By: hanton
fbshipit-source-id:
22cd8663ac239811bf8ed1c3b6301460d798dbfa
Christian Puhrsch [Tue, 10 Aug 2021 22:14:00 +0000 (15:14 -0700)]
test_cudnn_convolution_relu skipCUDAIfRocm
Summary: skip rocm test for test_cudnn_convolution_relu
Test Plan: This skips a test
Reviewed By: ngimel
Differential Revision:
D30233620
fbshipit-source-id:
31eab8b03c3f15674e0d262a8f55965c1aa6b809
Victor Quach [Tue, 10 Aug 2021 21:58:16 +0000 (14:58 -0700)]
Add docstring for saved tensors default hooks (#62361)
Summary:
Add documentation for the saved tensors default hooks introduced in https://github.com/pytorch/pytorch/issues/61834 / https://github.com/pytorch/pytorch/issues/62563
Sister PR: https://github.com/pytorch/pytorch/issues/62362 (will add a link from autograd.rst to notes/autograd in whatever PR does not land first)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62361
Reviewed By: zou3519
Differential Revision:
D30081997
Pulled By: Varal7
fbshipit-source-id:
cb923e943e1d96db9669c1d863d693af30910c62
Tao Xu [Tue, 10 Aug 2021 21:32:11 +0000 (14:32 -0700)]
[iOS][CI] Store every version of nightlies in S3 (#63039)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63039
Test Plan: Imported from OSS
Reviewed By: hanton
Differential Revision:
D30229385
Pulled By: xta0
fbshipit-source-id:
15b438a6326159258803ab97e67dc9ec5db50d59
Jerry Zhang [Tue, 10 Aug 2021 20:57:14 +0000 (13:57 -0700)]
[quant][graphmode] Reference pattern support for elu (#62607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62607
Removing the quantize handler for elu since it can be covered by DefaultNodeQuantizeHandler
Test Plan:
python test/test_quantization.py TestQuantizeFxOps
Imported from OSS
Reviewed By: iramazanli
Differential Revision:
D30053977
fbshipit-source-id:
426789443e928bb01a88907de616cbda5866f621
kshitij12345 [Tue, 10 Aug 2021 20:55:37 +0000 (13:55 -0700)]
[fix] TestMultiThreadAutograd: propagate exception from child thread to main thread (#63018)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63018
Reviewed By: anjali411
Differential Revision:
D30225856
Pulled By: Varal7
fbshipit-source-id:
b5dd7999de5060e06f8958ea3ce49e0b74110971
Amy He [Tue, 10 Aug 2021 20:36:02 +0000 (13:36 -0700)]
[1/N] Nnapi backend execute and compile (#62272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62272
Added Android NNAPI delegate implementation of runtime initialization (compilation) and execution.
The delegate's preprocess step was [previously implemented](https://github.com/pytorch/pytorch/pull/62225). Now, the reset of the delegate, which implements client-side execution, is added.
**nnapi_backend_lib.cpp**:
Implementation of delegate's compile and execute.
`execute()` is essentially a C++ implementation of [`NnapiModule`](https://github.com/pytorch/pytorch/blob/master/torch/backends/_nnapi/prepare.py), which wraps an NNAPI Compilation and handles preparation of weights, inputs, and outputs.
- Any steps that can be done before execution are moved to `compile()`.
- `init()` cannot be moved to `compile()` because it requires real inputs for dynamic shaping.
- `shape_compute_module` cannot currently be deserialized in `compile()`, since mobile::Module has no IValue conversion.
- Processed arguments that are modified by `init()` must be kept as member variables. Any other processed arguments are passed through a dictionary, `handles`.
**nnapi_bind.cpp & nnapi_bind.h**:
Created a header file for `nnapi_bind.cpp`, so that it's NnapiCompilation class can be used by `nnapi_backend_lib.cpp`.
**test_backend_nnapi.py**:
Enabled execution testing.
ghstack-source-id:
135432844
Test Plan:
Imported from OSS
Tested on devserver.
1. Load and unpack a special devserver build of NNAPI: `jf download GICWmAAzUR0eo20TAPasVts8ObhobsIXAAAz --file "nnapi-host-linux.tar.xz"`
2. `export LIBNEURALNETWORKS_PATH=/path/to/libneuralnetworks.so`
3. Run unittests: `python test/test_jit.py TestNnapiBackend` and `python test/test_nnapi.py`
TODO: test with lite interpreter runtime
Reviewed By: raziel, iseeyuan
Differential Revision:
D29944873
fbshipit-source-id:
48967d873e79ef2cce9bcba2aeea3c52f7a18c07
CaoE [Tue, 10 Aug 2021 20:21:22 +0000 (13:21 -0700)]
Add BFloat16 support for unique and unique_consecutive on CPU (#62559)
Summary:
Add BFloat16 support for unique and unique_consecutive on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62559
Reviewed By: anjali411
Differential Revision:
D30199482
Pulled By: ngimel
fbshipit-source-id:
6f2d9cc1a528bea7c723139a4f1b14e4b2213601
Jerry Zhang [Tue, 10 Aug 2021 19:16:00 +0000 (12:16 -0700)]
[quant][refactor] Checking activation_dtype instead of activation_post_process (#62489)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62489
Addressing comment from previous PR: https://github.com/pytorch/pytorch/pull/62374#discussion_r679354145
Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
Imported from OSS
Reviewed By: iramazanli
Differential Revision:
D30053980
fbshipit-source-id:
79c216410282eccd6f0a8f24e38c55c4d18ec0d0
Raghav Kansal [Tue, 10 Aug 2021 17:59:43 +0000 (10:59 -0700)]
LU solve uses cuBLAS and cuSOLVER for matrices with dim > 1024 (#61815)
Summary:
This PR builds off of https://github.com/pytorch/pytorch/issues/59148 and modifies the `lu_solve` routine to avoid MAGMA for `b` or `lu_data` matrices with any dimension > 1024, since MAGMA has a bug when dealing with such matrices (https://bitbucket.org/icl/magma/issues/19/dgesv_batched-dgetrs_batched-fails-for).
Fixes https://github.com/pytorch/pytorch/issues/36921
Fixes https://github.com/pytorch/pytorch/issues/61929
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61815
Reviewed By: anjali411
Differential Revision:
D30199618
Pulled By: ngimel
fbshipit-source-id:
06870793f697e9c35aaaa8254b8a8b1a38bd3aa9
Wanchao Liang [Tue, 10 Aug 2021 17:56:41 +0000 (10:56 -0700)]
[sharded_tensor] add default fields to ShardedTensorMetadata (#62867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62867
This add default fields for ShardedTensorMetadata, to allow easy construction and modification afterwards.
ghstack-source-id:
135284133
Test Plan: ShardedTensorMetadata validity should be guarded with `init_from_local_shards` API and its tests.
Reviewed By: pritamdamania87
Differential Revision:
D30148481
fbshipit-source-id:
0d99f41f23dbeb4201a36109556ba23b9a6c6fb1
Rohan Varma [Tue, 10 Aug 2021 17:46:50 +0000 (10:46 -0700)]
[DDP] Dont set thread local state in reducer autograd hook. (#62996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62996
No need to set this because autograd engine already propagates TLS
states.
ghstack-source-id:
135438220
Test Plan: CI
Reviewed By: albanD
Differential Revision:
D30202078
fbshipit-source-id:
e5e917269a03afd7a6b8e61f28b45cdb71ac3e64
Pyre Bot Jr [Tue, 10 Aug 2021 17:22:43 +0000 (10:22 -0700)]
[typing] suppress errors in `fbcode/caffe2` - batch 2
Test Plan: Sandcastle
Differential Revision:
D30222378
fbshipit-source-id:
6a0a5d210266f19de63273240a080365c9143eb0
Elias Ellison [Tue, 10 Aug 2021 16:40:41 +0000 (09:40 -0700)]
Test shape analysis with opinfos (#59814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59814
Using opinfos to test shape analysis. By default, we just check that we don't give incorrect answers, and then if `assert_jit_shape_analysis` is true, tests that we correctly propagates the full shape. and it found a couple bugs {emoji:1f603}
Test Plan: Imported from OSS
Reviewed By: Krovatkin
Differential Revision:
D30200058
Pulled By: eellison
fbshipit-source-id:
6226be87f5390277cfa5a1fffaa1b072d4bc8803
Elias Ellison [Tue, 10 Aug 2021 16:40:41 +0000 (09:40 -0700)]
add ssupport for a few more opinfos in jit (#59812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59812
This is sort of a half measure: we can successfully trace through opinfos which are registered as lambdas, we just can't script them. This tests if the op is a lambda in which case bails... see the next PR to get resize_ to work, maybe this should be consolidated with that...
Test Plan: Imported from OSS
Reviewed By: pbelevich, zhxchen17
Differential Revision:
D30200061
Pulled By: eellison
fbshipit-source-id:
7e3c9b0be746b16f0f57ece49f6fbe20bf6535ec
Elias Ellison [Tue, 10 Aug 2021 16:40:41 +0000 (09:40 -0700)]
Don't substitute in symbolic shapes to shape compute graph (#59811)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59811
We don't want to actually substitute in symbolic shapes, because it invalidates the partially evaluated graph for further use.
Test Plan: Imported from OSS
Reviewed By: pbelevich, zhxchen17
Differential Revision:
D30200059
Pulled By: eellison
fbshipit-source-id:
267ed97d8421fe480dec494cdf0dec9cf9ed3ba2
Elias Ellison [Tue, 10 Aug 2021 16:40:41 +0000 (09:40 -0700)]
small cleanups (#59810)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59810
Rephrasings and cleanup of dead code
Test Plan: Imported from OSS
Reviewed By: pbelevich, zhxchen17
Differential Revision:
D30200062
Pulled By: eellison
fbshipit-source-id:
b03e5adb928aa46bee6685667cad43333b6e6016
Elias Ellison [Tue, 10 Aug 2021 16:40:41 +0000 (09:40 -0700)]
Only optimize after change (redo) (#59809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59809
Some how this didnt get landed previously in ghstack mixup
Test Plan: Imported from OSS
Reviewed By: pbelevich, zhxchen17
Differential Revision:
D30200060
Pulled By: eellison
fbshipit-source-id:
47f256421a1fe1a005cd11fcc4d7f023b5990834
Michael Suo [Tue, 10 Aug 2021 16:21:24 +0000 (09:21 -0700)]
[jit] warn if _check_overload_body fails to find source
Summary:
Under certain conditions (particularly if a module is frozen, like with
PyInstaller or torch::deploy), we will not have source code available for
functions. `import torch` should still work in this case, but this check is
currently causing it to raise an exception.
Since this is an initial check (if an overload is actually exercised there will
be hard failure), raise a warning and move on.
Test Plan: unit tests
Reviewed By: eellison
Differential Revision:
D30214271
fbshipit-source-id:
eb021503e416268e8585e0708d6271c1e7b91e95
Supriya Rao [Tue, 10 Aug 2021 15:40:53 +0000 (08:40 -0700)]
[quant] Update get_default_qat_qconfig to return the fused observer+fake_quant module (#62702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62702
Expose the qconfig to the user to speed up training by leveraging the fused module.
The module currently supports per-tensor/per-channel moving avg observer and fake-quantize.
For details on perf benefits, refer to https://github.com/pytorch/pytorch/pull/61691
Test Plan: Imported from OSS
Reviewed By: raghuramank100
Differential Revision:
D30093719
fbshipit-source-id:
b78deb7810f5b597474b9b9a0395d361d04eb46a
Supriya Rao [Tue, 10 Aug 2021 15:40:53 +0000 (08:40 -0700)]
[quant] add reduce_range option to FusedMovingAvgFakeQuantize module (#62863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62863
To make this consistent with other observers, add reduce_range option that can be used to update quant_min/max
Test Plan:
python test/test_quantization.py test_fused_mod_reduce_range
Imported from OSS
Reviewed By: raghuramank100
Differential Revision:
D30146602
fbshipit-source-id:
a2015f095766f9c884611e9ab6942528bc9bc972
Peter Bell [Tue, 10 Aug 2021 14:57:04 +0000 (07:57 -0700)]
Codegen: Fix operator::name on windows (#62278)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62278
In `Operators.h` we're using `str(BaseOperatorName)`, while in
`OperatorsEverything.cpp` we're using `str(OperatorName)`. e.g.
```
STATIC_CONSTEXPR_STR_INL_EXCEPT_WIN_CUDA(name, "aten::abs")
```
vs
```
STATIC_CONST_STR_OUT_OF_LINE_FOR_WIN_CUDA(abs_out, name, "aten::abs.out")
```
Test Plan: Imported from OSS
Reviewed By: bdhirsh
Differential Revision:
D29962047
Pulled By: albanD
fbshipit-source-id:
5a05b898fc734a4751c2b0187e4eeea4efb0502b
Edward Yang [Tue, 10 Aug 2021 14:13:24 +0000 (07:13 -0700)]
Reject kwonly arguments passed positionally in torch.ops (#62981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62981
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: Chillee
Differential Revision:
D30211030
Pulled By: ezyang
fbshipit-source-id:
aae426592e92bf3a50076f470e153a4ae7d6f101
Sameer Deshmukh [Tue, 10 Aug 2021 13:53:43 +0000 (06:53 -0700)]
Allow LocalResponseNorm to accept 0 dim batch sizes (#62801)
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in https://github.com/pytorch/pytorch/issues/38115.
This PR allows `LocalResponseNorm` to accept tensors with 0 dimensional batch size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62801
Reviewed By: zou3519
Differential Revision:
D30165282
Pulled By: jbschlosser
fbshipit-source-id:
cce0b2d12dbf47dc8ed6247c267bf2f2305f858a
Luca Wehrstedt [Tue, 10 Aug 2021 12:44:50 +0000 (05:44 -0700)]
Update TensorPipe submodule
Test Plan: CI ran as part of https://github.com/pytorch/pytorch/pull/60938.
Reviewed By: beauby
Differential Revision:
D30219343
fbshipit-source-id:
531338f912fee488d312d23da8bda63ceb862aa9
Rohan Varma [Tue, 10 Aug 2021 05:27:49 +0000 (22:27 -0700)]
[Reland][DDP] Support not all outputs used in loss calculation (#61753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61753
Reland of https://github.com/pytorch/pytorch/pull/57081.
Main difference is that the former diff moved `prepare_for_backward` check into `DDPSink` backward, but that resulted in issues due to potential autograd engine races. The original diff moved `prepare_for_backward` into `DDPSink` as part of a long-term plan to always call it within `DDPSink`.
In particular this doesn't work because `prepare_for_backward` sets `expect_autograd_hooks=true` which enables autograd hooks to fire, but there were several use cases internally where autograd hooks were called before DDPSink called `prepare_for_backward`, resulting in errors/regression.
We instead keep the call to `prepare_for_backward` in the forward pass, but still run outputs through `DDPSink` when find_unused_parameters=True. As a result, outputs that are not used when computing loss have `None` gradients and we don't touch them if they are globally `None`. Note that the hooks still fire with a undefined gradient which is how we avoid the Reducer erroring out with the message that some hooks did not fire.
Added the unittests that were part of the reverted diff.
ghstack-source-id:
135388925
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision:
D29726179
fbshipit-source-id:
54c8819e0aa72c61554104723a5b9c936501e719
Ilqar Ramazanli [Tue, 10 Aug 2021 00:53:11 +0000 (17:53 -0700)]
To fix variance computation for complex Adam (#62946)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59998
It has been discussed in the issue that the variance term of Adam optimizer currently doesn't compute correctly for complex domain. As it has been stated in the Generalization to Complex numbers section in https://en.wikipedia.org/wiki/Variance variance is computed as E[(X - mu)(X-mu)*] (where mu = E[X] and * stands for conjugate) for complex random variable X.
However, currently the computation method in implementation of Adam is via E[(X - mu)(X-mu)] which doesn't return right variance value, in particular it returns complex number. Variance is defined to be real number even though underlying random variable is complex.
We fix this issue here, and testing that resulting variance is indeed real number.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62946
Reviewed By: albanD
Differential Revision:
D30196038
Pulled By: iramazanli
fbshipit-source-id:
ab0a6f31658aeb56bdcb211ff86eaa29f3f0d718
Jerry Zhang [Mon, 9 Aug 2021 23:46:45 +0000 (16:46 -0700)]
[quant][graphmode][fx] Attach a weight qparam dict to linear and conv in reference quantized model (#62488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62488
Instead of attaching weight observer/fake_quant to the float linear and conv, we can
compute the quantization parameters and attach that as a dictionary to these modules so
that we can reduce the model size and make the reference module clearer
TODO: the numerics for linear and conv in reference quantized model is still not correct since
we did not quantize weight, we may explore things like parameterization to implement this support
Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
Imported from OSS
Reviewed By: vkuzo
Differential Revision:
D30053979
fbshipit-source-id:
b5f8497cf6cf65eec924df2d8fb10a9e154b8cab
zhouzhuojie [Mon, 9 Aug 2021 23:42:38 +0000 (16:42 -0700)]
Simplify the logic of running ci workflow codegen (#62853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62853
wanted to simplify the logic in the `__post_int__`, and delegate the settings back to individual workflows, this gives us more flexibility in changing individual workflows, as well as reducing the complexity of understanding the mutation conditions.
Test Plan: Imported from OSS
Reviewed By: walterddr, seemethere
Differential Revision:
D30149190
Pulled By: zhouzhuojie
fbshipit-source-id:
44df5b1e14184f3a81cb8004151525d0e0fb20d9
Richard Barnes [Mon, 9 Aug 2021 23:39:32 +0000 (16:39 -0700)]
irange-ify 12b (#62484)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62484
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision:
D30015528
fbshipit-source-id:
c4e1a5425a73f100102a97dcec1579f1049c9c1d
Peter Bell [Mon, 9 Aug 2021 23:15:54 +0000 (16:15 -0700)]
Shard Operators.cpp (#62185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62185
This file can take 5 minutes on its own to compile, and is the single limiting
factor for compile time of `libtorch_cpu` on a 32-core threadripper. Instead,
sharding into 5 files that take around 1 minute each cuts a full minute off the
overall build time.
This also factors out the `.findSchemaOrThrow(...).typed` step so the code can
be shared between `call` and `redispatch`.
Test Plan: Imported from OSS
Reviewed By: bdhirsh
Differential Revision:
D29962049
Pulled By: albanD
fbshipit-source-id:
be5df05fbea09ada0d825855f1618c25a11abbd8
Richard Barnes [Mon, 9 Aug 2021 23:14:35 +0000 (16:14 -0700)]
irange-ify 13d (#62477)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62477
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision:
D30001499
fbshipit-source-id:
993eb2b39f332ff0ae6c663792bd04734cfc262b
peterjc123 [Mon, 9 Aug 2021 22:54:17 +0000 (15:54 -0700)]
Enable rebuilds for Ninja on Windows (#62948)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59859.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62948
Reviewed By: seemethere, tktrungna
Differential Revision:
D30192246
Pulled By: janeyx99
fbshipit-source-id:
af25cc4bf0db67a1304d9971cfa0ff6831bb3b48
Marjan Fariborz [Mon, 9 Aug 2021 22:45:43 +0000 (15:45 -0700)]
BFP16 quantization/dequantization (#62974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62974
Testing the functionality of `tensor.to` approach.
Comparing `tensor.to` and `torch.ops.fb.FloatToBfloat16Quantized` approach and testing if they match for 2d tensors.
Test Plan: buck test //torchrec/fb/distributed/tests:test_quantized_comms
Reviewed By: wanchaol
Differential Revision:
D30079121
fbshipit-source-id:
612e92baeb2245449637faa9bc31686353d67033
Xiang Gao [Mon, 9 Aug 2021 22:28:35 +0000 (15:28 -0700)]
Migrate Embedding thrust sort to cub sort (#62495)
Summary:
This PR only migrates sort. Other thrust operations will be migrated in followup PRs
Benchmark `num_embeddings` pulled from https://github.com/huggingface/transformers/tree/master/examples by
```
grep -P 'vocab_size.*(=|:)\s*[0-9]+' -r transformers/examples/
grep -P 'hidden_size.*(=|:)\s*[0-9]+' -r transformers/examples/
```
to get `vocab_size = 119547, 50265, 32000, 8000, 3052` (similar size omitted) and `hidden_size = 512, 768`
Code:
```python
import torch
import itertools
num_embeddings = (119547, 50265, 32000, 8000, 3052)
num_tokens = (4096, 16384)
hidden_sizes = (512, 768)
for ne, nt, nh in itertools.product(num_embeddings, num_tokens, hidden_sizes):
print(f"Embedding size: {ne}, Tokens: {nt}, Hidden size: {nh}")
embedding = torch.nn.Embedding(ne, nh).cuda()
input_ = torch.randint(ne, (nt,), device='cuda')
out = embedding(input_)
torch.cuda.synchronize()
%timeit out.backward(out, retain_graph=True); torch.cuda.synchronize()
```
## On CUDA 11.3.1
Before:
```
Embedding size: 119547, Tokens: 4096, Hidden size: 512
1.43 ms ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 4096, Hidden size: 768
2.07 ms ± 56.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 512
1.61 ms ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 768
2.32 ms ± 8.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 512
738 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 768
1.02 ms ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 512
913 µs ± 3.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 768
1.27 ms ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 512
559 µs ± 860 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 768
743 µs ± 630 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 512
713 µs ± 969 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 768
977 µs ± 884 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 512
301 µs ± 8.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 768
383 µs ± 4.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 512
409 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 768
515 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 512
215 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 768
250 µs ± 320 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 512
271 µs ± 888 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 768
325 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
After:
```
Embedding size: 119547, Tokens: 4096, Hidden size: 512
1.42 ms ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 4096, Hidden size: 768
2.05 ms ± 9.93 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 512
1.6 ms ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 768
2.3 ms ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 512
730 µs ± 811 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 768
1.01 ms ± 2.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 512
887 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 768
1.25 ms ± 2.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 512
556 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 768
744 µs ± 4.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 512
691 µs ± 570 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 768
957 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 512
309 µs ± 2.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 768
376 µs ± 2.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 512
381 µs ± 1.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 768
487 µs ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 512
202 µs ± 383 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 768
239 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 512
243 µs ± 1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 768
340 µs ± 2.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
## On CUDA 11.1
Before:
```
Embedding size: 119547, Tokens: 4096, Hidden size: 512
1.41 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 4096, Hidden size: 768
2.05 ms ± 7.61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 512
1.61 ms ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 768
2.32 ms ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 512
743 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 768
1.02 ms ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 512
912 µs ± 5.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 768
1.28 ms ± 6.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 512
555 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 768
743 µs ± 655 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 512
714 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 768
980 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 512
312 µs ± 396 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 768
386 µs ± 2.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 512
413 µs ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 768
512 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 512
209 µs ± 585 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 768
271 µs ± 776 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 512
297 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 768
377 µs ± 3.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
After:
```
Embedding size: 119547, Tokens: 4096, Hidden size: 512
1.46 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 4096, Hidden size: 768
2.09 ms ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 512
1.64 ms ± 4.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 119547, Tokens: 16384, Hidden size: 768
2.35 ms ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 512
782 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 4096, Hidden size: 768
1.06 ms ± 596 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 512
945 µs ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 50265, Tokens: 16384, Hidden size: 768
1.31 ms ± 553 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 512
603 µs ± 856 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 4096, Hidden size: 768
789 µs ± 500 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 512
752 µs ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 32000, Tokens: 16384, Hidden size: 768
1.01 ms ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 512
323 µs ± 7.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 4096, Hidden size: 768
398 µs ± 765 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 512
412 µs ± 544 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 8000, Tokens: 16384, Hidden size: 768
519 µs ± 614 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 512
229 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 4096, Hidden size: 768
263 µs ± 417 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 512
274 µs ± 576 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Embedding size: 3052, Tokens: 16384, Hidden size: 768
354 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62495
Reviewed By: gchanan
Differential Revision:
D30176833
Pulled By: ngimel
fbshipit-source-id:
44148ebb53a0abfc1e5ab8b986865555bf326ad1
= [Mon, 9 Aug 2021 22:28:00 +0000 (15:28 -0700)]
Use output memory format based on input for cudnn_convolution_relu (#62482)
Summary:
Currently when cudnn_convolution_relu is passed a channels last Tensor it will return a contiguous Tensor. This PR changes this behavior and bases the output format on the input format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62482
Reviewed By: ngimel
Differential Revision:
D30049905
Pulled By: cpuhrsch
fbshipit-source-id:
98521d14ee03466e7128a1912b9f754ffe10b448
Richard Barnes [Mon, 9 Aug 2021 22:27:14 +0000 (15:27 -0700)]
irange-ify 12 (#62120)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62120
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision:
D29879713
fbshipit-source-id:
3084a5eacb722f7fb0a630d47bf694f4d6831136
Richard Barnes [Mon, 9 Aug 2021 22:26:54 +0000 (15:26 -0700)]
irange-ify 1 (#62193)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62193
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision:
D29879504
fbshipit-source-id:
adc86adcd1e7dcdfa2d7adf4d576f081430d52ec
zhouzhuojie [Mon, 9 Aug 2021 22:25:59 +0000 (15:25 -0700)]
Fix render_test_results if condition on always() (#62997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62997
Fixes #62979, changed the condition to listen on the previous'
job's result to be either 'success' or 'failure'.
Notice that 'skipped' will also skip this job, which is what
we want.
Test Plan: Imported from OSS
Reviewed By: driazati, seemethere
Differential Revision:
D30202598
Pulled By: zhouzhuojie
fbshipit-source-id:
f3c0f715c39a5c8119b528b66e45f594a54b49d1
Rohan Varma [Mon, 9 Aug 2021 21:39:19 +0000 (14:39 -0700)]
[reland] Gate DistributedOptimizers on RPC availability (#62937)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62937
reland due to windows + cuda failure, fix by running it on gloo on windows even with cuda.
ghstack-source-id:
135306176
Test Plan: ci
Reviewed By: mrshenli
Differential Revision:
D30177734
fbshipit-source-id:
7625746984c8f858648c1b3632394b98bd4518d2
Richard Barnes [Mon, 9 Aug 2021 20:13:14 +0000 (13:13 -0700)]
irange-ify 8d (#62505)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62505
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision:
D29971891
fbshipit-source-id:
7dcbe27221788695f320c7238f5fe81e32823802
Bradley Davis [Mon, 9 Aug 2021 20:06:05 +0000 (13:06 -0700)]
[fx] store Tracer class on Graph and GraphModule for package deserialization (#62497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62497
Previously named: add support for custom tracer in __reduce_package__
Stores a Tracer class on a Graph created by Tracer, and copies the Tracer class into the GraphModule's state so that when a GraphModule is packaged by torch package, it can be reconstructed with the same Tracer and GraphModule class name.
Reviewed By: suo
Differential Revision:
D30019214
fbshipit-source-id:
eca09424ad30feb93524d481268b066ea55b892a
Nikita Shulga [Mon, 9 Aug 2021 19:57:05 +0000 (12:57 -0700)]
Mark unused functions with `C10_UNUSED` (#62929)
Summary:
Which fixes number of warnings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62929
Reviewed By: walterddr, albanD
Differential Revision:
D30171953
Pulled By: malfet
fbshipit-source-id:
f82475289ff4aebb0c97794114e94a24d00d2ff4
peterjc123 [Mon, 9 Aug 2021 19:50:52 +0000 (12:50 -0700)]
Stop exporting symbols in anonymous namespaces (#62952)
Summary:
The cases are found out by compiling against clang on Windows.
Those functions will still be exported under this case, which is a waste of space in the symbol table.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62952
Reviewed By: gchanan
Differential Revision:
D30191291
Pulled By: ezyang
fbshipit-source-id:
3319b0ec4f5fb02e0fe1b81dbbcedcf12a0c795e
Mike Iovine [Mon, 9 Aug 2021 19:07:55 +0000 (12:07 -0700)]
[Static Runtime] Add tests for all aten ops (#62347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62347
This diff includes tests for all `aten` ops that did not already have test coverage.
Test Plan: `buck test //caffe2/benchmarks/static_runtime/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision:
D29968280
fbshipit-source-id:
768655ca535f9e37422711673168dce193de45d2
Zeina Migeed [Mon, 9 Aug 2021 18:45:34 +0000 (11:45 -0700)]
handle get_attr opearations in typechecker (#62682)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62682
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision:
D30107789
Pulled By: migeed-z
fbshipit-source-id:
0b21b2893e2dc7cfaf5b5f5990f662e051a981b4
Bert Maher [Mon, 9 Aug 2021 18:22:24 +0000 (11:22 -0700)]
Linker version script to hide LLVM symbols (#62906)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62906
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision:
D30193893
Pulled By: bertmaher
fbshipit-source-id:
9b189bfd8d4c52e8dc4296a4bed517ff44994ba0
Andrew Gu [Mon, 9 Aug 2021 18:15:35 +0000 (11:15 -0700)]
Add ``allow_empty_param_list`` to functional optimizers (#62522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62522
Addresses https://github.com/pytorch/pytorch/issues/62481
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision:
D30072074
Pulled By: andwgu
fbshipit-source-id:
1a5da21f9636b8d74a6b00c0f029427f0edff0e3
Sangbaek Park [Mon, 9 Aug 2021 17:48:39 +0000 (10:48 -0700)]
[Vulkan] Added Hardshrink op (#62870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62870
Added Hardshrink operator for Vulkan
Added tests for Hardshrink op
Reference: [Hardshrink](https://pytorch.org/docs/stable/generated/torch.nn.Hardshrink.html#torch.nn.Hardshrink)
Test Plan: Imported from OSS
Reviewed By: SS-JIA
Differential Revision:
D30174950
Pulled By: beback4u
fbshipit-source-id:
3e192390eb9f92abecae966e84bbfae356bfd7c8
Zeina Migeed [Mon, 9 Aug 2021 17:45:47 +0000 (10:45 -0700)]
Change output node handling for typechecker to deal with tuples (#62582)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62582
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision:
D30050004
Pulled By: migeed-z
fbshipit-source-id:
9b81b10d24e1e8165cdc18c820ea314349b463cb