review.tizen.org Git - platform/upstream/pytorch.git/log

[FX] Fix GraphModule deepcopy to use deepcopied graph (#63090)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63090

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D30252471

Pulled By: jamesr66a

fbshipit-source-id: cafd7d7917935a5ea6ffa2a7fe9e9b2a9578b3e3

MaybeOwned page for dev wiki (#63450)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63450

Brief guide to understanding `MaybeOwned<Tensor>`, aimed at C++ PT devs who are obliged to interact with existing uses of it, rather than encouraging new usage.

For reviewers: I haven't yet added a link to this page from anywhere. I'm thinking the right place is the [dev wiki main page C++ section](https://github.com/pytorch/pytorch/wiki#c) but happy to put it wherever makes sense, suggestions welcome.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30402313

Pulled By: bhosmer

fbshipit-source-id: 69b15909ecafcd8d88e44f664f88c3ad4eb26d84

Disable RDYNAMIC check with MSVC (#62949)

Summary:
When testing with clang-cl, the flag is added though it is unsupported and that generates a few warnings. Tried a few alternatives like https://cmake.org/cmake/help/latest/module/CheckLinkerFlag.html, but they just don't work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62949

Reviewed By: zhouzhuojie, driazati

Differential Revision: D30359206

Pulled By: malfet

fbshipit-source-id: 1bd27ad5772fe6757fa8c3a4bddf904f88d70b7b

document why wrappers exist in `torch.functional` (#62847)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/62844.

These wrappers are not super obvious, but ultimately stem from the lack of support for functions with variadic args in native_functions.yaml. https://github.com/pytorch/pytorch/issues/62845 tracks that issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62847

Reviewed By: VitalyFedyunin

Differential Revision: D30305016

Pulled By: dagitses

fbshipit-source-id: 716fcecb0417b770bc92cfd8c54f7ead89070896

[DDP] Add a debug check in cpp fp16 compress (#63379)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63379

this codepath has been prone to bugs as seen in the below diff, this
will help ensure against changes/refactors that touch this, as a basic sanity
check. Enabled it in debug-only builds to not affect the perf.
ghstack-source-id: 136056093

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30358440

fbshipit-source-id: e1b3893a223722c2593ceed8696a09c7d07d47c1

[DDP][Grad compression] Fix fp16 cpp hook (#63375)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63375

I think tensor.copy_(tensor.to(torch::kFloat16)); will keep it as
float32.

Tested by add the following line:

```
LOG(INFO) << "Type is: " << compressed_tensor.scalar_type();
```

before:

```
I0816 17:03:09.823688 364141 default_comm_hooks.cpp:21] Type is: Float
```
after:

```
I0816 17:01:16.779052 353924 default_comm_hooks.cpp:21] Type is: Half
```
ghstack-source-id: 136056092

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D30356256

fbshipit-source-id: 8208a705acd7628541cd43c8bf61d007dfdd2435

[doc] pre-commit fix instructions (#61717)

Summary:
fix invalid instruction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61717

Reviewed By: zhouzhuojie, driazati

Differential Revision: D30359218

Pulled By: malfet

fbshipit-source-id: 61771babeac4d34425a61ce49f38a7099b521eec

Make SkipInfo with expected_failure an XFAIL (#63481)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63481

This PR changes the SkipInfo decorators to use unittest.expectedFailure so that the test reports as XFAIL as opposed to PASSED.

Note that changing the expectedFailure here https://github.com/pytorch/pytorch/blob/30e1c74dc19ae2b622b46ebcdb7972c42775ac80/torch/testing/_internal/common_device_type.py#L879 to an XFAIL is not possible because the decision of whether to decorate is delayed until the wrapper function is called.

fixes https://github.com/pytorch/pytorch/issues/63363

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30397154

Pulled By: heitorschueroff

fbshipit-source-id: c5e4911969ad8667763eec4203dbbc6a51178592

Improve custom function docs (#60312)

Summary:
- Adds some code examples for `ctx` methods and make requirements of arguments more clear
- Type annotations for `save_for_backward`, `mark_dirty`, `mark_non_differentiable`, and `set_materialize_grads` (BC-breaking?)
- Refactor `torch.autograd.Function` doc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60312

Reviewed By: VitalyFedyunin

Differential Revision: D30314961

Pulled By: soulitzer

fbshipit-source-id: a284314b65662e26390417bd2b6b12cd85e68dc8

[6/N] Enable opt-asan for elastic and launcher tests. (#63442)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63442

Continuation of https://github.com/pytorch/pytorch/pull/62051, I've
enabled elastic and launcher tests to run in opt-asan mode which is supported
with spawn multiprocessing.

This allows us to completely get rid of fork based tests from torch.distributed
and have all tests run in spawn mode.
ghstack-source-id: 136057123

Test Plan: waitforbuildbot

Reviewed By: cbalioglu

Differential Revision: D30384267

fbshipit-source-id: ad3447cfb9d6e31e7ec8332d64c8ff1054858dcb

Add validation check in fx2trt interpreter (#63424)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63424

Add validation check in fx2trt for missing converter operators. If any op missing, interpreter init will report missing operators.

Test Plan:
for call_function and call_method:
manual test with feeds benchmark and verify init failed with expected message.
{F642390780}

for call_module:
specify a module as leaf node and make acc_tracer trace it as a node; then in fx2trt.py, in CONVERTER initialize stage make it skip recording all modules; initialize interpreter and call validator function, verify the output includes the missing module name, return value print as screenshot below.

{F643458718}

Reviewed By: 842974287

Differential Revision: D30294832

fbshipit-source-id: 243dca3fdfc6a174ded65248938e2a234aec19c6

[pytorch] Make qconv forward() thread safe (#63432)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63432

There's a race condition in quantized models when multiple threads call forward() due to qnnpack packing the weights the first time the operator is called. This locks the entire apply_impl function.

Test Plan:
https://github.com/pytorch/pytorch/issues/58055

Ran the script before and after, original crashes went away

Reviewed By: kimishpatel

Differential Revision: D30229520

fbshipit-source-id: d06cabe24199a80325cd57f24a7fd60624be2cf7

Use `fastAtomicAdd` in EmbeddingBag (mode "max") backward (#63298)

Summary:
Rel: https://github.com/pytorch/pytorch/issues/62695

### This PR
|   n_tokens |   num_embeddings |   embedding_dim | mode   |    bwd_fp32 |    bwd_fp16 |
|-----------:|-----------------:|----------------:|:-------|------------:|------------:|
|       4096 |             4096 |            4096 | max    | 0.000326228 | 0.000181448 |
|       4096 |             4096 |           16384 | max    | 0.00102805  | 0.000618136 |
|       4096 |            16384 |            4096 | max    | 0.000907326 | 0.000530422 |
|       4096 |            16384 |           16384 | max    | 0.00334988  | 0.00264645  |
|      16384 |             4096 |            4096 | max    | 0.000366449 | 0.000320232 |
|      16384 |             4096 |           16384 | max    | 0.00126421  | 0.00104183  |
|      16384 |            16384 |            4096 | max    | 0.00087738  | 0.00065068  |
|      16384 |            16384 |           16384 | max    | 0.00379229  | 0.00298201  |

### Original
|   n_tokens |   num_embeddings |   embedding_dim | mode   |    bwd_fp32 |    bwd_fp16 |
|-----------:|-----------------:|----------------:|:-------|------------:|------------:|
|       4096 |             4096 |            4096 | max    | 0.00032407  | 0.000188231 |
|       4096 |             4096 |           16384 | max    | 0.00104356  | 0.000624001 |
|       4096 |            16384 |            4096 | max    | 0.000902069 | 0.000527382 |
|       4096 |            16384 |           16384 | max    | 0.00302202  | 0.00255153  |
|      16384 |             4096 |            4096 | max    | 0.000384343 | 0.000403249 |
|      16384 |             4096 |           16384 | max    | 0.00126445  | 0.00135069  |
|      16384 |            16384 |            4096 | max    | 0.000880814 | 0.000825679 |
|      16384 |            16384 |           16384 | max    | 0.00337611  | 0.00319515  |

cc xwang233 ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63298

Reviewed By: mruberry

Differential Revision: D30383583

Pulled By: ngimel

fbshipit-source-id: 14dd9d67002c53a153721812709033c198f68c1e

Reverting launch bounds change in topK that induced a regression in perf (#63431)

Summary:
[topkwsyncs.zip](https://github.com/pytorch/pytorch/files/7003077/topkwsyncs.zip)

Running this script on nvidia containers 21.08 vs 21.07 we see the following perf drops:
topk(input=(dtype=torch.float16,shape=[60, 201600]), k=2000, dim=1, sorted=True) - 0.63

topk(input=(dtype=torch.float32,shape=[120000]), k=12000, dim=0, sorted=False) - 0.55

topk(input=(dtype=torch.float16,shape=[5, 201600]), k=2000, dim=1, sorted=True) - 0.55

topk(input=(dtype=torch.float32,shape=[1, 10000]), k=1000, dim=1, sorted=False) - 0.33

The relative perf drop is reported as (21.08_time - 21.07_time) / 21.07_time

I narrowed down the source of the regression to this commit: https://github.com/pytorch/pytorch/pull/60314
which reduced launch bounds from 1024 to 512.

The perf did not seem to regress in the original evidence provided to change 1024 to 512 due to the input shapes in the benchmark being a lot smaller than the input shapes of the tensors which I am witnessing perf regression in. I suggest reverting back to 1024 as with 512 there was no considerable improvement in perf for small inputs and a major regression in perf for large tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63431

Reviewed By: mruberry

Differential Revision: D30384087

Pulled By: ngimel

fbshipit-source-id: 11eecbba82a069b1d4579d674c3f644ab8060ad2

Make DataChunk support list in-place ops (#63422)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63422

Fixes #63095

Make `DataChunk` delegate to list method. Then it will support in-place operations:
- `sort`
- `reverse`
- `append`
- `extend`
- `random.shuffle`

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30379027

Pulled By: ejguan

fbshipit-source-id: d176bd0cc8b89b915c7bb184ff243ab1f605616d

A tiny fix in MT19937RNGEngine (#63219)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63219

Reviewed By: VitalyFedyunin

Differential Revision: D30341484

Pulled By: ezyang

fbshipit-source-id: 0ff4499d0f4a3dfeb991c0f10fe3248c6ca1c992

Implement subclass priority for __torch_dispatch__ (#63411)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63411

In order to get this behavior, you have to use append_overloaded,
which I forgot to use in the previous implementation. I exposed
an internal helper function which is more appropriate for dispatch
to Python where we know that an argument is definitely a Tensor (and
this test no longer needs to be done).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D30374489

Pulled By: ezyang

fbshipit-source-id: 43b08c00d1958c9b26d82a025d19f0b67bb85590

[fx2trt] Add dequantize support (#63448)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63448

Only available after TensorRT 8.0

Test Plan: buck run mode/opt caffe2/torch/fb/fx2trt:test_dequantize

Reviewed By: 842974287

Differential Revision: D30296863

fbshipit-source-id: 44b9630ef0d210e7f20e650dc81c519f7e41f5f3

add `OpInfo` for `torch.linalg.tensorinv` (#62326)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/53739.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62326

Reviewed By: H-Huang

Differential Revision: D30136376

Pulled By: zou3519

fbshipit-source-id: 04ec9450e8866667649af401c7559b96ddc91491

Update cuda amp to also check xla device (#63413)

Summary:
Fixes https://github.com/pytorch/xla/issues/3086. Pytorch/XLA:GPU also use cuda amp. I verified the pt/xla `test_autocast` with this fix and all test passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63413

Reviewed By: ngimel

Differential Revision: D30380785

Pulled By: bdhirsh

fbshipit-source-id: fd1a1de7d224c616fc3fa90b80a688a21f6b1ecc

[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT`

Reviewed By: zertosh

Differential Revision: D30391472

fbshipit-source-id: d4eb1e7debea8905e7fee5f026c082bee65e78f3

enhance comparison tests for c10::optional (#62887)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62887

Reviewed By: VitalyFedyunin

Differential Revision: D30305044

Pulled By: dagitses

fbshipit-source-id: d0a3a9e4ea186915ef087543aaf81a606f943380

clarify the documentation of `torch.meshgrid` (#62977)

Summary:
Also warn about the behavior differences from `numpy.meshgrid`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62977

Reviewed By: mruberry, ngimel

Differential Revision: D30220930

Pulled By: dagitses

fbshipit-source-id: ae6587b41792721cae2135376c58121b4634e296

[5/N] Run opt-asan with detect_leaks=0 (#63361)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63361

Python multiprocessing doesn't support LSAN and causes false positives
instead. As a result, disabling LSAN for these tests so that we can still run
with opt-asan
ghstack-source-id: 135962489

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D30352269

fbshipit-source-id: f6ab5abce7bdef00cd5e1f5977424d2b151174af

[sharded_tensor] fix typing issue for placement (#63426)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63426

placement should either be a string or a _remote_device, this fixes the type to match the behaviors
ghstack-source-id: 136041125

Reviewed By: pritamdamania87

Differential Revision: D30379702

fbshipit-source-id: 34e226494240923b433e3a39cc08c84d42cdad6b

[easy][PyTorchEdge] print error message when failing to load model file (#63404)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63404

# Context
Loading a model file using `fopen` might error out for multiple reasons. Repro'ing the error on devices takes some time and efforts. Logging the error no# will help in debugging and fixing the error quickly.

# Mitigation
Printout the error message of the `fopen` to help users debug the issue.

Test Plan:
```
(base) [pavithran@devvm1803.vll0 /data/users/pavithran/fbsource] buck run xplat/caffe2/fb/lite_predictor:lite_predictor -- --model=/home/pavithran/models/prod/GAaNhAoTIV6cIvgJAHn30m8NR1QgbmQwAAAA.ptl --use_bundled_input=0
Building: finished in 0.5 sec (100%) 354/354 jobs, 0/354 updated
Total time: 0.6 sec
Run with 24 threads
Run with 24 threads
Loading model...
terminate called after throwing an instance of 'c10::Error'
what(): open file failed because of errno 2 on fopen: No such file or directory, file path: /home/pavithran/models/prod/GAaNhAoTIV6cIvgJAHn30m8NR1QgbmQwAAAA.ptl
Exception raised from RAIIFile at xplat/caffe2/caffe2/serialize/file_adapter.cc:15 (most recent call first):
(no backtrace available)
```

Reviewed By: dhruvbird

Differential Revision: D30372308

fbshipit-source-id: 5346e828f53f6bc5d871b403586566a3332a389a

[fx2trt] Add quantize_per_tensor support (#63447)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63447

Only available in TRT 8.0 and above

Test Plan: buck run mode/opt caffe2/torch/fb/fx2trt:test_quantize_per_tensor

Reviewed By: 842974287

Differential Revision: D30322844

fbshipit-source-id: dfd925e3432de128f2925b1aa55d6125e63359af

Fix RPC Python User Function Error Handling (#63406)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63406

The `RemoteException` will be thrown on the caller side when converting
the response message to IValue. Since it is a Python error, the error
message needs to be extracted explicitly and clear the `PyErr`.

Test Plan: Imported from OSS

Reviewed By: rohan-varma, ngimel

Differential Revision: D30372741

Pulled By: mrshenli

fbshipit-source-id: 1f72a7ee0c39cc2ef070f99884c142f7b3e0543d

[torch] Set default log level for torch elastic (#63214)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63214

The default log level in fb and oss is different: in oss we use WARNING and in fb we use INFO.

Test Plan: unittests, f291441502

Reviewed By: cbalioglu

Differential Revision: D30296298

fbshipit-source-id: 89067352be767255fbc66e790ec333582de64c6c

[BE] remove _SUPPORTED_OPTIM_MAP from tests (#63383)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63383

Per title
ghstack-source-id: 135966157

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30358921

fbshipit-source-id: 965e054e525194b1ee55980340df275bab355c9b

[DDP] Support step_param for AdamW (#63382)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63382

Per title
ghstack-source-id: 135966156

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D30255446

fbshipit-source-id: e6ffbf339db0bc5b4702d02b74a462309df07c75

[quant][graphmode][fx][fix] Fix quantization for tuple arguments (#63376)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63376

Previously when tuple is an argument for a quantizable op it would be transformed to a list by mistake,
this PR fixes that.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_preserve_tuple

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D30357642

fbshipit-source-id: 82d10805d9c00c003cc99983dca68b6455ff7b2e

Add more ciflow labels for more workflows (#63410)

Summary:
- Add more ciflow labels and enable it for more workflows.
- Only the 'ciflow/default' workflows are run by default on pull_request time
- Other labels can be manually triggered by (adding the labels + unassign pytorchbot), OR wait for pytorchbot's comment opt-in rollout
- The label design is a logical operator `OR`, i.e. adding ('ciflow/cuda' + 'ciflow/win') will trigger the union of them. (design feedback is needed here)

Typical default workflows for normal PRs.

<details>
<summary>Generated label rules</summary>

![image](https://user-images.githubusercontent.com/658840/129779905-eb5e56dd-a696-4040-9eb6-71ecb6487dc1.png)

```
{
  "label_rules": {
    "ciflow/all": [
      "libtorch-linux-xenial-cuda10.2-py3.6-gcc7",
      "libtorch-linux-xenial-cuda11.1-py3.6-gcc7",
      "linux-bionic-cuda10.2-py3.9-gcc7",
      "linux-bionic-py3.8-gcc9-coverage",
      "linux-xenial-cuda10.2-py3.6-gcc7",
      "linux-xenial-cuda11.1-py3.6-gcc7",
      "linux-xenial-py3.6-gcc5.4",
      "linux-xenial-py3.6-gcc7-bazel-test",
      "periodic-libtorch-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-win-vs2019-cuda11.3-py3",
      "win-vs2019-cpu-py3",
      "win-vs2019-cuda10.1-py3",
      "win-vs2019-cuda11.1-py3"
    ],
    "ciflow/bazel": [
      "linux-xenial-py3.6-gcc7-bazel-test"
    ],
    "ciflow/coverage": [
      "linux-bionic-py3.8-gcc9-coverage"
    ],
    "ciflow/cpu": [
      "linux-bionic-py3.8-gcc9-coverage",
      "linux-xenial-py3.6-gcc5.4",
      "linux-xenial-py3.6-gcc7-bazel-test",
      "win-vs2019-cpu-py3"
    ],
    "ciflow/cuda": [
      "libtorch-linux-xenial-cuda10.2-py3.6-gcc7",
      "libtorch-linux-xenial-cuda11.1-py3.6-gcc7",
      "linux-bionic-cuda10.2-py3.9-gcc7",
      "linux-xenial-cuda10.2-py3.6-gcc7",
      "linux-xenial-cuda11.1-py3.6-gcc7",
      "periodic-libtorch-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-win-vs2019-cuda11.3-py3",
      "win-vs2019-cuda10.1-py3",
      "win-vs2019-cuda11.1-py3"
    ],
    "ciflow/default": [
      "linux-bionic-py3.8-gcc9-coverage",
      "linux-xenial-cuda11.1-py3.6-gcc7",
      "linux-xenial-py3.6-gcc5.4",
      "linux-xenial-py3.6-gcc7-bazel-test",
      "win-vs2019-cpu-py3",
      "win-vs2019-cuda10.1-py3"
    ],
    "ciflow/libtorch": [
      "libtorch-linux-xenial-cuda10.2-py3.6-gcc7",
      "libtorch-linux-xenial-cuda11.1-py3.6-gcc7",
      "periodic-libtorch-linux-xenial-cuda11.3-py3.6-gcc7"
    ],
    "ciflow/linux": [
      "libtorch-linux-xenial-cuda10.2-py3.6-gcc7",
      "libtorch-linux-xenial-cuda11.1-py3.6-gcc7",
      "linux-bionic-cuda10.2-py3.9-gcc7",
      "linux-bionic-py3.8-gcc9-coverage",
      "linux-xenial-cuda10.2-py3.6-gcc7",
      "linux-xenial-cuda11.1-py3.6-gcc7",
      "linux-xenial-py3.6-gcc5.4",
      "linux-xenial-py3.6-gcc7-bazel-test",
      "periodic-libtorch-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-linux-xenial-cuda11.3-py3.6-gcc7"
    ],
    "ciflow/scheduled": [
      "periodic-libtorch-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-linux-xenial-cuda11.3-py3.6-gcc7",
      "periodic-win-vs2019-cuda11.3-py3"
    ],
    "ciflow/slow": [
      "linux-bionic-cuda10.2-py3.9-gcc7",
      "linux-xenial-cuda10.2-py3.6-gcc7"
    ],
    "ciflow/win": [
      "periodic-win-vs2019-cuda11.3-py3",
      "win-vs2019-cpu-py3",
      "win-vs2019-cuda10.1-py3",
      "win-vs2019-cuda11.1-py3"
    ]
  },
  "version": "v1"
}
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63410

Reviewed By: ngimel

Differential Revision: D30378553

Pulled By: zhouzhuojie

fbshipit-source-id: 4e0953740793e5e72b95018f8ab2ce4a6a364c38

`F.avg_pool3` CUDA backward: gpuAtomicAddNoReturn -> fastAtomicAdd (#63387)

Summary:
Rel: https://github.com/pytorch/pytorch/issues/62695

In the following two tables, I set `kernel_size` to 3 and `stride` to 2.
In benchmark, input tensors have the shape of (N, C, n_features, n_features, n_features).
Tested on RTX3080 w/ CUDA11.4 Update 1.

## This PR

|   N |   C |   n_features | dtype         |        time |
|----:|----:|-------------:|:--------------|------------:|
|  32 |   3 |            8 | torch.float16 | 7.46846e-05 |
|  32 |   3 |            8 | torch.float32 | 8.18968e-05 |
|  32 |   3 |           32 | torch.float16 | 0.000156748 |
|  32 |   3 |           32 | torch.float32 | 0.000165236 |
|  32 |   3 |          128 | torch.float16 | 0.00549854  |
|  32 |   3 |          128 | torch.float32 | 0.008926    |

## master (6acd87f)

|   N |   C |   n_features | dtype         |        time |
|----:|----:|-------------:|:--------------|------------:|
|  32 |   3 |            8 | torch.float16 | 7.60436e-05 |
|  32 |   3 |            8 | torch.float32 | 7.55072e-05 |
|  32 |   3 |           32 | torch.float16 | 0.000189292 |
|  32 |   3 |           32 | torch.float32 | 0.000168645 |
|  32 |   3 |          128 | torch.float16 | 0.00699538  |
|  32 |   3 |          128 | torch.float32 | 0.00890226  |

master's time divided by PR's time is as follows:

| N | C | n_features | master / PR |
|---:|---:|---------------:|----------------:|
| 32 | 3 | 8 | 1.018 |
| 32 | 3 | 32 | 1.208 |
| 32 | 3 | 128 | 1.272|

cc: xwang233 ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63387

Reviewed By: mruberry

Differential Revision: D30381434

Pulled By: ngimel

fbshipit-source-id: 3b97aee4b0d457a0277a0d31ac56d4151134c099

Add pocketfft as submodule (#62841)

Summary:
Using https://github.com/mreineck/pocketfft

Also delete explicit installation of pocketfft during the build as it will be available via submodule

Limit PocketFFT support to cmake-3.10 or newer, as `set_source_files_properties` does not seem to work as expected with cmake-3.5

Partially addresses https://github.com/pytorch/pytorch/issues/62821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62841

Reviewed By: seemethere

Differential Revision: D30140441

Pulled By: malfet

fbshipit-source-id: d1a1cf1b43375321f5ec5b3d0b538f58082f7825

[wip] Move smallest bucket to end after rebuild buckets (#62279)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62279

Before rebuild buckets, `kDefaultFirstBucketBytes` is actually misleading because we reverse the parameter indices when initialize reducer so it is actually the size of the last bucket.

Currently rebuild buckets sets this to be the first bucket size, but seeing if keeping it as last can help perf.

This is currently experimental only and don't plan to land it unless experiments show a clear win.
ghstack-source-id: 135966897

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29927931

fbshipit-source-id: 55b949986fa2c3bade6fcb4bf5b513461bf0f490

adding a note to the documentation of polar (#63259)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63259

Fix #52919

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D30342536

Pulled By: NivekT

fbshipit-source-id: 4c61a86f96a6370cc64652bf652c4ae25c9f4601

[quant][graphmode][fx][bc-breaking] Support for reference pattern for fixqparam ops in eval mode (#62608)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62608

Insert extra fixeqparam fake quant in the output of fixed qparam ops in fbgemm e.g. sigmoid
so that we can produce reference patterns for these ops

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30053978

fbshipit-source-id: c527944b6e791bb4d45ebe96265af52794203695

Revert D30281388: [PyTorch] Avoid using std::regex for device string parsing in Device.cpp

Test Plan: revert-hammer

Differential Revision:
D30281388 (https://github.com/pytorch/pytorch/commit/4d6f98ecada2d85b2474b023838debad4305316d)

Original commit changeset: 4d998e9f313e

fbshipit-source-id: 11134b3400cc3e851155c9c1b6fb59308ff1567b

Fix zero-dim handling in torch.matmul (#63359)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63359

Fixes #63352. The problem was that in e.g. `torch.matmul(A, B)` with A,
B having shapes [3, 2, 0] and [0, 2], the code attempts to call
`A.view(-1, 0)` which fails due to "-1 being ambiguous". The solution is
to manually compute what we want the shape of the view to be.

Test Plan: - new tests

Reviewed By: ngimel

Differential Revision: D30351583

Pulled By: zou3519

fbshipit-source-id: 7625691fe8b85d96a4073409596a932c303e3e8c

[TensorExpr] Add a wrapper for all expr and stmt pointers. (#63195)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195

This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.

The changes are mechanical and should not affect any functionality.

With this PR, we're changing the following:
* `Add*` --> `AddPtr`
* `new Add(...)` --> `alloc<Add>(...)`
* `dynamic_cast<Add*>` --> `to<Add>`
* `static_cast<Add*>` --> `static_to<Add>`

Due to some complications with args forwarding, some places became more
verbose, e.g.:
* `new Block({})` --> `new Block(std::vector<ExprPtr>())`

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30292779

Pulled By: ZolotukhinM

fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9

OpInfo fix: `conv_transpose2d` (#63389)

Summary:
Addresses comment: https://github.com/pytorch/pytorch/pull/62882#issuecomment-899679606.

cc: mruberry ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63389

Reviewed By: mruberry

Differential Revision: D30377481

Pulled By: ngimel

fbshipit-source-id: 0fa21acc3503c259c9b27463e8555247c43d9e2e

[Static Runtime] Implement aten::append (#63350)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63350

Add a native implementation for `aten::append`, the list append op.

Test Plan: New unit test: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Append`

Reviewed By: hlu1

Differential Revision: D30326461

fbshipit-source-id: 0dbdf6cc82e78c7c36db39583256f6b87385e3d3

[vulkan] Add log_softmax (#63193)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63193

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D30291987

fbshipit-source-id: 89c6560274e5a841e5af249f6963b67ef6826f4c

[quant][fx] Ensure qconfig works for QAT with multiple modules (#63343)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63343

The previous implementation had a bug where we were trying to modify an ordered dict value while iterating through it.
This fixes it by creating a copy before modifying it.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qconfig_qat_module_type

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D30346116

fbshipit-source-id: 0e33dad1163e8bff3fd363bfd04de8f7114d7a3a

Add return type hint and improve the docstring of consume_prefix_in_state_dict_if_present method (#63388)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63388

Context: https://discuss.pytorch.org/t/how-to-use-the-helper-function-consume-prefix-in-state-dict-if-present/129505/3

Make it clear that this method strips the prefix in place rather than returns a new value.

Additional reformatting is also applied.
ghstack-source-id: 135973393

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D30360931

fbshipit-source-id: 1a0c7967a4c86f729e3c810686c21dec43d1dd7a

Add handling of ifs to shape propagation (#62914)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62914

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30196945

Pulled By: eellison

fbshipit-source-id: 1c0c7f938c4547330fd1dba8ab7dd0b99a79b6a9

Small shape analysis changes (#62911)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62911

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D30196946

Pulled By: eellison

fbshipit-source-id: 2562bab323088d9c1440ae0431e533f9bcc513d3

Add a few peepholes (#62910)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62910

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30196947

Pulled By: eellison

fbshipit-source-id: d88c92616d4de4f47ff4fcf5c1994e629ca20395

Propagate symbolic dimensions through idioms like x.view(y.size()) (#61975)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61975

Propagate symbolic dimensions through size calls. We did this by associating SymbolicSizes with integer inputs by looking through their constructors for `x.size(1)` or `x.size()` nodes.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30196948

Pulled By: eellison

fbshipit-source-id: 377fc1d2f6d396c52dc0e87fa814b15720f1414e

[fx2trt] Refactor linear op to use mm + add

Summary:
Previously linear is translated to fully_connected which only works when weight is a constant,
this diff changes that to mm + add so that the weight can be an ITensor so that we can have the weight - quantize - dequantize
pattern in the produced TensorRT network

Test Plan: buck run mode/opt caffe2/torch/fb/fx2trt:test_linear

Reviewed By: 842974287

Differential Revision: D30294751

fbshipit-source-id: 596fbd4c81caef8df41a002a2e14fbf22d9d2a80

Updates set_default_dtype documentation (#63233)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/60560.

The description of set_default_dtype is updated to clarify that it affects the interpretation of Python numbers as either float32 (complex64) or float64 (complex128) and that default (floating) dtypes other than float32 or float64 are unsupported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63233

Reviewed By: VitalyFedyunin

Differential Revision: D30306396

Pulled By: mruberry

fbshipit-source-id: bbee62f323c773b23b2fa45cb99122bc28197432

Remove backend_debug from torch_core srcs and replace with library dependency (#63111)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63111

### Problem:
Buck contains at least two libraries which have `backend_debug_info.cpp` as a source, `torch_core` and `backend_interface_lib`. `backend_debug_info.cpp` registers BackendDebugInfo as a class. If targets contain both libraries (e.g. sparkAR debug build with NNAPI delegation), then BackendDebugInfo is registered twice, causing a runtime error.
### Solution:
These changes remove `backend_debug_info.cpp` and `backend_interface.cpp` as a source in `torch_core` and adds backend_interface_lib as a dependency instead.

**build_variables.bzl:**
- Added a list that excludes `backend_debug_info.cpp` and `backend_interface.cpp` ( both srcs already included by `backend_interface_lib`)

**buck:**
- torch_core: Removed `backend_debug_info.cpp` from srcs and added `backend_interface_lib` deps
- backend_interface_lib: Replaced `torch_mobile_core` dep with more specific deps
- to avoid an indirect dep between `torch_core` and `torch_mobile_core`

ghstack-source-id: 135981061

Test Plan:
### Test Plan:
Build and run SparkAR internally with Android NNAPI Delegation (`buck build --show-output arstudioplayer_arm64_debug`)
and internal tests.

Reviewed By: iseeyuan

Differential Revision: D30259034

fbshipit-source-id: 0c14c827732f07fb9b9bd25a999828b51793cdcc

Move Android Nnapi srcs from aten_native_cpu to aten_cpu (#62919)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62919

Move Android NNAPI srcs (nnapi_bind.cpp, nnapi_wrapper.cpp, nnapi_model_loader.cpp) from aten_native_cpu to aten_cpu, so that later the NNAPI delegate's execution library can depend on it.

aten_native_cpu is built selectively per app, but the srcs have no selective components and are required for the NNAPI delegate library in D30259033.

See Buck Dependencies: https://docs.google.com/document/d/17RuWkqWKCO6sc5fKzIDkGeNhhvMk7BvJOqeSnGsHZ8o/edit?usp=sharing
ghstack-source-id: 135981062

Test Plan: `buck build --show-output arstudioplayer_arm64_debug` and internal tests

Reviewed By: iseeyuan

Differential Revision: D30164867

fbshipit-source-id: 0beff481ff250e75664ce8393beabbeb9db66770

[android][vulkan] Fix model loading for Vulkan backend (#63402)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63402

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D30370692

Pulled By: IvanKobzarev

fbshipit-source-id: 73311b9b767fe9ed3ae390db59d6aa2c4a98f06d

Advertise USE_PRECOMPILED_HEADERS in CONTRIBUTING.md (#62827)

Summary:
This option was added in https://github.com/pytorch/pytorch/issues/61940 and fits with this section's theme of improving build times.

I've also changed it to a `cmake_dependent_option` instead of `FATAL_ERROR`ing for older CMake versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62827

Reviewed By: astaff

Differential Revision: D30342102

Pulled By: malfet

fbshipit-source-id: 3095b44b7085aee8a884ec95cba9f8998d4442e7

[fx] persist `tracer_cls` on `fx.Graph` when deep copying (#63353)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63353

Custom deepcopy method copies all nodes but does not copy the tracer_cls attribute

Reviewed By: houseroad

Differential Revision: D30349424

fbshipit-source-id: 3e98bdac8a8a992eb0b4ec67fe80bb2e5cf3884d

[PyTorch] Avoid using std::regex for device string parsing in Device.cpp (#63204)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63204

Currently, `std::regex` is used for parsing device strings. This is undesirable for a few reasons.

1. Increases binary size
2. Slows down model loading
3. Potentially uses more memory at runtime
4. Takes marginally longer time to build code that uses std::regex v/s not using std::regex

This change avoids the use of `std::regex` for parsing the device string since we don't need to.
ghstack-source-id: 136006963

Test Plan:
### AI Bench Runs

**Before this change:**
1. Model Load time: [252ms](https://www.internalfb.com/intern/aibench/details/332471502816548)
2. Model unload time: 3.5ms

**After this change:**
1. Model Load time: [240ms](https://www.internalfb.com/intern/aibench/details/652195589031318), which is an approx 5% reduction for the current model. I suspect percentage wise, it will be larger for smaller models since this is a fixed cost reduction.
2. Model unload time: 3.3ms (probably too small to be meaningfully impactful to an end user).

### BSB Results

```
D30281388-V1 (https://www.internalfb.com/intern/diff/D30281388/?dest_number=135713848)

messenger-pika-optimized-device: Succeeded
Change in Download Size for arm64 + 3x assets variation: -7.1 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -17.6 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:551399955987465@base/bsb:551399955987465@diff/
```

Reviewed By: raziel

Differential Revision: D30281388

fbshipit-source-id: 4d998e9f313e6366d9d89a6a73cd090ddfb059fc

[PyTorch] Add Device_test.cpp (#63203)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63203

Currently, `c10::Device` isn't being tested - i.e. there's no test to ensure that the device string parsing works as expected. This diff adds very basic tests to assert that the stuff we expect to work works, and the stuff that we don't expect to work doesn't work.

ghstack-source-id: 136006962

Test Plan:
New test. Ran as:

```
cd fbsource/fbcode/
buck test //caffe2/c10:c10_test_0 -- -r '.*DeviceTest.*'
```

Reviewed By: dreiss, raziel

Differential Revision: D30286910

fbshipit-source-id: b5699068dcbba89d5d224dbaf74b175f3f785a00

change with_callable_args to return a fresh _PartialWrapper (#63374)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/63326

Currently `get_callable_args` has the side effect of mutating the input _PartialWrapper. When that input is one of the global defaults, there are all sorts of lifetime issues that crop up. (Details in the linked issue.) So far as I can tell, we only need to make a constructor which is module (and by extension device) aware, so making a fresh one should have the same effect without leaking the last call's module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63374

Test Plan: the repro in https://github.com/pytorch/pytorch/issues/63326 now reports no leaked Tensors, and all quantization tests pass locally.

Reviewed By: HDCharles

Differential Revision: D30359360

Pulled By: robieta

fbshipit-source-id: aef33261ac49952d8d90da868a57ab063dfc456e

Fix flaky test for dp saved tensor hooks (#63324)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63324

Fix for https://www.internalfb.com/tasks/?t=98258963
`catch_warnings` seem to only trigger once in certain cases where it
should trigger twice.
This test is only meant to test whether hooks are trigger / not trigger,
so changing it to self.assertGreater is ok.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30340833

Pulled By: Varal7

fbshipit-source-id: 1bfb9437befe9e8ab8f95efe5f513337fa9bdc5c

Add mode to TarArchiveReader (#63332)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63332

Add a corresponding PR from [torchdata](https://github.com/facebookexternal/torchdata/pull/101)

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D30350151

Pulled By: ejguan

fbshipit-source-id: bced4a1ee1ce89d4e91e678327342e1c095dbb9e

add torch.meshgrid() OpInfo (#62720)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/62719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62720

Reviewed By: astaff

Differential Revision: D30344574

Pulled By: dagitses

fbshipit-source-id: ed42d9fe20741df98018efb08e640fca370583fb

Extends warning on norm docs (#63310)

Summary:
torch.norm has a couple documentation issues, like https://github.com/pytorch/pytorch/issues/44552 and https://github.com/pytorch/pytorch/issues/38595, but since it's deprecated this PR simply clarifies that the documentation (and implementation) of torch.norm maybe be incorrect. This should be additional encouragement for users to migrate to torch.linalg.vector_norm and torch.linalg.matrix_norm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63310

Reviewed By: ngimel

Differential Revision: D30337997

Pulled By: mruberry

fbshipit-source-id: 0fdcc438f36e4ab29e21e0a64709e4f35a2467ba

Cleanup dead code (#63328)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63328

This code supported the old `at::_fft_with_size` operator which no longer exists.

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D30343557

Pulled By: mruberry

fbshipit-source-id: 7a71585e013acb46c98f14fd40e15bdfbf026bac

Workaround for cuFFT bug (#63327)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63327

Fixes #63152

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D30343558

Pulled By: mruberry

fbshipit-source-id: 68e17a07650f65f397e26efc417e97e2ab302f82

Add step to report code coverage from GHA (#63373)

Summary:
Similar to the logic provided in https://github.com/pytorch/pytorch/blob/b2069e7d01814d776c417042e28133c6b0e5082f/.circleci/verbatim-sources/job-specs/pytorch-job-specs.yml#L197-L201

Fixes https://github.com/pytorch/pytorch/issues/63366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63373

Reviewed By: walterddr

Differential Revision: D30357737

Pulled By: malfet

fbshipit-source-id: 20b115eb4d6412bd9895680308a9097742d2ae7b

[TensorExpr] Remove test_train from tensorexpr tests. (#63194)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63194

This test implements functionality used nowhere, and the author no
longer works on that. This PR also adds test_approx to CMakeLists where
it's been missing before.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30292777

Pulled By: ZolotukhinM

fbshipit-source-id: ab6d98e729320a16f1b02ea0c69734f5e7fb2554

[JIT] Set future's error to current exception as is when `--torch_jit_enable_rethrow_caught_exception=true` (#63348)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63348

This change addresses singlaiiit's comment on D30241792 (https://github.com/pytorch/pytorch/commit/61b49c8e41a2faf7fd40278ca72616c5d92963cb), which makes the JIT interpreter's behavior consistent between `future` is set and not.

Test Plan: Enhanced `EnableRethrowCaughtExceptionTest.EnableRethrowCaughtExceptionTestRethrowsCaughtException` to cover the modified code path.

Reviewed By: singlaiiit

Differential Revision: D30347782

fbshipit-source-id: 79ce57283154ca4372e5341217d942398db21ac8

[Static Runtime] Fix a bug that assigns multiple outputs to single storage (#63012)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63012

This change fixes a bug that the static runtime's memory optimizer assigns multiple outputs of a node to the same storage. Fixing this bug enables the static runtime to run `inline_cvr` with its memory optimizer enabled.

A problematic line from `inline_cvr` was as follows:
```
%7767 : Tensor, %getitem_6419.1 : Tensor = fb::gather_ranges(%tensor74.1, %7764)
```
where enabling the memory optimizer assigns `%7767` and `%getitem_6419.1` to the same storage, which made their data corrupted during the 2nd iteration.

This change fixed the aforementioned bug by marking all inputs & outputs of a node as `alive` during our liveness analysis. By doing that, no inputs / outputs will collide with each other. I believe this is a fair assumption that most ops' implementation always has, but missing in our analysis before this change.

Test Plan: - Added a unittest `StaticRuntime.ValuesShareSameStorageDoesNotContainOutputsFromSameNode` to cover the new code.

Reviewed By: hlu1

Differential Revision: D30202018

fbshipit-source-id: 10287a1bee9e86be16a5201e9a7cd7c7f046bab9

[Model Averaging] Add a few member methods of PostLocalSGDOptimizer (#63340)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63340

Some methods are needed such as accessing optimizer states. These are necessary for integration with PyTorch Lightning.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 135912246

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD

Reviewed By: rohan-varma

Differential Revision: D30328794

fbshipit-source-id: e585b874313bd266fdc7c79936e2af98700c7bad

[PyPer] Skip printing out per node time when do_profile is on (#63256)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63256

This suppresses printing out the per node time which is very long when the net has too many ops. It can be easily turned on by setting `--pt_sr_print_per_node_time=1`.

Reviewed By: ajyu, mikeiovine

Differential Revision: D30298331

fbshipit-source-id: 32b3f93b3fe19d335654168311fda93331a1e706

Refactor NnapiCompilation registration into it's own file (#63183)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63183

Move registration of NnapiCompilation into it's own file, so that `nnapi_bind.cpp` (which contains the implementation of NnapiCompilation) can be moved to `aten_cpu`, while maintaining the selectiveness for registration.

`nnapi_bind.cpp` is moved to `aten_cpu` in https://github.com/pytorch/pytorch/pull/62919. See the PR for more details on why it's needed.

ghstack-source-id: 135900318

Test Plan: Nnapi unit tests: `python test/test_nnapi.py`

Reviewed By: iseeyuan

Differential Revision: D30288708

fbshipit-source-id: 6ed5967fa6bd018075469d18e68f844d413cf265

Add section to CONTRIBUTING.md explaining developer docs (#63228)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63228

It is a quick summary and links to a page on the Developer Wiki that has
more detail.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D30347109

Pulled By: zou3519

fbshipit-source-id: a6242986d275e5279ca3f61ade2294a132d268c4

test: Add ability to set CONTINUE_THROUGH_ERROR (#63357)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63357

Adds the ability to set CONTINUE_THROUGH_ERROR as an environment
variable so that we can easily set it without having to add the flag
directly

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D30351108

Pulled By: seemethere

fbshipit-source-id: 767fa9bd24e1399f359eb24d16f6cc985a2d7173

Add driver function to run test_sharded_tensor.py and test_sharding_spec.py (#63189)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63189

Add main --> run_tests func in test file which is needed to launch the real test cases in OSS flow.

Test Plan:
b/f:
$ python test/distributed/_sharding_spec/test_sharding_spec.py --v ==> nothing happened
$ python test/distributed/_sharded_tensor/test_sharded_tensor.py --v ==> nothing happened

after:

$ python test/distributed/_sharding_spec/test_sharding_spec.py --v ==>

test_chunked_sharding_spec (__main__.TestShardingSpec) ... ok
test_device_placement (__main__.TestShardingSpec) ... ok
test_enumerable_sharding_spec (__main__.TestShardingSpec) ... ok

$ python test/distributed/_sharded_tensor/test_sharded_tensor.py --v

test_complete_world_size (__main__.TestShardedTensorChunked) ... ok
test_insufficient_sharding_dims (__main__.TestShardedTensorChunked) ... ok
test_invalid_pg_rpc_ranks (__main__.TestShardedTensorChunked) ... [W tensorpipe_agent.cpp:699] RPC agent for worker2 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
ok
test_invalid_sharding (__main__.TestShardedTensorChunked) ... ok
test_load_state_dict_errors (__main__.TestShardedTensorChunked) ... ok
test_multiple_local_shards (__main__.TestShardedTensorChunked) ... ok
test_new_group (__main__.TestShardedTensorChunked) ... ok
test_partial_world_size (__main__.TestShardedTensorChunked) ... ok
test_sharded_tensor_metadata (__main__.TestShardedTensorChunked) ... ok
test_sharded_tensor_sizes (__main__.TestShardedTensorChunked) ... ok
test_sharding_columns (__main__.TestShardedTensorChunked) ... ok
test_state_dict (__main__.TestShardedTensorChunked) ... ok
test_state_dict_new_group (__main__.TestShardedTensorChunked) ... ok
test_state_dict_no_sharded_tensors (__main__.TestShardedTensorChunked) ... ok
test_grid_sharding (__main__.TestShardedTensorEnumerable) ... ok
test_multiple_local_shards (__main__.TestShardedTensorEnumerable) ... ok
test_new_group (__main__.TestShardedTensorEnumerable) ... ok
test_partial_world_size (__main__.TestShardedTensorEnumerable) ... ok
test_sharded_tensor_metadata (__main__.TestShardedTensorEnumerable) ... ok
test_uneven_shards (__main__.TestShardedTensorEnumerable) ... ok
test_with_rpc_names (__main__.TestShardedTensorEnumerable) ... ok
test_init_from_local_shards (__main__.TestShardedTensorFromLocalShards) ... ok
test_init_from_local_shards_invalid_shards (__main__.TestShardedTensorFromLocalShards) ... ok
test_init_from_local_shards_invalid_shards_gaps (__main__.TestShardedTensorFromLocalShards) ...

Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30294094

fbshipit-source-id: 08f0431a12ea854abe00dc920205b10ba43ae6b6

[fx2trt] add unsqueeze converter (#63355)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63355

Added converter for acc_ops.unsqueeze. Needed for ig model.

DIdn't add support for input that has more than one dynamic dim. This is not needed right now and I feel it would be a rare case.

Test Plan: unit test

Reviewed By: yinghai

Differential Revision: D30138293

fbshipit-source-id: 899fe8eb68387de83195a2f6e199618d96f09a9e

[Static Runtime] Implement prim::TupleUnpack (#63243)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63243

Add `prim::TupleUnpack` native op to static runtime.

Test Plan: Unit test: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D30306955

fbshipit-source-id: 21923d6cbd5545c144ac051b3d48b37ec6e610cf

[fx2trt] Factor out add_matrix_multiply_layer

Summary: Factor out the function so that it can be reused in future diffs

Test Plan: buck run mode/opt caffe2/torch/fb/fx2trt:test_matmul

Reviewed By: 842974287

Differential Revision: D30322823

fbshipit-source-id: 069b945de2c744cdbcca1618b62827692dfb4174

A re-open PR: Avoid re-creating the random number generator in RandomSampler (#63026)

Summary:
More details can be found in the old pr: https://github.com/pytorch/pytorch/pull/53085

ejguan Thanks for your guidance. I tried to reopen this PR following your instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63026

Reviewed By: anjali411

Differential Revision: D30224920

Pulled By: ejguan

fbshipit-source-id: 2fa83bd4a2661485e553447fe3e57ce723f2716d

Improve pip package determination (#63321)

Summary:
Invoking `pip` or `pip3` yields list of packages invoked for `pip` alias on the path, rather than for the one currently being executed. Changed `get_pip_packages` to use `sys.executable + '-mpip'`

Also, add mypy to the list of packages of interest

Discovered while looking at https://github.com/pytorch/pytorch/issues/63279

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63321

Reviewed By: walterddr

Differential Revision: D30342099

Pulled By: malfet

fbshipit-source-id: fc8d17cf2ddcf18236cfde5c1b9edb4e72804ee0

[Profiler] Change FLOP/s to Total FLOPs (#62779)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62779

Change from floating point operations per second to total floating point operations.  This requires removing the division  by executing time from the Kineto computed FLOPs and updating necessary documentation

Test Plan:
Running the following script:

```
import torch
from torch.profiler import profile
import torchvision.models as models

model = models.resnet18().eval()
inputs = torch.randn(5, 3, 224, 224)
with torch.no_grad():
    with profile(record_shapes=True, with_flops=True) as prof:
        model(inputs)
print(prof.key_averages().table(sort_by="cpu_time_total"))
```

Before diff results in:

{F636640118}

And after diff should be about `(27.78 * 10^9) FLOP/s * .652838 seconds =18135839640 FLOP = 18.136 GFLOP`.  Running the script again yields this answer:

{F636655686}

------------------------------------

Reviewed By: gdankel

Differential Revision: D29972997

fbshipit-source-id: 0f8d9f264b7d9f8f6bb3f10ab7c2c9794291e28b

Fix triage workflow when the card already exists in project (#63347)

Summary:
Fixes issues like https://github.com/pytorch/pytorch/runs/3336787242

```
RequestError [HttpError]: Validation Failed: {"resource":"ProjectCard","code":"unprocessable","field":"data","message":"Project already has the associated issue"}
Error: Unhandled error: HttpError: Validation Failed: {"resource":"ProjectCard","code":"unprocessable","field":"data","message":"Project already has the associated issue"}
    at /home/runner/work/_actions/actions/github-script/v2/dist/index.js:7531:23
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async eval (eval at callAsyncFunction (/home/runner/work/_actions/actions/github-script/v2/dist/index.js:7985:56), <anonymous>:63:1)
    at async main (/home/runner/work/_actions/actions/github-script/v2/dist/index.js:8011:20) {
  name: 'HttpError',
  status: 422,

...
```

The card may already exist, thus no need to handle `422` status code. Anything else will re-throw the err.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63347

Reviewed By: malfet

Differential Revision: D30348529

Pulled By: zhouzhuojie

fbshipit-source-id: 36647837bfccad43ce01eb5dfe6642e685615037

[opinfo] nn.functional.pad (#62814)

Summary:
Reference: https://github.com/facebookresearch/functorch/issues/78

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62814

Reviewed By: VitalyFedyunin

Differential Revision: D30307492

Pulled By: zou3519

fbshipit-source-id: 4f6062eb4a3c91ed1795df1f82846afa0abafcdc

Add expecttest to requirements.txt (#63320)

Summary:
This PR closes the developer environment gap left by https://github.com/pytorch/pytorch/issues/60658 by adding [expecttest](https://github.com/ezyang/expecttest) to `requirements.txt`. Thus it provides a solution to one of the short-term problems that https://github.com/pytorch/pytorch/issues/60697 tries to solve, but does not provide a long-term solution to https://github.com/pytorch/pytorch/issues/61375.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63320

Reviewed By: malfet

Differential Revision: D30340654

Pulled By: samestep

fbshipit-source-id: 26c8f8c9889cce4a94fafb1bf2f0d6df4c70503f

add comma to prevent syntax errors (#62492)

Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62492

Reviewed By: VitalyFedyunin

Differential Revision: D30304684

Pulled By: ezyang

fbshipit-source-id: db08ca39bcecbfd79ea50df18536bf4e87f51e15

Retry apt-get during setup_ci_workspace (#63319)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63319

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D30346067

Pulled By: bertmaher

fbshipit-source-id: 2aafa97e78f9297553d772b2524d6f1c0ebaa46e

Make `torch.lu` differentiable for wide/tall inputs + jit (#61564)

Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61564

Reviewed By: astaff

Differential Revision: D30338136

Pulled By: mruberry

fbshipit-source-id: f01436fc90980544cdfa270feee16bb3dda21b93

[Model Averaging] Allow subgroup to be None in PostLocalSGDState (#63277)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63277

`PostLocalSGDState` requires a subgroup. To initialize this subgroup, a global process group must be initialized. However, this imposes a restriction that a hook state can only be provided after distributed environment initialization, which is not compatible with lightning DDP plugin setup where hook state should be provided before distributed environment initialization.

Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 135848575

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD

Reviewed By: cbalioglu

Differential Revision: D30325041

fbshipit-source-id: 7b870166d096d306c3f2f7c69816a705cec0bebd

Revert "[docs] Update docs for NegativeBinomial (#45693)" (#63192)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63192

**Summary**
This reverts commit 402caaeba513929dcfe12df183c764b0ef43f688. As per the
dicussion in #62178, this commit was not needed.

**Test Plan**
Continuous integration.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30293202

Pulled By: SplitInfinity

fbshipit-source-id: 91ee7ad0523a9880605d83fe9712c39df67384a8

Refactor BucketBatch (#63185)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63185

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30288893

Pulled By: ejguan

fbshipit-source-id: b88b792d12a83c99d8ea9e516e3b4c54a82100f6

Replace str by repr for DataChunk (#63184)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63184

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30288892

Pulled By: ejguan

fbshipit-source-id: 45c88fdd3987e234f2c22ebbbfd8d5044983c34c

[nnc] Updated IRMutator and IRSimplifier to perform in-place mutations. (#63246)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63246

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D30309636

Pulled By: navahgar

fbshipit-source-id: 409ea8d6982888cfee9127e6248044dd2ed9d8d4

[docs][ao] Add overload information for fake_quantize_per_tensor_affine (#63258)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63258

This function supports scalar and tensor qparams

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D30316432

fbshipit-source-id: 8b2f5582e7e095fdda22c17d178abcbc89a2d1fc

[docs][ao] Add missing docstrings for quantized_max_pool1d and quantized_max_pool2d (#63242)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63242

These functions are part of the native functions namespace as well as the quantized namespace

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D30316430

fbshipit-source-id: cd9c839e5c1a961e3c6944e514c16fbc256a2f0c

[docs][ao] Add missing documentation for torch.quantized_batch_norm (#63240)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63240

Op is exposed via torch.quantized_batch_norm to the end user without any existing documentation

Test Plan:
CI

Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30316431

fbshipit-source-id: bf2dc8b7b6f497cf73528eaa2bedef9f65029d84

[OpInfo] Add expected_failure kwarg to SkipInfo (#62963)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62963

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30327199

Pulled By: heitorschueroff

fbshipit-source-id: 45231eca11d1697a4449d79849fb17264d128a6b

Small refactor for OpInfo decorators (#62713)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62713

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30327200

Pulled By: heitorschueroff

fbshipit-source-id: 1899293990c8c0a66da88646714b38f1aae9179d

[Pytorch Edge] Fix broken test post changes in error reporting format. (#63287)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63287

Recent changes in https://github.com/pytorch/pytorch/pull/62419 changed
the way module hierarchy is reported. Now it includes information about
function names as well.

Test Plan:
python test/mobile/test_lite_script_module.py
TestLiteScriptModule.test_save_mobile_module_with_debug_info_with_trace

Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D30328512

fbshipit-source-id: ddd6b11b9ab01cc725f4568a35eff7a92f17204b

To add warm-up scheduler to optim (#60836)

Summary:
Warm up of learning rate scheduling has initially been discussed  by Priya et. al. in the paper: https://arxiv.org/pdf/1706.02677.pdf .

In the section 2.2 of the paper they discussed and proposed idea of warming up learning schedulers in order to prevent big variance / noise in the learning rate. Then idea has been further discussed in the following papers:
  * Akilesh Gotmare et al. https://arxiv.org/abs/1810.13243
  * Bernstein et al  http://proceedings.mlr.press/v80/bernstein18a/bernstein18a.pdf
  * Liyuan Liu et al: https://arxiv.org/pdf/1908.03265.pdf

There are two type of popularly used learning rate warm up ideas
  * Constant warmup  (start with very small constant learning rate)
  * Linear Warmup        ( start with small learning rate and gradually increase)

In this PR we are adding warm up as learning rate scheduler. Note that learning rates are chainable, which means that we can merge warmup scheduler with any other learning rate scheduler to make more sophisticated learning rate scheduler.

## Linear Warmup

Linear Warmup is multiplying learning rate with pre-defined constant - warmup_factor in the first epoch (epoch 0). Then targeting to increase this multiplication constant to one in warmup_iters many epochs. Hence we can derive the formula at i-th step to have multiplication constant equal to:

                    warmup_factor + (1-warmup_factor) * i /  warmup_iters

Moreover, the fraction of this quantity at point i to point i-1 will give us

           1 + (1.0 - warmup_factor) / [warmup_iters*warmup_factor+(i-1)*(1-warmup_factor)]

which is used in get_lr() method in our implementation. Below we provide an example how to use linear warmup scheduler and to give an example to show how does it works.

```python
import torch
from torch.nn import Parameter
from torch.optim import SGD
from torch.optim.lr_scheduler import WarmUpLR

model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = SGD(model, 0.1)
scheduler = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=10, warmup_method="linear")

for epoch in range(15):

    print(epoch, scheduler.get_last_lr()[0])

    optimizer.step()
    scheduler.step()
```

```
0 0.010000000000000002
1 0.019000000000000003
2 0.028000000000000008
3 0.03700000000000001
4 0.04600000000000001
5 0.055000000000000014
6 0.06400000000000002
7 0.07300000000000002
8 0.08200000000000003
9 0.09100000000000004
10 0.10000000000000005
11 0.10000000000000005
12 0.10000000000000005
13 0.10000000000000005
14 0.10000000000000005
```

## Constant Warmup

Constant warmup has straightforward idea, to multiply learning rate by warmup_factor until we reach to epoch warmup_factor, then do nothing for following epochs

```python
import torch
from torch.nn import Parameter
from torch.optim import SGD
from torch.optim.lr_scheduler import WarmUpLR

model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = SGD(model, 0.1)
scheduler = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=5, warmup_method="constant")

for epoch in range(10):

    print(epoch, scheduler.get_last_lr()[0])

    optimizer.step()
    scheduler.step()
```

```
0 0.010000000000000002
1 0.010000000000000002
2 0.010000000000000002
3 0.010000000000000002
4 0.010000000000000002
5 0.10000000000000002
6 0.10000000000000002
7 0.10000000000000002
8 0.10000000000000002
9 0.10000000000000002
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60836

Reviewed By: saketh-are

Differential Revision: D29537615

Pulled By: iramazanli

fbshipit-source-id: d910946027acc52663b301f9c56ade686e62cb69