review.tizen.org Git - platform/upstream/pytorch.git/log

[acc_normalizer] Improve error when kwarg normalization fails (#64408)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64408

att

Test Plan: NFC

Reviewed By: protonu

Differential Revision: D30716392

fbshipit-source-id: e1c3bb1afcd5363a9d502549d8a46b90226be40c

Update breakpad to an existing commit: 7d188f6 (#64666)

Summary:
Fixes issue https://github.com/pytorch/pytorch/issues/64561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64666

Reviewed By: driazati

Differential Revision: D30814127

Pulled By: hyuen

fbshipit-source-id: 511a30fc26153569b1cd39f34e4a1a6bb99cc5e4

To add Stochastic Gradient Descent to Documentation (#63805)

Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of Stochastic Gradient Descent to the documentation.

<img width="466" alt="SGDalgo" src="https://user-images.githubusercontent.com/73658284/132585881-b351a6d4-ece0-4825-b9c0-126d7303ed53.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63805

Reviewed By: albanD

Differential Revision: D30818947

Pulled By: iramazanli

fbshipit-source-id: 3812028e322c8a64f4343552b0c8c4582ea382f3

.github: Upgrade windows CUDA 10.1 -> 10.2 (#64658)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64658

We don't release 10.1 anymore so let's bump to 10.2

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS

Reviewed By: malfet, janeyx99

Differential Revision: D30811178

Pulled By: seemethere

fbshipit-source-id: c504ebf7f0d4c0d6229319d774f808b4ba0facd9

Add plugin for linalg norm operation (#64611)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64611

Add plugin for torch.linalg.norm, this plugin correctly only support norm operation without batch_size change, so vector input or matrix input with dim including '0' is not supported with this plugin.

Test Plan: Unit test

Reviewed By: 842974287

Differential Revision: D30525958

fbshipit-source-id: 0d66b60a390bb6235166e5a80390090d0acf691a

Revert D30735341: Migrate uses of THCReduceApplyUtils to cuda_utils::BlockReduce

Test Plan: revert-hammer

Differential Revision:
D30735341 (https://github.com/pytorch/pytorch/commit/a5ad08ec704a3f765814eacf5c393e871c0174e1)

Original commit changeset: 3cb58bed8f1f

fbshipit-source-id: 874dd0f93b24a99694db42a15714834069d402bc

[fx] make const fold code more pythonic (#64451)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64451

No functional change.

Test Plan:
```
buck test caffe2/test:fx_const_fold
```

Reviewed By: jfix71, RoshanPAN, houseroad

Differential Revision: D30718255

fbshipit-source-id: 95f98561c7f33fcc6c839db68683c85eb152c949

[quant] Enable jit tracing on quantizable LSTM (resubmission) (#64638)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64638

The quantizable LSTM didn't support jit tracing because it had several non taceable paths. We sacrifice some of the user experience to enable the tracing.
The main UX feature removed is a user-friendly message when trying to access the backwards path in a bidirectional LSTM: When the bidirectional flag is False, we used to throw a nice error message when the user tried accessing backwards weights. Now the message is default (removed properties).

Test Plan: `buck test mode/dev //caffe2/test:quantization -- test_custom_module_lstm`

Reviewed By: HDCharles

Differential Revision: D30803753

fbshipit-source-id: a639955a96cee22538d9436f1c952a5d121f50f9

Factor out TensorBase that doesn't depend on native operators (#63612)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63612

This makes Tensor inherit from a new class TensorBase, that provides a subset of Tensor that doesn't
directly depend on native_functions.yaml. Code that only includes TensorBase.h with thus not need to
be rebuilt every time someone changes an operator signature.

Making `Tensor` inherit from this class means that `const TensorBase&` parameters will be callable
with an ordinary `Tensor`. I've also made `Tensor` constructible and assignable from `TensorBase` to
minimize friction in code mixing the two types.

To help enforce that `Tensor.h` and `Functions.h` aren't accidentally included, I've added an error
into `Operators.h` if `TORCH_ASSERT_NO_OPERATORS` is defined. We can either set this in the build
system for certain folders, or just define it at the top of any file.

I've also included an example of manually special-casing the commonly used `contiguous` operator.
The inline function's slow path defers to `TensorBase::__dispatch_contiguous` which is defined in
`Tensor.cpp`. I've made it so `OptionalTensorRef` is constructible from `TensorBase`, so I can
materialize a `Tensor` for use in dispatch without actually increasing its refcount.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30728580

Pulled By: ezyang

fbshipit-source-id: 2cbc8eee08043382ee6904ea8e743b1286921c03

Make doc previews use its own S3 bucket (#64594)

Summary:
We had been using the gha-artifacts bucket (which previously only stored workflow artifacts) to keep the docs around. This makes it hard to see how our storage for artifacts vs docs is trending.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64594

Reviewed By: seemethere

Differential Revision: D30794328

Pulled By: driazati

fbshipit-source-id: 6b2721a3d76e8a273bde055783d56551f8409edd

TST Adds inplace checks to module_info (#63739)

Summary:
Follow up to https://github.com/pytorch/pytorch/pull/61935

This PR adds inplace checks to `test_modules`. This version checks the constructor for `inplace` and performs the check automatically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63739

Reviewed By: saketh-are

Differential Revision: D30737774

Pulled By: jbschlosser

fbshipit-source-id: 8813534511e9296c8424d1ca878412726ddd4043

Migrate uses of THCReduceApplyUtils to cuda_utils::BlockReduce (#64442)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64442

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30735341

Pulled By: ngimel

fbshipit-source-id: 3cb58bed8f1f5aa32fd49fd37b10c8490bcc645a

.github: Run docker containers in detach mode (#64459)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64459

Should allow users to exec into the docker container if using with-ssh,
even if the build / test command has finished executing

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30742797

Pulled By: seemethere

fbshipit-source-id: 969ed8799216c6051439c7d41ab709b2d40938ac

[NNC] Add Softplus operator (#64589)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64589

Adding softplus operator lowering for NNC. Enabling element wise fusion as well.

Test Plan: Added a test in test_jit_fuser.py

Reviewed By: bertmaher

Differential Revision: D30736449

fbshipit-source-id: 6c5fc3bceb5cef2322ecd4449f827e4af018ea93

Add `__matmul__` to the magic methods for FX tracing (#64512)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/64483

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64512

Reviewed By: mrshenli

Differential Revision: D30797265

Pulled By: Chillee

fbshipit-source-id: 7630e048a960e0b27c4309d04d85301abe325189

update scatter formula (#64546)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/63430

Already tested OpInfo gradient tests
https://github.com/pytorch/pytorch/blob/544c8e6a5d26efdf1cf679b313893fe119825930/torch/testing/_internal/common_methods_invocations.py#L8575-L8577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64546

Reviewed By: saketh-are

Differential Revision: D30768759

Pulled By: albanD

fbshipit-source-id: 27d144971c51a956a232fc7d02df5c9d2706d565

fixing trapezoid() comments for clarity (#64592)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64592

cc mruberry rgommers heitorschueroff

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30785663

Pulled By: NivekT

fbshipit-source-id: e968687fbb83a59bb46ce6858c6caafa5aa04412

Add forward mode differentiation for torch.linalg.cholesky and transpose (#62159)

Summary:
This PR adds forward mode differentiation for `torch.linalg.cholesky`, `torch.linalg.cholesky_ex`, and `transpose` functions.
Complex tests for Cholesky fail because for some reason the gradcheck sends matrices full of zeros to `cholesky_jvp` function.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7 jianyuh mruberry heitorschueroff walterddr IvanYashchuk xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62159

Reviewed By: mrshenli

Differential Revision: D30776829

Pulled By: albanD

fbshipit-source-id: 32e5539ed6423eed8c18cce16271330ab0ea8d5e

Fix typo embedding_renorm_cuda_ (#64542)

Summary:
Fixes #{issue number}

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7 ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64542

Reviewed By: mrshenli

Differential Revision: D30792842

Pulled By: ngimel

fbshipit-source-id: c9a548256d02b3ce6fb77dd9fb058084f2c91608

[c10d] Provide failure reason from ProcessGroup when aborting NCCL comm (#64241)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64241

When things go wrong PG NCCL aborts nccl communicators via `ncclCommAbort`, but one issues is that often the error can be set to `ncclSystemError` (see https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L176) when that might not be the true cause of the issue and the actual issue is that some prior work timed out, communicator was aborted on other rank, etc.

This results in a lot of confusion when debugging jobs with a large no. of processes as the current message for ncclSystemError is not very informative: https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L22

The fix here is to pass in a string exception message from PG NCCL down to `NCCLUtils` which will aim to raise that as the actual issue and not the confusing `ncclSystemError` message.

Test Plan: CI

Reviewed By: pallab-zz, cbalioglu

Differential Revision: D30658855

fbshipit-source-id: 17661dbe0a1bb8cc5b87b637c47634b1f52f54e1

Change MaxUnpool to accept tensors with 0-dim batch sizes. (#64082)

Summary:
Part of the fix for https://github.com/pytorch/pytorch/issues/38115.

Changes the `MaxUnpool` module to work with 0-dimensions batch sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64082

Reviewed By: mrshenli

Differential Revision: D30793907

Pulled By: jbschlosser

fbshipit-source-id: d21aa665be5aa18f592b39ef7b4e3cbc632e21ed

Add Half conversion of bit cast for SYCL kernel (#64340)

Summary:
## Motivation
Enhance the performance of Half/float conversion in SYCL kernels.

## Solution
Add the native SYCL half type to help convert the half from/to float in the kernel code.

## Additional Context
`__SYCL_DEVICE_ONLY__` is a MACRO only valid when compiling the kernel code for SYCL backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64340

Reviewed By: gchanan

Differential Revision: D30720823

Pulled By: ezyang

fbshipit-source-id: e7e770d02df5b2d45da61d2fed3ba59383b3dc3a

[nnc] Provide helpful error messages about turning off the fuser (#64516)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64516

If fuser compilation fails due to a bug (which should be highly
unlikely at this point) we want to direct the user how to unblock themselves by
disabling fusion, in addition to requesting that they report a bug.
ghstack-source-id: 137398537

Test Plan: existing tests

Reviewed By: ZolotukhinM

Differential Revision: D30758051

fbshipit-source-id: 98be89f1b1d4fb3bc816f5b2634c618b9297930e

Allow disabling cache in autocast (automatic mixed precision) (#63552)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63552

In this PR, we want to exclude these 2 cases in the `Autocast` weight cache usages:

- Using `torch.jit.trace` under the `Autocast`
As report in https://github.com/pytorch/pytorch/issues/50231 and several other discussions, using `torch.jit.trace` under the `Autocast`, the trace process would hit Autocast's weight cache and fails. So we should disable weight cache under the trace process.
- Using `Autocast` with `Grad mode`

  - Usually we are using `Grad mode` for training. Since in the training phase, the weight will change in every step. So we doesn't need to cache the weight.
  - For the recommended `Autocast` training case in the [doc](https://pytorch.org/docs/stable/amp.html), `Autocast` will clear the cache every step leaving the context. We should disable it to save the clear operations.
    ```
    model = Net().cuda()
    optimizer = optim.SGD(model.parameters(), ...)

    for input, target in data:
        optimizer.zero_grad()
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    ```

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30644913

Pulled By: ezyang

fbshipit-source-id: ad7bc87372e554e7aa1aa0795e9676871b3974e7

Adding support for lowering 4Bit EmbeddingBag Operator (#5806)

Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/5806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64001

Add 4 bit embeddingbag operator in acc_ops.

Test Plan: Let CI run.

Reviewed By: jfix71

Differential Revision: D30532824

fbshipit-source-id: bf476c9710477792aae202dacf64e23539c33bd9

restore test_inplace_comparison_ops_require_inputs_have_same_dtype Expected behavior (#64267)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64267

This test expects every operation to throw a runtime error.

And Reinsert in-place operation test，Fix bug for comparison operation

fix: #64018

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30720915

Pulled By: ezyang

fbshipit-source-id: 215a6556d20770f70f4ced1c1f9a9753933f1d37

[quant] AO migration of the `quantize.py` (resubmission) (#64445)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64445

AO Team is migrating the existing torch.quantization into torch.ao.quantization. We are doing it one file at a time to make sure that the internal callsites are updated properly.
This migrates the quantize.py from torch.quantization to torch.ao.quantization.
At this point both locations will be supported. Eventually the torch.quantization will be deprecated.

Test Plan: `buck test mode/dev //caffe2/test:quantization`

Reviewed By: HDCharles

Differential Revision: D30734870

fbshipit-source-id: dc204f3cc46bff2cc81c95159eab9d333b43bb4b

[TensorExpr] Don't rely on exceptions in Vectorizer. (#64609)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64609

We've been using exceptions to indicate whether vectorization succeeded
or not, but that posed some problems with (e.g. we spent too much time
symbolicazing these exceptions). This change converts this mechanism to
a standard error return code.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30795342

Pulled By: ZolotukhinM

fbshipit-source-id: 16e38b37bcdd78ceb438ac814cc377f35b058e17

[fx_const_fold] Fix constant folding for attrs in submodule hierarchies (#64342)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64342

Previously we weren't handling the case where an attribute was in a module that wasn't the root.

Test Plan: Added unit test coverage.

Reviewed By: yinghai

Differential Revision: D30691730

fbshipit-source-id: b39b5cf748c4c882f315a4f32b51ad88cc7a43ed

Add __ge__ to TorchVersion (#64565)

Summary:
This PR adds greater equal comparison so that not the base class's (str) comparison method is used.
This is necessary for a correct comparison with a version string.

Previously the following was the case:
```py
>>> torch.__version__
'1.10.0.dev20210830+cpu'
>>> torch.__version__>"1.9"
True
>>> torch.__version__>="1.9"
False # Wrong output since the base class (str) was used for __ge__ comparison
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64565

Reviewed By: raghuramank100

Differential Revision: D30790463

Pulled By: mrshenli

fbshipit-source-id: 79c680f8b448001b34d3e5d5332124a78bea4e34

add out variant of linear (#61801)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61801

resubmitting because the last one was unrecoverable due to making changes incorrectly in the stack

Test Plan: Imported from OSS

Reviewed By: desertfire

Differential Revision: D29812510

Pulled By: makslevental

fbshipit-source-id: ba9685dc81b6699724104d5ff3211db5852370a6

Fix building docs instructions (#64508)

Summary:
Fixes #{64507}

Removed duplicate instruction and linted the file a bit (consistent spacing around codeblocks/headers, adding code types in codeblocks, remove `$` from bash code blocks when uncecessary).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64508

Reviewed By: raghuramank100

Differential Revision: D30791164

Pulled By: mrshenli

fbshipit-source-id: a00db32dcfdd1ecc194c836f31174c806062eb6d

Fix quicklint (#64612)

Summary:
Fixes land-race introduced by https://github.com/pytorch/pytorch/commit/a22c936b6398f5cfd959b3e09622db4d90d61050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64612

Reviewed By: ngimel

Differential Revision: D30798648

Pulled By: malfet

fbshipit-source-id: ca546f68141d44493deba7bbf840e5f9662e8558

Revert D29998114: [pytorch][PR] enable bf16 mkldnn path for gemm

Test Plan: revert-hammer

Differential Revision:
D29998114 (https://github.com/pytorch/pytorch/commit/acc9f9afc8f2be70d7f5d3248ca1760e0336b3b8)

Original commit changeset: 459dc5874c63

fbshipit-source-id: 1994623a3afc22a94bd0cf5de766b023185f5238

[JIT] Fix a bug of rejecting ops with AliasAnalysisKind::CONSERVATIVE (#64336)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64336

Currently AliasDB rejects any user-defined ops with `AliasAnalysisKind::CONSERVATIVE` if they do not have a special treatment for alias analysis. For example, the following alias schema gets rejects:

```
  m.def(torch::schema(
      "namescope::my_op(...) -> ...",
      c10::AliasAnalysisKind::CONSERVATIVE));
```

This rejection condition is contradictory: AliasDB can handle ops with `CONSERVATIVE` in a general way without any special casing at https://fburl.com/diffusion/op5u72sk calling https://fburl.com/diffusion/h3aws5dd which seems very appropriate to be conservative for alias analysis.

This change corrects the rejection condition to be satisfied for ops *with* special casing but have `CONSERVATIVE`, since they both cannot be used simultaneously.

Test Plan:
Confirmed that
```
  m.def(torch::schema(
      "namescope::my_op(...) -> ...",
      c10::AliasAnalysisKind::CONSERVATIVE));
```
gets accepted and `my_op`'s all inputs and outputs are put to point to wildcard(*) by AliasDB.

Reviewed By: eellison

Differential Revision: D30690121

fbshipit-source-id: 431cc1a84edd5227f52b44a0fd85d5eb16f3c288

Add symbolic shape comparison optimization (#64300)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64300

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738146

Pulled By: eellison

fbshipit-source-id: 96287798535b367f23d3e9430d70fc02c59744ab

Refactor to use shape arguments (#64299)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64299

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738141

Pulled By: eellison

fbshipit-source-id: 37ca30de81349ecf23d8656291863737b6ad6d96

Add view with negative dim (#63516)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63516

how to review: pretty much just check that the inputs generated are a good representation of the op semantics, that should be sufficient for correctness, and then you can also double check the op size semantics by going to https://codebrowser.bddppq.com/pytorch/pytorch/ typing in native::{op_name} and looking at the op implementation as a bonus if you want

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738143

Pulled By: eellison

fbshipit-source-id: c7cd01cb2c8a13cb2664415f3d98aedec19a8e07

Generalize expand logic (#63615)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63615

how to review: pretty much just check that the inputs generated are a good representation of the op semantics, that should be sufficient for correctness, and then you can also double check the op size semantics by going to https://codebrowser.bddppq.com/pytorch/pytorch/ typing in native::{op_name} and looking at the op implementation as a bonus if you want

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738148

Pulled By: eellison

fbshipit-source-id: 4ef74a9c9b39c0beb73949e63aa844c46ab637eb

Add permute, arange (#63407)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63407

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738149

Pulled By: eellison

fbshipit-source-id: 36d572488408d38b0643aa93cb08aab5c45218ad

Add support for slice, selec twith int, index_select (#63365)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63365

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30738144

Pulled By: eellison

fbshipit-source-id: 7e0c572209bdc6e62ecb4fd1f06f80291de69803

Add squeeze, unsqueeze, transpose shape functins (#63099)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63099

These are checked by OpInfos, which represent all of the inputs and semantics of the operators so it should be an easy stamp

Test Plan: Imported from OSS

Reviewed By: desertfire, astaff

Differential Revision: D30347514

Pulled By: eellison

fbshipit-source-id: 37b4c9ecd8c222cc12bf39166181464b43218830

Add batch of unary functions (#63050)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63050

Test Plan: Imported from OSS

Reviewed By: priyaramani, astaff

Differential Revision: D30347513

Pulled By: eellison

fbshipit-source-id: abaf641778671d17df87a2b7b47bad7501a91b5a

Back out "update rpc tensorpipe logic for sparse tensors" (#64575)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64575

Original commit changeset: daee9a567645

Test Plan: unit test

Reviewed By: gcramer23

Differential Revision: D30778736

fbshipit-source-id: 8d9386158fb6a3d025c149cdc37558d57c615e9f

Use trsm for triangular_solve in CPU (#63567)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63567

The current implementation called trtrs for CPU and trsm for CUDA.
See https://github.com/pytorch/pytorch/issues/56326#issuecomment-825496115 for a discussion on the differences between
these two functions and why we prefer trsm vs trtrs on CUDA.

This PR also exposes the `side` argument of this function which is used
in the second PR of this stack to optimise the number copies one needs to make
when preparing the arguments to be sent to the backends.

It also changes the use of `bool`s to a common enum type to represent
whether a matrix is transposed / conj transposed, etc. This makes the API
consistent, as before, the behaviour of these functions with `transpose=True`
and `conjugate_transpose=True` it was not well defined.
Functions to transform this type into the specific types / chars for the different
libraries are provided under the names `to_blas`, `to_lapack`, `to_magma`, etc.

This is the first of a stack of PRs that aim to improve the performance of
`linalg.solve_triangular`. `trsm` has an extra parameter (`side`), which allows to
ellide the copy of the triangular matrix in many cases.

Fixes https://github.com/pytorch/pytorch/issues/56326

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D30566479

Pulled By: mruberry

fbshipit-source-id: 3831af9b51e09fbfe272c17c88c21ecf45413212

[iOS][Metal] Add aten:hardswish (#64588)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64588

Add `aten::hardswish` to run the mobilenetv3 model from torchvision.
ghstack-source-id: 137479323

Test Plan:
- buck test pp-macos
- circleCI

Reviewed By: beback4u

Differential Revision: D30781008

fbshipit-source-id: 83454869195ef4ab50570ea9b3bf2a55f32a3e86

[special] Alias igamma, igammac to special.gammaninc, special.gammaincc (#61902)

Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Also added relevant OpInfo

TODO:
* [x] Check rendered docs gammainc : https://docs-preview.pytorch.org/61902/special.html#torch.special.gammainc
* [x] Check rendered docs gammaincc: https://docs-preview.pytorch.org/61902/special.html#torch.special.gammaincc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61902

Reviewed By: ngimel

Differential Revision: D30761428

Pulled By: mruberry

fbshipit-source-id: 06a16432873357958d53364f12a4e91c29779d26

Disables four failing distributions tests on windows (#64596)

Summary:
Per title. Unblocks CI. See https://github.com/pytorch/pytorch/issues/64595.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64596

Reviewed By: mrshenli

Differential Revision: D30787296

Pulled By: mruberry

fbshipit-source-id: 84b90cb25c0185f1851db02425ea40aa13d3e598

Add lint to ensure .github/ pypi dependencies are pinned (#64463)

Summary:
Example failing run: https://github.com/pytorch/pytorch/pull/64463/checks?check_run_id=3501249102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64463

Reviewed By: janeyx99

Differential Revision: D30744930

Pulled By: driazati

fbshipit-source-id: 4dd97054db1d4c776a4512bc3d664987cd7b6d23

Update explicit_ci_jobs to work with GHA (#64598)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64598

This adds a filter option rather than an all-or-nothing so it's easier to iterate on a specific job.

```bash
python tools/testing/explicit_ci_jobs.py --filter-gha '*generated-linux-*gcc5.4*'
```

See #64600 for an example usage

NB: If you regenerate the worfklows you will need to re-run that command to re-delete everything.

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D30788850

Pulled By: driazati

fbshipit-source-id: a32c266bbd876c396665bceef9a0a961b4586564

Move ParallelTBB to GHA (take 2) (#64193)

Summary:
2nd attempt to do the same
Skip failing `TestTensorCreationCPU.test_trilu_indices_cpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64193

Reviewed By: mrshenli

Differential Revision: D30779469

Pulled By: malfet

fbshipit-source-id: 5c51fcbb383d0823d0e953d7af181b5f22eda9ab

[Static Runtime] Add first iter metric (#64457)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64457

The first iteration is special since it initializes the memory planner. This change logs and reports first iteration time during benchmarking. It also generates a FAI-PEP output when `generate_ai_pep_output` is set.

Test Plan:
Run any benchmark, and observe:
```
I0902 15:19:32.528977 2492358 impl.cpp:948] PyTorchObserver {"value":6.415958881378174,"unit":"ms","metric":"latency","type":"static_runtime_first_iter"}
...
First iter time: 6.41596 ms
```

Note that this metric is likely to have significantly more noise than the others since we don't have as many data points.

Unit tests: `buck test //caffe2/test:static_runtime`

Reviewed By: d1jang

Differential Revision: D30740619

fbshipit-source-id: 4dcfccd5629f4fa34254fd355073ef19e151245a

add bubdle input into AIBench (#64557)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64557

MaskRCNN speed depends on how many people detected in the detection stage. A random input from dataloader doesn't satisfy this. In order to standardize the benchmarking, we use 2 standard image for benchmarking, 2/3 people.

Test Plan: AIBench result: https://www.internalfb.com/intern/aibench/details/945883114818980

Reviewed By: axitkhurana

Differential Revision: D30446049

fbshipit-source-id: a2826fdb69e9f840c0afc566c4cbbcde1c2fba89

Automated submodule update: FBGEMM (#64582)

Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: https://github.com/pytorch/FBGEMM/commit/3ce04fc664beaa1cba1ae0a072c8db99c4ac91de

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64582

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: mrshenli

Differential Revision: D30779695

fbshipit-source-id: 22460a4047e2462e672eb4931e44648ae6bde627

enable bf16 mkldnn path for gemm (#61891)

Summary:
# Goal: Integrate mkldnn bf16 Gemm to pytorch

## BF16 Suport for mm, addmm, bmm, addbmm, baddbmm, mv, addmv, dot (with mkldnn matmul primitive):
https://oneapi-src.github.io/oneDNN/group__dnnl__api__matmul.html
For gemm related ops, we keep all inputs under plain format. So we will not introduce opaque tensor for these ops to save mem copy here.

![mkldnn bf16 gemm integration](https://user-images.githubusercontent.com/54701539/126263077-4b5134e1-52a7-4fad-94fb-19e13a0377f6.png)

The minimized integration is only dispatch to mkldnn in addmm, but for gemm with 3-D input (with additional dim for"batch") this will call mkldnn gemm for "batch" times. Since mkldnn matmul support input with multiple dims, we directly dispatch to mkldnn gemm in {bmm, addbmm, baddbmm} to reduce the time to create mkldnn memory desc, primitive, etc.

For the different definition for "bias" between mkldnn(which must be shape of (1, N)) and pytorch (which can be same shape with gemm result (M, N)), we use a fused sum to handle it.

## User Case:
User case is exactly same with before because no opaque tensor's is introduced. Since the pytorch has already support bf16 data type with CPU tensor before, we can leverage the existed bf16 gemm UT.

## Gemm performance gain on CPX 28Cores/Socket:
Note: data is collected using PyTorch operator benchmarks: https://github.com/pytorch/pytorch/tree/master/benchmarks/operator_benchmark (with adding bfloat16 dtype)

### use 1 thread on 1 core
### torch.addmm (M, N) * (N, K) + (M, K)
| impl |16x16x16|32x32x32| 64x64x64 | 128x128x128| 256x256x256| 512x512x512|1024x1024x1024|
|:---:|:---:| :---: | :---: | :---: | :---: | :---: | :---: |
| aten-fp32| 4.115us|4.583us|8.230us|26.972us|211.857us|1.458ms|11.258ms|
| aten-bf16 | 15.812us| 105.087us|801.787us|3.767ms|20.274ms|122.440ms|836.453ms|
| mkldnn-bf16 |20.561us |22.510us|24.551us|37.709us|143.571us|0.835ms|5.76ms|

We can see mkldnn-bf16 are better than aten bf16, but for smaller shapes, mkldnn bf16 are not better than aten fp32. This is because onednn overhead, this overhead more like a "constant" overhead and while problems get larger, we can ignore it. Also we are continue optimize the kernel efficiency and decrease the overhead as well.

More shapes
| impl |1x2048x2048|2048x1x2048| 2048x2048x1 |
|:---:|:---:| :---: | :---: |
| aten-fp32| 0.640ms|3.794ms|0.641ms|
| aten-bf16 | 2.924ms| 3.868ms|23.413ms|
| mkldnn-bf16 |0.335ms |4.490ms|0.368ms|

### use 1 socket (28 thread, 28 core)
| impl | 256x256x256| 512x512x512|1024x1024x1024| 2048x2048x2048|4096x4096x4096|
|:---:| :---: | :---: | :---: | :---: | :---: |
| aten-fp32| 35.943us |140.315us|643.510us|5.827ms|41.761ms|
| mkldnn-bf16 |53.432us|114.716us|421.858us|2.863ms|23.029ms|

More shapes
| impl |128x2048x2048|2048x128x2048| 2048x2048x128 |
|:---:|:---:| :---: | :---: |
| aten-fp32| 0.561ms|0.458ms|0.406ms|
| mkldnn-bf16 |0.369ms |0.331ms|0.239ms|

We dose not show aten-bf16 for this case since aten-bf16 always compute as single thread and the performance is extreme poor. The trend for this case is similar for 1 thread on 1 core.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61891

Reviewed By: iramazanli

Differential Revision: D29998114

Pulled By: VitalyFedyunin

fbshipit-source-id: 459dc5874c638d62f290c96684ca0a694ded4b5a

Array API: Add `torch.linalg.matmul` alias to `torch.matmul` (#63227)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/62811

Add `torch.linalg.matmul` alias to `torch.matmul`. Note that the `linalg.matmul` doesn't have a `method` variant.

Also cleaning up `torch/_torch_docs.py` when formatting is not needed.

cc IvanYashchuk Lezcano mruberry rgommers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63227

Reviewed By: mrshenli

Differential Revision: D30770235

Pulled By: mruberry

fbshipit-source-id: bfba77dfcbb61fcd44f22ba41bd8d84c21132403

[small BE] .github: refactor concurrency into a common macro (#64587)

Summary:
By using a macro for these concurrency groups, we can edit just one place for the linux and windows workflows (vs 2).

I wanted to loop all the other workflow files in as well, but since those aren't generated, the macros won't work the same way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64587

Reviewed By: mrshenli

Differential Revision: D30783224

Pulled By: janeyx99

fbshipit-source-id: ae16ebb12d2d63a563d28f0ce88e280f68ed4b9b

Fixes issue related torch.trapezoid broadcasting behavior and documentation (#64054)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64054

Fixes #63608

cc mruberry rgommers heitorschueroff

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D30617078

Pulled By: NivekT

fbshipit-source-id: 815896ec56d447562790df4d662e94fd13457e2a

Add space in Feature Request issue template (#64563)

Summary:
Add space between emoji and text in Feature Request issue template

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64563

Reviewed By: janeyx99

Differential Revision: D30779429

Pulled By: seemethere

fbshipit-source-id: 3625299923a7022fa66473633524a6620d58188b

Clean up op BC check list (#64584)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64584

It has been a while since last clean up. The list is really long.

Test Plan: ci

Reviewed By: hl475

Differential Revision: D30779350

fbshipit-source-id: 908b47d0b9a16b784aad6a34c5c87f923500c247

[doc][hackathon] To add Adam Optimizer to the documentation (#63251)

Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of Adam Algorithm to the documentation. For more details, we refer to the paper https://arxiv.org/abs/1412.6980

<img width="442" alt="Screen Shot 2021-08-27 at 6 37 54 PM" src="https://user-images.githubusercontent.com/73658284/131195297-35fce613-3691-4fed-b42d-db234d4fcd7c.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63251

Reviewed By: albanD

Differential Revision: D30779163

Pulled By: iramazanli

fbshipit-source-id: 319a80fc3952793b0d064d0e641ddc1de3c05a86

minor fix for elastic doc (#64531)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64531

fix #64530

Test Plan: unit test

Reviewed By: mrshenli

Differential Revision: D30760879

fbshipit-source-id: 94ed1476e886513427d928a36f5be6b9bfff0826

deprecate dtype getters from `torch.testing` namespace (#63554)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63554

Following https://github.com/pytorch/pytorch/pull/61840#issuecomment-884087809, this deprecates all the dtype getters publicly exposed in the `torch.testing` namespace. The reason for this twofold:

1. If someone is not familiar with the C++ dispatch macros PyTorch uses, the names are misleading. For example `torch.testing.floating_types()` will only give you `float32` and `float64` skipping `float16` and `bfloat16`.
2. The dtype getters provide very minimal functionality that can be easily emulated by downstream libraries.

We thought about [providing an replacement](https://gist.github.com/pmeier/3dfd2e105842ad0de4505068a1a0270a), but ultimately decided against it. The major problem is BC: by keeping it, either the namespace is getting messy again after a new dtype is added or we need to somehow version the return values of the getters.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D30662206

Pulled By: mruberry

fbshipit-source-id: a2bdb10ab02ae665df1b5b76e8afa9af043bbf56

To change WarmUp Scheduler with ConstantLR and LinearLR (#64395)

Summary:
Partially unblocks https://github.com/pytorch/vision/issues/4281

Previously we have added WarmUp Schedulers to PyTorch Core in the PR : https://github.com/pytorch/pytorch/pull/60836 which had two mode of execution - linear and constant depending on warming up function.

In this PR we are changing this interface to more direct form, as separating linear and constant modes to separate Schedulers. In particular

```Python
scheduler1 = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=5, warmup_method="constant")
scheduler2 = WarmUpLR(optimizer, warmup_factor=0.1, warmup_iters=5, warmup_method="linear")
```

will look like

```Python
scheduler1 = ConstantLR(optimizer, warmup_factor=0.1, warmup_iters=5)
scheduler2 = LinearLR(optimizer, warmup_factor=0.1, warmup_iters=5)
```

correspondingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64395

Reviewed By: datumbox

Differential Revision: D30753688

Pulled By: iramazanli

fbshipit-source-id: e47f86d12033f80982ddf1faf5b46873adb4f324

[JIT] Freeze unrolls constant loops (#63614)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63614

There are a number of optimizations (`RemoveListMutation` in particular) that are tied to loop unrolling in `runOptimizations`. However, these were not invoked from `freeze_module` since the freezing pass should be idempotent.

This diff makes `runOptimizations` run `UnrollConstantLoops` instead of `UnrollLoops`. `freeze_module` is then able to run these optimizations.

Test Plan: Observed that `freeze_module` applies `RemoveListMutation`

Reviewed By: eellison

Differential Revision: D30437356

fbshipit-source-id: cba04bd958a48ad51b151aa3264f3d5bbb1fc2a4

Fix fx2trt SplitterBase non_tensor_input logic (#64286)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64286

During graph splitting, `_SplitterBase` supports taking into consideration whether the subnet boundary nodes
produces "supported" outputs that will cross the acc/non-acc boundary. Specifically, if the backend only
supports Tensor-based data passing cross boundary, then we cannot split the graph at a place where the node
output is a non-Tensor type (e.g., `Tuple[Tensor]`).

There's currently a bug in this logic that it does not correctly detect the output type of a Node. Instead of
using `Node.meta['tensor_meta']`, we should instead check `Node.meta['type']`.

`Node.meta['tensor_meta']` is not appropriate because this key will exist if the node output is an iterable
and one of the element is of type `Tensor`. So `Tuple[Tensor]` will be wrongly considered "supported".

Test Plan:
arc lint
run CI tests

Reviewed By: yinghai, 842974287

Differential Revision: D30617147

fbshipit-source-id: e8ba70dfaddc05cafb8037d58fca73b7ccbb1a49

Update error messages that use LAPACK error codes (#63864)

Summary:
This PR updates the` batchCheckErrors` and `singleCheckErrors` functions so that the error messages are defined only once.
`batchCheckErrors` function reuses `singleCheckErrors` now.

Fixes https://github.com/pytorch/pytorch/issues/63220, fixes https://github.com/pytorch/pytorch/issues/59779

cc jianyuh nikitaved pearu mruberry heitorschueroff walterddr IvanYashchuk xwang233 Lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63864

Reviewed By: ngimel

Differential Revision: D30672933

Pulled By: mruberry

fbshipit-source-id: 0ba37ff98ef278efdb12c3890aa07d687047da7a

Support `torch.concat` alias, add `cat` OpInfo & remove OpInfo test_out skips {cat, stack, hstack, vtack, dstack} (#62560)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/61767

## Changes

- [x] Add `torch.concat` alias to `torch.cat`
- [x] Add OpInfo for `cat`/`concat`
- [x] Fix `test_out` skips (Use `at::native::resize_output` or `at::native::resize_output_check`)
  - [x] `cat`/`concat`
  - [x] `stack`
  - [x] `hstack`
  - [x] `dstack`
  - [x] `vstack`/`row_stack`
- [x] Remove redundant tests for `cat`/`stack`

~I've not added `cat`/`concat` to OpInfo `op_db` yet, since cat is a little more tricky than other OpInfos (should have a lot of tests) and currently there are no OpInfos for that. I can try to add that in a subsequent PR or maybe here itself, whatever is suggested.~
**Edit**: cat/concat OpInfo has been added.

**Note**: I've added the named tensor support for `concat` alias as well, maybe that's out of spec in `array-api` but it is still useful for consistency in PyTorch.

Thanks to krshrimali for guidance on my first PR :))

cc mruberry rgommers pmeier asmeurer leofang AnirudhDagar asi1024 emcastillo kmaehashi heitorschueroff krshrimali

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62560

Reviewed By: saketh-are

Differential Revision: D30762069

Pulled By: mruberry

fbshipit-source-id: 6985159d1d9756238890488a0ab3ae7699d94337

Remove dead code from THC (THCApply.cuh) (#64559)

Summary:
cc peterbell10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64559

Reviewed By: mruberry

Differential Revision: D30769526

Pulled By: ngimel

fbshipit-source-id: 034a5c778a2b902cffa57b76511fa0dcdea26825

Move ParallelNative and PureTorch to GHA (#64452)

Summary:
Separate ParallelTBB move to https://github.com/pytorch/pytorch/pull/64193 as it requires some further investiagation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64452

Reviewed By: seemethere, janeyx99

Differential Revision: D30738337

Pulled By: malfet

fbshipit-source-id: 81c46423e903058bd1a3e8553e8a10ce978eeefd

Mark functions in backend header as inline to suppress warning (#64098)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64098

Reviewed By: kimishpatel, iseeyuan

Differential Revision: D30593104

fbshipit-source-id: 328196b9bc4a89a28ad89bede7e337107976c303

Revert D30745610: [nnc] Make our exceptions c10::Errors, get C++ stacktraces

Test Plan: revert-hammer

Differential Revision:
D30745610 (https://github.com/pytorch/pytorch/commit/18b2751ea143374adbb690889427e06a9334da05)

Original commit changeset: a1cfaa7364ef

fbshipit-source-id: 9b716053b96a65745240ddef1c456c44d5d09671

[Vulkan] Code Quality: Remove duplicate code for hardshrink and leaky_relu functions (#64405)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64405

Code quality improvement: removed duplicate code for hardshrink and leaky_relu functions.
ghstack-source-id: 137319378

Test Plan:
```buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"```

Reviewed By: SS-JIA

Differential Revision: D30690251

fbshipit-source-id: 5729d1f32946e42f41df77756a8313f297dd822f

Back out "nn.functional.linear OpInfo" (#64517)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64517

Original commit changeset: ca41dbd98176

Test Plan: PyTorch CI

Reviewed By: ngimel

Differential Revision: D30758201

fbshipit-source-id: 2d3274293d340373b8af86083336607818019619

Back out "D30740897 Add fusion enabled apis" (#64500)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64500

D30740897 (https://github.com/pytorch/pytorch/commit/39aeb3bf63f61664bc6c4a929a80a660365c2a5e) broke caffe2/torch/fb/module_factory/optimizers/tests:test_full_sync_optimizer_needed_coverage (https://fburl.com/test/mb46jxon) and blocked training_platform_unit_tests

{F660271297}

multsect results confirms

```
multisect --config FBCODE_TEST bisect 844424966128796 --workers 16 revisions --begin 09629edc --end fc86b434
D30740897 (https://github.com/pytorch/pytorch/commit/39aeb3bf63f61664bc6c4a929a80a660365c2a5e)

````

{F660271232}

Test Plan:
```
buck test mode/opt //caffe2/torch/fb/module_factory/optimizers/tests:test_full_sync_optimizer_needed_coverage

Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/4785074671474181
✓ Pass: caffe2/torch/fb/module_factory/optimizers/tests:test_full_sync_optimizer_needed_coverage - main (3.729)
Summary
Pass: 1

```

Differential Revision: D30753916

fbshipit-source-id: 302fd4113ef1f3069846be03edc2300d82b66719

[nnc] Make our exceptions c10::Errors, get C++ stacktraces (#64332)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64332

With this diff, if a compiler bug occurs (unlikely, I know!) we'll be able to get a c++ stacktrace leading to the exception, rather than just a terse message. E.g.,
```
RuntimeError: UNSUPPORTED DTYPE
Exception raised from compilation_error at ../torch/csrc/jit/tensorexpr/exceptions.h:32 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f966659b2eb in /fsx/users/bertrand/c\
onda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x376f099 (0x7f966a195099 in /fsx/users/bertrand/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x3763bf5 (0x7f966a189bf5 in /fsx/users/bertrand/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: torch::jit::tensorexpr::CudaCodeGen::Initialize() + 0xdd8 (0x7f966a193368 in /fsx/users/bertrand/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_cuda\
.so)
```

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D30745610

Pulled By: bertmaher

fbshipit-source-id: a1cfaa7364ef4120de834e9cbe57ced1d082ab4e

Ensure num_threads is initialized in get_num_threads (#64486)

Summary:
Possible source of the recent layernorm CI failures. `lazy_init_num_threads` appears at the top of `parallel_for` and can change the number of threads set. So, we need to ensure `num_threads` is initialized during `get_num_threads` calls as well. It's already done this way for OpenMP, but is missing from other parallel backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64486

Reviewed By: mruberry

Differential Revision: D30752615

Pulled By: ngimel

fbshipit-source-id: 085873ce312edbee1254c0aaae30dec7fcfe2c57

Automated submodule update: FBGEMM (#64338)

Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: https://github.com/pytorch/FBGEMM/commit/9ccb2714a93e8324119676f6b3dc1c26eef0a703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64338

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D30690319

fbshipit-source-id: 884d1f950cd1f7d2a77b79affb9215f285d5d0da

Fix `copy_transpose_valid` condition for `copy_same_type_transpose_` (#64425)

Summary:
Thanks to ngimel for the hint where the problem might be (https://github.com/pytorch/pytorch/issues/64358#issuecomment-910868849)!

I added a test that fails on master to verify the fix. The shape `(60, 60)` was chosen because of `MIN_SZ = 60 * 60` in `copy_transpose_valid`.

Fixes https://github.com/pytorch/pytorch/issues/64358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64425

Reviewed By: mruberry

Differential Revision: D30752725

Pulled By: ngimel

fbshipit-source-id: f40370ea8365c94e30f8e8a3dcab5f3b3462464a

[CUDA graphs] Error if attempting to capture uncapturable nccl (#64440)

Summary:
NCCL < 2.9.6 is not capturable. Attempting to capture it can cause nasty behavior (for example, ive seen capture succeed, but replay silently hang). Pytorch should preempt this with a friendlier error.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64440

Reviewed By: mruberry

Differential Revision: D30733884

Pulled By: ngimel

fbshipit-source-id: 5f2df3cf5cc0e5e68f49bf22a80d9f58064dc7ec

Fix logical typo in _compare_trilu_indices (#64468)

Summary:
I'm pretty sure that repeating the same call twice is pretty meaningless and intend was to call `tril`/`tril_indices` in first case and `triu`/`triu_indices` in another

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64468

Reviewed By: mruberry

Differential Revision: D30744978

Pulled By: malfet

fbshipit-source-id: 7cd36789a7ebf1cc263fb2d875e479c05e7588a4

Support Union in TorchScript (#64234)

Summary:
This PR is created to replace https://github.com/pytorch/pytorch/pull/53180 PR stack, which has all the review discussions. Reason for needing a replacement is due to a messy Sandcastle issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64234

Reviewed By: gmagogsfm

Differential Revision: D30656444

Pulled By: ansley

fbshipit-source-id: 77536c8bcc88162e2c72636026ca3c16891d669a

Add fx2trt pass for removing duplicate output args (#64461)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64461

Fx2TRT does not support duplicate nodes in the output args tuple.

This pass removes duplicate output args from the target subnets and fixes their uses in the top level module where the subnets are called. This pass must be called after acc split on the top-level net and subsequent calls to the acc trace on the subnets.

This pass will change both the subnets and top level module.

Test Plan:
Run:

```
buck run mode/opt -c python.package_style=inplace //caffe2/torch/fb/fx2trt/tests/passes/:test_remove_duplicate_output_args

```

Reviewed By: yinghai

Differential Revision: D30740499

fbshipit-source-id: 98459f7677980b21c7bffda918158001285572db

Add fusion enabled apis (#64429)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64429

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D30740897

Pulled By: eellison

fbshipit-source-id: 446aa63b5d763f1cfffea62547db7294368e3438

update optimize_for_inference docs (#64428)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64428

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D30740898

Pulled By: eellison

fbshipit-source-id: b94d2c3deb661a6ba048f19e8c1d5e1799667eeb

[resubmit][FX] Prototype for guarding against mutable operations in tracing (#64467)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64467

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D30744870

Pulled By: jamesr66a

fbshipit-source-id: fc652f8b17748f90dbeb83fabf3bd5bb57d6ff1a

Skips layer norm OpInfo on tbb platform (#64469)

Summary:
The OpInfo tests appear to be discovering a layer norm x tbb issue that requires investigation. Skipping tests on that platform for now to restore CI signal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64469

Reviewed By: ngimel

Differential Revision: D30745746

Pulled By: mruberry

fbshipit-source-id: 282484cc00b867fac85b7df61430d64277da6421

THC: Cleanup dead code (#64441)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64441

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30735342

Pulled By: ngimel

fbshipit-source-id: 84ab36f7aec6b8cd7f1f34c19a58a382c06ad68d

Regenerate generated github workflows (#64465)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64465

These were out of date and causing master failures

Test Plan: Imported from OSS

Reviewed By: zhouzhuojie

Differential Revision: D30744594

Pulled By: driazati

fbshipit-source-id: 09a21c3c5d9bc83b368d66cabbafd1ba83302dd3

Revert D30732630: [quant] Enable jit tracing on quantizable LSTM

Test Plan: revert-hammer

Differential Revision:
D30732630 (https://github.com/pytorch/pytorch/commit/116142143cc2d66c7e582d9f96e00862456fd736)

Original commit changeset: 443e351ebb0e

fbshipit-source-id: 49001392f01366f3b1ccc31139f824c80b86cd40

Revert D30055886: [quant] AO migration of the `quantize.py`

Test Plan: revert-hammer

Differential Revision:
D30055886 (https://github.com/pytorch/pytorch/commit/44e3ed88c9a1bd9ee6b0168ba5271a2c6b006cc8)

Original commit changeset: 8ef7470f9fa6

fbshipit-source-id: c5bd3ead43a2d44b9e56872ec5bd7a195bdac725

[POC] .github: Add event name to concurrency (#64402)

Summary:
This would ensure that manually/API triggered workflows would not cancel other triggered workflows. For example, the manually triggered periodic 11.1 linux job cancelled the scheduled one here, which we may not want:
![image](https://user-images.githubusercontent.com/31798555/131752175-1c99d56e-d344-46e1-b8ac-9c12bba0569a.png).

This would be helpful later as we use more dispatched workflows (e.g., for bisect functionality)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64402

Reviewed By: malfet

Differential Revision: D30734860

Pulled By: janeyx99

fbshipit-source-id: 220016716094666e9af836fcd716dd529cf23d8a

update rpc tensorpipe logic for sparse tensors (#62960)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62960

A bug was filed a few years ago for sending sparse tensor over rpc #30807.

This pr updates rpc/tensorpipe logic for CUDA sparse tensors. During the serialization process, the pickler.cpp implementation breaks down the sparse tensor into two tensors and metadata. torch/csrc/distributed/rpc/tensorpipe_agent.cpp needs to be updated because it does not have logic sparse tensors. It pushes a single device for a sparse tensor. This is wrong because after the sparse tensor has been serialized, there will be two tensors. The second tensor will not have a device. This will cause the second tensor to have the wrong target device. tensorpipe_utils.cpp needs to be updated because deserialization happens after the data is received on the target pipe. This takes the two tensors and metadata sent and rebuilds the sparse tensor. There will be two tpDescriptors but only one tensor after deserialization. The logic is updated to verify the sparse tensor is on the correct device using the first tpDescriptor.

This pr also updates ivalue.cpp and ivalue.h to support more paths for Sparse COO tensors.

I tested these changes by adding sparse tests to rpc_test.py and dist_autograd_test.py.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30717285

Pulled By: gcramer23

fbshipit-source-id: daee9a56764550f56b131f9dd8e74e23113d6714

Revert D30675780: [FX] Prototype for guarding against mutable operations in tracing

Test Plan: revert-hammer

Differential Revision:
D30675780 (https://github.com/pytorch/pytorch/commit/795387477fe90e03cb598f3077a32222896e65dd)

Original commit changeset: b2116b51dcc8

fbshipit-source-id: d4f1173f4989556ea54974f4c2739ef85a705fae

[quant] Enable jit tracing on quantizable LSTM (#64438)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64438

The quantizable LSTM didn't support jit tracing because it had several non taceable paths. We sacrifice some of the user experience to enable the tracing.

The main UX feature removed is a user-friendly message when trying to access the backwards path in a bidirectional LSTM: When the bidirectional flag is `False`, we used to throw a nice error message when the user tried accessing backwards weights. Now the message is default (removed properties).

Test Plan: `buck test mode/dev //caffe2/test:quantization -- test_custom_module_lstm`

Reviewed By: mtl67

Differential Revision: D30732630

fbshipit-source-id: 443e351ebb0e2b636c86dea9691b9bf42ffe618f

[FX] Prototype for guarding against mutable operations in tracing (#64295)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64295

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D30675780

Pulled By: jamesr66a

fbshipit-source-id: b2116b51dcc87357f0c84192c4c336680875e27a

.github: Migrate pytorch_linux_bionic_py_3_6_clang9 to GHA (#64218)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64218

Relies on https://github.com/fairinternal/pytorch-gha-infra/pull/11

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
cc ezyang seemethere malfet walterddr lg20987 pytorch/pytorch-dev-infra bdhirsh

Test Plan: Imported from OSS

Reviewed By: malfet, H-Huang, janeyx99

Differential Revision: D30651516

Pulled By: seemethere

fbshipit-source-id: e5843dfe84f096f2872d88f2e53e9408ad2fe399

Switch Shuffler to use iter-local buffer (#64195)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64195

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D30642947

Pulled By: ejguan

fbshipit-source-id: d4b52479b4ae37ad693388b9cdb8eed83a136474

Disable CircleCI ROCm build (#64434)

Summary:
Per jithunnair-amd suggestion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64434

Reviewed By: seemethere, janeyx99

Differential Revision: D30732289

Pulled By: malfet

fbshipit-source-id: 1932d0a7d1e648006f8030c8237b187d0709f688

[DataPipe] removing filter's inheritance from map (#64404)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64404

This PR remove `filter`'s inheritance from `map`. This allows `filter` to not have a `__len__` function and that behavior is what we would like.

cc VitalyFedyunin ejguan

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30713120

Pulled By: NivekT

fbshipit-source-id: 4d5d07555297ee2bd4b49842c0d26cdc00638f6c