review.tizen.org Git - platform/upstream/pytorch.git/log

Allow torch.utils.cpp_extension.load to load shared libraries that aren't Python modules (#13941)

Summary:
For custom TorchScript operators, `torch.ops.load_library` must be used and passed the path to the shared library containing the custom ops. Our C++ extensions stuff generally is meant to build a Python module and import it. This PR changes `torch.utils.cpp_extension.load` to have an option to just return the shared library path instead of importing it as a Python module, so you can then pass it to `torch.ops.load_library`. This means folks can re-use `torch.utils.cpp_extension.load` and `torch.utils.cpp_extension.load_inline` to even write their custom ops inline. I think t-vi and fmassa will appreciate this.

soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13941

Differential Revision: D13110592

Pulled By: goldsborough

fbshipit-source-id: 37756307dbf80a81d2ed550e67c8743dca01dc20

Batch more matrix multiplies (#13456)

Summary:
This handles the input pre-multiplication in RNNs, yielding pretty significant speedups in backward times. This pass depends on loop unrolling, so we'll batch only as many elements as the unrolling factor allows.

cc mruberry ngimel zou3519 zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13456

Differential Revision: D12920339

Pulled By: zou3519

fbshipit-source-id: 5bcd6d259c054a6dea02ae09a9fdf9f030856443

Enable native wrappers for the remainder of nn functions.

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14290

Differential Revision: D13162562

Pulled By: gchanan

fbshipit-source-id: 615e1727988bfeeade48f9b38162333a2e298f7b

Add Recency Weighted into SparseLookup (#14291)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14291

Add RecencyWeighted into SparseLookup.

Reviewed By: Wakeupbuddy

Differential Revision: D13147738

fbshipit-source-id: de5dc3aaee8ce7d41c6d30d2ff47e9786a7fa4da

quote NUMPY_INCLUDE_DIR (#14341)

Summary:
when NUMPY_INCLUDE_DIR contains space character (e.g. "C:\Program Files (x86)\Microsoft Visual Studio\..."), cmake cannot receive correct path name.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14341

Differential Revision: D13188408

Pulled By: soumith

fbshipit-source-id: b62127d90e53da94fe6af5d3bdd2ea4fd6546210

shape analysis fix (#14325)

Summary:
This PR is deceptively large because of an indenting change. The actual change is small; I will highlight it inline
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14325

Differential Revision: D13183296

Pulled By: suo

fbshipit-source-id: fcbf6d5317954694ec83e6b8cc1c989f2d8ac298

Some minor fixes for Windows build script (#14218)

Summary:
1. Fix execution failure when some of the paths are not defined
2. Users can now optionally override install dir by setting `CMAKE_INSTALL_PREFIX`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14218

Differential Revision: D13180350

Pulled By: soumith

fbshipit-source-id: 8c9680d1285dbf08b49380af1ebfa43ede99babc

Allow dataloader to accept a custom memory pinning function (#14171)

Summary:
Currently, the `pin_memory_batch` function in the dataloader will return a batch comprised of any unrecognized type without pinning the data, because it doesn't know how.

This behavior was preventing us from overlapping data prefetching in Mask-RCNN, whose custom `collate_fn` returns a custom batch type.

The present PR adds the ability for the user to pass a `pin_fn` alongside any custom `collate_fn` to handle such custom types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14171

Differential Revision: D13166669

Pulled By: soumith

fbshipit-source-id: ca965f9841d4a259b3ca4413c8bd0d8743d433ab

Option to preserve bitwise accuracy of gradient checkpointed vs non-checkpointed dropout (#14253)

Summary:
This issue was noticed, and fix proposed, by raulpuric.

Checkpointing is implemented by rerunning a forward-pass segment for each checkpointed segment during backward.  This can result in the RNG state advancing more than it would without checkpointing, which can cause checkpoints that include dropout invocations to lose end-to-end bitwise accuracy as compared to non-checkpointed passes.

The present PR contains optional logic to juggle the RNG states such that checkpointed passes containing dropout achieve bitwise accuracy with non-checkpointed equivalents.**  The user requests this behavior by supplying `preserve_rng_state=True` to `torch.utils.checkpoint` or `torch.utils.checkpoint_sequential`.

Currently, `preserve_rng_state=True` may incur a moderate performance hit because restoring MTGP states can be expensive.  However, restoring Philox states is dirt cheap, so syed-ahmed's [RNG refactor](https://github.com/pytorch/pytorch/pull/13070#discussion_r235179882), once merged, will make this option more or less free.

I'm a little wary of the [def checkpoint(function, *args, preserve_rng_state=False):](https://github.com/pytorch/pytorch/pull/14253/files#diff-58da227fc9b1d56752b7dfad90428fe0R75) argument-passing method (specifically, putting a kwarg after a variable argument list).  Python 3 seems happy with it.
Edit:  It appears Python 2.7 is NOT happy with a [kwarg after *args](https://travis-ci.org/pytorch/pytorch/builds/457706518?utm_source=github_status&utm_medium=notification).  `preserve_rng_state` also needs to be communicated in a way that doesn't break any existing usage.  I'm open to suggestions (a global flag perhaps)?

**Batchnorm may still be an issue, but that's a battle for another day.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14253

Differential Revision: D13166665

Pulled By: soumith

fbshipit-source-id: 240cddab57ceaccba038b0276151342344eeecd7

Updating submodules

Reviewed By: yns88

fbshipit-source-id: e92b0c24a56b588dcf30542692cb4bdc2d474825

Remove individual "using c10:xxx" statements (#13168)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13168

We now have a "using namespace c10" in the at and caffe2 namespaces, we don't need the individual ones anymore

Reviewed By: ezyang

Differential Revision: D11669870

fbshipit-source-id: fc2bb1008e533906914188da4b6eb30e7db6acc1

Make sure we bind input/output of Onnxifi op positionally (#14214)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14214

This is to pick up the residual task of T36325466 to make sure that input/output binding of c2 Onnxifi op is positional.

Reviewed By: dzhulgakov

Differential Revision: D13134470

fbshipit-source-id: d1b916dade65c79133b86507cd54ea5166fa6810

Convert gumbel_softmax, lp pooling weak functions and modules (#14232)

Summary:
1. Support `Optional[BroadcastingList1[int]]` like type annotation to accept a int or a list[int]
2. Convert gumbel_softmax, lp pooling weak functions and modules
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14232

Differential Revision: D13164506

Pulled By: wanchaol

fbshipit-source-id: 6c2a2b9a0613bfe907dbb5934122656ce2b05700

Use ADL to find toString (#14021)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14021

I'm planning to move at::Scalar to c10, and there's a at::toString(Scalar) defined.
Unfortunately, we call it by specifying at::toString() instead of relying on ADL.
This diff changes that to prepare the actual move.

Reviewed By: ezyang

Differential Revision: D13015239

fbshipit-source-id: f2a09f43a96bc5ef20ec2c4c88f7790fd5a04870

Fix include paths for intrusive_ptr (#13692)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13692

This now lives in c10/util, not ATen/core anymore.

Reviewed By: ezyang

Differential Revision: D12937091

fbshipit-source-id: ea2d420a15e7941a38d0b4c75e20ca18437c73f8

Move intrusive_ptr to c10/util

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13691

Reviewed By: ezyang

Differential Revision: D12937090

fbshipit-source-id: fe9d21d5f7ea4e78e7e38ac60db13814a9971ed9

ignore generated caffe2 docs and virtualenvs

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14309

Reviewed By: soumith

Differential Revision: D13166626

Pulled By: JoelMarcey

fbshipit-source-id: 4f11228d8b5da85cec222bf11282722a7319581b

Updating submodules

Reviewed By: yns88

fbshipit-source-id: 20976d595e68a08d746d8806fd0205d810656366

removing quantization utility functions moved to fbgemm (#14301)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14301

This diff removes quantization utility functions copied to fbgemm

Reviewed By: Maratyszcza

Differential Revision: D13159299

fbshipit-source-id: a7f3cd2af0aa241a8578d532a70a157da70d9289

Cuda version comparison with CUDA_VERSION_STRING (#14302)

Summary:
Cuda headers include cuda version in form of major.minor. But when we do find_package(cuda). CUDA_VERSION variable includes patch number as well which fails following condition.

`
if(NOT ${cuda_version_from_header} STREQUAL ${CUDA_VERSION})
`

**For example:**
I have cuda 10.0 installed. My nvcc output looks like this
`Cuda compilation tools, release 10.0, **V10.0.130**
`

If I compile my application with caffe2. It gives me following error:

```
CMake Error at /usr/share/cmake/Caffe2/public/cuda.cmake:59 (message):
  FindCUDA says CUDA version is (usually determined by nvcc), but the CUDA
  headers say the version is 10.0.  This often occurs when you set both
  CUDA_HOME and CUDA_NVCC_EXECUTABLE to non-standard locations, without also
  setting PATH to point to the correct nvcc.  Perhaps, try re-running this
  command again with PATH=/usr/local/cuda/bin:$PATH.  See above log messages
  for more diagnostics, and see
  https://github.com/pytorch/pytorch/issues/8092 for more details.
```

**In this case, it got failed because**
cuda_version_from_header = 10.0
CUDA_VERSION = 10.0.130 (Came from NVCC)

`if(NOT ${cuda_version_from_header} STREQUAL ${CUDA_VERSION})
`

**Fix:**
We should compare header version with **major.minor format** which is given by CUDA_VERSION_STRING
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14302

Differential Revision: D13166485

Pulled By: soumith

fbshipit-source-id: 1b74e756a76c4cc5aa09978f5850f763ed5469b6

Updating submodules

Reviewed By: yns88

fbshipit-source-id: ee60b4dddf688608ef80043b1dc336d120a045d0

Updating submodules

Reviewed By: yns88

fbshipit-source-id: 366c29d09bec53459e2a4890c7fe8d10f45ff5c3

Robust NCCL barrier improvement to cover all devices combinations (#14271)

Summary:
This covers the very edgy case when we run the same NCCL process group with multiple GPU combinations instead of the last GPU combination. We always keep track of what GPUs have been used previously in the NCCL process group and barrier() itself will synchronize on each GPU's NCCL stream.

Test covered as well. Tested on 8-GPU machine
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14271

Differential Revision: D13164993

Pulled By: teng-li

fbshipit-source-id: 81e04352740ea50b5e943369e74cfcba40bb61c1

alias analysis (#14018)

Summary:
First draft of an alias analysis pass. It's a big PR unfortunately; a rough table of contents/suggested order of review:
1. `AliasAnalysis` pass, which traverses the graph and builds an `AliasDb`. The basic strategy is to assign alias information to every value of mutable type (list/tuple/tensor), and use the alias annotations of each node's schema to assign alias info to the outputs based on the alias info the inputs. Nodes that aren't explicitly schematized have hand-written analysis rules.

2. Integration of aliasing information into `moveBefore/AfterTopologicallyValid()`. Basically, we pass in an alias DB when we ask for moveBefore/After. Similar to how we can boil down dependency analysis to "what nodes use this node", we can boil down mutability analysis to "what nodes write to an alias set input/output'd by this node".

3. Integration of alias analysis to optimization passes that need it. Right now, it is `GraphFuser`, `CreateAutodiffSubgraphs`, constant prop, and CSE. Not sure if any others need it.

- Testing; still figuring out the best way to do this.
- Eventually we want to integrate the alias db into the graph, but we shouldn't do that until we can guarantee that the information can stay up to date with mutations.
- Do the same thing `python_printer` did for operators and force people to register alias analyzers if they can't schematize their op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14018

Differential Revision: D13144906

Pulled By: suo

fbshipit-source-id: 1bc964f9121a504c237cef6dfeea6b233694de6a

Remove extra include

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14206

Reviewed By: dzhulgakov

Differential Revision: D13131318

fbshipit-source-id: 559b55b8d98cdf6b7d1d3e31237c5473edc5e462

Removed redundant allreduce options in DDP (#14208)

Summary:
This somehow is not cleaned up after the C++ migration. Unused and can be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14208

Differential Revision: D13132492

Pulled By: teng-li

fbshipit-source-id: 0f05b6368174664ebb2560c037347c8eb45f7c38

Add list inequality operator (#14129)

Summary:
This PR adds `aten::neq` for list inequality comparisons and converts
`nll_loss` to weak script
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14129

Differential Revision: D13123894

Pulled By: driazati

fbshipit-source-id: 8c1edf7c163217ec00eb653f95d196db3998613f

Add onnxifi support to SparseLengthsWeightedSum (#14210)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14210

We left `SparseLengthsWeightedSum` as benchmark is not testing it due to fp16 filler issue. It was flushed out by unit tests. Hence we add the support here.

Reviewed By: bddppq

Differential Revision: D13132320

fbshipit-source-id: b21c30c185c9e1fbf3980641bc3cdc39e85af2e1

Add "axis" and "axis_w" arguments in FC to support customized axix to reduce dim. (#12971)

Summary:
Add "axis" and "axis_w" arguments in FC to support customized axix to reduce dim.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12971

Reviewed By: bddppq

Differential Revision: D12850675

Pulled By: yinghai

fbshipit-source-id: f1cde163201bd7add53b8475329db1f038a73019

IDEEP fallback for ResizeNearest op (#14212)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14212

TSIA

Reviewed By: yinghai

Differential Revision: D13134134

fbshipit-source-id: e3c5c9c8756d6e25b213f8dde9d809a44373d7a3

Fix ONNX_ATEN mode (#14239)

Summary:
Fix ONNX_ATEN mode by adding it to the validateBlock method.
Before this pr, validateBlock will throw an exception when using this mode.

I will add related test cases for ONNX_ATEN mode in a different pr once this is merged, since we dont have any currently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14239

Differential Revision: D13145443

Pulled By: zrphercule

fbshipit-source-id: 60e7942aa126acfe67bdb428ef231ac3066234b1

Bump gloo (#14281)

Summary:
Includes more robust error handling and timeout support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14281

Differential Revision: D13158232

Pulled By: pietern

fbshipit-source-id: e80432799a020576d5abdcd9a21d66b629479caf

fix comment on dnnlowp op arguments (#14265)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14265

Fix comment

Reviewed By: hx89

Differential Revision: D13152106

fbshipit-source-id: fbe98906963cbd5cb20a583a737a792fbc38292e

native NN wrappers, including with buffers.

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14256

Differential Revision: D13148783

Pulled By: gchanan

fbshipit-source-id: 4b6179033cf1df26061b6731eaaa4e008692e592

Remove header generated at configuration time (#14244)

Summary:
The build was picking up the empty stub header instead of the generated
one. Because of the large number of include paths we end up passing to
the compiler it is brittle to have both an empty stub file and a
generated file and expect the compiler to pick up the right one.

With the recent change to compile everything from a single CMake run we
can now use native CMake facilities to propagate macros that indicate
backend support. The stanzas target_compile_definitions with the
INTERFACE flag ensure that these macros are set only for downstream
consumers of the c10d target.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14244

Reviewed By: teng-li

Differential Revision: D13144293

Pulled By: pietern

fbshipit-source-id: f49324220db689c68c126b159f4f00a8b9bc1252

Address jittering issues in python_print (#14064)

Summary:
export - print a method with python_print
import - import a method with import_method

We want to ensure:

    export(g) == export(import(export(g)))

That is after after exporting/importing once, the graph will stay exactly
the same. This is less strict that g == import(export(g)) which would
require us to maintain a lot more information about the structure of the
IR and about the names of debug symbols.

This PR addresses this with the following fixes:
* print out double-precision numbers with high enough precision such
  that they always parse in the same way
* when creating loop-carried dependencies, sort them
  by variable name, ensuring a consistent order
* parse nan correctly
* DCE: remove unused outputs of if statements, and loop-carried dependencies
  in loops that are dead both after the loop and inside the body of the
  loop.
* Do not set uniqueName for variables whose names are _[0-9]+, these
  are probably rare in user code, and we need a way to communicate
  that we do not care about a variable name when re-parsing the graph.
  Otherwise temporary variable names will jitter around.
* Expand the definition of a constant in printing code to None,
  and family.
* Allow re-treeing to work as long as the only thing in its way is a
  constant node. These do not have side effects but are sometimes
  inserted in a different order when tracing compared to how we print them.
* Print all constant nodes out first in the order in which they are used_val
(or, if they are inlined, ensure they get assigned CONSTANT.cX number
  in a consistent order). Cleanup tuples (this is done in the compiler,
  but not in the tracer, leading to some tuple indexing jitter if not
  done).
* use strtod_l, not std::stod which can throw exceptions

Other:
* Add REL_WITH_DEB_INFO to setup.py. It already existed for the
  cmake files. Threading it into setup.py allows us to turn on
  debug symbols with optimization everywhere.
* enable round trip testing for all generated graphs. This only adds
  ~6 seconds to total build time but tests printing for every graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14064

Differential Revision: D13094637

Pulled By: zdevito

fbshipit-source-id: 0a1c6912194d965f15d6b0c6cf838ccc551f161d

Updating submodules

Reviewed By: cdelahousse

fbshipit-source-id: 27838fb2dad82c78906faf3cc2d124557c30e88f

Updating submodules

Reviewed By: cdelahousse

fbshipit-source-id: 3c17e12a579245a84e9a56b1d8a1641232150675

Add tensor table in ModelDef and use it for jit script serialization and deserialization (#13861)

Summary:
As we discussed, the tensors in the torch script will be associated with the tensor data in the serialized file. So let's add a table of tensor (actually it's a repeated TensorProto filed) in the ModelDef. TensorProto.name will be the id.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/13861

Reviewed By: dzhulgakov

Differential Revision: D13036940

Pulled By: zrphercule

fbshipit-source-id: ecb91b062ac4bc26af2a8d6d12c91d5614efd559

c10d Automatically retry on EINTR (#14180)

Summary:
Probably fixes https://github.com/pytorch/pytorch/issues/14170

Actually I probably shouldn't retry all `SYSCHECK` calls. I'll leave to the reviewers to decide.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14180

Reviewed By: pietern

Differential Revision: D13144741

Pulled By: SsnL

fbshipit-source-id: d73288f76b18cae14b1b43dad4e5e8d010a96d95

Make NCCL backend support barrier op (#14142)

Summary:
This is a feature request from: https://github.com/pytorch/pytorch/issues/13573

As the title says, this PR makes NCCL backend support barrier op.

There are a couple scenarios that need to be addressed:
(1) When there is already a NCCL op happened, we need to record what GPU device(s) the previous op happened and queue the allreduce barrier op on the same GPU device
(2) When there is no NCCL op yet, we will try to use a single GPU and separate each process from a single GPU as the best effort.

As for the async work, during wait, we would like not just wait on the NCCL kernel to be completed, but also block the thread until the current stream and nccl stream return.

`test_distributed` should cover the test. I also manually tested both scenarios.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14142

Differential Revision: D13113391

Pulled By: teng-li

fbshipit-source-id: 96c33d4d129e2977e6892d85d0fc449424c35499

Fix memory leakage in onnxifi transformer (#14245)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14245

tsia

Reviewed By: bddppq, rdzhabarov

Differential Revision: D13144783

fbshipit-source-id: 5e07bb7ab883ba1af68547a26272cd320967b9e3

Allow undefined tensors as constants (#14120)

Summary:
This PR inserts `prim::None` constants for undefined tensors. This comes in the standard library if an `Optional[Tensor]` is statically determined to be `None`:

```python
torch.jit.script
def fn(x=None):
    # type: (Optional[Tensor]) -> Tensor
    return torch.jit._unwrap_optional(x)

torch.jit.script
def fn2():
    # type: () -> Tensor
    return fn()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14120

Differential Revision: D13124625

Pulled By: driazati

fbshipit-source-id: 9eaa82e478c49c503f68ed89d8c770e8273ea569

Export BatchNorm functional and module, add necessary JIT support (#14016)

Summary:
This PR did three things:

1. It export the BatchNorm functional and module, and rewrite some of the components to stay align with the current supported JIT features
2. In the process of export, add necessary compiler support for in_place op aug assign
4. change the test_jit behavior in add_module_test to utilize a single rng state during module initialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14016

Differential Revision: D13112064

Pulled By: wanchaol

fbshipit-source-id: 31e3aee5fbb509673c781e7dbb6d8884cfa55d91

Have PYTORCH_FUSION_DEBUG print C kernel source (#14213)

Summary:
- Move up handling the environment variable from CPU only to all
- Introduce two levels to be enabled with PYTORCH_FUSION_DEBUG=n:
1: print C source
2: print CPU assembly, too (previous effect of PYTORCH_FUSION_DEBUG)

apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14213

Differential Revision: D13135393

Pulled By: soumith

fbshipit-source-id: befa4ebea3b3c97e471393a9f6402b93a6b24031

Delete backwards compatibility StorageImpl.h and TensorImpl.h (#14230)

Summary:
Since they directly include the real ones in core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14230

Differential Revision: D13140323

Pulled By: tugrulates

fbshipit-source-id: d7e3b94e891b2d7fa273d01c0b7edfebdbd7e368

remove unused parameters from caffe2_dnnlowp_utils.cc (#14164)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14164

See title

Reviewed By: csummersea

Differential Revision: D13115470

fbshipit-source-id: d754f558cd06e5f4c1cd00315e912cdb7b50731a

use pragma once (#14163)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14163

Some of the names we were using to guard the header file was too short (e.g. DYNAMIC_HISTOGRAM_H).

Reviewed By: csummersea

Differential Revision: D13115451

fbshipit-source-id: cef8c84c62922616ceea17effff7bdf8d67302a2

format python files (#14161)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14161

Formatting using Nuclide

Reviewed By: hx89

Differential Revision: D13115348

fbshipit-source-id: 7432ce6072a1822d7287b4ebcfcb6309282e15ac

clang-format (#14160)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14160

clang-format of C++ files

Reviewed By: hx89

Differential Revision: D13115201

fbshipit-source-id: d2ad65f66209e00578ef90f87f41272de2d24aa9

Add sigmoid op based on MKL-DNN

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13097

Differential Revision: D13105366

Pulled By: yinghai

fbshipit-source-id: d156e8fd519baeecf61c25dcd8fa2c2fa7351ef4

OSS build fix (#14192)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14192

We can only use C10_* in OSS. The build is only broken if built with USE_FBGEMM=ON

Reviewed By: jianyuh

Differential Revision: D13121781

fbshipit-source-id: f0ee9a75997766e63e1da8a53de7ddb98296a171

Make EncodeMethod in jit script serialization return a string (#14167)

Summary:
Nit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/14167

Reviewed By: ezyang

Differential Revision: D13116584

Pulled By: dzhulgakov

fbshipit-source-id: c0e7e71a81004031564bd2fc59f393041e1283d5

Create README.md of caffe2/quantization/server

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14217

Reviewed By: csummersea

Differential Revision: D13135086

Pulled By: jspark1105

fbshipit-source-id: bddf4f1c2dc5ec8ea6ebe9e265956f367e082d52

CircleCI: fix NCCL install (#14172)

Summary:
The `$BUILD_ENVIRONMENT` checks work in `test.sh` but not `build.sh`, this PR fixes the issue.

This replaces https://github.com/pytorch/pytorch/pull/14124.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14172

Differential Revision: D13135087

Pulled By: yf225

fbshipit-source-id: 42fff3926734778713d483d74ba0a89e5502dd9e

Fix a bug in test case of onnx::If

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14209

Differential Revision: D13132607

Pulled By: zrphercule

fbshipit-source-id: b7f7ccc6a6cbdeb57a7f88a1971d15dd81e6fc81

Tensor type checking and informative error messages for torch.distributed (#14204)

Summary:
This will address https://github.com/pytorch/pytorch/issues/13574

This error message should be more informative to the user for all the non-multiGPU ops, since we python binding to multi-gpu ops always.

test_distributed should cover all. Also tested both RunTime errors.

```
>>> a = torch.ByteTensor([])
>>> b = [a, a]
>>> dist.all_reduce(b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 809, in all_reduce
    _check_single_tensor(tensor, "tensor")
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 207, in _check_single_tensor
    "to be a torch.Tensor type".format(param_name))
RuntimeError: Invalid function argument. Expecting parameter: tensor to be a torch.Tensor type

>>> b = ["b"]
>>> dist.all_gather(b, a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 1006, in all_gather
    _check_tensor_list(tensor_list, "tensor_list")
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 225, in _check_tensor_list
    "to be a List[torch.Tensor] type".format(param_name))
RuntimeError: Invalid function argument. Expecting parameter: tensor_list to be a List[torch.Tensor] type
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14204

Differential Revision: D13131526

Pulled By: teng-li

fbshipit-source-id: bca3d881e41044a013a6b90fa187e722b9dd45f2

Move stream functions from CUDAContext to CUDAStream (#14110)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14110

I'm planning to move CUDAStream to c10/cuda, without also moving
CUDAContext, and so it's most convenient if these definitions
are in the actual header file in question.

Reviewed By: smessmer

Differential Revision: D13104693

fbshipit-source-id: 23ce492003091adadaa5ca6a17124213005046c2

Move CUDAStreamInternals inside detail namespace. (#14109)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14109

Previously it was at the top level, because the author was under
the impression that you could only refer to top-level C++ names
from C, but this is not true; you just need to make a stub struct
conditioned on __cplusplus.

Reviewed By: smessmer

Differential Revision: D13104694

fbshipit-source-id: ecb7ae6dcfa4ab4e062aad7a886937dca15fd1b2

Delete dependencies from CUDAStream; remove synchronize_with (#13920)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13920

I want to move CUDAStream and CUDAGuard to c10_cuda without also
bringing along CUDAContext or CUDAEvent for the ride (at least for
now). To do this, I need to eliminate those dependencies.

There's a few functions in CUDAContext.h which don't really need
THCState, so they're separated out and put in general
purpose c10/cuda/CUDAFunctions.h

Reviewed By: smessmer

Differential Revision: D13047468

fbshipit-source-id: 7ed9d5e660f95805ab39d7af25892327edae050e

Fix race in AtomicFetchAdd. (#13479)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13479

Increases the lock scope to above Output() calls.

These calls potentially allocate the underlying blob/tensor
objects and multiple invocations race each other over the
same output blobs/tensors.

Reviewed By: bwasti

Differential Revision: D12891629

fbshipit-source-id: a6015cfdb08e352521a1f062eb9d94a971cfbdb0

Remove API macros from intrusive_ptr (#14137)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14137

This is a templated header-only class and shouldn't need export/import macros.

Reviewed By: ezyang

Differential Revision: D13111712

fbshipit-source-id: c8c958e75b090d011d25156af22f37f9ca605196

Tensor construction: combine Resize+mutable_data - 1/4 (#13942)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13942

Codemod generated with clangr shard mode, 25 files per diff,
motivation: https://github.com/pytorch/pytorch/pull/12407

Reviewed By: smessmer

Differential Revision: D13054770

fbshipit-source-id: a9e86e5dfcb4f7cebf5243e1d359fad064561bed

Tensor construction: combine Resize+mutable_data - 3/4 (#13944)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/13854

Codemod generated with clangr shard mode, 25 files per diff,
motivation: https://github.com/pytorch/pytorch/pull/12407

Reviewed By: ezyang

Differential Revision: D13054836

fbshipit-source-id: 5de07a156687f1ee607d0450410881d9176a87a7

Store the optimize flag in module (#14166)

Summary:
When the save/load of script module, we store optimize flag in module instead of encoding it in method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/14166

Reviewed By: ezyang

Differential Revision: D13117577

Pulled By: dzhulgakov

fbshipit-source-id: dc322948bda0ac5809d8ef9a345497ebb8f33a61

Cleanup caffe2 hipify exclude patterns (#14198)

Summary:
depthwise_3x3_conv_op.cu does not exist
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14198

Differential Revision: D13127479

Pulled By: bddppq

fbshipit-source-id: ec6bd434055a49ea405c4b399bde8c074114f955

Support 'python_module' of 'nn' in native functions. (#14126)

Summary:
Also move mse_loss, binary_cross_entropy, l1_loss to use this functionality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14126

Reviewed By: ezyang

Differential Revision: D13109975

Pulled By: gchanan

fbshipit-source-id: 0b29dc8cf222d25db14da7532d8dc096a988a0ec

Use onnx proto_utils to support using protobuf-lite

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14150

Differential Revision: D13115586

Pulled By: bddppq

fbshipit-source-id: d6b6935a8deac60f6f58d62a71f6840182a72a51

Use fbgemm revision file added by shipit (#14105)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14105

Pull Request resolved: https://github.com/facebook/fbshipit/pull/62

Use fbgemm revision file created by ShipIt for updating fbgemm revision for pytorch. We don't have to manually update submodule now.

Reviewed By: yns88

Differential Revision: D13072074

fbshipit-source-id: bef9eabad50f7140179c370a60bd9ca73067b9b5

Setup sccache for PyTorch ROCm CI (#14153)

Summary:
Discovered huge build time difference between caffe2 rocm build and pytorch rocm build (6min vs. 30min), turns out it's because the sccache setup needed in caffe2 docker images are not n pytorch build script.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14153

Differential Revision: D13115097

Pulled By: bddppq

fbshipit-source-id: 88414f164b980f0e667c8e138479b4a75ab7692e

allow empty index for scatter_* methods (#14077)

Summary:
Fixes #2027
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14077

Differential Revision: D13095788

Pulled By: ailzhang

fbshipit-source-id: ad2c8bbf83d36e07940782b9206fbdcde8905fd3

use at::Device throughout JIT (#14181)

Summary:
zdevito soumith

Sorry about the previous PR, had some git issues. This is the same exact code as the previous PR but updated w.r.t pytorch/master.

fixes #13254
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14181

Differential Revision: D13117688

Pulled By: soumith

fbshipit-source-id: 044840b2c7a0101ef43dd16655fd9a0f9981f53f

Support named return arguments in native_functions. (#14100)

Summary:
Note there was a hacky way of doing this before by specifying "return:" lists manually; this makes the
return names part of the function declaration itself.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14100

Differential Revision: D13101810

Pulled By: gchanan

fbshipit-source-id: 1c80574cd4e8263764fc65126427b122fe36df35

Split out CUDAMultiStreamGuard from CUDAGuard (#13912)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13912

The implementation and API of CUDAMultiStreamGuard is less mature,
and it cannot be implemented generically (yet) in c10_cuda. This
might be a reasonable thing to do eventually, but not for now.

Reviewed By: smessmer

Differential Revision: D13046500

fbshipit-source-id: 4ea39ca1344f1ad5ae7c82c98617aa348c327848

Move AT_CUDA_CHECK to c10

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13910

Reviewed By: smessmer

Differential Revision: D13046201

fbshipit-source-id: 8d360a0e4d6c2edf070d130e600c6b04f0ee0058

Add c10 cuda library. (#13900)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13900

Add c10 cuda library.

Right now, this is not used by anything, and only tests if the CUDA
headers are available (and not, e.g., that linking works.)

Extra changes:
- cmake/public/cuda.cmake now is correctly include guarded, so you
can include it multiple times without trouble.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Reviewed By: smessmer

Differential Revision: D13025313

fbshipit-source-id: fda85b4c35783ffb48ddd6bbb98dbd9154119d86

Switch Int8Add operator to QNNPACK (#14089)

Summary:
- Improved single-threaded performance due to optimized low-level micro-kernels
- Improved parallelization (previously was parallelized across images in a batch and pixels only, now within channels as well)
- Slightly different result due to different implementation of fixed-point arithmetics (no accuracy loss expected)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14089

Differential Revision: D13110135

Pulled By: Maratyszcza

fbshipit-source-id: 1f149394af5c16940f79a3fd36e183bba1be2497

No more -werror for c10d (#14155)

Summary:
As the title says
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14155

Differential Revision: D13115769

Pulled By: teng-li

fbshipit-source-id: 278deba090364544d92fa603621604ce37fa974e

Add ultra low precision options (#14133)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14133

Experiment with ultra low precisions on the Resnext-101 URU trunk model

Reviewed By: jspark1105

Differential Revision: D10108518

fbshipit-source-id: f04d74fbe1c9e75efafcd9845719bdb2efbbfe9c

Adds symbolic diff for THNN Conv2d and aten native BatchNorm (#13888)

Summary:
Adds symbolic diff and tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13888

Differential Revision: D13115548

Pulled By: soumith

fbshipit-source-id: ba75b01a95a5715a7761724dda018168b6188917

Print warning when ROCm memory leaking is detected in pytorch tests (#14151)

Summary:
We keep seeing random failures in CI because of ROCm memory leaking, e.g:

https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/3102//console
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/3080//console

To make the CI more stable, turn it to warning instead of failure.

iotamudelta please help investigating the memory leaking
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14151

Differential Revision: D13115096

Pulled By: bddppq

fbshipit-source-id: a13b68274ecba363d9d8436aa6a62ac40a77d78c

Remove debugging code in test_cholesky_batched (#14156)

Summary:
They didn't turn up in my tests because I use pytest which doesn't
print debug statements if the tests pass

Differential Revision: D13115227

Pulled By: soumith

fbshipit-source-id: 46a7d47da7412d6b071158a23ab21e7fb0c6e11b

Back out "[reland][codemod][caffe2] Tensor construction: combine Resize+mutable_data - 2/4" (#14154)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14154

Original commit changeset: e89c2e692178

Reviewed By: amateurcoffee

Differential Revision: D13115023

fbshipit-source-id: 8f9fb55842ae6c8139d5cd88ec6d0abb0c5cc5e7

CostInference for 1D conv (#14009)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14009

As title

Reviewed By: yinghai

Differential Revision: D13078718

fbshipit-source-id: 081e7b13ad6741c635ef413915b555f10f93bd33

Batched cholesky decomposition (#14017)

Summary:
Implements batching for the Cholesky decomposition.

Performance could be improved with a dedicated batched `tril` and `triu` op, which is also impeding autograd operations.

Changes made:
- batching code
- tests in `test_torch.py`, `test_cuda.py` and `test_autograd.py`.
- doc string modification
- autograd modification
- removal of `_batch_potrf` in `MultivariateNormal`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14017

Differential Revision: D13087945

Pulled By: ezyang

fbshipit-source-id: 2386db887140295475ffc247742d5e9562a42f6e

remove unnecessary file from avx2 list (#14012)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14012

conv_dnnlowp_op.cc doesn't need avx2 anymore.

Reviewed By: dskhudia

Differential Revision: D13079665

fbshipit-source-id: dbfe8d2213de4969b6334d54de81d51149268cbd

Change from using enum to int to store data_type

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14140

Differential Revision: D13112937

Pulled By: bddppq

fbshipit-source-id: 124d9546bfbd1f9c207a21e40eb3646f7739bd58

Revert "CircleCI: fix NCCL install (#14124)" (#14146)

Summary:
This reverts commit a1fa9d8cf9b2b0e7373ec420c2487d4dfd0e587c.

[pytorch_linux_trusty_py2_7_9_build](https://circleci.com/gh/pytorch/pytorch/270206?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link/console):
```
Nov 17 07:37:27 + sudo apt-get -qq update
Nov 17 07:37:30 W: Ignoring Provides line with DepCompareOp for package gdb-minimal
Nov 17 07:37:30 W: You may want to run apt-get update to correct these problems
Nov 17 07:37:30 + sudo apt-get -qq install --allow-downgrades --allow-change-held-packages openmpi-bin libopenmpi-dev
Nov 17 07:37:30 E: Command line option --allow-downgrades is not understood
Nov 17 07:37:30 + cleanup
Nov 17 07:37:30 + retcode=100
Nov 17 07:37:30 + set +x
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14146

Differential Revision: D13113912

Pulled By: bddppq

fbshipit-source-id: cd9d371cf72159f03d12a8b56ed5bd2060ebbe59

Revert D10428917: [Caffe2] Add cost into profile observer

Differential Revision:
D10428917

Original commit changeset: 7c100e551bdd

fbshipit-source-id: 5164d9ba61cc103eccfdeb91a5cc140cea31a819

Revert D10439558: Add cost for non-linear ops

Differential Revision:
D10439558

Original commit changeset: 9aeb05bac8b5

fbshipit-source-id: f00977b4f95bdd500d254eb44fb5b0c816506ee4

Update FXdiv submodule (#14128)

Summary:
Use the most recent version that disables inline assembly.
I suspect inline assembly causes miscompilation on some versions of gcc7.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14128

Reviewed By: bddppq

Differential Revision: D13112370

Pulled By: Maratyszcza

fbshipit-source-id: 36cc95dc51390a293b72c18ae982c3a515a11981

Rename neon2sse.h to NEON_2_SSE.h to match upstream repo

Summary:
- NEON2SSE is a header that implements NEON intrinsics on top fo SSE intrinsics
- Upstream repo provides NEON_2_SSE.h header, but internally it was imported as neon2sse.h
- This patch fix incompatibilities between internal and upstream versions

Reviewed By: hlu1

Differential Revision: D13096755

fbshipit-source-id: 65e1df9a2a5e74bd52c9aee9be27469ba938cd8c

Disable QNNPACK for multi-architecture iOS builds (#14125)

Summary:
QNNPACK contains assembly files, and CMake tries to build them for wrong architectures in multi-arch builds. This patch has two effects:
- Disables QNNPACK in multi-arch iOS builds
- Specifies a single `IOS_ARCH=arm64` by default (covers most iPhones/iPads on the market)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14125

Differential Revision: D13112366

Pulled By: Maratyszcza

fbshipit-source-id: b369083045b440e41d506667a92e41139c11a971

Register caffe2 layer norm with c10 dispatcher (#13693)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13693

We can't directly call the caffe2::Operator class from c10 yet because that class isn't deprotobuffed yet.
Instead, we factor out the kernel into a reusable static method and call it from the caffe2::Operator and
also register it with c10.

Reviewed By: ezyang

Differential Revision: D12912242

fbshipit-source-id: c57502f14cea7a8be281f9787b175bb6e402d00c

Add c10/core/ to cmake build (#14111)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14111

It was already in TARGETs, but we forgot it in cmake.

Reviewed By: ezyang

Differential Revision: D13105166

fbshipit-source-id: f09549e98ebca751339b5ada1150e00cc4cd9540

Update atol scale in dnnlowp test (#14135)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14135

Update atol scale of dnnlowp test. Can't reproduce the flaky test error in the task locally even after setting the same seed value, but found according to comments in check_quantized_results_close(), atol_scale should be 1/1.9=0.526315789473684, which is larger than current value 0.51. So increase the atol_scale to 0.53.

Reviewed By: jspark1105

Differential Revision: D13108415

fbshipit-source-id: 1e8840659fdf0092f51b439cf499858795f9706a

fix sparse_adagrad param_size overflow error (#14049)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14049

param_size should be passed as int64_t

Reviewed By: hyuen

Differential Revision: D13090511

fbshipit-source-id: 7892d315d7c82c7d7ca103fb36d30cdf1fe24785

Add cost for non-linear ops (#13327)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13327

Add cost inference function to non-linear ops. Since the actual flops of the non-linear operator depends on the implementation, we use the number of non-linear operations as the proxy for the analytical flops for non-linear operators.

Reviewed By: jspark1105

Differential Revision: D10439558

fbshipit-source-id: 9aeb05bac8b5c7ae5d351ebf365e0a81cf4fc227

Add cost into profile observer (#12793)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12793

Add analytical cost into profile observer. It includes the op level cost information for each op run and net level aggregated cost information for each op type.

It outputs the following information:
1. analytical flops
2. analytical bytes_read
3. analytical bytes_written

Example output at op level:
```I1017 14:58:14.245978 3686541 profile_observer_gpu.cc:26] --------- Starting operator FC op#24 ---------
I1017 14:58:14.246049 3686541 profile_observer_gpu.cc:33] Input 0: Tensor model1/embedded_encoder_inputs of type float. Dims: (17,1,256,):
I1017 14:58:14.246109 3686541 profile_observer_gpu.cc:33] Input 1: Tensor model1/encoder/layer0/fw/milstm/i2h_w of type float. Dims: (2048,256,):
I1017 14:58:14.246176 3686541 profile_observer_gpu.cc:33] Input 2: Tensor model1/encoder/layer0/fw/milstm/i2h_b of type float. Dims: (2048,):
I1017 14:58:14.246217 3686541 profile_observer_gpu.cc:44] Argument 0: name: "use_cudnn" i: 1
I1017 14:58:14.246271 3686541 profile_observer_gpu.cc:44] Argument 1: name: "cudnn_exhaustive_search" i: 0
I1017 14:58:14.246338 3686541 profile_observer_gpu.cc:44] Argument 2: name: "order" s: "NHWC"
I1017 14:58:14.246372 3686541 profile_observer_gpu.cc:44] Argument 3: name: "axis" i: 2
I1017 14:58:14.246418 3686541 profile_observer_gpu.cc:44] Argument 4: name: "quantization_scheme" i: 1
I1017 14:58:14.246470 3686541 profile_observer_gpu.cc:53] Output 0: Tensor model1/encoder/layer0/fw/milstm/i2h of type float. Dims: (17,1,2048,):
I1017 14:58:14.246596 3686541 profile_observer_gpu.cc:61] Cost (flops, bytes_read, bytes_written):
I1017 14:58:14.246649 3686541 profile_observer_gpu.cc:62]        17860608 2122752 139264
I1017 14:58:14.246677 3686541 profile_observer_gpu.cc:64] --------- Finished operator FC in 0.764221 ms ---------
```
Example output at net level:
```
I1017 11:13:44.675585 3146691 profile_observer_gpu.cc:165] ================ Detailed stats for net model0/encoder/layer0/bw/milstm ================
I1017 11:13:44.675662 3146691 profile_observer_gpu.cc:167] Cost (flops, bytes_read, bytes_written) per operator type:
I1017 11:13:44.675706 3146691 profile_observer_gpu.cc:169]        20992000 42045440 81920 FC
I1017 11:13:44.675745 3146691 profile_observer_gpu.cc:169]           20480 163840 81920 Mul
I1017 11:13:44.675824 3146691 profile_observer_gpu.cc:169]           20480 163840 81920 Sum
I1017 11:13:44.675878 3146691 profile_observer_gpu.cc:169]               0 0 0 ElementwiseLinear
I1017 11:13:44.675909 3146691 profile_observer_gpu.cc:169]               0 0 0 LSTMUnit
I1017 11:13:44.675958 3146691 profile_observer_gpu.cc:169]               0 0 0 rnn_internal_apply_link
```

Reviewed By: mdschatz

Differential Revision: D10428917

fbshipit-source-id: 7c100e551bdd3ac8d7c09be12c72d70a2d67cae1

CircleCI: fix NCCL install (#14124)

Summary:
The `$BUILD_ENVIRONMENT` checks work in `test.sh` but not `build.sh`, this PR is trying to figure out why.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14124

Reviewed By: teng-li

Differential Revision: D13112483

Pulled By: yf225

fbshipit-source-id: 5f65997586648805cf52217a261389625b5535e1