nvfuser update (#63745)
authorjiej <jiej@nvidia.com>
Wed, 15 Sep 2021 21:40:18 +0000 (14:40 -0700)
committerFacebook GitHub Bot <facebook-github-bot@users.noreply.github.com>
Wed, 15 Sep 2021 21:42:55 +0000 (14:42 -0700)
commitcfaecaf40bd6cabd3f4e0ef0d8c7252655349b61
tree3a6ef7ffbc07672434dfc650ad1dddb2c5c66a5e
parent59988f81bda9b0fd3db5cf61992a3b2ec8f4f147
nvfuser update (#63745)

Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:

- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).

To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.

internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`

updates affecting integration:

1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745

Reviewed By: saketh-are

Differential Revision: D30752939

Pulled By: malfet

fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c
181 files changed:
CMakeLists.txt
aten/src/ATen/core/aten_interned_strings.h
aten/src/ATen/core/interned_strings.h
benchmarks/cpp/nvfuser/CMakeLists.txt [new file with mode: 0644]
benchmarks/cpp/nvfuser/batch_norm.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/bert.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/broadcast.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/gelu_backward.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/heuristic_cache.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/heuristic_lookup.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/instance_norm.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/layer_norm.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/lstm_cell.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/main.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/reduction.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/scale_bias_relu.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/softmax.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/utils.cpp [new file with mode: 0644]
benchmarks/cpp/nvfuser/utils.h [new file with mode: 0644]
caffe2/CMakeLists.txt
cmake/Summary.cmake
test/cpp/jit/CMakeLists.txt
test/cpp/jit/test_gpu.cpp
test/cpp/jit/test_gpu_shift.cpp [new file with mode: 0644]
test/cpp/jit/test_gpu_validator.h [new file with mode: 0644]
test/test_jit_cuda_fuser.py
tools/build_variables.bzl
torch/csrc/jit/codegen/cuda/arith.cpp
torch/csrc/jit/codegen/cuda/arith.h
torch/csrc/jit/codegen/cuda/codegen.cpp
torch/csrc/jit/codegen/cuda/codegen.h
torch/csrc/jit/codegen/cuda/compute_at.cpp
torch/csrc/jit/codegen/cuda/compute_at.h
torch/csrc/jit/codegen/cuda/compute_at_map.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/compute_at_map.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/disjoint_set.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/dispatch.cpp
torch/csrc/jit/codegen/cuda/dispatch.h
torch/csrc/jit/codegen/cuda/executor.cpp
torch/csrc/jit/codegen/cuda/executor.h
torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp
torch/csrc/jit/codegen/cuda/executor_kernel_arg.h
torch/csrc/jit/codegen/cuda/executor_launch_params.cpp
torch/csrc/jit/codegen/cuda/executor_launch_params.h
torch/csrc/jit/codegen/cuda/executor_utils.cpp
torch/csrc/jit/codegen/cuda/executor_utils.h
torch/csrc/jit/codegen/cuda/expr_evaluator.cpp
torch/csrc/jit/codegen/cuda/expr_evaluator.h
torch/csrc/jit/codegen/cuda/fusion.cpp
torch/csrc/jit/codegen/cuda/fusion.h
torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/fusion_segmenter.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/graph_fuser.cpp
torch/csrc/jit/codegen/cuda/index_compute.cpp
torch/csrc/jit/codegen/cuda/index_compute.h
torch/csrc/jit/codegen/cuda/index_reference_replay.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/index_reference_replay.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/instrumentation.cpp
torch/csrc/jit/codegen/cuda/instrumentation.h
torch/csrc/jit/codegen/cuda/interface.cpp
torch/csrc/jit/codegen/cuda/interface.h
torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp
torch/csrc/jit/codegen/cuda/ir_base_nodes.h
torch/csrc/jit/codegen/cuda/ir_cloner.cpp
torch/csrc/jit/codegen/cuda/ir_cloner.h
torch/csrc/jit/codegen/cuda/ir_graphviz.cpp
torch/csrc/jit/codegen/cuda/ir_graphviz.h
torch/csrc/jit/codegen/cuda/ir_interface_nodes.h
torch/csrc/jit/codegen/cuda/ir_internal_nodes.h
torch/csrc/jit/codegen/cuda/ir_iostream.cpp
torch/csrc/jit/codegen/cuda/ir_iostream.h
torch/csrc/jit/codegen/cuda/ir_nodes.cpp
torch/csrc/jit/codegen/cuda/ir_printer.h
torch/csrc/jit/codegen/cuda/ir_utils.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/ir_utils.h
torch/csrc/jit/codegen/cuda/iter_visitor.cpp
torch/csrc/jit/codegen/cuda/iter_visitor.h
torch/csrc/jit/codegen/cuda/kernel.cpp
torch/csrc/jit/codegen/cuda/kernel.h
torch/csrc/jit/codegen/cuda/kernel_cache.cpp
torch/csrc/jit/codegen/cuda/kernel_cache.h
torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/kernel_ir.cpp
torch/csrc/jit/codegen/cuda/kernel_ir.h
torch/csrc/jit/codegen/cuda/kernel_ir_builder.cpp
torch/csrc/jit/codegen/cuda/kernel_ir_builder.h
torch/csrc/jit/codegen/cuda/kernel_ir_printer.cpp
torch/csrc/jit/codegen/cuda/kernel_ir_printer.h
torch/csrc/jit/codegen/cuda/lower2device.cpp
torch/csrc/jit/codegen/cuda/lower2device.h
torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp
torch/csrc/jit/codegen/cuda/lower_alias_memory.h
torch/csrc/jit/codegen/cuda/lower_allocation.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_allocation.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_expr_sort.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_index.cpp
torch/csrc/jit/codegen/cuda/lower_index.h
torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp
torch/csrc/jit/codegen/cuda/lower_insert_syncs.h
torch/csrc/jit/codegen/cuda/lower_loops.cpp
torch/csrc/jit/codegen/cuda/lower_loops.h
torch/csrc/jit/codegen/cuda/lower_magic_zero.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_magic_zero.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_misaligned_vectorization.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_misaligned_vectorization.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_predicate.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_predicate.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_shift.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_shift.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp
torch/csrc/jit/codegen/cuda/lower_thread_predicate.h
torch/csrc/jit/codegen/cuda/lower_trivial_reductions.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_trivial_reductions.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/lower_unroll.cpp
torch/csrc/jit/codegen/cuda/lower_unroll.h
torch/csrc/jit/codegen/cuda/lower_utils.cpp
torch/csrc/jit/codegen/cuda/lower_utils.h
torch/csrc/jit/codegen/cuda/lower_validation.cpp
torch/csrc/jit/codegen/cuda/lower_validation.h
torch/csrc/jit/codegen/cuda/manager.cpp
torch/csrc/jit/codegen/cuda/mutator.cpp
torch/csrc/jit/codegen/cuda/ops/all_ops.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/ops/composite.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/ops/composite.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/ops/normalization.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/ops/normalization.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/parallel_dimension_map.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/parallel_type_bitmap.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/parallel_type_bitmap.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/parser.cpp
torch/csrc/jit/codegen/cuda/parser.h
torch/csrc/jit/codegen/cuda/partition.cpp
torch/csrc/jit/codegen/cuda/partition.h
torch/csrc/jit/codegen/cuda/predicate_compute.cpp
torch/csrc/jit/codegen/cuda/predicate_compute.h
torch/csrc/jit/codegen/cuda/register_interface.cpp
torch/csrc/jit/codegen/cuda/root_domain_map.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/root_domain_map.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/runtime/block_reduction.cu
torch/csrc/jit/codegen/cuda/runtime/block_sync_atomic.cu [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/runtime/block_sync_default.cu [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/runtime/broadcast.cu
torch/csrc/jit/codegen/cuda/runtime/fp16_support.cu
torch/csrc/jit/codegen/cuda/runtime/grid_reduction.cu
torch/csrc/jit/codegen/cuda/runtime/helpers.cu
torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu
torch/csrc/jit/codegen/cuda/runtime/tensor.cu
torch/csrc/jit/codegen/cuda/runtime/welford.cu [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler.cpp [deleted file]
torch/csrc/jit/codegen/cuda/scheduler.h [deleted file]
torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler/normalization.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler/pointwise.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler/pointwise_heuristic.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler/reduction.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler/reduction_heuristic.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler/registry.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler/registry.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler/utils.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/scheduler/utils.h [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/shape_inference.cpp
torch/csrc/jit/codegen/cuda/tensor_view.cpp
torch/csrc/jit/codegen/cuda/transform_iter.cpp
torch/csrc/jit/codegen/cuda/transform_iter.h
torch/csrc/jit/codegen/cuda/transform_replay.cpp
torch/csrc/jit/codegen/cuda/transform_replay.h
torch/csrc/jit/codegen/cuda/transform_rfactor.cpp
torch/csrc/jit/codegen/cuda/type.cpp
torch/csrc/jit/codegen/cuda/type.h
torch/csrc/jit/codegen/cuda/utils.cpp [new file with mode: 0644]
torch/csrc/jit/codegen/cuda/utils.h
torch/csrc/jit/runtime/autodiff.cpp
torch/csrc/jit/runtime/profiling_graph_executor_impl.cpp
torch/csrc/jit/runtime/profiling_record.cpp
torch/testing/_internal/jit_utils.py