Implementing cuda kernel for tril_indices and triu_indices (#15203)
Summary:
Followup PR of #14904, and the stretch goal of #12653.
Directly calculate coordinates in the original tensor using column index in the result tensor. Every GPU thread takes care of a column (two numbers) in the output tensor.
The implementation detects and handles precision loss during calculating the square root of a `int64_t` variable, and supports tensors with up to `row * column = 2 ^ 59` numbers.
Algorithm details are describe in [comments of TensorFactories.cu](https://github.com/pytorch/pytorch/blob/
23ddb6f58a1c8a7a660a793f174cf014230176c6/aten/src/ATen/native/cuda/TensorFactories.cu#L109-L255).
zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15203
Reviewed By: zou3519
Differential Revision:
D13517695
Pulled By: mrshenli
fbshipit-source-id:
86b305d22cac08c8962a3b0cf8e9e620b7ec33ea