Summary:
This PR aligns the Array struct such that cuda vector performance improvements can be utilized.
I tested this by using it on our Philox header. Note how the vector store instruction gets used for cuda vector types and when using alignas on Array, vs when not using alignas on Array.
With cuda vector type (uint4, uint2, float4): https://godbolt.org/z/UaWOmR
With alignas: https://godbolt.org/z/Eeh0t5
Without alignas: https://godbolt.org/z/QT63gq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14920
Differential Revision:
D13406751
Pulled By: soumith
fbshipit-source-id:
685b1010ef1f576dde30c278b1e9b642f87c843d
namespace at { namespace cuda {
template <typename T, int size>
+#ifndef __HIP_PLATFORM_HCC__
+struct alignas(16) Array {
+#else
struct Array {
+#endif
T data[size];
C10_HOST_DEVICE T operator[](int i) const {