Add GPU support for float16 batched matmul (#18436)
authorBen Barsdell <benbarsdell@gmail.com>
Thu, 10 May 2018 18:06:01 +0000 (11:06 -0700)
committerJonathan Hseu <vomjom@vomjom.net>
Thu, 10 May 2018 18:06:01 +0000 (11:06 -0700)
commitf08f24cd559b5824a1874a0e76d339875e43f366
treeade423df2e77815bcc246064124fbd0ecbe8e286
parent9201e2c002667047b1807745c4a7d6a8e5f2e9da
Add GPU support for float16 batched matmul (#18436)

* Add GPU support for float16 batched matmul

- Uses cublasGemmBatchedEx introduced in CUDA 9.1.
- Includes support for Tensor Op math.
- Falls back to a loop over non-batched gemm calls on older CUDA
  versions or GPU architectures.

* Refactor GPU batched gemm into one internal func
tensorflow/core/kernels/batch_matmul_op_impl.h
tensorflow/core/kernels/batch_matmul_op_real.cc
tensorflow/stream_executor/blas.h
tensorflow/stream_executor/cuda/cuda_blas.cc
tensorflow/stream_executor/cuda/cuda_blas.h
tensorflow/stream_executor/stream.cc
tensorflow/stream_executor/stream.h