Fix sparse mm for ROCm (#18985)
Summary:
* Annotate also two pass reduction with launch bounds
* ifdef some shortcomings of ROCm w.r.t. short-circuit returns - internal tickets filed
* while there, plug memory leak by destroying matrix descriptor after the sparse call (applicable to cuSPARSE)
* while there, fix types for cusparseXcoo2csr as per cuSPARSE documentation
* enable test_dsmm in test_sparse which now passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18985
Differential Revision:
D14822009
Pulled By: bddppq
fbshipit-source-id:
757267a47a63ee56ef396c33059f7eca099f4833