s390x: Use new sgemm kernel also for strmm on Z14 and newer
Employ the newly added GEMM kernel also for STRMM on Z14. The
implementation in C with vector intrinsics exploits FP32 SIMD operations
and thereby gains performance over the existing assembly code. Extend
the implementation for handling triangular matrix multiplication,
accordingly. As added benefit, the more flexible C code enables us to
adjust register blocking in the subsequent commit.
Tested via make -C test / ctest / utest and by a couple of additional
unit tests that exercise blocking.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>