We recently changed the register blocking for SGEMM on s390x to 16x4.
However, we did not adjust Q to a multiple of 16 and thus fell back to
the 8x4 kernel at each block's margin, without need. Adjust P and Q to
multiples of 16 to employ the faster 16x4 kernel for complete full-sized
blocks.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
#define ZGEMM_DEFAULT_UNROLL_M 4
#define ZGEMM_DEFAULT_UNROLL_N 4
-#define SGEMM_DEFAULT_P 456
+#define SGEMM_DEFAULT_P 480
#define DGEMM_DEFAULT_P 320
#define CGEMM_DEFAULT_P 480
#define ZGEMM_DEFAULT_P 224
-#define SGEMM_DEFAULT_Q 488
+#define SGEMM_DEFAULT_Q 512
#define DGEMM_DEFAULT_Q 384
#define CGEMM_DEFAULT_Q 128
#define ZGEMM_DEFAULT_Q 352