param.h defines a per-platform SWITCH_RATIO, which is used as a measure for how fine
grained the blocks for gemm need to be split up. Many platforms define this to 4.
The reality is that the gemm low level implementation for SkylakeX likes bigger blocks
due to the nature of SIMD... by tuning the SWITCH_RATIO to 32 the threading performance
improves significantly:
Before
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7%
64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0%
65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3%
80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6%
96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4%
112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6%
128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%
After
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10666.3 10.6 0.4% 18236.9 6.2 -1.4%
64 x 64 20410.1 13.0 1.8% 39925.8 6.6 1.7%
65 x 65 34983.0 7.9 -30.2% 51494.6 5.4 2.0%
80 x 80 39769.1 13.0 -4.4% 63805.2 8.1 12.0%
96 x 96 45169.6 19.7 26.7% 80065.8 11.1 29.8%
112 x 112 57026.1 24.7 38.7% 99535.5 14.2 44.1%
128 x 128 64789.8 32.5 51.3% 117407.2 17.9 54.6%
With this change, threading starts to be a win already at 96x96
#define SYMV_P 8
-#define SWITCH_RATIO 4
+#define SWITCH_RATIO 32
#ifdef ARCH_X86