Similar to the SKYLAKEX patch, 32 seems to work best
(much better than 4 or 16)
Before (4)
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 15554.3 7.2 0.2% 30353.8 3.7 0.3%
64 x 64 30346.8 8.7 1.6% 63495.0 4.1 -0.1%
65 x 65 81668.1 3.4 -123.3% 82705.2 3.3 -21.2%
80 x 80 105045.9 4.9 -95.5% 115226.0 4.5 -2.2%
96 x 96 152461.2 5.8 -74.3% 148156.3 6.0 16.4%
112 x 112 188505.2 7.5 -42.2% 171187.3 8.2 36.4%
128 x 128 257884.0 8.1 -39.5% 224764.8 9.3 46.0%
Intermediate (16)
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 15565.7 7.2 0.2% 30378.9 3.7 0.2%
64 x 64 30430.2 8.7 1.3% 63046.4 4.2 0.6%
65 x 65 27306.0 10.1 25.3% 38879.2 7.1 43.0%
80 x 80 51008.7 10.1 5.1% 61007.6 8.4 45.9%
96 x 96 70856.7 12.5 19.0% 83403.1 10.6 53.0%
112 x 112 84769.9 16.6 36.0% 99920.1 14.1 62.9%
128 x 128 84213.2 25.0 54.5% 113024.2 18.6 72.8%
After (32)
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 15537.3 7.2 0.3% 30537.0 3.6 -0.3%
64 x 64 30352.7 8.7 1.6% 62597.8 4.2 1.3%
65 x 65 36857.0 7.5 -0.8% 56167.6 4.9 17.7%
80 x 80 42552.6 12.1 20.8% 69536.7 7.4 38.3%
96 x 96 52101.5 17.1 40.5% 91016.1 9.7 48.7%
112 x 112 63853.7 22.1 51.8% 110507.4 12.7 58.9%
128 x 128 73966.1 28.4 60.0% 163146.4 12.9 60.8%
#define SYMV_P 8
-#define SWITCH_RATIO 4
+#define SWITCH_RATIO 32
#ifdef ARCH_X86