Only initialize the part of the jobs array that will get used
The jobs array is getting initialized in O(compiled cpus^2) complexity.
Distros and people with bigger systems will use pretty high values
(128 or 256 or more) for this value, leading to interesting bubbles
in performance.
Baseline (single threaded performance) gets roughly 13 - 15 multiplications per cycle
in the interesting range (threading kicks in at 65x65 mult by 65x65).
The hardware is capable of 32 multiplications per cycle theoretically.
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10703.9 10.6 0.0% 17990.6 6.3 0.0%
64 x 64 20778.4 12.8 0.0% 40629.2 6.5 0.0%
65 x 65 26869.9 10.3 0.0% 52545.7 5.3 0.0%
80 x 80 38104.5 13.5 0.0% 72492.7 7.1 0.0%
96 x 96 61626.4 14.4 0.0% 113983.8 7.8 0.0%
112 x 112 91803.8 15.3 0.0% 180987.3 7.8 0.0%
128 x 128 133161.4 15.8 0.0% 258374.3 8.1 0.0%
When threading is turned on
TARGET=SKYLAKEX F_COMPILER=GFORTRAN SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0 NUM_THREADS=128
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10725.9 10.5 -0.2% 18134.9 6.2 -0.8%
64 x 64 20500.6 12.9 1.3% 40929.1 6.5 -0.7%
65 x 65 2040832.1 0.1 -7495.2% 2097633.6 0.1 -3892.0%
80 x 80 2063129.1 0.2 -5314.4% 2119925.2 0.2 -2824.3%
96 x 96 2070374.5 0.4 -3259.6% 2173604.4 0.4 -1806.9%
112 x 112 2111721.5 0.7 -2169.6% 2263330.8 0.6 -1170.0%
128 x 128 2276181.5 0.9 -1609.3% 2377228.9 0.9 -820.1%
There is a deep deep cliff once you hit 65x65
With this patch
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7%
64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4%
65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2%
80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6%
96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0%
112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1%
128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2%
The cliff is very significantly reduced.
(more to follow)