Don't use _Atomic for jobs sometimes...
The use of _Atomic leads to really bad code generation in the compiler
(on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite
x86 being ordered and cache coherent). But there's a fallback in the code that
just uses volatile which is more than plenty in practice.
If we're nervous about cross thread synchronization for these variables, we should
make the YIELD function be a compiler/memory barrier instead.
performance before (after last commit)
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7%
64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4%
65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2%
80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6%
96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0%
112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1%
128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2%
Performance with this patch (roughly a 2x improvement):
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7%
64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0%
65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3%
80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6%
96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4%
112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6%
128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%