Martin Kroeker [Mon, 25 Jun 2018 18:48:10 +0000 (20:48 +0200)]
Merge pull request #2 from martin-frbg/develop
merge develop
Martin Kroeker [Mon, 25 Jun 2018 18:45:56 +0000 (20:45 +0200)]
Merge pull request #1 from xianyi/develop
Merge xianyi:develop into develop
Martin Kroeker [Mon, 25 Jun 2018 17:23:40 +0000 (19:23 +0200)]
Merge pull request #1642 from oon3m0oo/develop
Rewrite &= -> = and simplify the initial blocking phase.
Craig Donner [Mon, 25 Jun 2018 12:53:11 +0000 (13:53 +0100)]
Rewrite &= -> = and simplify the initial blocking phase.
Martin Kroeker [Sat, 23 Jun 2018 17:42:15 +0000 (19:42 +0200)]
Add support for a user-defined list of dynamic targets
Martin Kroeker [Sat, 23 Jun 2018 17:41:32 +0000 (19:41 +0200)]
Add support for a user-defined list of dynamic targets
Martin Kroeker [Sat, 23 Jun 2018 13:01:02 +0000 (15:01 +0200)]
Merge pull request #1638 from martin-frbg/issue1637
Expose the CBLAS interface to the IxAMIN functions and have make build it
Martin Kroeker [Sat, 23 Jun 2018 11:31:09 +0000 (13:31 +0200)]
Expose CBLAS interface to BLAS extensions iXamin
Martin Kroeker [Sat, 23 Jun 2018 11:27:30 +0000 (13:27 +0200)]
Build cblas_iXamin interfaces
Martin Kroeker [Thu, 21 Jun 2018 19:01:03 +0000 (21:01 +0200)]
Merge pull request #1634 from oon3m0oo/develop
Fix data races reported by TSAN.
oon3m0oo [Thu, 21 Jun 2018 16:47:45 +0000 (17:47 +0100)]
Use BLAS rather than CBLAS in test_fork.c (#1626)
This is handy for people not using lapack.
Craig Donner [Thu, 21 Jun 2018 10:13:57 +0000 (11:13 +0100)]
Fix data races reported by TSAN.
oon3m0oo [Wed, 20 Jun 2018 20:04:03 +0000 (21:04 +0100)]
Further improvements to memory.c. (#1625)
- Compiler TLS is now used only used when the compiler supports it
- If compiler TLS is unsupported, we use platform-specific TLS
- Only one variable (an index) is now in TLS
- We only access TLS once per alloc, and never when freeing
- Allocation / release info is now stored within the allocation itself, by
over-allocating; this saves having external structures do the bookkeeping, and
reduces some of the redundant data that was being stored (such as addresses)
- We never hit the alloc lock when not using SMP or when using OpenMP (that was
my fault)
- Now that there are fewer tracking structures I think this is a bit easier to
read than before
Martin Kroeker [Wed, 20 Jun 2018 19:51:57 +0000 (21:51 +0200)]
Merge pull request #1630 from martin-frbg/x86-march
Add -march=skylake-avx512 to flags if target is skylake x
Martin Kroeker [Wed, 20 Jun 2018 19:51:38 +0000 (21:51 +0200)]
Merge pull request #1631 from oon3m0oo/stack
Avoid declaring arrays of size 0 when making large stack allocations.
Craig Donner [Wed, 20 Jun 2018 16:03:18 +0000 (17:03 +0100)]
Avoid declaring arrays of size 0 when making large stack allocations.
Martin Kroeker [Wed, 20 Jun 2018 14:41:13 +0000 (16:41 +0200)]
Merge pull request #1629 from martin-frbg/issue1628
Make gfortran link libomp for clang in the tests; avoid two typical gotchas with NOFORTRAN
Martin Kroeker [Wed, 20 Jun 2018 13:16:19 +0000 (15:16 +0200)]
Add -march=skylake-avx512 to flags if target is skylake x
Martin Kroeker [Wed, 20 Jun 2018 11:20:30 +0000 (13:20 +0200)]
Need to use filter-out to handle NOFORTRAN not set
Martin Kroeker [Tue, 19 Jun 2018 21:28:06 +0000 (23:28 +0200)]
Modify NOFORTRAN tests to always check the value; fix rewriting of NO_FORTRAN
Martin Kroeker [Tue, 19 Jun 2018 18:53:19 +0000 (20:53 +0200)]
Handle erroneous user settings NOFORTRAN=0 and NO_FORTRAN
Martin Kroeker [Tue, 19 Jun 2018 18:47:33 +0000 (20:47 +0200)]
Handle special case of gfortran+clang+OpenMP
Martin Kroeker [Tue, 19 Jun 2018 18:46:36 +0000 (20:46 +0200)]
Handle special case of gfortran+clang+OpenMP
Martin Kroeker [Mon, 18 Jun 2018 07:02:40 +0000 (09:02 +0200)]
Merge pull request #1623 from fenrus75/fast-thread
Initialize only the required subset of the jobs array, fix barriers and improve switch ratio on SkylakeX and Haswell. For issue #1622
Martin Kroeker [Sun, 17 Jun 2018 21:38:14 +0000 (23:38 +0200)]
Support upcoming Intel Cannon Lake CPUs as Skylake X (#1621)
* Support upcoming Cannon Lake as Skylake X
Arjan van de Ven [Sun, 17 Jun 2018 18:06:24 +0000 (18:06 +0000)]
make WMB / MB safer on x86-64
make it so that
if (foo)
RMB;
else
MB;
is always done correctly and without syntax surprises
Arjan van de Ven [Sun, 17 Jun 2018 17:53:15 +0000 (17:53 +0000)]
On x86-64, make MB/WMB compiler barriers
Whie on x86(64) one does not normally need full memory barriers, it's
good practice to at least use compiler barriers for places where on other
architectures memory barriers are used; this prevents the compiler
from over-optimizing.
Arjan van de Ven [Sun, 17 Jun 2018 17:50:43 +0000 (17:50 +0000)]
Add missing barriers in gemm scheduler
a few places in the gemm scheduler code were missing barriers;
the code likely worked OK due to heavy use of volatile / _Atomic
but there's no reason to get this incorrect
Arjan van de Ven [Sun, 17 Jun 2018 17:05:04 +0000 (17:05 +0000)]
Tune HASWELL SWITCH_RATIO as well
Similar to the SKYLAKEX patch, 32 seems to work best
(much better than 4 or 16)
Before (4)
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 15554.3 7.2 0.2% 30353.8 3.7 0.3%
64 x 64 30346.8 8.7 1.6% 63495.0 4.1 -0.1%
65 x 65 81668.1 3.4 -123.3% 82705.2 3.3 -21.2%
80 x 80 105045.9 4.9 -95.5% 115226.0 4.5 -2.2%
96 x 96 152461.2 5.8 -74.3% 148156.3 6.0 16.4%
112 x 112 188505.2 7.5 -42.2% 171187.3 8.2 36.4%
128 x 128 257884.0 8.1 -39.5% 224764.8 9.3 46.0%
Intermediate (16)
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 15565.7 7.2 0.2% 30378.9 3.7 0.2%
64 x 64 30430.2 8.7 1.3% 63046.4 4.2 0.6%
65 x 65 27306.0 10.1 25.3% 38879.2 7.1 43.0%
80 x 80 51008.7 10.1 5.1% 61007.6 8.4 45.9%
96 x 96 70856.7 12.5 19.0% 83403.1 10.6 53.0%
112 x 112 84769.9 16.6 36.0% 99920.1 14.1 62.9%
128 x 128 84213.2 25.0 54.5% 113024.2 18.6 72.8%
After (32)
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 15537.3 7.2 0.3% 30537.0 3.6 -0.3%
64 x 64 30352.7 8.7 1.6% 62597.8 4.2 1.3%
65 x 65 36857.0 7.5 -0.8% 56167.6 4.9 17.7%
80 x 80 42552.6 12.1 20.8% 69536.7 7.4 38.3%
96 x 96 52101.5 17.1 40.5% 91016.1 9.7 48.7%
112 x 112 63853.7 22.1 51.8% 110507.4 12.7 58.9%
128 x 128 73966.1 28.4 60.0% 163146.4 12.9 60.8%
Arjan van de Ven [Sun, 17 Jun 2018 15:47:50 +0000 (15:47 +0000)]
Tune param.h for SkylakeX
param.h defines a per-platform SWITCH_RATIO, which is used as a measure for how fine
grained the blocks for gemm need to be split up. Many platforms define this to 4.
The reality is that the gemm low level implementation for SkylakeX likes bigger blocks
due to the nature of SIMD... by tuning the SWITCH_RATIO to 32 the threading performance
improves significantly:
Before
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7%
64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0%
65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3%
80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6%
96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4%
112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6%
128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%
After
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10666.3 10.6 0.4% 18236.9 6.2 -1.4%
64 x 64 20410.1 13.0 1.8% 39925.8 6.6 1.7%
65 x 65 34983.0 7.9 -30.2% 51494.6 5.4 2.0%
80 x 80 39769.1 13.0 -4.4% 63805.2 8.1 12.0%
96 x 96 45169.6 19.7 26.7% 80065.8 11.1 29.8%
112 x 112 57026.1 24.7 38.7% 99535.5 14.2 44.1%
128 x 128 64789.8 32.5 51.3% 117407.2 17.9 54.6%
With this change, threading starts to be a win already at 96x96
Arjan van de Ven [Sun, 17 Jun 2018 15:39:15 +0000 (15:39 +0000)]
Don't use _Atomic for jobs sometimes...
The use of _Atomic leads to really bad code generation in the compiler
(on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite
x86 being ordered and cache coherent). But there's a fallback in the code that
just uses volatile which is more than plenty in practice.
If we're nervous about cross thread synchronization for these variables, we should
make the YIELD function be a compiler/memory barrier instead.
performance before (after last commit)
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7%
64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4%
65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2%
80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6%
96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0%
112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1%
128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2%
Performance with this patch (roughly a 2x improvement):
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7%
64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0%
65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3%
80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6%
96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4%
112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6%
128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%
Arjan van de Ven [Sun, 17 Jun 2018 15:32:03 +0000 (15:32 +0000)]
Only initialize the part of the jobs array that will get used
The jobs array is getting initialized in O(compiled cpus^2) complexity.
Distros and people with bigger systems will use pretty high values
(128 or 256 or more) for this value, leading to interesting bubbles
in performance.
Baseline (single threaded performance) gets roughly 13 - 15 multiplications per cycle
in the interesting range (threading kicks in at 65x65 mult by 65x65).
The hardware is capable of 32 multiplications per cycle theoretically.
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10703.9 10.6 0.0% 17990.6 6.3 0.0%
64 x 64 20778.4 12.8 0.0% 40629.2 6.5 0.0%
65 x 65 26869.9 10.3 0.0% 52545.7 5.3 0.0%
80 x 80 38104.5 13.5 0.0% 72492.7 7.1 0.0%
96 x 96 61626.4 14.4 0.0% 113983.8 7.8 0.0%
112 x 112 91803.8 15.3 0.0% 180987.3 7.8 0.0%
128 x 128 133161.4 15.8 0.0% 258374.3 8.1 0.0%
When threading is turned on
TARGET=SKYLAKEX F_COMPILER=GFORTRAN SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0 NUM_THREADS=128
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10725.9 10.5 -0.2% 18134.9 6.2 -0.8%
64 x 64 20500.6 12.9 1.3% 40929.1 6.5 -0.7%
65 x 65 2040832.1 0.1 -7495.2% 2097633.6 0.1 -3892.0%
80 x 80 2063129.1 0.2 -5314.4% 2119925.2 0.2 -2824.3%
96 x 96 2070374.5 0.4 -3259.6% 2173604.4 0.4 -1806.9%
112 x 112 2111721.5 0.7 -2169.6% 2263330.8 0.6 -1170.0%
128 x 128 2276181.5 0.9 -1609.3% 2377228.9 0.9 -820.1%
There is a deep deep cliff once you hit 65x65
With this patch
Matrix SGEMM cycles MPC DGEMM cycles MPC
48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7%
64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4%
65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2%
80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6%
96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0%
112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1%
128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2%
The cliff is very significantly reduced.
(more to follow)
Martin Kroeker [Fri, 15 Jun 2018 09:25:05 +0000 (11:25 +0200)]
Add build-time option for OMP scheduler; document MULTITHREAD_THRESHOLD range (#1620)
* Allow choosing the OpenMP scheduler and add range hint for GEMM_MULTITHREAD_THRESHOLD
* Amended description of GEMM_MULTITHREAD_THRESHOLD
to reflect #742 making it track floating point operations rather than matrix size
Martin Kroeker [Thu, 14 Jun 2018 22:10:29 +0000 (00:10 +0200)]
Merge pull request #1618 from oon3m0oo/less_locking
Remove the need for most locking in memory.c.
Craig Donner [Thu, 14 Jun 2018 11:18:04 +0000 (12:18 +0100)]
Remove the need for most locking in memory.c.
Using thread local storage for tracking memory allocations means that threads
no longer have to lock at all when doing memory allocations / frees. This
particularly helps the gemm driver since it does an allocation per invocation.
Even without threading at all, this helps, since even calling a lock with
no contention has a cost:
Before this change, no threading:
```
----------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------
BM_SGEMM/4 102 ns 102 ns
13504412
BM_SGEMM/6 175 ns 175 ns 7997580
BM_SGEMM/8 205 ns 205 ns 6842073
BM_SGEMM/10 266 ns 266 ns 5294919
BM_SGEMM/16 478 ns 478 ns 2963441
BM_SGEMM/20 690 ns 690 ns 2144755
BM_SGEMM/32 1906 ns 1906 ns 716981
BM_SGEMM/40 2983 ns 2983 ns 473218
BM_SGEMM/64 9421 ns 9422 ns 148450
BM_SGEMM/72 12630 ns 12631 ns 112105
BM_SGEMM/80 15845 ns 15846 ns 89118
BM_SGEMM/90 25675 ns 25676 ns 54332
BM_SGEMM/100 29864 ns 29865 ns 47120
BM_SGEMM/112 37841 ns 37842 ns 36717
BM_SGEMM/128 56531 ns 56532 ns 25361
BM_SGEMM/140 75886 ns 75888 ns 18143
BM_SGEMM/150 98493 ns 98496 ns 14299
BM_SGEMM/160 102620 ns 102622 ns 13381
BM_SGEMM/170 135169 ns 135173 ns 10231
BM_SGEMM/180 146170 ns 146172 ns 9535
BM_SGEMM/189 190226 ns 190231 ns 7397
BM_SGEMM/200 194513 ns 194519 ns 7210
BM_SGEMM/256 396561 ns 396573 ns 3531
```
with this change:
```
----------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------
BM_SGEMM/4 95 ns 95 ns
14500387
BM_SGEMM/6 166 ns 166 ns 8381763
BM_SGEMM/8 196 ns 196 ns 7277044
BM_SGEMM/10 256 ns 256 ns 5515721
BM_SGEMM/16 463 ns 463 ns 3025197
BM_SGEMM/20 636 ns 636 ns 2070213
BM_SGEMM/32 1885 ns 1885 ns 739444
BM_SGEMM/40 2969 ns 2969 ns 472152
BM_SGEMM/64 9371 ns 9372 ns 148932
BM_SGEMM/72 12431 ns 12431 ns 112919
BM_SGEMM/80 15615 ns 15616 ns 89978
BM_SGEMM/90 25397 ns 25398 ns 55041
BM_SGEMM/100 29445 ns 29446 ns 47540
BM_SGEMM/112 37530 ns 37531 ns 37286
BM_SGEMM/128 55373 ns 55375 ns 25277
BM_SGEMM/140 76241 ns 76241 ns 18259
BM_SGEMM/150 102196 ns 102200 ns 13736
BM_SGEMM/160 101521 ns 101525 ns 13556
BM_SGEMM/170 136182 ns 136184 ns 10567
BM_SGEMM/180 146861 ns 146864 ns 9035
BM_SGEMM/189 192632 ns 192632 ns 7231
BM_SGEMM/200 198547 ns 198555 ns 6995
BM_SGEMM/256 392316 ns 392330 ns 3539
```
Before, when built with USE_THREAD=1, GEMM_MULTITHREAD_THRESHOLD = 4, the cost
of small matrix operations was overshadowed by thread locking (look smaller than
32) even when not explicitly spawning threads:
```
----------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------
BM_SGEMM/4 328 ns 328 ns 4170562
BM_SGEMM/6 396 ns 396 ns 3536400
BM_SGEMM/8 418 ns 418 ns 3330102
BM_SGEMM/10 491 ns 491 ns 2863047
BM_SGEMM/16 710 ns 710 ns 2028314
BM_SGEMM/20 871 ns 871 ns 1581546
BM_SGEMM/32 2132 ns 2132 ns 657089
BM_SGEMM/40 3197 ns 3196 ns 437969
BM_SGEMM/64 9645 ns 9645 ns 144987
BM_SGEMM/72 35064 ns 32881 ns 50264
BM_SGEMM/80 37661 ns 35787 ns 42080
BM_SGEMM/90 36507 ns 36077 ns 40091
BM_SGEMM/100 32513 ns 31850 ns 48607
BM_SGEMM/112 41742 ns 41207 ns 37273
BM_SGEMM/128 67211 ns 65095 ns 21933
BM_SGEMM/140 68263 ns 67943 ns 19245
BM_SGEMM/150 121854 ns 115439 ns 10660
BM_SGEMM/160 116826 ns 115539 ns 10000
BM_SGEMM/170 126566 ns 122798 ns 11960
BM_SGEMM/180 130088 ns 127292 ns 11503
BM_SGEMM/189 120309 ns 116634 ns 13162
BM_SGEMM/200 114559 ns 110993 ns 10000
BM_SGEMM/256 217063 ns 207806 ns 6417
```
and after, it's gone (note this includes my other change which reduces calls
to num_cpu_avail):
```
----------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------
BM_SGEMM/4 95 ns 95 ns
12347650
BM_SGEMM/6 166 ns 166 ns 8259683
BM_SGEMM/8 193 ns 193 ns 7162210
BM_SGEMM/10 258 ns 258 ns 5415657
BM_SGEMM/16 471 ns 471 ns 2981009
BM_SGEMM/20 666 ns 666 ns 2148002
BM_SGEMM/32 1903 ns 1903 ns 738245
BM_SGEMM/40 2969 ns 2969 ns 473239
BM_SGEMM/64 9440 ns 9440 ns 148442
BM_SGEMM/72 37239 ns 33330 ns 46813
BM_SGEMM/80 57350 ns 55949 ns 32251
BM_SGEMM/90 36275 ns 36249 ns 42259
BM_SGEMM/100 31111 ns 31008 ns 45270
BM_SGEMM/112 43782 ns 40912 ns 34749
BM_SGEMM/128 67375 ns 64406 ns 22443
BM_SGEMM/140 76389 ns 67003 ns 21430
BM_SGEMM/150 72952 ns 71830 ns 19793
BM_SGEMM/160 97039 ns 96858 ns 11498
BM_SGEMM/170 123272 ns 122007 ns 11855
BM_SGEMM/180 126828 ns 126505 ns 11567
BM_SGEMM/189 115179 ns 114665 ns 11044
BM_SGEMM/200 89289 ns 87259 ns 16147
BM_SGEMM/256 226252 ns 222677 ns 7375
```
I've also tested this with ThreadSanitizer and found no data races during
execution. I'm not sure why 200 is always faster than it's neighbors, we must
be hitting some optimal cache size or something.
Martin Kroeker [Thu, 14 Jun 2018 15:48:51 +0000 (17:48 +0200)]
Merge pull request #1619 from martin-frbg/issue1580
Update OSX deployment target to 10.8
Martin Kroeker [Thu, 14 Jun 2018 14:57:58 +0000 (16:57 +0200)]
Update OSX deployment target to 10.8
fixes #1580
Martin Kroeker [Thu, 14 Jun 2018 14:52:55 +0000 (16:52 +0200)]
Merge pull request #1607 from martin-frbg/dynarch
Move some x86_64 DYNAMIC_ARCH targets to new DYNAMIC_OLDER option
Martin Kroeker [Thu, 14 Jun 2018 14:51:31 +0000 (16:51 +0200)]
Merge pull request #1612 from oon3m0oo/cpus
Fixed a few more unnecessary calls to num_cpu_avail.
Martin Kroeker [Tue, 12 Jun 2018 21:00:24 +0000 (23:00 +0200)]
Merge pull request #1609 from martin-frbg/issue1529
Create OpenBLASConfig.cmake in cmake builds as well
Martin Kroeker [Mon, 11 Jun 2018 15:14:49 +0000 (17:14 +0200)]
Merge pull request #1613 from xianyi/revert-1600-noyield
Revert "Use usleep instead of sched_yield by default"
Martin Kroeker [Mon, 11 Jun 2018 15:05:27 +0000 (17:05 +0200)]
Revert "Use usleep instead of sched_yield by default"
Martin Kroeker [Mon, 11 Jun 2018 11:26:19 +0000 (13:26 +0200)]
Return a somewhat sane default value for L2 cache size if cpuid retur… (#1611)
* Return a somewhat sane default value for L2 cache size if cpuid returned something unexpected
Fixes #1610, the KVM hypervisor on Google Chromebooks returning zero for CPUID 0x80000006, causing DYNAMIC_ARCH
builds of OpenBLAS to hang
Craig Donner [Mon, 11 Jun 2018 09:13:09 +0000 (10:13 +0100)]
Fixed a few more unnecessary calls to num_cpu_avail.
I don't have as many benchmarks for these as for gemm, but it should still
make a difference for small matrices.
Martin Kroeker [Sun, 10 Jun 2018 13:09:43 +0000 (15:09 +0200)]
include CMakePackageConfigHelpers
Martin Kroeker [Sun, 10 Jun 2018 07:25:46 +0000 (09:25 +0200)]
Add template for OpenBLASConfig.cmake
Martin Kroeker [Sun, 10 Jun 2018 07:24:37 +0000 (09:24 +0200)]
Create OpenBLASConfig.cmake from cmake as well
Martin Kroeker [Sat, 9 Jun 2018 17:57:33 +0000 (19:57 +0200)]
Merge pull request #1608 from martin-frbg/issue874
Enable parallel make on MS Windows by default
Martin Kroeker [Sat, 9 Jun 2018 15:54:36 +0000 (17:54 +0200)]
Enable parallel make on MS Windows by default
fixes #874
Martin Kroeker [Sat, 9 Jun 2018 14:31:38 +0000 (16:31 +0200)]
Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option
Martin Kroeker [Sat, 9 Jun 2018 14:30:46 +0000 (16:30 +0200)]
Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option
Martin Kroeker [Sat, 9 Jun 2018 14:29:17 +0000 (16:29 +0200)]
Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option
Martin Kroeker [Sat, 9 Jun 2018 10:42:34 +0000 (12:42 +0200)]
Merge pull request #1605 from oon3m0oo/develop
Improve performance of GEMM for small matrices when SMP is defined.
Craig Donner [Thu, 7 Jun 2018 13:54:42 +0000 (14:54 +0100)]
Improve performance of GEMM for small matrices when SMP is defined.
Always checking num_cpu_avail() regardless of whether threading will actually
be used adds noticeable overhead for small matrices. Most other uses of
num_cpu_avail() do so only if threading will be used, so do the same here.
Martin Kroeker [Thu, 7 Jun 2018 12:09:58 +0000 (14:09 +0200)]
Merge pull request #1601 from martin-frbg/zaxpy
Use a single thread for small input size in zaxpy
Martin Kroeker [Thu, 7 Jun 2018 10:42:00 +0000 (12:42 +0200)]
Merge pull request #1600 from martin-frbg/noyield
Use usleep instead of sched_yield by default
Martin Kroeker [Thu, 7 Jun 2018 08:26:55 +0000 (10:26 +0200)]
Use a single thread for small input size
copies daxpy improvement from #27, see #1560
Martin Kroeker [Thu, 7 Jun 2018 08:18:26 +0000 (10:18 +0200)]
Use usleep instead of sched_yield by default
sched_yield only burns cpu cycles, fixes #900, see also #923, #1560
Martin Kroeker [Wed, 6 Jun 2018 20:07:09 +0000 (22:07 +0200)]
Merge pull request #1589 from fenrus75/skylakex
Initial support for SkylakeX / AVX512
Martin Kroeker [Wed, 6 Jun 2018 16:42:42 +0000 (18:42 +0200)]
Merge pull request #1599 from martin-frbg/c_check_avx512
Improved AVX512 test case for c_check
Martin Kroeker [Wed, 6 Jun 2018 14:51:30 +0000 (16:51 +0200)]
Better AVX512 test case
Martin Kroeker [Wed, 6 Jun 2018 14:49:00 +0000 (16:49 +0200)]
Improve AVX512 testcase
clang 3.4 managed to accept the original test code, only to fail on the actual Skylake asm later
Martin Kroeker [Wed, 6 Jun 2018 10:48:26 +0000 (12:48 +0200)]
Merge pull request #1598 from martin-frbg/issue1593-2
Restore _Atomic define before stdatomic.h for old gcc
Martin Kroeker [Wed, 6 Jun 2018 07:27:49 +0000 (09:27 +0200)]
Update common.h
Martin Kroeker [Wed, 6 Jun 2018 07:21:41 +0000 (09:21 +0200)]
Merge branch 'develop' into issue1593-2
Martin Kroeker [Wed, 6 Jun 2018 07:18:10 +0000 (09:18 +0200)]
Restore _Atomic define before stdatomic.h for old gcc
see #1593
Martin Kroeker [Wed, 6 Jun 2018 05:22:20 +0000 (07:22 +0200)]
Merge pull request #1597 from martin-frbg/cmake-avx512
Check build system support for AVX512 instructions
Martin Kroeker [Tue, 5 Jun 2018 21:29:33 +0000 (23:29 +0200)]
Check build system support for AVX512 instructions
Martin Kroeker [Tue, 5 Jun 2018 17:09:38 +0000 (19:09 +0200)]
Re-enable QUIET_MAKE
Martin Kroeker [Tue, 5 Jun 2018 16:23:01 +0000 (18:23 +0200)]
disable quiet_make for the moment
Martin Kroeker [Tue, 5 Jun 2018 14:02:51 +0000 (16:02 +0200)]
Merge pull request #1594 from martin-frbg/issue1593
Fix inverted condition in _Atomic declaration
Martin Kroeker [Tue, 5 Jun 2018 13:58:34 +0000 (15:58 +0200)]
export NO_AVX512 setting
Martin Kroeker [Tue, 5 Jun 2018 08:31:34 +0000 (10:31 +0200)]
Fix inverted condition in _Atomic declaration
fixes #1593
Martin Kroeker [Tue, 5 Jun 2018 08:26:49 +0000 (10:26 +0200)]
Extend loop range to find SkylakeX in force_coretype
Martin Kroeker [Tue, 5 Jun 2018 08:24:05 +0000 (10:24 +0200)]
Propagate NO_AVX512 via CCOMMON_OPT
Martin Kroeker [Mon, 4 Jun 2018 15:10:19 +0000 (17:10 +0200)]
Update cpuid_x86.c
Martin Kroeker [Mon, 4 Jun 2018 12:36:39 +0000 (14:36 +0200)]
Update dynamic.c
Martin Kroeker [Mon, 4 Jun 2018 06:23:40 +0000 (08:23 +0200)]
Fix misplaced endif
Martin Kroeker [Mon, 4 Jun 2018 06:18:38 +0000 (08:18 +0200)]
Merge pull request #1590 from martin-frbg/avx512_check
Disable AVX512 (Skylake X) support if the build system is too old
Arjan van de Ven [Sun, 3 Jun 2018 22:15:09 +0000 (22:15 +0000)]
Use AVX512 also for DGEMM
this required switching to the generic gemm_beta code (which is faster anyway on SKX)
for both DGEMM and SGEMM
Performance for the not-retuned version is in the 30% range
Martin Kroeker [Sun, 3 Jun 2018 22:13:19 +0000 (00:13 +0200)]
typo fix
Martin Kroeker [Sun, 3 Jun 2018 22:01:11 +0000 (00:01 +0200)]
Disable AVX512 (Skylake X) support if the build system is too old
Martin Kroeker [Sun, 3 Jun 2018 21:41:33 +0000 (23:41 +0200)]
Separate Skylake X from Skylake
Martin Kroeker [Sun, 3 Jun 2018 21:29:07 +0000 (23:29 +0200)]
Separate Skylake X from Skylake
Martin Kroeker [Sun, 3 Jun 2018 21:13:25 +0000 (23:13 +0200)]
Add SKYLAKEX to DYNAMIC_CORE list only if AVX512 is available
Martin Kroeker [Sun, 3 Jun 2018 11:48:27 +0000 (13:48 +0200)]
Propagate NO_AVX512 if needed
Martin Kroeker [Sun, 3 Jun 2018 11:22:59 +0000 (13:22 +0200)]
Typo fix (misplaced parenthesis)
Arjan van de Ven [Sun, 3 Jun 2018 07:24:29 +0000 (07:24 +0000)]
Initial support for SkylakeX / AVX512
This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)
target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,
which brings 2 basic things:
1) 512 bit wide SIMD (2x width of AVX2)
2) 32 SIMD registers (2x the number on AVX2)
This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel
to AVX512VL; more will follow later but this patch aims to get the infrastructure
in place for this "later".
Full performance tuning has not been done yet; with more registers and wider SIMD
it's in theory possible to retune the kernels but even without that there's an
interesting enough performance increase (30-40% range) with just this change.
Martin Kroeker [Sat, 2 Jun 2018 08:02:38 +0000 (10:02 +0200)]
Merge pull request #1587 from matthew-brett/fix-compile-error-early-glibc
Revert "take out unused variables"
Matthew Brett [Fri, 1 Jun 2018 22:20:00 +0000 (23:20 +0100)]
Revert "take out unused variables"
This reverts commit
e5752ff9b322c665a7393d6109c2da7ad6ee2523.
The variables i and n are used in the `#if !__GLIBC_PREREQ(2, 7)`
branch.
Closes gh-1586.
Martin Kroeker [Fri, 1 Jun 2018 16:59:33 +0000 (18:59 +0200)]
Merge pull request #1585 from martin-frbg/lapack-253
Fixes from Lapack-Reference PR 253
Martin Kroeker [Fri, 1 Jun 2018 13:14:45 +0000 (15:14 +0200)]
Fixes from netlib PR 253
Martin Kroeker [Fri, 1 Jun 2018 13:12:59 +0000 (15:12 +0200)]
Fixes from netlib PR 253
When minimal workspace is given in ?hesv_aa, ?sysv_aa, ?hesv_aa_2stage, ?sysv_aa_2stage, now no error is given
Quick return for ?laqr1
Martin Kroeker [Fri, 1 Jun 2018 13:08:14 +0000 (15:08 +0200)]
Fixes from netlib PR253
LAPACKE interfaces for Aasen's functions now call ?sytrf_aa and ?hetrf_aa instead of ?sytrf and ?hetrf
Martin Kroeker [Thu, 31 May 2018 19:56:04 +0000 (21:56 +0200)]
Merge pull request #1584 from martin-frbg/issue1503
Work around name clash with Windows10's winnt.h
Martin Kroeker [Thu, 31 May 2018 19:55:26 +0000 (21:55 +0200)]
Merge pull request #1583 from martin-frbg/issue1575
Handle INCX=0,INCY=0 case
Martin Kroeker [Thu, 31 May 2018 19:55:07 +0000 (21:55 +0200)]
Merge pull request #1582 from martin-frbg/develop-031
Update version number on the develop branch to 0.3.1.dev
Martin Kroeker [Thu, 31 May 2018 19:54:45 +0000 (21:54 +0200)]
Merge pull request #1581 from martin-frbg/issue1574-2
Fix paths to LIN and EIG tests
Martin Kroeker [Thu, 31 May 2018 15:23:08 +0000 (17:23 +0200)]
typo fix
Martin Kroeker [Thu, 31 May 2018 11:41:12 +0000 (13:41 +0200)]
Restore optimized swap kernel now that we have a proper fix