platform/upstream/openblas.git
6 years agoMerge pull request #2 from martin-frbg/develop
Martin Kroeker [Mon, 25 Jun 2018 18:48:10 +0000 (20:48 +0200)]
Merge pull request #2 from martin-frbg/develop

merge develop

6 years agoMerge pull request #1 from xianyi/develop
Martin Kroeker [Mon, 25 Jun 2018 18:45:56 +0000 (20:45 +0200)]
Merge pull request #1 from xianyi/develop

Merge xianyi:develop into develop

6 years agoMerge pull request #1642 from oon3m0oo/develop
Martin Kroeker [Mon, 25 Jun 2018 17:23:40 +0000 (19:23 +0200)]
Merge pull request #1642 from oon3m0oo/develop

Rewrite &= -> = and simplify the initial blocking phase.

6 years agoRewrite &= -> = and simplify the initial blocking phase.
Craig Donner [Mon, 25 Jun 2018 12:53:11 +0000 (13:53 +0100)]
Rewrite &= -> = and simplify the initial blocking phase.

6 years agoAdd support for a user-defined list of dynamic targets
Martin Kroeker [Sat, 23 Jun 2018 17:42:15 +0000 (19:42 +0200)]
Add support for a user-defined list of dynamic targets

6 years agoAdd support for a user-defined list of dynamic targets
Martin Kroeker [Sat, 23 Jun 2018 17:41:32 +0000 (19:41 +0200)]
Add support for a user-defined list of dynamic targets

6 years agoMerge pull request #1638 from martin-frbg/issue1637
Martin Kroeker [Sat, 23 Jun 2018 13:01:02 +0000 (15:01 +0200)]
Merge pull request #1638 from martin-frbg/issue1637

Expose the CBLAS interface to the IxAMIN functions and have make build it

6 years agoExpose CBLAS interface to BLAS extensions iXamin
Martin Kroeker [Sat, 23 Jun 2018 11:31:09 +0000 (13:31 +0200)]
Expose CBLAS interface to BLAS extensions iXamin

6 years agoBuild cblas_iXamin interfaces
Martin Kroeker [Sat, 23 Jun 2018 11:27:30 +0000 (13:27 +0200)]
Build cblas_iXamin interfaces

6 years agoMerge pull request #1634 from oon3m0oo/develop
Martin Kroeker [Thu, 21 Jun 2018 19:01:03 +0000 (21:01 +0200)]
Merge pull request #1634 from oon3m0oo/develop

Fix data races reported by TSAN.

6 years agoUse BLAS rather than CBLAS in test_fork.c (#1626)
oon3m0oo [Thu, 21 Jun 2018 16:47:45 +0000 (17:47 +0100)]
Use BLAS rather than CBLAS in test_fork.c (#1626)

This is handy for people not using lapack.

6 years agoFix data races reported by TSAN.
Craig Donner [Thu, 21 Jun 2018 10:13:57 +0000 (11:13 +0100)]
Fix data races reported by TSAN.

6 years agoFurther improvements to memory.c. (#1625)
oon3m0oo [Wed, 20 Jun 2018 20:04:03 +0000 (21:04 +0100)]
Further improvements to memory.c. (#1625)

- Compiler TLS is now used only used when the compiler supports it
- If compiler TLS is unsupported, we use platform-specific TLS
- Only one variable (an index) is now in TLS
- We only access TLS once per alloc, and never when freeing
- Allocation / release info is now stored within the allocation itself, by
  over-allocating; this saves having external structures do the bookkeeping, and
  reduces some of the redundant data that was being stored (such as addresses)
- We never hit the alloc lock when not using SMP or when using OpenMP (that was
  my fault)
- Now that there are fewer tracking structures I think this is a bit easier to
  read than before

6 years agoMerge pull request #1630 from martin-frbg/x86-march
Martin Kroeker [Wed, 20 Jun 2018 19:51:57 +0000 (21:51 +0200)]
Merge pull request #1630 from martin-frbg/x86-march

Add -march=skylake-avx512 to flags if target is skylake x

6 years agoMerge pull request #1631 from oon3m0oo/stack
Martin Kroeker [Wed, 20 Jun 2018 19:51:38 +0000 (21:51 +0200)]
Merge pull request #1631 from oon3m0oo/stack

Avoid declaring arrays of size 0 when making large stack allocations.

6 years agoAvoid declaring arrays of size 0 when making large stack allocations.
Craig Donner [Wed, 20 Jun 2018 16:03:18 +0000 (17:03 +0100)]
Avoid declaring arrays of size 0 when making large stack allocations.

6 years agoMerge pull request #1629 from martin-frbg/issue1628
Martin Kroeker [Wed, 20 Jun 2018 14:41:13 +0000 (16:41 +0200)]
Merge pull request #1629 from martin-frbg/issue1628

Make gfortran link libomp for clang in the tests; avoid two typical gotchas with NOFORTRAN

6 years agoAdd -march=skylake-avx512 to flags if target is skylake x
Martin Kroeker [Wed, 20 Jun 2018 13:16:19 +0000 (15:16 +0200)]
Add -march=skylake-avx512 to flags if target is skylake x

6 years agoNeed to use filter-out to handle NOFORTRAN not set
Martin Kroeker [Wed, 20 Jun 2018 11:20:30 +0000 (13:20 +0200)]
Need to use filter-out to handle NOFORTRAN not set

6 years agoModify NOFORTRAN tests to always check the value; fix rewriting of NO_FORTRAN
Martin Kroeker [Tue, 19 Jun 2018 21:28:06 +0000 (23:28 +0200)]
Modify NOFORTRAN tests to always check the value; fix rewriting of NO_FORTRAN

6 years agoHandle erroneous user settings NOFORTRAN=0 and NO_FORTRAN
Martin Kroeker [Tue, 19 Jun 2018 18:53:19 +0000 (20:53 +0200)]
Handle erroneous user settings NOFORTRAN=0 and NO_FORTRAN

6 years agoHandle special case of gfortran+clang+OpenMP
Martin Kroeker [Tue, 19 Jun 2018 18:47:33 +0000 (20:47 +0200)]
Handle special case of gfortran+clang+OpenMP

6 years agoHandle special case of gfortran+clang+OpenMP
Martin Kroeker [Tue, 19 Jun 2018 18:46:36 +0000 (20:46 +0200)]
Handle special case of gfortran+clang+OpenMP

6 years agoMerge pull request #1623 from fenrus75/fast-thread
Martin Kroeker [Mon, 18 Jun 2018 07:02:40 +0000 (09:02 +0200)]
Merge pull request #1623 from fenrus75/fast-thread

Initialize only the required subset of the jobs array, fix barriers and improve switch ratio on SkylakeX and Haswell. For issue #1622

6 years agoSupport upcoming Intel Cannon Lake CPUs as Skylake X (#1621)
Martin Kroeker [Sun, 17 Jun 2018 21:38:14 +0000 (23:38 +0200)]
Support upcoming Intel Cannon Lake CPUs as Skylake X (#1621)

* Support  upcoming Cannon Lake as Skylake X

6 years agomake WMB / MB safer on x86-64
Arjan van de Ven [Sun, 17 Jun 2018 18:06:24 +0000 (18:06 +0000)]
make WMB / MB safer on x86-64

make it so that

if (foo)
RMB;
else
MB;

is always done correctly and without syntax surprises

6 years agoOn x86-64, make MB/WMB compiler barriers
Arjan van de Ven [Sun, 17 Jun 2018 17:53:15 +0000 (17:53 +0000)]
On x86-64, make MB/WMB compiler barriers

Whie on x86(64) one does not normally need full memory barriers, it's
good practice to at least use compiler barriers for places where on other
architectures memory barriers are used; this prevents the compiler
from over-optimizing.

6 years agoAdd missing barriers in gemm scheduler
Arjan van de Ven [Sun, 17 Jun 2018 17:50:43 +0000 (17:50 +0000)]
Add missing barriers in gemm scheduler

a few places in the gemm scheduler code were missing barriers;
the code likely worked OK due to heavy use of volatile / _Atomic
but there's no reason to get this incorrect

6 years agoTune HASWELL SWITCH_RATIO as well
Arjan van de Ven [Sun, 17 Jun 2018 17:05:04 +0000 (17:05 +0000)]
Tune HASWELL SWITCH_RATIO as well

Similar to the SKYLAKEX patch, 32 seems to work best
(much better than 4 or 16)

Before (4)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               15554.3    7.2       0.2%                             30353.8      3.7       0.3%
  64 x 64               30346.8    8.7       1.6%                             63495.0      4.1      -0.1%
  65 x 65               81668.1    3.4    -123.3%                             82705.2      3.3     -21.2%
  80 x 80              105045.9    4.9     -95.5%                            115226.0      4.5      -2.2%
  96 x 96              152461.2    5.8     -74.3%                            148156.3      6.0      16.4%
 112 x 112             188505.2    7.5     -42.2%                            171187.3      8.2      36.4%
 128 x 128             257884.0    8.1     -39.5%                            224764.8      9.3      46.0%

Intermediate (16)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               15565.7    7.2       0.2%                             30378.9      3.7       0.2%
  64 x 64               30430.2    8.7       1.3%                             63046.4      4.2       0.6%
  65 x 65               27306.0   10.1      25.3%                             38879.2      7.1      43.0%
  80 x 80               51008.7   10.1       5.1%                             61007.6      8.4      45.9%
  96 x 96               70856.7   12.5      19.0%                             83403.1     10.6      53.0%
 112 x 112              84769.9   16.6      36.0%                             99920.1     14.1      62.9%
 128 x 128              84213.2   25.0      54.5%                            113024.2     18.6      72.8%

After (32)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               15537.3    7.2       0.3%                             30537.0      3.6      -0.3%
  64 x 64               30352.7    8.7       1.6%                             62597.8      4.2       1.3%
  65 x 65               36857.0    7.5      -0.8%                             56167.6      4.9      17.7%
  80 x 80               42552.6   12.1      20.8%                             69536.7      7.4      38.3%
  96 x 96               52101.5   17.1      40.5%                             91016.1      9.7      48.7%
 112 x 112              63853.7   22.1      51.8%                            110507.4     12.7      58.9%
 128 x 128              73966.1   28.4      60.0%                            163146.4     12.9      60.8%

6 years agoTune param.h for SkylakeX
Arjan van de Ven [Sun, 17 Jun 2018 15:47:50 +0000 (15:47 +0000)]
Tune param.h for SkylakeX

param.h defines a per-platform SWITCH_RATIO, which is used as a measure for how fine
grained the blocks for gemm need to be split up. Many platforms define this to 4.

The reality is that the gemm low level implementation for SkylakeX likes bigger blocks
due to the nature of SIMD... by tuning the SWITCH_RATIO to 32 the threading performance
improves significantly:

Before
   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10756.0   10.5      -0.5%                             18296.7      6.1      -1.7%
  64 x 64               20490.0   12.9       1.4%                             40615.0      6.5       0.0%
  65 x 65               83528.3    3.3    -210.9%                             96319.0      2.9     -83.3%
  80 x 80              101453.5    5.1    -166.3%                            128021.7      4.0     -76.6%
  96 x 96              149795.1    5.9    -143.1%                            168059.4      5.3     -47.4%
 112 x 112             191481.2    7.3    -105.8%                            204165.0      6.9     -14.6%
 128 x 128             265019.2    7.9     -99.0%                            272006.4      7.7      -5.3%

After
   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10666.3   10.6       0.4%                             18236.9      6.2      -1.4%
  64 x 64               20410.1   13.0       1.8%                             39925.8      6.6       1.7%
  65 x 65               34983.0    7.9     -30.2%                             51494.6      5.4       2.0%
  80 x 80               39769.1   13.0      -4.4%                             63805.2      8.1      12.0%
  96 x 96               45169.6   19.7      26.7%                             80065.8     11.1      29.8%
 112 x 112              57026.1   24.7      38.7%                             99535.5     14.2      44.1%
 128 x 128              64789.8   32.5      51.3%                            117407.2     17.9      54.6%

With this change, threading starts to be a win already at 96x96

6 years agoDon't use _Atomic for jobs sometimes...
Arjan van de Ven [Sun, 17 Jun 2018 15:39:15 +0000 (15:39 +0000)]
Don't use _Atomic for jobs sometimes...

The use of _Atomic leads to really bad code generation in the compiler
(on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite
x86 being ordered and cache coherent). But there's a fallback in the code that
just uses volatile which is more than plenty in practice.

If we're nervous about cross thread synchronization for these variables, we should
make the YIELD function be a compiler/memory barrier instead.

performance before (after last commit)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10630.0   10.6       0.7%                             18112.8      6.2      -0.7%
  64 x 64               20374.8   13.0       1.9%                             40487.0      6.5       0.4%
  65 x 65              141955.2    1.9    -428.3%                            146708.8      1.9    -179.2%
  80 x 80              178921.1    2.9    -369.6%                            186032.7      2.8    -156.6%
  96 x 96              205436.2    4.3    -233.4%                            224513.1      3.9     -97.0%
 112 x 112             244408.2    5.8    -162.7%                            262158.7      5.4     -47.1%
 128 x 128             321334.5    6.5    -141.3%                            333829.0      6.3     -29.2%

Performance with this patch (roughly a 2x improvement):

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10756.0   10.5      -0.5%                             18296.7      6.1      -1.7%
  64 x 64               20490.0   12.9       1.4%                             40615.0      6.5       0.0%
  65 x 65               83528.3    3.3    -210.9%                             96319.0      2.9     -83.3%
  80 x 80              101453.5    5.1    -166.3%                            128021.7      4.0     -76.6%
  96 x 96              149795.1    5.9    -143.1%                            168059.4      5.3     -47.4%
 112 x 112             191481.2    7.3    -105.8%                            204165.0      6.9     -14.6%
 128 x 128             265019.2    7.9     -99.0%                            272006.4      7.7      -5.3%

6 years agoOnly initialize the part of the jobs array that will get used
Arjan van de Ven [Sun, 17 Jun 2018 15:32:03 +0000 (15:32 +0000)]
Only initialize the part of the jobs array that will get used

The jobs array is getting initialized in O(compiled cpus^2) complexity.
Distros and people with bigger systems will use pretty high values
(128 or 256 or more) for this value, leading to interesting bubbles
in performance.

Baseline (single threaded performance) gets roughly 13 - 15 multiplications per cycle
in the interesting range (threading kicks in at 65x65 mult by 65x65).
The hardware is capable of 32 multiplications per cycle theoretically.

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10703.9   10.6       0.0%                             17990.6      6.3       0.0%
  64 x 64               20778.4   12.8       0.0%                             40629.2      6.5       0.0%
  65 x 65               26869.9   10.3       0.0%                             52545.7      5.3       0.0%
  80 x 80               38104.5   13.5       0.0%                             72492.7      7.1       0.0%
  96 x 96               61626.4   14.4       0.0%                            113983.8      7.8       0.0%
 112 x 112              91803.8   15.3       0.0%                            180987.3      7.8       0.0%
 128 x 128             133161.4   15.8       0.0%                            258374.3      8.1       0.0%

When threading is turned on
TARGET=SKYLAKEX F_COMPILER=GFORTRAN  SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0  NUM_THREADS=128

  Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10725.9   10.5      -0.2%                             18134.9      6.2      -0.8%
  64 x 64               20500.6   12.9       1.3%                             40929.1      6.5      -0.7%
  65 x 65             2040832.1    0.1   -7495.2%                           2097633.6      0.1   -3892.0%
  80 x 80             2063129.1    0.2   -5314.4%                           2119925.2      0.2   -2824.3%
  96 x 96             2070374.5    0.4   -3259.6%                           2173604.4      0.4   -1806.9%
 112 x 112            2111721.5    0.7   -2169.6%                           2263330.8      0.6   -1170.0%
 128 x 128            2276181.5    0.9   -1609.3%                           2377228.9      0.9    -820.1%

There is a deep deep cliff once you hit 65x65

With this patch

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10630.0   10.6       0.7%                             18112.8      6.2      -0.7%
  64 x 64               20374.8   13.0       1.9%                             40487.0      6.5       0.4%
  65 x 65              141955.2    1.9    -428.3%                            146708.8      1.9    -179.2%
  80 x 80              178921.1    2.9    -369.6%                            186032.7      2.8    -156.6%
  96 x 96              205436.2    4.3    -233.4%                            224513.1      3.9     -97.0%
 112 x 112             244408.2    5.8    -162.7%                            262158.7      5.4     -47.1%
 128 x 128             321334.5    6.5    -141.3%                            333829.0      6.3     -29.2%

The cliff is very significantly reduced.
(more to follow)

6 years agoAdd build-time option for OMP scheduler; document MULTITHREAD_THRESHOLD range (#1620)
Martin Kroeker [Fri, 15 Jun 2018 09:25:05 +0000 (11:25 +0200)]
Add build-time option for OMP scheduler; document MULTITHREAD_THRESHOLD range (#1620)

* Allow choosing the OpenMP scheduler and add range hint for GEMM_MULTITHREAD_THRESHOLD
* Amended description of GEMM_MULTITHREAD_THRESHOLD
to reflect #742 making it track floating point operations rather than matrix size

6 years agoMerge pull request #1618 from oon3m0oo/less_locking
Martin Kroeker [Thu, 14 Jun 2018 22:10:29 +0000 (00:10 +0200)]
Merge pull request #1618 from oon3m0oo/less_locking

Remove the need for most locking in memory.c.

6 years agoRemove the need for most locking in memory.c.
Craig Donner [Thu, 14 Jun 2018 11:18:04 +0000 (12:18 +0100)]
Remove the need for most locking in memory.c.

Using thread local storage for tracking memory allocations means that threads
no longer have to lock at all when doing memory allocations / frees. This
particularly helps the gemm driver since it does an allocation per invocation.
Even without threading at all, this helps, since even calling a lock with
no contention has a cost:

Before this change, no threading:
```
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_SGEMM/4          102 ns        102 ns   13504412
BM_SGEMM/6          175 ns        175 ns    7997580
BM_SGEMM/8          205 ns        205 ns    6842073
BM_SGEMM/10         266 ns        266 ns    5294919
BM_SGEMM/16         478 ns        478 ns    2963441
BM_SGEMM/20         690 ns        690 ns    2144755
BM_SGEMM/32        1906 ns       1906 ns     716981
BM_SGEMM/40        2983 ns       2983 ns     473218
BM_SGEMM/64        9421 ns       9422 ns     148450
BM_SGEMM/72       12630 ns      12631 ns     112105
BM_SGEMM/80       15845 ns      15846 ns      89118
BM_SGEMM/90       25675 ns      25676 ns      54332
BM_SGEMM/100      29864 ns      29865 ns      47120
BM_SGEMM/112      37841 ns      37842 ns      36717
BM_SGEMM/128      56531 ns      56532 ns      25361
BM_SGEMM/140      75886 ns      75888 ns      18143
BM_SGEMM/150      98493 ns      98496 ns      14299
BM_SGEMM/160     102620 ns     102622 ns      13381
BM_SGEMM/170     135169 ns     135173 ns      10231
BM_SGEMM/180     146170 ns     146172 ns       9535
BM_SGEMM/189     190226 ns     190231 ns       7397
BM_SGEMM/200     194513 ns     194519 ns       7210
BM_SGEMM/256     396561 ns     396573 ns       3531
```
with this change:
```
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_SGEMM/4           95 ns         95 ns   14500387
BM_SGEMM/6          166 ns        166 ns    8381763
BM_SGEMM/8          196 ns        196 ns    7277044
BM_SGEMM/10         256 ns        256 ns    5515721
BM_SGEMM/16         463 ns        463 ns    3025197
BM_SGEMM/20         636 ns        636 ns    2070213
BM_SGEMM/32        1885 ns       1885 ns     739444
BM_SGEMM/40        2969 ns       2969 ns     472152
BM_SGEMM/64        9371 ns       9372 ns     148932
BM_SGEMM/72       12431 ns      12431 ns     112919
BM_SGEMM/80       15615 ns      15616 ns      89978
BM_SGEMM/90       25397 ns      25398 ns      55041
BM_SGEMM/100      29445 ns      29446 ns      47540
BM_SGEMM/112      37530 ns      37531 ns      37286
BM_SGEMM/128      55373 ns      55375 ns      25277
BM_SGEMM/140      76241 ns      76241 ns      18259
BM_SGEMM/150     102196 ns     102200 ns      13736
BM_SGEMM/160     101521 ns     101525 ns      13556
BM_SGEMM/170     136182 ns     136184 ns      10567
BM_SGEMM/180     146861 ns     146864 ns       9035
BM_SGEMM/189     192632 ns     192632 ns       7231
BM_SGEMM/200     198547 ns     198555 ns       6995
BM_SGEMM/256     392316 ns     392330 ns       3539
```

Before, when built with USE_THREAD=1, GEMM_MULTITHREAD_THRESHOLD = 4, the cost
of small matrix operations was overshadowed by thread locking (look smaller than
32) even when not explicitly spawning threads:
```
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_SGEMM/4          328 ns        328 ns    4170562
BM_SGEMM/6          396 ns        396 ns    3536400
BM_SGEMM/8          418 ns        418 ns    3330102
BM_SGEMM/10         491 ns        491 ns    2863047
BM_SGEMM/16         710 ns        710 ns    2028314
BM_SGEMM/20         871 ns        871 ns    1581546
BM_SGEMM/32        2132 ns       2132 ns     657089
BM_SGEMM/40        3197 ns       3196 ns     437969
BM_SGEMM/64        9645 ns       9645 ns     144987
BM_SGEMM/72       35064 ns      32881 ns      50264
BM_SGEMM/80       37661 ns      35787 ns      42080
BM_SGEMM/90       36507 ns      36077 ns      40091
BM_SGEMM/100      32513 ns      31850 ns      48607
BM_SGEMM/112      41742 ns      41207 ns      37273
BM_SGEMM/128      67211 ns      65095 ns      21933
BM_SGEMM/140      68263 ns      67943 ns      19245
BM_SGEMM/150     121854 ns     115439 ns      10660
BM_SGEMM/160     116826 ns     115539 ns      10000
BM_SGEMM/170     126566 ns     122798 ns      11960
BM_SGEMM/180     130088 ns     127292 ns      11503
BM_SGEMM/189     120309 ns     116634 ns      13162
BM_SGEMM/200     114559 ns     110993 ns      10000
BM_SGEMM/256     217063 ns     207806 ns       6417
```
and after, it's gone (note this includes my other change which reduces calls
to num_cpu_avail):
```
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_SGEMM/4           95 ns         95 ns   12347650
BM_SGEMM/6          166 ns        166 ns    8259683
BM_SGEMM/8          193 ns        193 ns    7162210
BM_SGEMM/10         258 ns        258 ns    5415657
BM_SGEMM/16         471 ns        471 ns    2981009
BM_SGEMM/20         666 ns        666 ns    2148002
BM_SGEMM/32        1903 ns       1903 ns     738245
BM_SGEMM/40        2969 ns       2969 ns     473239
BM_SGEMM/64        9440 ns       9440 ns     148442
BM_SGEMM/72       37239 ns      33330 ns      46813
BM_SGEMM/80       57350 ns      55949 ns      32251
BM_SGEMM/90       36275 ns      36249 ns      42259
BM_SGEMM/100      31111 ns      31008 ns      45270
BM_SGEMM/112      43782 ns      40912 ns      34749
BM_SGEMM/128      67375 ns      64406 ns      22443
BM_SGEMM/140      76389 ns      67003 ns      21430
BM_SGEMM/150      72952 ns      71830 ns      19793
BM_SGEMM/160      97039 ns      96858 ns      11498
BM_SGEMM/170     123272 ns     122007 ns      11855
BM_SGEMM/180     126828 ns     126505 ns      11567
BM_SGEMM/189     115179 ns     114665 ns      11044
BM_SGEMM/200      89289 ns      87259 ns      16147
BM_SGEMM/256     226252 ns     222677 ns       7375
```

I've also tested this with ThreadSanitizer and found no data races during
execution.  I'm not sure why 200 is always faster than it's neighbors, we must
be hitting some optimal cache size or something.

6 years agoMerge pull request #1619 from martin-frbg/issue1580
Martin Kroeker [Thu, 14 Jun 2018 15:48:51 +0000 (17:48 +0200)]
Merge pull request #1619 from martin-frbg/issue1580

Update OSX deployment target to 10.8

6 years agoUpdate OSX deployment target to 10.8
Martin Kroeker [Thu, 14 Jun 2018 14:57:58 +0000 (16:57 +0200)]
Update OSX deployment target to 10.8

fixes #1580

6 years agoMerge pull request #1607 from martin-frbg/dynarch
Martin Kroeker [Thu, 14 Jun 2018 14:52:55 +0000 (16:52 +0200)]
Merge pull request #1607 from martin-frbg/dynarch

Move some x86_64 DYNAMIC_ARCH targets to new DYNAMIC_OLDER option

6 years agoMerge pull request #1612 from oon3m0oo/cpus
Martin Kroeker [Thu, 14 Jun 2018 14:51:31 +0000 (16:51 +0200)]
Merge pull request #1612 from oon3m0oo/cpus

Fixed a few more unnecessary calls to num_cpu_avail.

6 years agoMerge pull request #1609 from martin-frbg/issue1529
Martin Kroeker [Tue, 12 Jun 2018 21:00:24 +0000 (23:00 +0200)]
Merge pull request #1609 from martin-frbg/issue1529

Create OpenBLASConfig.cmake in cmake builds as well

6 years agoMerge pull request #1613 from xianyi/revert-1600-noyield
Martin Kroeker [Mon, 11 Jun 2018 15:14:49 +0000 (17:14 +0200)]
Merge pull request #1613 from xianyi/revert-1600-noyield

Revert "Use usleep instead of sched_yield by default"

6 years agoRevert "Use usleep instead of sched_yield by default"
Martin Kroeker [Mon, 11 Jun 2018 15:05:27 +0000 (17:05 +0200)]
Revert "Use usleep instead of sched_yield by default"

6 years agoReturn a somewhat sane default value for L2 cache size if cpuid retur… (#1611)
Martin Kroeker [Mon, 11 Jun 2018 11:26:19 +0000 (13:26 +0200)]
Return a somewhat sane default value for L2 cache size if cpuid retur… (#1611)

* Return a somewhat sane default value for L2 cache size if cpuid returned something unexpected

Fixes #1610, the KVM hypervisor on Google Chromebooks returning zero for CPUID  0x80000006, causing DYNAMIC_ARCH
builds of OpenBLAS to hang

6 years agoFixed a few more unnecessary calls to num_cpu_avail.
Craig Donner [Mon, 11 Jun 2018 09:13:09 +0000 (10:13 +0100)]
Fixed a few more unnecessary calls to num_cpu_avail.

I don't have as many benchmarks for these as for gemm, but it should still
make a difference for small matrices.

6 years agoinclude CMakePackageConfigHelpers
Martin Kroeker [Sun, 10 Jun 2018 13:09:43 +0000 (15:09 +0200)]
include CMakePackageConfigHelpers

6 years agoAdd template for OpenBLASConfig.cmake
Martin Kroeker [Sun, 10 Jun 2018 07:25:46 +0000 (09:25 +0200)]
Add template for OpenBLASConfig.cmake

6 years agoCreate OpenBLASConfig.cmake from cmake as well
Martin Kroeker [Sun, 10 Jun 2018 07:24:37 +0000 (09:24 +0200)]
Create OpenBLASConfig.cmake from cmake as well

6 years agoMerge pull request #1608 from martin-frbg/issue874
Martin Kroeker [Sat, 9 Jun 2018 17:57:33 +0000 (19:57 +0200)]
Merge pull request #1608 from martin-frbg/issue874

Enable parallel make on MS Windows by default

6 years agoEnable parallel make on MS Windows by default
Martin Kroeker [Sat, 9 Jun 2018 15:54:36 +0000 (17:54 +0200)]
Enable parallel make on MS Windows by default

fixes #874

6 years agoMove some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option
Martin Kroeker [Sat, 9 Jun 2018 14:31:38 +0000 (16:31 +0200)]
Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option

6 years agoMove some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option
Martin Kroeker [Sat, 9 Jun 2018 14:30:46 +0000 (16:30 +0200)]
Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option

6 years agoMove some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option
Martin Kroeker [Sat, 9 Jun 2018 14:29:17 +0000 (16:29 +0200)]
Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option

6 years agoMerge pull request #1605 from oon3m0oo/develop
Martin Kroeker [Sat, 9 Jun 2018 10:42:34 +0000 (12:42 +0200)]
Merge pull request #1605 from oon3m0oo/develop

Improve performance of GEMM for small matrices when SMP is defined.

6 years agoImprove performance of GEMM for small matrices when SMP is defined.
Craig Donner [Thu, 7 Jun 2018 13:54:42 +0000 (14:54 +0100)]
Improve performance of GEMM for small matrices when SMP is defined.

Always checking num_cpu_avail() regardless of whether threading will actually
be used adds noticeable overhead for small matrices.  Most other uses of
num_cpu_avail() do so only if threading will be used, so do the same here.

6 years agoMerge pull request #1601 from martin-frbg/zaxpy
Martin Kroeker [Thu, 7 Jun 2018 12:09:58 +0000 (14:09 +0200)]
Merge pull request #1601 from martin-frbg/zaxpy

Use a single thread for small input size in zaxpy

6 years agoMerge pull request #1600 from martin-frbg/noyield
Martin Kroeker [Thu, 7 Jun 2018 10:42:00 +0000 (12:42 +0200)]
Merge pull request #1600 from martin-frbg/noyield

Use usleep instead of sched_yield by default

6 years agoUse a single thread for small input size
Martin Kroeker [Thu, 7 Jun 2018 08:26:55 +0000 (10:26 +0200)]
Use a single thread for small input size

copies daxpy improvement from #27, see #1560

6 years agoUse usleep instead of sched_yield by default
Martin Kroeker [Thu, 7 Jun 2018 08:18:26 +0000 (10:18 +0200)]
Use usleep instead of sched_yield by default

sched_yield only burns cpu cycles, fixes #900,  see also #923, #1560

6 years agoMerge pull request #1589 from fenrus75/skylakex
Martin Kroeker [Wed, 6 Jun 2018 20:07:09 +0000 (22:07 +0200)]
Merge pull request #1589 from fenrus75/skylakex

Initial support for SkylakeX / AVX512

6 years agoMerge pull request #1599 from martin-frbg/c_check_avx512
Martin Kroeker [Wed, 6 Jun 2018 16:42:42 +0000 (18:42 +0200)]
Merge pull request #1599 from martin-frbg/c_check_avx512

Improved AVX512 test case for c_check

6 years agoBetter AVX512 test case
Martin Kroeker [Wed, 6 Jun 2018 14:51:30 +0000 (16:51 +0200)]
Better AVX512 test case

6 years agoImprove AVX512 testcase
Martin Kroeker [Wed, 6 Jun 2018 14:49:00 +0000 (16:49 +0200)]
Improve AVX512 testcase

clang 3.4 managed to accept the original test code, only to fail on the actual Skylake asm later

6 years agoMerge pull request #1598 from martin-frbg/issue1593-2
Martin Kroeker [Wed, 6 Jun 2018 10:48:26 +0000 (12:48 +0200)]
Merge pull request #1598 from martin-frbg/issue1593-2

Restore _Atomic define before stdatomic.h for old gcc

6 years agoUpdate common.h
Martin Kroeker [Wed, 6 Jun 2018 07:27:49 +0000 (09:27 +0200)]
Update common.h

6 years agoMerge branch 'develop' into issue1593-2
Martin Kroeker [Wed, 6 Jun 2018 07:21:41 +0000 (09:21 +0200)]
Merge branch 'develop' into issue1593-2

6 years agoRestore _Atomic define before stdatomic.h for old gcc
Martin Kroeker [Wed, 6 Jun 2018 07:18:10 +0000 (09:18 +0200)]
Restore _Atomic define before stdatomic.h for old gcc

see #1593

6 years agoMerge pull request #1597 from martin-frbg/cmake-avx512
Martin Kroeker [Wed, 6 Jun 2018 05:22:20 +0000 (07:22 +0200)]
Merge pull request #1597 from martin-frbg/cmake-avx512

Check build system support for AVX512 instructions

6 years agoCheck build system support for AVX512 instructions
Martin Kroeker [Tue, 5 Jun 2018 21:29:33 +0000 (23:29 +0200)]
Check build system support for AVX512 instructions

6 years agoRe-enable QUIET_MAKE
Martin Kroeker [Tue, 5 Jun 2018 17:09:38 +0000 (19:09 +0200)]
Re-enable QUIET_MAKE

6 years agodisable quiet_make for the moment
Martin Kroeker [Tue, 5 Jun 2018 16:23:01 +0000 (18:23 +0200)]
disable quiet_make for the moment

6 years agoMerge pull request #1594 from martin-frbg/issue1593
Martin Kroeker [Tue, 5 Jun 2018 14:02:51 +0000 (16:02 +0200)]
Merge pull request #1594 from martin-frbg/issue1593

Fix inverted condition in _Atomic declaration

6 years agoexport NO_AVX512 setting
Martin Kroeker [Tue, 5 Jun 2018 13:58:34 +0000 (15:58 +0200)]
export NO_AVX512 setting

6 years agoFix inverted condition in _Atomic declaration
Martin Kroeker [Tue, 5 Jun 2018 08:31:34 +0000 (10:31 +0200)]
Fix inverted condition in _Atomic declaration

fixes #1593

6 years agoExtend loop range to find SkylakeX in force_coretype
Martin Kroeker [Tue, 5 Jun 2018 08:26:49 +0000 (10:26 +0200)]
Extend loop range to find SkylakeX in force_coretype

6 years agoPropagate NO_AVX512 via CCOMMON_OPT
Martin Kroeker [Tue, 5 Jun 2018 08:24:05 +0000 (10:24 +0200)]
Propagate NO_AVX512 via CCOMMON_OPT

6 years agoUpdate cpuid_x86.c
Martin Kroeker [Mon, 4 Jun 2018 15:10:19 +0000 (17:10 +0200)]
Update cpuid_x86.c

6 years agoUpdate dynamic.c
Martin Kroeker [Mon, 4 Jun 2018 12:36:39 +0000 (14:36 +0200)]
Update dynamic.c

6 years agoFix misplaced endif
Martin Kroeker [Mon, 4 Jun 2018 06:23:40 +0000 (08:23 +0200)]
Fix misplaced endif

6 years agoMerge pull request #1590 from martin-frbg/avx512_check
Martin Kroeker [Mon, 4 Jun 2018 06:18:38 +0000 (08:18 +0200)]
Merge pull request #1590 from martin-frbg/avx512_check

Disable AVX512 (Skylake X) support if the build system is too old

6 years agoUse AVX512 also for DGEMM
Arjan van de Ven [Sun, 3 Jun 2018 22:15:09 +0000 (22:15 +0000)]
Use AVX512 also for DGEMM

this required switching to the generic gemm_beta code (which is faster anyway on SKX)
for both DGEMM and SGEMM

Performance for the not-retuned version is in the 30% range

6 years agotypo fix
Martin Kroeker [Sun, 3 Jun 2018 22:13:19 +0000 (00:13 +0200)]
typo fix

6 years agoDisable AVX512 (Skylake X) support if the build system is too old
Martin Kroeker [Sun, 3 Jun 2018 22:01:11 +0000 (00:01 +0200)]
Disable AVX512 (Skylake X) support if the build system is too old

6 years agoSeparate Skylake X from Skylake
Martin Kroeker [Sun, 3 Jun 2018 21:41:33 +0000 (23:41 +0200)]
Separate Skylake X from Skylake

6 years agoSeparate Skylake X from Skylake
Martin Kroeker [Sun, 3 Jun 2018 21:29:07 +0000 (23:29 +0200)]
Separate Skylake X from Skylake

6 years agoAdd SKYLAKEX to DYNAMIC_CORE list only if AVX512 is available
Martin Kroeker [Sun, 3 Jun 2018 21:13:25 +0000 (23:13 +0200)]
Add SKYLAKEX to DYNAMIC_CORE list only if AVX512 is available

6 years agoPropagate NO_AVX512 if needed
Martin Kroeker [Sun, 3 Jun 2018 11:48:27 +0000 (13:48 +0200)]
Propagate NO_AVX512 if needed

6 years agoTypo fix (misplaced parenthesis)
Martin Kroeker [Sun, 3 Jun 2018 11:22:59 +0000 (13:22 +0200)]
Typo fix (misplaced parenthesis)

6 years agoInitial support for SkylakeX / AVX512
Arjan van de Ven [Sun, 3 Jun 2018 07:24:29 +0000 (07:24 +0000)]
Initial support for SkylakeX / AVX512

This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)
target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,
which brings 2 basic things:
1) 512 bit wide SIMD (2x width of AVX2)
2) 32 SIMD registers (2x the number on AVX2)

This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel
to AVX512VL; more will follow later but this patch aims to get the infrastructure
in place for this "later".

Full performance tuning has not been done yet; with more registers and wider SIMD
it's in theory possible to retune the kernels but even without that there's an
interesting enough performance increase (30-40% range) with just this change.

6 years agoMerge pull request #1587 from matthew-brett/fix-compile-error-early-glibc
Martin Kroeker [Sat, 2 Jun 2018 08:02:38 +0000 (10:02 +0200)]
Merge pull request #1587 from matthew-brett/fix-compile-error-early-glibc

Revert "take out unused variables"

6 years agoRevert "take out unused variables"
Matthew Brett [Fri, 1 Jun 2018 22:20:00 +0000 (23:20 +0100)]
Revert "take out unused variables"

This reverts commit e5752ff9b322c665a7393d6109c2da7ad6ee2523.

The variables i and n are used in the `#if !__GLIBC_PREREQ(2, 7)`
branch.

Closes gh-1586.

6 years agoMerge pull request #1585 from martin-frbg/lapack-253
Martin Kroeker [Fri, 1 Jun 2018 16:59:33 +0000 (18:59 +0200)]
Merge pull request #1585 from martin-frbg/lapack-253

Fixes from Lapack-Reference PR 253

6 years agoFixes from netlib PR 253
Martin Kroeker [Fri, 1 Jun 2018 13:14:45 +0000 (15:14 +0200)]
Fixes from netlib PR 253

6 years agoFixes from netlib PR 253
Martin Kroeker [Fri, 1 Jun 2018 13:12:59 +0000 (15:12 +0200)]
Fixes from netlib PR 253

When minimal workspace is given in ?hesv_aa, ?sysv_aa, ?hesv_aa_2stage, ?sysv_aa_2stage, now no error is given
Quick return for ?laqr1

6 years agoFixes from netlib PR253
Martin Kroeker [Fri, 1 Jun 2018 13:08:14 +0000 (15:08 +0200)]
Fixes from netlib PR253

LAPACKE interfaces for Aasen's functions now call ?sytrf_aa and ?hetrf_aa instead of ?sytrf and ?hetrf

6 years agoMerge pull request #1584 from martin-frbg/issue1503
Martin Kroeker [Thu, 31 May 2018 19:56:04 +0000 (21:56 +0200)]
Merge pull request #1584 from martin-frbg/issue1503

Work around name clash with Windows10's winnt.h

6 years agoMerge pull request #1583 from martin-frbg/issue1575
Martin Kroeker [Thu, 31 May 2018 19:55:26 +0000 (21:55 +0200)]
Merge pull request #1583 from martin-frbg/issue1575

Handle INCX=0,INCY=0 case

6 years agoMerge pull request #1582 from martin-frbg/develop-031
Martin Kroeker [Thu, 31 May 2018 19:55:07 +0000 (21:55 +0200)]
Merge pull request #1582 from martin-frbg/develop-031

Update version number on the develop branch to 0.3.1.dev

6 years agoMerge pull request #1581 from martin-frbg/issue1574-2
Martin Kroeker [Thu, 31 May 2018 19:54:45 +0000 (21:54 +0200)]
Merge pull request #1581 from martin-frbg/issue1574-2

Fix paths to LIN and EIG tests

6 years agotypo fix
Martin Kroeker [Thu, 31 May 2018 15:23:08 +0000 (17:23 +0200)]
typo fix

6 years agoRestore optimized swap kernel now that we have a proper fix
Martin Kroeker [Thu, 31 May 2018 11:41:12 +0000 (13:41 +0200)]
Restore optimized swap kernel now that we have a proper fix