review.tizen.org Git - platform/upstream/openblas.git/log

Further improvements to memory.c. (#1625)

- Compiler TLS is now used only used when the compiler supports it
- If compiler TLS is unsupported, we use platform-specific TLS
- Only one variable (an index) is now in TLS
- We only access TLS once per alloc, and never when freeing
- Allocation / release info is now stored within the allocation itself, by
  over-allocating; this saves having external structures do the bookkeeping, and
  reduces some of the redundant data that was being stored (such as addresses)
- We never hit the alloc lock when not using SMP or when using OpenMP (that was
  my fault)
- Now that there are fewer tracking structures I think this is a bit easier to
  read than before

commit | commitdiff | tree

Martin Kroeker [Wed, 20 Jun 2018 19:51:57 +0000 (21:51 +0200)]

Merge pull request #1630 from martin-frbg/x86-march

Add -march=skylake-avx512 to flags if target is skylake x

commit | commitdiff | tree

Martin Kroeker [Wed, 20 Jun 2018 19:51:38 +0000 (21:51 +0200)]

Merge pull request #1631 from oon3m0oo/stack

Avoid declaring arrays of size 0 when making large stack allocations.

commit | commitdiff | tree

Craig Donner [Wed, 20 Jun 2018 16:03:18 +0000 (17:03 +0100)]

Avoid declaring arrays of size 0 when making large stack allocations.

commit | commitdiff | tree

Martin Kroeker [Wed, 20 Jun 2018 14:41:13 +0000 (16:41 +0200)]

Merge pull request #1629 from martin-frbg/issue1628

Make gfortran link libomp for clang in the tests; avoid two typical gotchas with NOFORTRAN

commit | commitdiff | tree

Martin Kroeker [Wed, 20 Jun 2018 13:16:19 +0000 (15:16 +0200)]

Add -march=skylake-avx512 to flags if target is skylake x

commit | commitdiff | tree

Martin Kroeker [Wed, 20 Jun 2018 11:20:30 +0000 (13:20 +0200)]

Need to use filter-out to handle NOFORTRAN not set

commit | commitdiff | tree

Martin Kroeker [Tue, 19 Jun 2018 21:28:06 +0000 (23:28 +0200)]

Modify NOFORTRAN tests to always check the value; fix rewriting of NO_FORTRAN

commit | commitdiff | tree

Martin Kroeker [Tue, 19 Jun 2018 18:53:19 +0000 (20:53 +0200)]

Handle erroneous user settings NOFORTRAN=0 and NO_FORTRAN

commit | commitdiff | tree

Martin Kroeker [Tue, 19 Jun 2018 18:47:33 +0000 (20:47 +0200)]

Handle special case of gfortran+clang+OpenMP

commit | commitdiff | tree

Martin Kroeker [Tue, 19 Jun 2018 18:46:36 +0000 (20:46 +0200)]

Handle special case of gfortran+clang+OpenMP

commit | commitdiff | tree

Martin Kroeker [Mon, 18 Jun 2018 07:02:40 +0000 (09:02 +0200)]

Merge pull request #1623 from fenrus75/fast-thread

Initialize only the required subset of the jobs array, fix barriers and improve switch ratio on SkylakeX and Haswell. For issue #1622

commit | commitdiff | tree

Martin Kroeker [Sun, 17 Jun 2018 21:38:14 +0000 (23:38 +0200)]

Support upcoming Intel Cannon Lake CPUs as Skylake X (#1621)

* Support upcoming Cannon Lake as Skylake X

commit | commitdiff | tree

Arjan van de Ven [Sun, 17 Jun 2018 18:06:24 +0000 (18:06 +0000)]

make WMB / MB safer on x86-64

make it so that

if (foo)
RMB;
else
MB;

is always done correctly and without syntax surprises

commit | commitdiff | tree

Arjan van de Ven [Sun, 17 Jun 2018 17:53:15 +0000 (17:53 +0000)]

On x86-64, make MB/WMB compiler barriers

Whie on x86(64) one does not normally need full memory barriers, it's
good practice to at least use compiler barriers for places where on other
architectures memory barriers are used; this prevents the compiler
from over-optimizing.

commit | commitdiff | tree

Arjan van de Ven [Sun, 17 Jun 2018 17:50:43 +0000 (17:50 +0000)]

Add missing barriers in gemm scheduler

a few places in the gemm scheduler code were missing barriers;
the code likely worked OK due to heavy use of volatile / _Atomic
but there's no reason to get this incorrect

commit | commitdiff | tree

Arjan van de Ven [Sun, 17 Jun 2018 17:05:04 +0000 (17:05 +0000)]

Tune HASWELL SWITCH_RATIO as well

Similar to the SKYLAKEX patch, 32 seems to work best
(much better than 4 or 16)

Before (4)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               15554.3    7.2       0.2%                             30353.8      3.7       0.3%
  64 x 64               30346.8    8.7       1.6%                             63495.0      4.1      -0.1%
  65 x 65               81668.1    3.4    -123.3%                             82705.2      3.3     -21.2%
  80 x 80              105045.9    4.9     -95.5%                            115226.0      4.5      -2.2%
  96 x 96              152461.2    5.8     -74.3%                            148156.3      6.0      16.4%
112 x 112             188505.2    7.5     -42.2%                            171187.3      8.2      36.4%
128 x 128             257884.0    8.1     -39.5%                            224764.8      9.3      46.0%

Intermediate (16)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               15565.7    7.2       0.2%                             30378.9      3.7       0.2%
  64 x 64               30430.2    8.7       1.3%                             63046.4      4.2       0.6%
  65 x 65               27306.0   10.1      25.3%                             38879.2      7.1      43.0%
  80 x 80               51008.7   10.1       5.1%                             61007.6      8.4      45.9%
  96 x 96               70856.7   12.5      19.0%                             83403.1     10.6      53.0%
112 x 112              84769.9   16.6      36.0%                             99920.1     14.1      62.9%
128 x 128              84213.2   25.0      54.5%                            113024.2     18.6      72.8%

After (32)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               15537.3    7.2       0.3%                             30537.0      3.6      -0.3%
  64 x 64               30352.7    8.7       1.6%                             62597.8      4.2       1.3%
  65 x 65               36857.0    7.5      -0.8%                             56167.6      4.9      17.7%
  80 x 80               42552.6   12.1      20.8%                             69536.7      7.4      38.3%
  96 x 96               52101.5   17.1      40.5%                             91016.1      9.7      48.7%
112 x 112              63853.7   22.1      51.8%                            110507.4     12.7      58.9%
128 x 128              73966.1   28.4      60.0%                            163146.4     12.9      60.8%

commit | commitdiff | tree

Arjan van de Ven [Sun, 17 Jun 2018 15:47:50 +0000 (15:47 +0000)]

Tune param.h for SkylakeX

param.h defines a per-platform SWITCH_RATIO, which is used as a measure for how fine
grained the blocks for gemm need to be split up. Many platforms define this to 4.

The reality is that the gemm low level implementation for SkylakeX likes bigger blocks
due to the nature of SIMD... by tuning the SWITCH_RATIO to 32 the threading performance
improves significantly:

Before
   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10756.0   10.5      -0.5%                             18296.7      6.1      -1.7%
  64 x 64               20490.0   12.9       1.4%                             40615.0      6.5       0.0%
  65 x 65               83528.3    3.3    -210.9%                             96319.0      2.9     -83.3%
  80 x 80              101453.5    5.1    -166.3%                            128021.7      4.0     -76.6%
  96 x 96              149795.1    5.9    -143.1%                            168059.4      5.3     -47.4%
112 x 112             191481.2    7.3    -105.8%                            204165.0      6.9     -14.6%
128 x 128             265019.2    7.9     -99.0%                            272006.4      7.7      -5.3%

After
   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10666.3   10.6       0.4%                             18236.9      6.2      -1.4%
  64 x 64               20410.1   13.0       1.8%                             39925.8      6.6       1.7%
  65 x 65               34983.0    7.9     -30.2%                             51494.6      5.4       2.0%
  80 x 80               39769.1   13.0      -4.4%                             63805.2      8.1      12.0%
  96 x 96               45169.6   19.7      26.7%                             80065.8     11.1      29.8%
112 x 112              57026.1   24.7      38.7%                             99535.5     14.2      44.1%
128 x 128              64789.8   32.5      51.3%                            117407.2     17.9      54.6%

With this change, threading starts to be a win already at 96x96

commit | commitdiff | tree

Arjan van de Ven [Sun, 17 Jun 2018 15:39:15 +0000 (15:39 +0000)]

Don't use _Atomic for jobs sometimes...

The use of _Atomic leads to really bad code generation in the compiler
(on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite
x86 being ordered and cache coherent). But there's a fallback in the code that
just uses volatile which is more than plenty in practice.

If we're nervous about cross thread synchronization for these variables, we should
make the YIELD function be a compiler/memory barrier instead.

performance before (after last commit)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10630.0   10.6       0.7%                             18112.8      6.2      -0.7%
  64 x 64               20374.8   13.0       1.9%                             40487.0      6.5       0.4%
  65 x 65              141955.2    1.9    -428.3%                            146708.8      1.9    -179.2%
  80 x 80              178921.1    2.9    -369.6%                            186032.7      2.8    -156.6%
  96 x 96              205436.2    4.3    -233.4%                            224513.1      3.9     -97.0%
112 x 112             244408.2    5.8    -162.7%                            262158.7      5.4     -47.1%
128 x 128             321334.5    6.5    -141.3%                            333829.0      6.3     -29.2%

Performance with this patch (roughly a 2x improvement):

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10756.0   10.5      -0.5%                             18296.7      6.1      -1.7%
  64 x 64               20490.0   12.9       1.4%                             40615.0      6.5       0.0%
  65 x 65               83528.3    3.3    -210.9%                             96319.0      2.9     -83.3%
  80 x 80              101453.5    5.1    -166.3%                            128021.7      4.0     -76.6%
  96 x 96              149795.1    5.9    -143.1%                            168059.4      5.3     -47.4%
112 x 112             191481.2    7.3    -105.8%                            204165.0      6.9     -14.6%
128 x 128             265019.2    7.9     -99.0%                            272006.4      7.7      -5.3%

commit | commitdiff | tree

Arjan van de Ven [Sun, 17 Jun 2018 15:32:03 +0000 (15:32 +0000)]

Only initialize the part of the jobs array that will get used

The jobs array is getting initialized in O(compiled cpus^2) complexity.
Distros and people with bigger systems will use pretty high values
(128 or 256 or more) for this value, leading to interesting bubbles
in performance.

Baseline (single threaded performance) gets roughly 13 - 15 multiplications per cycle
in the interesting range (threading kicks in at 65x65 mult by 65x65).
The hardware is capable of 32 multiplications per cycle theoretically.

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10703.9   10.6       0.0%                             17990.6      6.3       0.0%
  64 x 64               20778.4   12.8       0.0%                             40629.2      6.5       0.0%
  65 x 65               26869.9   10.3       0.0%                             52545.7      5.3       0.0%
  80 x 80               38104.5   13.5       0.0%                             72492.7      7.1       0.0%
  96 x 96               61626.4   14.4       0.0%                            113983.8      7.8       0.0%
112 x 112              91803.8   15.3       0.0%                            180987.3      7.8       0.0%
128 x 128             133161.4   15.8       0.0%                            258374.3      8.1       0.0%

When threading is turned on
TARGET=SKYLAKEX F_COMPILER=GFORTRAN  SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0  NUM_THREADS=128

  Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10725.9   10.5      -0.2%                             18134.9      6.2      -0.8%
  64 x 64               20500.6   12.9       1.3%                             40929.1      6.5      -0.7%
  65 x 65             2040832.1    0.1   -7495.2%                           2097633.6      0.1   -3892.0%
  80 x 80             2063129.1    0.2   -5314.4%                           2119925.2      0.2   -2824.3%
  96 x 96             2070374.5    0.4   -3259.6%                           2173604.4      0.4   -1806.9%
112 x 112            2111721.5    0.7   -2169.6%                           2263330.8      0.6   -1170.0%
128 x 128            2276181.5    0.9   -1609.3%                           2377228.9      0.9    -820.1%

There is a deep deep cliff once you hit 65x65

With this patch

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10630.0   10.6       0.7%                             18112.8      6.2      -0.7%
  64 x 64               20374.8   13.0       1.9%                             40487.0      6.5       0.4%
  65 x 65              141955.2    1.9    -428.3%                            146708.8      1.9    -179.2%
  80 x 80              178921.1    2.9    -369.6%                            186032.7      2.8    -156.6%
  96 x 96              205436.2    4.3    -233.4%                            224513.1      3.9     -97.0%
112 x 112             244408.2    5.8    -162.7%                            262158.7      5.4     -47.1%
128 x 128             321334.5    6.5    -141.3%                            333829.0      6.3     -29.2%

The cliff is very significantly reduced.
(more to follow)

commit | commitdiff | tree

Martin Kroeker [Fri, 15 Jun 2018 09:25:05 +0000 (11:25 +0200)]

Add build-time option for OMP scheduler; document MULTITHREAD_THRESHOLD range (#1620)

* Allow choosing the OpenMP scheduler and add range hint for GEMM_MULTITHREAD_THRESHOLD
* Amended description of GEMM_MULTITHREAD_THRESHOLD
to reflect #742 making it track floating point operations rather than matrix size

commit | commitdiff | tree

Martin Kroeker [Thu, 14 Jun 2018 22:10:29 +0000 (00:10 +0200)]

Merge pull request #1618 from oon3m0oo/less_locking

Remove the need for most locking in memory.c.

commit | commitdiff | tree

Craig Donner [Thu, 14 Jun 2018 11:18:04 +0000 (12:18 +0100)]

Remove the need for most locking in memory.c.

Using thread local storage for tracking memory allocations means that threads
no longer have to lock at all when doing memory allocations / frees. This
particularly helps the gemm driver since it does an allocation per invocation.
Even without threading at all, this helps, since even calling a lock with
no contention has a cost:

Before this change, no threading:
```
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_SGEMM/4          102 ns        102 ns   13504412
BM_SGEMM/6          175 ns        175 ns    7997580
BM_SGEMM/8          205 ns        205 ns    6842073
BM_SGEMM/10         266 ns        266 ns    5294919
BM_SGEMM/16         478 ns        478 ns    2963441
BM_SGEMM/20         690 ns        690 ns    2144755
BM_SGEMM/32        1906 ns       1906 ns     716981
BM_SGEMM/40        2983 ns       2983 ns     473218
BM_SGEMM/64        9421 ns       9422 ns     148450
BM_SGEMM/72       12630 ns      12631 ns     112105
BM_SGEMM/80       15845 ns      15846 ns      89118
BM_SGEMM/90       25675 ns      25676 ns      54332
BM_SGEMM/100      29864 ns      29865 ns      47120
BM_SGEMM/112      37841 ns      37842 ns      36717
BM_SGEMM/128      56531 ns      56532 ns      25361
BM_SGEMM/140      75886 ns      75888 ns      18143
BM_SGEMM/150      98493 ns      98496 ns      14299
BM_SGEMM/160     102620 ns     102622 ns      13381
BM_SGEMM/170     135169 ns     135173 ns      10231
BM_SGEMM/180     146170 ns     146172 ns       9535
BM_SGEMM/189     190226 ns     190231 ns       7397
BM_SGEMM/200     194513 ns     194519 ns       7210
BM_SGEMM/256     396561 ns     396573 ns       3531
```
with this change:
```
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_SGEMM/4           95 ns         95 ns   14500387
BM_SGEMM/6          166 ns        166 ns    8381763
BM_SGEMM/8          196 ns        196 ns    7277044
BM_SGEMM/10         256 ns        256 ns    5515721
BM_SGEMM/16         463 ns        463 ns    3025197
BM_SGEMM/20         636 ns        636 ns    2070213
BM_SGEMM/32        1885 ns       1885 ns     739444
BM_SGEMM/40        2969 ns       2969 ns     472152
BM_SGEMM/64        9371 ns       9372 ns     148932
BM_SGEMM/72       12431 ns      12431 ns     112919
BM_SGEMM/80       15615 ns      15616 ns      89978
BM_SGEMM/90       25397 ns      25398 ns      55041
BM_SGEMM/100      29445 ns      29446 ns      47540
BM_SGEMM/112      37530 ns      37531 ns      37286
BM_SGEMM/128      55373 ns      55375 ns      25277
BM_SGEMM/140      76241 ns      76241 ns      18259
BM_SGEMM/150     102196 ns     102200 ns      13736
BM_SGEMM/160     101521 ns     101525 ns      13556
BM_SGEMM/170     136182 ns     136184 ns      10567
BM_SGEMM/180     146861 ns     146864 ns       9035
BM_SGEMM/189     192632 ns     192632 ns       7231
BM_SGEMM/200     198547 ns     198555 ns       6995
BM_SGEMM/256     392316 ns     392330 ns       3539
```

Before, when built with USE_THREAD=1, GEMM_MULTITHREAD_THRESHOLD = 4, the cost
of small matrix operations was overshadowed by thread locking (look smaller than
32) even when not explicitly spawning threads:
```
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_SGEMM/4          328 ns        328 ns    4170562
BM_SGEMM/6          396 ns        396 ns    3536400
BM_SGEMM/8          418 ns        418 ns    3330102
BM_SGEMM/10         491 ns        491 ns    2863047
BM_SGEMM/16         710 ns        710 ns    2028314
BM_SGEMM/20         871 ns        871 ns    1581546
BM_SGEMM/32        2132 ns       2132 ns     657089
BM_SGEMM/40        3197 ns       3196 ns     437969
BM_SGEMM/64        9645 ns       9645 ns     144987
BM_SGEMM/72       35064 ns      32881 ns      50264
BM_SGEMM/80       37661 ns      35787 ns      42080
BM_SGEMM/90       36507 ns      36077 ns      40091
BM_SGEMM/100      32513 ns      31850 ns      48607
BM_SGEMM/112      41742 ns      41207 ns      37273
BM_SGEMM/128      67211 ns      65095 ns      21933
BM_SGEMM/140      68263 ns      67943 ns      19245
BM_SGEMM/150     121854 ns     115439 ns      10660
BM_SGEMM/160     116826 ns     115539 ns      10000
BM_SGEMM/170     126566 ns     122798 ns      11960
BM_SGEMM/180     130088 ns     127292 ns      11503
BM_SGEMM/189     120309 ns     116634 ns      13162
BM_SGEMM/200     114559 ns     110993 ns      10000
BM_SGEMM/256     217063 ns     207806 ns       6417
```
and after, it's gone (note this includes my other change which reduces calls
to num_cpu_avail):
```
----------------------------------------------------
Benchmark             Time           CPU Iterations
----------------------------------------------------
BM_SGEMM/4           95 ns         95 ns   12347650
BM_SGEMM/6          166 ns        166 ns    8259683
BM_SGEMM/8          193 ns        193 ns    7162210
BM_SGEMM/10         258 ns        258 ns    5415657
BM_SGEMM/16         471 ns        471 ns    2981009
BM_SGEMM/20         666 ns        666 ns    2148002
BM_SGEMM/32        1903 ns       1903 ns     738245
BM_SGEMM/40        2969 ns       2969 ns     473239
BM_SGEMM/64        9440 ns       9440 ns     148442
BM_SGEMM/72       37239 ns      33330 ns      46813
BM_SGEMM/80       57350 ns      55949 ns      32251
BM_SGEMM/90       36275 ns      36249 ns      42259
BM_SGEMM/100      31111 ns      31008 ns      45270
BM_SGEMM/112      43782 ns      40912 ns      34749
BM_SGEMM/128      67375 ns      64406 ns      22443
BM_SGEMM/140      76389 ns      67003 ns      21430
BM_SGEMM/150      72952 ns      71830 ns      19793
BM_SGEMM/160      97039 ns      96858 ns      11498
BM_SGEMM/170     123272 ns     122007 ns      11855
BM_SGEMM/180     126828 ns     126505 ns      11567
BM_SGEMM/189     115179 ns     114665 ns      11044
BM_SGEMM/200      89289 ns      87259 ns      16147
BM_SGEMM/256     226252 ns     222677 ns       7375
```

I've also tested this with ThreadSanitizer and found no data races during
execution.  I'm not sure why 200 is always faster than it's neighbors, we must
be hitting some optimal cache size or something.

commit | commitdiff | tree

Martin Kroeker [Thu, 14 Jun 2018 15:48:51 +0000 (17:48 +0200)]

Merge pull request #1619 from martin-frbg/issue1580

Update OSX deployment target to 10.8

commit | commitdiff | tree

Martin Kroeker [Thu, 14 Jun 2018 14:57:58 +0000 (16:57 +0200)]

Update OSX deployment target to 10.8

fixes #1580

commit | commitdiff | tree

Martin Kroeker [Thu, 14 Jun 2018 14:52:55 +0000 (16:52 +0200)]

Merge pull request #1607 from martin-frbg/dynarch

Move some x86_64 DYNAMIC_ARCH targets to new DYNAMIC_OLDER option

commit | commitdiff | tree

Martin Kroeker [Thu, 14 Jun 2018 14:51:31 +0000 (16:51 +0200)]

Merge pull request #1612 from oon3m0oo/cpus

Fixed a few more unnecessary calls to num_cpu_avail.

commit | commitdiff | tree

Martin Kroeker [Tue, 12 Jun 2018 21:00:24 +0000 (23:00 +0200)]

Merge pull request #1609 from martin-frbg/issue1529

Create OpenBLASConfig.cmake in cmake builds as well

commit | commitdiff | tree

Martin Kroeker [Mon, 11 Jun 2018 15:14:49 +0000 (17:14 +0200)]

Merge pull request #1613 from xianyi/revert-1600-noyield

Revert "Use usleep instead of sched_yield by default"

commit | commitdiff | tree

Martin Kroeker [Mon, 11 Jun 2018 15:05:27 +0000 (17:05 +0200)]

Revert "Use usleep instead of sched_yield by default"

commit | commitdiff | tree

Martin Kroeker [Mon, 11 Jun 2018 11:26:19 +0000 (13:26 +0200)]

Return a somewhat sane default value for L2 cache size if cpuid retur… (#1611)

* Return a somewhat sane default value for L2 cache size if cpuid returned something unexpected

Fixes #1610, the KVM hypervisor on Google Chromebooks returning zero for CPUID 0x80000006, causing DYNAMIC_ARCH
builds of OpenBLAS to hang

commit | commitdiff | tree

Craig Donner [Mon, 11 Jun 2018 09:13:09 +0000 (10:13 +0100)]

Fixed a few more unnecessary calls to num_cpu_avail.

I don't have as many benchmarks for these as for gemm, but it should still
make a difference for small matrices.

commit | commitdiff | tree

Martin Kroeker [Sun, 10 Jun 2018 13:09:43 +0000 (15:09 +0200)]

include CMakePackageConfigHelpers

commit | commitdiff | tree

Martin Kroeker [Sun, 10 Jun 2018 07:25:46 +0000 (09:25 +0200)]

Add template for OpenBLASConfig.cmake

commit | commitdiff | tree

Martin Kroeker [Sun, 10 Jun 2018 07:24:37 +0000 (09:24 +0200)]

Create OpenBLASConfig.cmake from cmake as well

commit | commitdiff | tree

Martin Kroeker [Sat, 9 Jun 2018 17:57:33 +0000 (19:57 +0200)]

Merge pull request #1608 from martin-frbg/issue874

Enable parallel make on MS Windows by default

commit | commitdiff | tree

Martin Kroeker [Sat, 9 Jun 2018 15:54:36 +0000 (17:54 +0200)]

Enable parallel make on MS Windows by default

fixes #874

commit | commitdiff | tree

Martin Kroeker [Sat, 9 Jun 2018 14:31:38 +0000 (16:31 +0200)]

Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option

commit | commitdiff | tree

Martin Kroeker [Sat, 9 Jun 2018 14:30:46 +0000 (16:30 +0200)]

Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option

commit | commitdiff | tree

Martin Kroeker [Sat, 9 Jun 2018 14:29:17 +0000 (16:29 +0200)]

Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option

commit | commitdiff | tree

Martin Kroeker [Sat, 9 Jun 2018 10:42:34 +0000 (12:42 +0200)]

Merge pull request #1605 from oon3m0oo/develop

Improve performance of GEMM for small matrices when SMP is defined.

commit | commitdiff | tree

Craig Donner [Thu, 7 Jun 2018 13:54:42 +0000 (14:54 +0100)]

Improve performance of GEMM for small matrices when SMP is defined.

Always checking num_cpu_avail() regardless of whether threading will actually
be used adds noticeable overhead for small matrices. Most other uses of
num_cpu_avail() do so only if threading will be used, so do the same here.

commit | commitdiff | tree

Martin Kroeker [Thu, 7 Jun 2018 12:09:58 +0000 (14:09 +0200)]

Merge pull request #1601 from martin-frbg/zaxpy

Use a single thread for small input size in zaxpy

commit | commitdiff | tree

Martin Kroeker [Thu, 7 Jun 2018 10:42:00 +0000 (12:42 +0200)]

Merge pull request #1600 from martin-frbg/noyield

Use usleep instead of sched_yield by default

commit | commitdiff | tree

Martin Kroeker [Thu, 7 Jun 2018 08:26:55 +0000 (10:26 +0200)]

Use a single thread for small input size

copies daxpy improvement from #27, see #1560

commit | commitdiff | tree

Martin Kroeker [Thu, 7 Jun 2018 08:18:26 +0000 (10:18 +0200)]

Use usleep instead of sched_yield by default

sched_yield only burns cpu cycles, fixes #900, see also #923, #1560