platform/upstream/openblas.git
6 years agofabs -> fabsl
Martin Kroeker [Sat, 4 Aug 2018 18:14:51 +0000 (20:14 +0200)]
fabs -> fabsl

6 years agofabs -> fabsl
Steven G. Johnson [Fri, 3 Aug 2018 17:00:10 +0000 (13:00 -0400)]
fabs -> fabsl

Fixes two calls that were using `fabs` on a `long double` argument rather than `fabsl`, which looks like it is doing an unintentional truncation to `double` precision.

6 years agoMerge pull request #1703 from wsttiger/cmake_fix
Martin Kroeker [Thu, 2 Aug 2018 21:48:42 +0000 (23:48 +0200)]
Merge pull request #1703 from wsttiger/cmake_fix

Set EXPORT_NAME to match OpenBLASConfig.cmake

6 years agoMerge pull request #1707 from extrowerk/haiku_support
Martin Kroeker [Thu, 2 Aug 2018 20:27:00 +0000 (22:27 +0200)]
Merge pull request #1707 from extrowerk/haiku_support

Haiku supporting patches

6 years agoAdded target_include_directories()
Scott Thornton [Thu, 2 Aug 2018 19:58:52 +0000 (14:58 -0500)]
Added target_include_directories()

6 years agoHaiku supporting patches
Zoltán Mizsei [Thu, 2 Aug 2018 18:49:14 +0000 (20:49 +0200)]
Haiku supporting patches

6 years agoMerge pull request #1706 from oon3m0oo/develop
Martin Kroeker [Thu, 2 Aug 2018 16:53:34 +0000 (18:53 +0200)]
Merge pull request #1706 from oon3m0oo/develop

Fix #1705 where we incorrectly calculate page locations.

6 years agoFix #1705 where we incorrectly calculate page locations.
Craig Donner [Thu, 2 Aug 2018 15:21:19 +0000 (16:21 +0100)]
Fix #1705 where we incorrectly calculate page locations.

Since we now use an allocation size that isn't a multiple of PAGESIZE, finding
the pages for run_bench wasn't terminating properly.  Now we detect if we've
found enough pages for the allocation and terminate the loop.

6 years agoSet EXPORT_NAME to match OpenBLASConfig.cmake
Scott Thornton [Mon, 30 Jul 2018 20:18:29 +0000 (15:18 -0500)]
Set EXPORT_NAME to match OpenBLASConfig.cmake

6 years agoSet version to 0.3.3.dev
Martin Kroeker [Mon, 30 Jul 2018 06:23:13 +0000 (08:23 +0200)]
Set version to 0.3.3.dev

6 years agoSet version to 0.3.3.dev
Martin Kroeker [Mon, 30 Jul 2018 06:22:38 +0000 (08:22 +0200)]
Set version to 0.3.3.dev

6 years agoMerge branch 'release-0.3.0' into develop
Martin Kroeker [Sun, 29 Jul 2018 20:37:09 +0000 (22:37 +0200)]
Merge branch 'release-0.3.0' into develop

6 years agoMerge pull request #1697 from martin-frbg/issue1696
Martin Kroeker [Wed, 25 Jul 2018 17:55:29 +0000 (19:55 +0200)]
Merge pull request #1697 from martin-frbg/issue1696

Do not treat WIndows UWB builds as cross-compiling

6 years agoDo not treat WIndows UWB builds as cross-compiling
Martin Kroeker [Tue, 24 Jul 2018 15:46:33 +0000 (17:46 +0200)]
Do not treat WIndows UWB builds as cross-compiling

6 years agoMerge pull request #1695 from martin-frbg/issue1692
Martin Kroeker [Sun, 22 Jul 2018 14:34:09 +0000 (16:34 +0200)]
Merge pull request #1695 from martin-frbg/issue1692

Unset memory table entry, not just the local pointer to it on shutdown

6 years agoUnset memory table entry, not just the temporary pointer to it on shutdown
Martin Kroeker [Sun, 22 Jul 2018 07:19:19 +0000 (09:19 +0200)]
Unset memory table entry, not just the temporary pointer to it on shutdown

to fix crash with multiple instances of OpenBLAS, #1692

6 years agoMerge pull request #1688 from martin-frbg/issue1673
Martin Kroeker [Thu, 19 Jul 2018 17:03:45 +0000 (19:03 +0200)]
Merge pull request #1688 from martin-frbg/issue1673

Temporarily disable special handling of OPENMP thread memory allocation

6 years agoTemporarily disable special handling of OPENMP thread memory allocation
Martin Kroeker [Thu, 19 Jul 2018 06:57:56 +0000 (08:57 +0200)]
Temporarily disable special handling of OPENMP thread memory allocation

for issue #1673

6 years agoMerge pull request #1681 from martin-frbg/issue1671
Martin Kroeker [Mon, 16 Jul 2018 20:47:05 +0000 (22:47 +0200)]
Merge pull request #1681 from martin-frbg/issue1671

Add cpu identification via mfpvr call for the BSDs

6 years agoMerge pull request #1684 from martin-frbg/issue1672
Martin Kroeker [Mon, 16 Jul 2018 20:46:49 +0000 (22:46 +0200)]
Merge pull request #1684 from martin-frbg/issue1672

Work around utest failures in the MIPS64 SICORTEX target

6 years agotypo fix
Martin Kroeker [Mon, 16 Jul 2018 10:56:39 +0000 (12:56 +0200)]
typo fix

6 years agoFix precision problem in DSDOT
Martin Kroeker [Sun, 15 Jul 2018 15:11:40 +0000 (17:11 +0200)]
Fix precision problem in DSDOT

6 years agoUse C kernels for default c/zAXPY, xROT, c/zSWAP
Martin Kroeker [Sun, 15 Jul 2018 15:09:55 +0000 (17:09 +0200)]
Use C kernels for default c/zAXPY, xROT, c/zSWAP

6 years agoAdd cpu identification via mfpvr call for the BSDs
Martin Kroeker [Thu, 12 Jul 2018 21:39:00 +0000 (23:39 +0200)]
Add cpu identification via mfpvr call for the BSDs

fixes #1671

6 years agoMerge pull request #1680 from martin-frbg/snprint
Martin Kroeker [Thu, 12 Jul 2018 12:05:13 +0000 (14:05 +0200)]
Merge pull request #1680 from martin-frbg/snprint

Fix wrong redefinitions of snprintf for older MSVC

6 years agoFix declaration of snprintf for older MSVC
Martin Kroeker [Thu, 12 Jul 2018 09:47:52 +0000 (11:47 +0200)]
Fix declaration of snprintf for older MSVC

_snprintf_s takes an additional (size) argument, so is no direct replacement.
(Note that this code is currently unused - the two instances of snprintf here are within ifdef blocks that are not compiled for MSVC)

6 years agoFix definition of snprintf for MSVC
Martin Kroeker [Thu, 12 Jul 2018 09:42:25 +0000 (11:42 +0200)]
Fix definition of snprintf for MSVC

MS _snprintf_s takes an additional argument for the size of the buffer, so is not a direct replacement (utest/ctest.h from which I copied was wrong)

6 years agoMerge pull request #1678 from martin-frbg/issue1677
Martin Kroeker [Thu, 12 Jul 2018 07:21:34 +0000 (09:21 +0200)]
Merge pull request #1678 from martin-frbg/issue1677

Define snprintf for older versions of MSVC

6 years agoDefine snprintf for older versions of MSVC
Martin Kroeker [Thu, 12 Jul 2018 05:30:58 +0000 (07:30 +0200)]
Define snprintf for older versions of MSVC

for #1677

6 years agoMerge pull request #1667 from xianyi/revert-1642-develop
Martin Kroeker [Wed, 4 Jul 2018 06:27:21 +0000 (08:27 +0200)]
Merge pull request #1667 from xianyi/revert-1642-develop

Revert "Rewrite &= -> = and simplify the initial blocking phase."

6 years agoMerge pull request #1665 from martin-frbg/cpuid-ryzen2
Martin Kroeker [Wed, 4 Jul 2018 06:19:40 +0000 (08:19 +0200)]
Merge pull request #1665 from martin-frbg/cpuid-ryzen2

Add cpuid for AMD Ryzen 2

6 years agoMerge pull request #1663 from martin-frbg/issue1641
Martin Kroeker [Wed, 4 Jul 2018 06:19:11 +0000 (08:19 +0200)]
Merge pull request #1663 from martin-frbg/issue1641

Double MAX_ALLOCATING_THREADS to fix segfaults with Go and Octave

6 years agoRevert "Rewrite &= -> = and simplify the initial blocking phase."
Martin Kroeker [Tue, 3 Jul 2018 19:42:28 +0000 (21:42 +0200)]
Revert "Rewrite &= -> = and simplify the initial blocking phase."

6 years agoAdd cpuid for AMD Ryzen 2
Martin Kroeker [Tue, 3 Jul 2018 19:03:24 +0000 (21:03 +0200)]
Add cpuid for AMD Ryzen 2

6 years agoAdd cpuid for AMD Ryzen 2
Martin Kroeker [Tue, 3 Jul 2018 19:01:35 +0000 (21:01 +0200)]
Add cpuid for AMD Ryzen 2

for #1664

6 years agoMerge pull request #1662 from martin-frbg/cmake-avx512
Martin Kroeker [Tue, 3 Jul 2018 15:40:09 +0000 (17:40 +0200)]
Merge pull request #1662 from martin-frbg/cmake-avx512

Add -march=skylake-avx512 to AVX512 compile check and suppress its ou…

6 years agoDouble MAX_ALLOCATING_THREADS to fix segfaults with Go and Octave
Martin Kroeker [Tue, 3 Jul 2018 15:35:54 +0000 (17:35 +0200)]
Double MAX_ALLOCATING_THREADS to fix segfaults with Go and Octave

for #1641

6 years agoAdd -march=skylake-avx512 to AVX512 compile check and suppress its output
Martin Kroeker [Tue, 3 Jul 2018 12:41:44 +0000 (14:41 +0200)]
Add -march=skylake-avx512 to AVX512 compile check and suppress its output

6 years agoMerge pull request #1660 from martin-frbg/issue1659
Martin Kroeker [Mon, 2 Jul 2018 15:48:19 +0000 (17:48 +0200)]
Merge pull request #1660 from martin-frbg/issue1659

Fix typo that broke compilation with DYNAMIC_ARCH and NO_AVX2

6 years agoFix typo that broke compilation with DYNAMIC_ARCH and NO_AVX2
Martin Kroeker [Mon, 2 Jul 2018 12:40:41 +0000 (14:40 +0200)]
Fix typo that broke compilation with DYNAMIC_ARCH and NO_AVX2

fixes 1659

6 years agoMerge pull request #1657 from martin-frbg/release-0.3.0 v0.3.1
Martin Kroeker [Sun, 1 Jul 2018 10:03:07 +0000 (12:03 +0200)]
Merge pull request #1657 from martin-frbg/release-0.3.0

Release 0.3.1

6 years agoset version number to 0.3.2.dev
Martin Kroeker [Sun, 1 Jul 2018 10:01:51 +0000 (12:01 +0200)]
set version number to 0.3.2.dev

6 years agoset version number to 0.3.2.dev
Martin Kroeker [Sun, 1 Jul 2018 10:01:16 +0000 (12:01 +0200)]
set version number to 0.3.2.dev

6 years agoremove dev suffix from version number
Martin Kroeker [Sun, 1 Jul 2018 09:59:47 +0000 (11:59 +0200)]
remove dev suffix from version number

6 years agoremove dev suffix from version number
Martin Kroeker [Sun, 1 Jul 2018 09:58:57 +0000 (11:58 +0200)]
remove dev suffix from version number

6 years agoMerge pull request #1648 from martin-frbg/nofort
Martin Kroeker [Sun, 1 Jul 2018 09:56:40 +0000 (11:56 +0200)]
Merge pull request #1648 from martin-frbg/nofort

Handle NOFORTRAN=0

6 years agoMerge pull request #1656 from xianyi/develop
Martin Kroeker [Sun, 1 Jul 2018 09:55:21 +0000 (11:55 +0200)]
Merge pull request #1656 from xianyi/develop

Update the 0.3 branch from develop

6 years agoMerge pull request #1655 from martin-frbg/issue1641
Martin Kroeker [Sun, 1 Jul 2018 06:41:22 +0000 (08:41 +0200)]
Merge pull request #1655 from martin-frbg/issue1641

Fix apparent off-by-one error in calculation of MAX_ALLOCATING_THREADS

6 years agoMerge pull request #1654 from martin-frbg/avx512check
Martin Kroeker [Sat, 30 Jun 2018 23:17:03 +0000 (01:17 +0200)]
Merge pull request #1654 from martin-frbg/avx512check

Add compiler option to avx512 test and hide test output

6 years agoFix apparent off-by-one error in calculation of MAX_ALLOCATING_THREADS
Martin Kroeker [Sat, 30 Jun 2018 21:57:50 +0000 (23:57 +0200)]
Fix apparent off-by-one error in calculation of MAX_ALLOCATING_THREADS

fixes #1641

6 years agoAdd compiler option to avx512 test and hide test output
Martin Kroeker [Sat, 30 Jun 2018 21:47:44 +0000 (23:47 +0200)]
Add compiler option to avx512 test and hide test output

6 years agoMerge pull request #1651 from martin-frbg/avx512-nodgemm
Martin Kroeker [Sat, 30 Jun 2018 15:48:03 +0000 (17:48 +0200)]
Merge pull request #1651 from martin-frbg/avx512-nodgemm

Disable the 16x2 DTRMM kernel on SkylakeX as well

6 years agoDisable the 16x2 DTRMM kernel on SkylakeX as well
Martin Kroeker [Sat, 30 Jun 2018 15:31:06 +0000 (17:31 +0200)]
Disable the 16x2 DTRMM kernel on SkylakeX as well

6 years agoMerge pull request #1650 from martin-frbg/avx512-nodgemm
Martin Kroeker [Sat, 30 Jun 2018 11:05:46 +0000 (13:05 +0200)]
Merge pull request #1650 from martin-frbg/avx512-nodgemm

Disable the AVX512 DGEMM kernel for now

6 years agoMerge pull request #1639 from martin-frbg/dyn_list
Martin Kroeker [Sat, 30 Jun 2018 11:05:30 +0000 (13:05 +0200)]
Merge pull request #1639 from martin-frbg/dyn_list

Add DYNAMIC_LIST option for user-defined list of dynamic targets

6 years agoDisable the AVX512 DGEMM kernel for now
Martin Kroeker [Sat, 30 Jun 2018 09:34:48 +0000 (11:34 +0200)]
Disable the AVX512 DGEMM kernel for now

due to #1643

6 years agoUpdate Makefile
Martin Kroeker [Tue, 26 Jun 2018 22:09:21 +0000 (00:09 +0200)]
Update Makefile

6 years agoMerge branch 'develop' into nofort
Martin Kroeker [Tue, 26 Jun 2018 22:07:32 +0000 (00:07 +0200)]
Merge branch 'develop' into nofort

6 years agoHandle NOFORTRAN=0
Martin Kroeker [Tue, 26 Jun 2018 22:00:27 +0000 (00:00 +0200)]
Handle NOFORTRAN=0

6 years agoMerge pull request #1647 from martin-frbg/armv7-dot
Martin Kroeker [Tue, 26 Jun 2018 20:27:30 +0000 (22:27 +0200)]
Merge pull request #1647 from martin-frbg/armv7-dot

Remove premature exits from ARMV7 xdot codes

6 years agoRemove premature exit for INC_X or INC_Y zero
Martin Kroeker [Tue, 26 Jun 2018 18:46:42 +0000 (20:46 +0200)]
Remove premature exit for INC_X or INC_Y zero

6 years agoRemove premature exit for INC_X or INC_Y zero
Martin Kroeker [Tue, 26 Jun 2018 18:45:57 +0000 (20:45 +0200)]
Remove premature exit for INC_X or INC_Y zero

6 years agoRemove premature exit for INC_X or INC_Y zero
Martin Kroeker [Tue, 26 Jun 2018 18:45:00 +0000 (20:45 +0200)]
Remove premature exit for INC_X or INC_Y zero

6 years agoRemove premature exit for INC_X or INC_Y zero
Martin Kroeker [Tue, 26 Jun 2018 18:44:13 +0000 (20:44 +0200)]
Remove premature exit for INC_X or INC_Y zero

6 years agoMerge pull request #1644 from martin-frbg/revert-filterout
Martin Kroeker [Tue, 26 Jun 2018 08:15:15 +0000 (10:15 +0200)]
Merge pull request #1644 from martin-frbg/revert-filterout

Revert changes to NOFORTRAN handling in Makefile

6 years agoRevert changes to NOFORTRAN handling from 952541e
Martin Kroeker [Tue, 26 Jun 2018 06:09:52 +0000 (08:09 +0200)]
Revert changes to NOFORTRAN handling from 952541e

6 years agoTry gradual fallback for cores not in the dynamic core list
Martin Kroeker [Mon, 25 Jun 2018 19:02:31 +0000 (21:02 +0200)]
Try gradual fallback for cores not in the dynamic core list

6 years agoMerge pull request #2 from martin-frbg/develop
Martin Kroeker [Mon, 25 Jun 2018 18:48:10 +0000 (20:48 +0200)]
Merge pull request #2 from martin-frbg/develop

merge develop

6 years agoMerge pull request #1 from xianyi/develop
Martin Kroeker [Mon, 25 Jun 2018 18:45:56 +0000 (20:45 +0200)]
Merge pull request #1 from xianyi/develop

Merge xianyi:develop into develop

6 years agoMerge pull request #1642 from oon3m0oo/develop
Martin Kroeker [Mon, 25 Jun 2018 17:23:40 +0000 (19:23 +0200)]
Merge pull request #1642 from oon3m0oo/develop

Rewrite &= -> = and simplify the initial blocking phase.

6 years agoRewrite &= -> = and simplify the initial blocking phase.
Craig Donner [Mon, 25 Jun 2018 12:53:11 +0000 (13:53 +0100)]
Rewrite &= -> = and simplify the initial blocking phase.

6 years agoAdd support for a user-defined list of dynamic targets
Martin Kroeker [Sat, 23 Jun 2018 17:42:15 +0000 (19:42 +0200)]
Add support for a user-defined list of dynamic targets

6 years agoAdd support for a user-defined list of dynamic targets
Martin Kroeker [Sat, 23 Jun 2018 17:41:32 +0000 (19:41 +0200)]
Add support for a user-defined list of dynamic targets

6 years agoMerge pull request #1638 from martin-frbg/issue1637
Martin Kroeker [Sat, 23 Jun 2018 13:01:02 +0000 (15:01 +0200)]
Merge pull request #1638 from martin-frbg/issue1637

Expose the CBLAS interface to the IxAMIN functions and have make build it

6 years agoExpose CBLAS interface to BLAS extensions iXamin
Martin Kroeker [Sat, 23 Jun 2018 11:31:09 +0000 (13:31 +0200)]
Expose CBLAS interface to BLAS extensions iXamin

6 years agoBuild cblas_iXamin interfaces
Martin Kroeker [Sat, 23 Jun 2018 11:27:30 +0000 (13:27 +0200)]
Build cblas_iXamin interfaces

6 years agoMerge pull request #1634 from oon3m0oo/develop
Martin Kroeker [Thu, 21 Jun 2018 19:01:03 +0000 (21:01 +0200)]
Merge pull request #1634 from oon3m0oo/develop

Fix data races reported by TSAN.

6 years agoUse BLAS rather than CBLAS in test_fork.c (#1626)
oon3m0oo [Thu, 21 Jun 2018 16:47:45 +0000 (17:47 +0100)]
Use BLAS rather than CBLAS in test_fork.c (#1626)

This is handy for people not using lapack.

6 years agoFix data races reported by TSAN.
Craig Donner [Thu, 21 Jun 2018 10:13:57 +0000 (11:13 +0100)]
Fix data races reported by TSAN.

6 years agoFurther improvements to memory.c. (#1625)
oon3m0oo [Wed, 20 Jun 2018 20:04:03 +0000 (21:04 +0100)]
Further improvements to memory.c. (#1625)

- Compiler TLS is now used only used when the compiler supports it
- If compiler TLS is unsupported, we use platform-specific TLS
- Only one variable (an index) is now in TLS
- We only access TLS once per alloc, and never when freeing
- Allocation / release info is now stored within the allocation itself, by
  over-allocating; this saves having external structures do the bookkeeping, and
  reduces some of the redundant data that was being stored (such as addresses)
- We never hit the alloc lock when not using SMP or when using OpenMP (that was
  my fault)
- Now that there are fewer tracking structures I think this is a bit easier to
  read than before

6 years agoMerge pull request #1630 from martin-frbg/x86-march
Martin Kroeker [Wed, 20 Jun 2018 19:51:57 +0000 (21:51 +0200)]
Merge pull request #1630 from martin-frbg/x86-march

Add -march=skylake-avx512 to flags if target is skylake x

6 years agoMerge pull request #1631 from oon3m0oo/stack
Martin Kroeker [Wed, 20 Jun 2018 19:51:38 +0000 (21:51 +0200)]
Merge pull request #1631 from oon3m0oo/stack

Avoid declaring arrays of size 0 when making large stack allocations.

6 years agoAvoid declaring arrays of size 0 when making large stack allocations.
Craig Donner [Wed, 20 Jun 2018 16:03:18 +0000 (17:03 +0100)]
Avoid declaring arrays of size 0 when making large stack allocations.

6 years agoMerge pull request #1629 from martin-frbg/issue1628
Martin Kroeker [Wed, 20 Jun 2018 14:41:13 +0000 (16:41 +0200)]
Merge pull request #1629 from martin-frbg/issue1628

Make gfortran link libomp for clang in the tests; avoid two typical gotchas with NOFORTRAN

6 years agoAdd -march=skylake-avx512 to flags if target is skylake x
Martin Kroeker [Wed, 20 Jun 2018 13:16:19 +0000 (15:16 +0200)]
Add -march=skylake-avx512 to flags if target is skylake x

6 years agoNeed to use filter-out to handle NOFORTRAN not set
Martin Kroeker [Wed, 20 Jun 2018 11:20:30 +0000 (13:20 +0200)]
Need to use filter-out to handle NOFORTRAN not set

6 years agoModify NOFORTRAN tests to always check the value; fix rewriting of NO_FORTRAN
Martin Kroeker [Tue, 19 Jun 2018 21:28:06 +0000 (23:28 +0200)]
Modify NOFORTRAN tests to always check the value; fix rewriting of NO_FORTRAN

6 years agoHandle erroneous user settings NOFORTRAN=0 and NO_FORTRAN
Martin Kroeker [Tue, 19 Jun 2018 18:53:19 +0000 (20:53 +0200)]
Handle erroneous user settings NOFORTRAN=0 and NO_FORTRAN

6 years agoHandle special case of gfortran+clang+OpenMP
Martin Kroeker [Tue, 19 Jun 2018 18:47:33 +0000 (20:47 +0200)]
Handle special case of gfortran+clang+OpenMP

6 years agoHandle special case of gfortran+clang+OpenMP
Martin Kroeker [Tue, 19 Jun 2018 18:46:36 +0000 (20:46 +0200)]
Handle special case of gfortran+clang+OpenMP

6 years agoMerge pull request #1623 from fenrus75/fast-thread
Martin Kroeker [Mon, 18 Jun 2018 07:02:40 +0000 (09:02 +0200)]
Merge pull request #1623 from fenrus75/fast-thread

Initialize only the required subset of the jobs array, fix barriers and improve switch ratio on SkylakeX and Haswell. For issue #1622

6 years agoSupport upcoming Intel Cannon Lake CPUs as Skylake X (#1621)
Martin Kroeker [Sun, 17 Jun 2018 21:38:14 +0000 (23:38 +0200)]
Support upcoming Intel Cannon Lake CPUs as Skylake X (#1621)

* Support  upcoming Cannon Lake as Skylake X

6 years agomake WMB / MB safer on x86-64
Arjan van de Ven [Sun, 17 Jun 2018 18:06:24 +0000 (18:06 +0000)]
make WMB / MB safer on x86-64

make it so that

if (foo)
RMB;
else
MB;

is always done correctly and without syntax surprises

6 years agoOn x86-64, make MB/WMB compiler barriers
Arjan van de Ven [Sun, 17 Jun 2018 17:53:15 +0000 (17:53 +0000)]
On x86-64, make MB/WMB compiler barriers

Whie on x86(64) one does not normally need full memory barriers, it's
good practice to at least use compiler barriers for places where on other
architectures memory barriers are used; this prevents the compiler
from over-optimizing.

6 years agoAdd missing barriers in gemm scheduler
Arjan van de Ven [Sun, 17 Jun 2018 17:50:43 +0000 (17:50 +0000)]
Add missing barriers in gemm scheduler

a few places in the gemm scheduler code were missing barriers;
the code likely worked OK due to heavy use of volatile / _Atomic
but there's no reason to get this incorrect

6 years agoTune HASWELL SWITCH_RATIO as well
Arjan van de Ven [Sun, 17 Jun 2018 17:05:04 +0000 (17:05 +0000)]
Tune HASWELL SWITCH_RATIO as well

Similar to the SKYLAKEX patch, 32 seems to work best
(much better than 4 or 16)

Before (4)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               15554.3    7.2       0.2%                             30353.8      3.7       0.3%
  64 x 64               30346.8    8.7       1.6%                             63495.0      4.1      -0.1%
  65 x 65               81668.1    3.4    -123.3%                             82705.2      3.3     -21.2%
  80 x 80              105045.9    4.9     -95.5%                            115226.0      4.5      -2.2%
  96 x 96              152461.2    5.8     -74.3%                            148156.3      6.0      16.4%
 112 x 112             188505.2    7.5     -42.2%                            171187.3      8.2      36.4%
 128 x 128             257884.0    8.1     -39.5%                            224764.8      9.3      46.0%

Intermediate (16)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               15565.7    7.2       0.2%                             30378.9      3.7       0.2%
  64 x 64               30430.2    8.7       1.3%                             63046.4      4.2       0.6%
  65 x 65               27306.0   10.1      25.3%                             38879.2      7.1      43.0%
  80 x 80               51008.7   10.1       5.1%                             61007.6      8.4      45.9%
  96 x 96               70856.7   12.5      19.0%                             83403.1     10.6      53.0%
 112 x 112              84769.9   16.6      36.0%                             99920.1     14.1      62.9%
 128 x 128              84213.2   25.0      54.5%                            113024.2     18.6      72.8%

After (32)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               15537.3    7.2       0.3%                             30537.0      3.6      -0.3%
  64 x 64               30352.7    8.7       1.6%                             62597.8      4.2       1.3%
  65 x 65               36857.0    7.5      -0.8%                             56167.6      4.9      17.7%
  80 x 80               42552.6   12.1      20.8%                             69536.7      7.4      38.3%
  96 x 96               52101.5   17.1      40.5%                             91016.1      9.7      48.7%
 112 x 112              63853.7   22.1      51.8%                            110507.4     12.7      58.9%
 128 x 128              73966.1   28.4      60.0%                            163146.4     12.9      60.8%

6 years agoTune param.h for SkylakeX
Arjan van de Ven [Sun, 17 Jun 2018 15:47:50 +0000 (15:47 +0000)]
Tune param.h for SkylakeX

param.h defines a per-platform SWITCH_RATIO, which is used as a measure for how fine
grained the blocks for gemm need to be split up. Many platforms define this to 4.

The reality is that the gemm low level implementation for SkylakeX likes bigger blocks
due to the nature of SIMD... by tuning the SWITCH_RATIO to 32 the threading performance
improves significantly:

Before
   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10756.0   10.5      -0.5%                             18296.7      6.1      -1.7%
  64 x 64               20490.0   12.9       1.4%                             40615.0      6.5       0.0%
  65 x 65               83528.3    3.3    -210.9%                             96319.0      2.9     -83.3%
  80 x 80              101453.5    5.1    -166.3%                            128021.7      4.0     -76.6%
  96 x 96              149795.1    5.9    -143.1%                            168059.4      5.3     -47.4%
 112 x 112             191481.2    7.3    -105.8%                            204165.0      6.9     -14.6%
 128 x 128             265019.2    7.9     -99.0%                            272006.4      7.7      -5.3%

After
   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10666.3   10.6       0.4%                             18236.9      6.2      -1.4%
  64 x 64               20410.1   13.0       1.8%                             39925.8      6.6       1.7%
  65 x 65               34983.0    7.9     -30.2%                             51494.6      5.4       2.0%
  80 x 80               39769.1   13.0      -4.4%                             63805.2      8.1      12.0%
  96 x 96               45169.6   19.7      26.7%                             80065.8     11.1      29.8%
 112 x 112              57026.1   24.7      38.7%                             99535.5     14.2      44.1%
 128 x 128              64789.8   32.5      51.3%                            117407.2     17.9      54.6%

With this change, threading starts to be a win already at 96x96

6 years agoDon't use _Atomic for jobs sometimes...
Arjan van de Ven [Sun, 17 Jun 2018 15:39:15 +0000 (15:39 +0000)]
Don't use _Atomic for jobs sometimes...

The use of _Atomic leads to really bad code generation in the compiler
(on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite
x86 being ordered and cache coherent). But there's a fallback in the code that
just uses volatile which is more than plenty in practice.

If we're nervous about cross thread synchronization for these variables, we should
make the YIELD function be a compiler/memory barrier instead.

performance before (after last commit)

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10630.0   10.6       0.7%                             18112.8      6.2      -0.7%
  64 x 64               20374.8   13.0       1.9%                             40487.0      6.5       0.4%
  65 x 65              141955.2    1.9    -428.3%                            146708.8      1.9    -179.2%
  80 x 80              178921.1    2.9    -369.6%                            186032.7      2.8    -156.6%
  96 x 96              205436.2    4.3    -233.4%                            224513.1      3.9     -97.0%
 112 x 112             244408.2    5.8    -162.7%                            262158.7      5.4     -47.1%
 128 x 128             321334.5    6.5    -141.3%                            333829.0      6.3     -29.2%

Performance with this patch (roughly a 2x improvement):

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10756.0   10.5      -0.5%                             18296.7      6.1      -1.7%
  64 x 64               20490.0   12.9       1.4%                             40615.0      6.5       0.0%
  65 x 65               83528.3    3.3    -210.9%                             96319.0      2.9     -83.3%
  80 x 80              101453.5    5.1    -166.3%                            128021.7      4.0     -76.6%
  96 x 96              149795.1    5.9    -143.1%                            168059.4      5.3     -47.4%
 112 x 112             191481.2    7.3    -105.8%                            204165.0      6.9     -14.6%
 128 x 128             265019.2    7.9     -99.0%                            272006.4      7.7      -5.3%

6 years agoOnly initialize the part of the jobs array that will get used
Arjan van de Ven [Sun, 17 Jun 2018 15:32:03 +0000 (15:32 +0000)]
Only initialize the part of the jobs array that will get used

The jobs array is getting initialized in O(compiled cpus^2) complexity.
Distros and people with bigger systems will use pretty high values
(128 or 256 or more) for this value, leading to interesting bubbles
in performance.

Baseline (single threaded performance) gets roughly 13 - 15 multiplications per cycle
in the interesting range (threading kicks in at 65x65 mult by 65x65).
The hardware is capable of 32 multiplications per cycle theoretically.

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10703.9   10.6       0.0%                             17990.6      6.3       0.0%
  64 x 64               20778.4   12.8       0.0%                             40629.2      6.5       0.0%
  65 x 65               26869.9   10.3       0.0%                             52545.7      5.3       0.0%
  80 x 80               38104.5   13.5       0.0%                             72492.7      7.1       0.0%
  96 x 96               61626.4   14.4       0.0%                            113983.8      7.8       0.0%
 112 x 112              91803.8   15.3       0.0%                            180987.3      7.8       0.0%
 128 x 128             133161.4   15.8       0.0%                            258374.3      8.1       0.0%

When threading is turned on
TARGET=SKYLAKEX F_COMPILER=GFORTRAN  SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0  NUM_THREADS=128

  Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10725.9   10.5      -0.2%                             18134.9      6.2      -0.8%
  64 x 64               20500.6   12.9       1.3%                             40929.1      6.5      -0.7%
  65 x 65             2040832.1    0.1   -7495.2%                           2097633.6      0.1   -3892.0%
  80 x 80             2063129.1    0.2   -5314.4%                           2119925.2      0.2   -2824.3%
  96 x 96             2070374.5    0.4   -3259.6%                           2173604.4      0.4   -1806.9%
 112 x 112            2111721.5    0.7   -2169.6%                           2263330.8      0.6   -1170.0%
 128 x 128            2276181.5    0.9   -1609.3%                           2377228.9      0.9    -820.1%

There is a deep deep cliff once you hit 65x65

With this patch

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
  48 x 48               10630.0   10.6       0.7%                             18112.8      6.2      -0.7%
  64 x 64               20374.8   13.0       1.9%                             40487.0      6.5       0.4%
  65 x 65              141955.2    1.9    -428.3%                            146708.8      1.9    -179.2%
  80 x 80              178921.1    2.9    -369.6%                            186032.7      2.8    -156.6%
  96 x 96              205436.2    4.3    -233.4%                            224513.1      3.9     -97.0%
 112 x 112             244408.2    5.8    -162.7%                            262158.7      5.4     -47.1%
 128 x 128             321334.5    6.5    -141.3%                            333829.0      6.3     -29.2%

The cliff is very significantly reduced.
(more to follow)

6 years agoAdd build-time option for OMP scheduler; document MULTITHREAD_THRESHOLD range (#1620)
Martin Kroeker [Fri, 15 Jun 2018 09:25:05 +0000 (11:25 +0200)]
Add build-time option for OMP scheduler; document MULTITHREAD_THRESHOLD range (#1620)

* Allow choosing the OpenMP scheduler and add range hint for GEMM_MULTITHREAD_THRESHOLD
* Amended description of GEMM_MULTITHREAD_THRESHOLD
to reflect #742 making it track floating point operations rather than matrix size