Martin Kroeker [Tue, 3 Dec 2019 07:22:40 +0000 (08:22 +0100)]
Merge pull request #20 from xianyi/develop
Rebase
Martin Kroeker [Mon, 2 Dec 2019 07:30:43 +0000 (08:30 +0100)]
Merge pull request #2329 from isuruf/patch-1
Workaround an ICE in clang 9.0.0
Isuru Fernando [Sun, 1 Dec 2019 17:55:49 +0000 (11:55 -0600)]
Workaround an ICE in clang 9.0.0
This bug is not there in 8.x nor in the 9.0 daily snapshot.
Martin Kroeker [Sat, 30 Nov 2019 11:23:57 +0000 (12:23 +0100)]
Merge pull request #2328 from martin-frbg/ppc9
Fix precompiled kernels on POWER9 and make their use conditional on (old) gcc version
Martin Kroeker [Fri, 29 Nov 2019 23:03:42 +0000 (00:03 +0100)]
Merge pull request #2324 from antonblanchard/power9_segv
Fix SEGV in cdot_power9
Martin Kroeker [Fri, 29 Nov 2019 22:56:57 +0000 (23:56 +0100)]
Fix caxpy/caxpyc naming in localentry
Martin Kroeker [Fri, 29 Nov 2019 22:54:15 +0000 (23:54 +0100)]
Fix caxpy/caxpyc naming in localentry
Martin Kroeker [Fri, 29 Nov 2019 22:49:50 +0000 (23:49 +0100)]
Substitute precompiled gcc7 codes only when gcc is older than 9.x
Martin Kroeker [Fri, 29 Nov 2019 22:47:23 +0000 (23:47 +0100)]
Add variable for gcc >=9 test
used in KERNEL.POWER9
Martin Kroeker [Fri, 29 Nov 2019 22:44:09 +0000 (23:44 +0100)]
Merge pull request #19 from xianyi/develop
rebase
Martin Kroeker [Thu, 28 Nov 2019 19:55:16 +0000 (20:55 +0100)]
Merge pull request #2323 from wjc404/develop
some optimizations of AVX512 DGEMM
wjc404 [Thu, 28 Nov 2019 11:57:50 +0000 (19:57 +0800)]
Update param.h
wjc404 [Thu, 28 Nov 2019 11:56:35 +0000 (19:56 +0800)]
Update dgemm_kernel_4x8_skylakex_2.c
Martin Kroeker [Thu, 28 Nov 2019 08:30:24 +0000 (09:30 +0100)]
Merge pull request #2321 from martin-frbg/issue2319
Fix race conditions in multithreaded GEMM3M
Martin Kroeker [Thu, 28 Nov 2019 07:43:45 +0000 (08:43 +0100)]
Merge pull request #2327 from martin-frbg/travisosx
Cleanup IOS build and disable FORTRAN on 32bit and ios builds for now
Martin Kroeker [Wed, 27 Nov 2019 23:17:19 +0000 (00:17 +0100)]
Merge pull request #2326 from xianyi/revert-2325-travisosx
Revert "Cleanup Travis IOS xbuild and disable FORTRAN on 32bit and ios builds for now"
Martin Kroeker [Wed, 27 Nov 2019 23:15:36 +0000 (00:15 +0100)]
Cleanup IOS build and disable FORTRAN on 32bit and ios builds for now
Travis recently appears unable to find a matching homebrew package for 32bit gfortran,
and the IOS crossbuild suffered from excessive output due to the known problem with "ASMNAME redefined"
warnings when CFLAGS is set in the environment
Martin Kroeker [Wed, 27 Nov 2019 23:09:06 +0000 (00:09 +0100)]
Revert "Cleanup Travis IOS xbuild and disable FORTRAN on 32bit and ios builds for now"
Martin Kroeker [Wed, 27 Nov 2019 20:59:36 +0000 (21:59 +0100)]
Merge pull request #2325 from martin-frbg/travisosx
Cleanup Travis IOS xbuild and disable FORTRAN on 32bit and ios builds for now
Martin Kroeker [Wed, 27 Nov 2019 14:10:57 +0000 (15:10 +0100)]
Cleanup IOS build and disable FORTRAN on 32bit and ios builds for now
Travis recently appears unable to find a matching homebrew package for 32bit gfortran,
and the IOS crossbuild suffered from excessive output due to the known problem with "ASMNAME redefined"
warnings when CFLAGS is set in the environment
Anton Blanchard [Wed, 27 Nov 2019 04:55:04 +0000 (21:55 -0700)]
Fix SEGV in cdot_power9
We were corrupting r2 because the local entry wasn't being
setup correctly.
wjc404 [Tue, 26 Nov 2019 06:12:20 +0000 (14:12 +0800)]
some optimizations
Martin Kroeker [Sat, 23 Nov 2019 21:38:07 +0000 (22:38 +0100)]
Fix AVX512 capability test (always returning zero)
from #2322
Martin Kroeker [Sat, 23 Nov 2019 18:54:56 +0000 (19:54 +0100)]
Fix race conditions in multithreaded GEMM3M
by adding barriers (and a mutex lock for the non-OpenMP case) like it was already done for GEMM in level3_thread.c some time ago
Martin Kroeker [Thu, 21 Nov 2019 17:14:29 +0000 (18:14 +0100)]
Add the cpuid of the business/rackmount version of z15 as well
Martin Kroeker [Thu, 21 Nov 2019 17:03:00 +0000 (18:03 +0100)]
Merge pull request #2316 from sharkcz/s390x
zarch: treat z15 as z14 instead of generic
Martin Kroeker [Thu, 21 Nov 2019 16:59:21 +0000 (17:59 +0100)]
Merge pull request #2317 from aarnez/develop
Change bad usage of "asum" to "sum" in ZARCH versions of ?sum
Andreas Arnez [Fri, 20 Sep 2019 16:32:47 +0000 (18:32 +0200)]
Change bad usage of "asum" to "sum" in ZARCH versions of ?sum
The ZARCH implementations of ?sum contain a cut & paste-error: An inline
assembly argument is named "sum", but the assembly references "asum"
instead. The mismatch causes a build error. This is fixed.
Dan Horák [Thu, 21 Nov 2019 11:49:54 +0000 (12:49 +0100)]
zarch: treat z15 as z14 instead of generic
Signed-off-by: Dan Horák <dan@danny.cz>
Martin Kroeker [Thu, 21 Nov 2019 04:06:44 +0000 (05:06 +0100)]
Merge pull request #2315 from ewanglong/develop
revised fix windows compatible for #2313
Wang, Long [Thu, 21 Nov 2019 02:19:40 +0000 (10:19 +0800)]
revised fix windows compatible for #2313
Signed-off-by: Wang, Long <long1.wang@intel.com>
Martin Kroeker [Wed, 20 Nov 2019 15:16:35 +0000 (16:16 +0100)]
Merge pull request #2314 from Jehan/wip/Jehan/fix-openblas-crash
Fix usage of TerminateThread() causing critical section corruption.
Martin Kroeker [Wed, 20 Nov 2019 14:12:06 +0000 (15:12 +0100)]
Merge pull request #2312 from martin-frbg/power8be
Further Power8 big-endian corrections
Martin Kroeker [Wed, 20 Nov 2019 13:49:15 +0000 (14:49 +0100)]
Merge pull request #2313 from ewanglong/develop
Fix the integer overflow issue for large matrix size
Wang, Long [Wed, 20 Nov 2019 13:30:16 +0000 (21:30 +0800)]
For the sake of windows compatible, used "unsigned long long" to ensure 64-bit length
Signed-off-by: Wang, Long <long1.wang@intel.com>
Jehan [Wed, 20 Nov 2019 11:21:35 +0000 (12:21 +0100)]
Fix usage of TerminateThread() causing critical section corruption.
This patch was submitted to the GIMP project by a publisher wishing to
keep confidentiality (hence anonymously). I just pass along the patch.
Here is the patch explanation which came with:
First they remind us what Microsoft documentation says about
TerminateThread:
> TerminateThread is a dangerous function that should only be used in
> the most extreme cases. You should call TerminateThread only if you
> know exactly what the target thread is doing, and you control all of
> the code that the target thread could possibly be running at the time
> of the termination.
(https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-terminatethread)
Then they say that 5 milliseconds time-out might not be long enough for
the thread to exit gracefully. They propose to set it to a much higher
value (for instance here 5 seconds).
And finally you should always check the return value of
WaitForSingleObject(). In particular you want to run TerminateThread()
only if WaitForSingleObject() failed, not on success case.
Wang, Long [Wed, 20 Nov 2019 03:50:37 +0000 (11:50 +0800)]
Fix the integer overflow issue for large matrix size
For large matrix, e.g. M=N=K, and M>1290, int mnk=M*N*K will overflow.
This will lead to wrong branching to single-threading. The performance
is downgraded significantly.
Signed-off-by: Wang, Long <long1.wang@intel.com>
Martin Kroeker [Sun, 17 Nov 2019 22:19:48 +0000 (23:19 +0100)]
Merge pull request #2310 from martin-frbg/ppc440
Fix PPC440 big-endian support and disable the QCDOC qalloc routine by default
Martin Kroeker [Sun, 17 Nov 2019 22:12:10 +0000 (23:12 +0100)]
Define alternate kernels for big-endian POWER8
Martin Kroeker [Sun, 17 Nov 2019 21:58:32 +0000 (22:58 +0100)]
Fix compilation for big-endian POWER8
Martin Kroeker [Sun, 17 Nov 2019 18:25:08 +0000 (19:25 +0100)]
Define alternate kernels for big-endian PPC440
Martin Kroeker [Sun, 17 Nov 2019 18:22:04 +0000 (19:22 +0100)]
Disable the old QCDOC qalloc by default and copy utility functions from memory.c
1. qalloc() appears to have been a special routine written for the PPC440-based QCDOC supercomputer(s) from around 2005, its source does not seem to be readily available. So switch the #if 1 in the code to rely on standard malloc() by default.
2. Utility functions like get_num_procs, get_num_threads that were added to the "normally" used memory.c in the meantime were still missing here.
Martin Kroeker [Sun, 17 Nov 2019 18:09:49 +0000 (19:09 +0100)]
Merge pull request #17 from xianyi/develop
rebase
Martin Kroeker [Sun, 17 Nov 2019 17:22:24 +0000 (18:22 +0100)]
Merge pull request #2309 from martin-frbg/ppc970-be
Fix PPC970 big-endian support
Martin Kroeker [Sun, 17 Nov 2019 14:19:39 +0000 (15:19 +0100)]
Define alternate kernels for big-endian PPC970
The altivec versions of SGEMM and CGEMM fail most test in LAPACK-TESTING when compiled for big endian, STRSM/CTRSM even cause segfaults. The rot kernels either fail the corresponding utest or lead to failures in LAPACK-TESTING.
Martin Kroeker [Sun, 17 Nov 2019 14:10:26 +0000 (15:10 +0100)]
Use "generic" S/CGEMM unroll M on big-endian PPC970
as the respective PPC970 "altivec" kernels give wrong results when compiled for big endian
Martin Kroeker [Fri, 15 Nov 2019 07:33:17 +0000 (08:33 +0100)]
Merge pull request #2308 from martin-frbg/ctestfix
Fix potential issue in the c/z blas3 ctests
Martin Kroeker [Thu, 14 Nov 2019 23:20:36 +0000 (00:20 +0100)]
Fix potential spurious failure from uninitialized variable
Martin Kroeker [Thu, 14 Nov 2019 23:19:24 +0000 (00:19 +0100)]
Fix potential spurious failure from uninitialized variable
Martin Kroeker [Tue, 12 Nov 2019 06:38:37 +0000 (07:38 +0100)]
Merge pull request #2305 from wjc404/develop
AVX512 CGEMM & ZGEMM kernels
wjc404 [Mon, 11 Nov 2019 12:04:52 +0000 (20:04 +0800)]
AVX512 CGEMM & ZGEMM kernels
96-99% 1-thread performance of MKL2018
Martin Kroeker [Sat, 9 Nov 2019 17:52:08 +0000 (18:52 +0100)]
Merge pull request #15 from xianyi/develop
rebase
Martin Kroeker [Wed, 6 Nov 2019 06:27:33 +0000 (07:27 +0100)]
Merge pull request #2300 from wjc404/develop
Optimize SGEMM on SKYLAKEX CPUs
wjc404 [Tue, 5 Nov 2019 05:36:56 +0000 (13:36 +0800)]
optimizations of software prefetching
Martin Kroeker [Mon, 4 Nov 2019 21:55:05 +0000 (22:55 +0100)]
Merge pull request #2302 from martin-frbg/ppc970
Disable three-operand DCBT on PPC970 regardless of operating system
Martin Kroeker [Mon, 4 Nov 2019 21:54:28 +0000 (22:54 +0100)]
Merge pull request #2301 from martin-frbg/ppc8be
Disable IDAMIN/MAX and IZAMIN/MAX optimizations on big-endian POWER8
Martin Kroeker [Mon, 4 Nov 2019 21:53:58 +0000 (22:53 +0100)]
Merge pull request #2294 from martin-frbg/ios-cleanup
Remove obsolete workarounds for IOS on ARMV8
wjc404 [Mon, 4 Nov 2019 12:10:12 +0000 (20:10 +0800)]
Add files via upload
wjc404 [Mon, 4 Nov 2019 11:37:19 +0000 (19:37 +0800)]
optimizations via software prefetches
Martin Kroeker [Sun, 3 Nov 2019 21:55:31 +0000 (22:55 +0100)]
Use the two-operand form of DCBT on all PPC970 regardless of OS
There seems to be no advantage to the three-operand form used in the earliest GotoBLAS kernels, and it causes compilation problems on other than the previously special-cased platforms as well
Martin Kroeker [Sun, 3 Nov 2019 21:42:46 +0000 (22:42 +0100)]
The assembly microkernel is not safe to use on ELFv1
Martin Kroeker [Sun, 3 Nov 2019 21:41:19 +0000 (22:41 +0100)]
The assembly microkernel is not safe to use on ELFv1
Martin Kroeker [Sun, 3 Nov 2019 21:39:06 +0000 (22:39 +0100)]
The assembly microkernel is not safe to use on ELFv1
Martin Kroeker [Sun, 3 Nov 2019 21:37:27 +0000 (22:37 +0100)]
The assembly microkernel is not safe to use on ELFv1
Martin Kroeker [Sun, 3 Nov 2019 21:33:31 +0000 (22:33 +0100)]
Merge pull request #13 from xianyi/develop
resync with upstream
wjc404 [Sat, 2 Nov 2019 02:09:19 +0000 (10:09 +0800)]
Add files via upload
wjc404 [Sat, 2 Nov 2019 02:06:13 +0000 (10:06 +0800)]
Add files via upload
wjc404 [Fri, 1 Nov 2019 16:00:48 +0000 (00:00 +0800)]
new sgemm kernel for skylakex
wjc404 [Fri, 1 Nov 2019 15:59:18 +0000 (23:59 +0800)]
update sgemm_q on skylakex cpus
Martin Kroeker [Mon, 28 Oct 2019 12:24:18 +0000 (13:24 +0100)]
Merge pull request #2296 from kdunee/develop
Fixed a minor cmake problem, occuring when DYNAMIC_ARCH=ON and CMAKE_C_FLAGS was empty
k.dunikowski [Mon, 28 Oct 2019 07:51:05 +0000 (08:51 +0100)]
Fixed a minor cmake problem, occuring when DYNAMIC_CORE=ON and CMAKE_C_FLAGS was empty
Martin Kroeker [Fri, 25 Oct 2019 21:46:39 +0000 (23:46 +0200)]
Merge pull request #2293 from martin-frbg/pr2288
Add support for NetBSD by adding it to the existing xBSD conditionals
Martin Kroeker [Fri, 25 Oct 2019 21:07:00 +0000 (23:07 +0200)]
Remove special parameter set for obsolete IOS/ARMV8 workaround
Martin Kroeker [Fri, 25 Oct 2019 21:02:37 +0000 (23:02 +0200)]
Remove the IOS fallbacks to generic C kernels
Martin Kroeker [Fri, 25 Oct 2019 20:52:30 +0000 (22:52 +0200)]
Fix regex to parse -R options with and without whitespace
Both forms are seen on NetBSD (#2288)
Martin Kroeker [Fri, 25 Oct 2019 10:52:49 +0000 (12:52 +0200)]
Add NetBSD to the xBSD conditionals
Martin Kroeker [Fri, 25 Oct 2019 10:51:06 +0000 (12:51 +0200)]
Add NetBSD
Martin Kroeker [Fri, 25 Oct 2019 08:35:17 +0000 (10:35 +0200)]
Merge pull request #2292 from martin-frbg/g95fixes
Improve support for g95 and non-GNU ld
Martin Kroeker [Fri, 25 Oct 2019 08:34:50 +0000 (10:34 +0200)]
Merge pull request #2291 from martin-frbg/gensymbol
Fix netlib 3.7/3.8 function enumeration for linktest
Martin Kroeker [Fri, 25 Oct 2019 07:56:30 +0000 (09:56 +0200)]
Merge pull request #2282 from martin-frbg/issue2281
Optimize RPCC function on ARM64
Martin Kroeker [Thu, 24 Oct 2019 20:52:15 +0000 (22:52 +0200)]
Merge pull request #2290 from martin-frbg/cpuidfixes
Fixup x86 cpuid changes from #2283
Martin Kroeker [Thu, 24 Oct 2019 20:43:27 +0000 (22:43 +0200)]
Improve support for g95 and non-GNU ld
Auto-add "-fno-second-underscore" option to make LAPACKE compile (as it calls LAPACK functions that may have gotten a second underscore added otherwise). Also support -R for rpath when parsing compiler directives in f_check
Martin Kroeker [Thu, 24 Oct 2019 19:26:20 +0000 (21:26 +0200)]
Move most lapack 3.7/3.8 additions to the embedded_underscores list
to allow linktest to pass with a compiler that adds a second underscore to such names
Martin Kroeker [Thu, 24 Oct 2019 19:18:17 +0000 (21:18 +0200)]
Disable direct clock register access on IOS and Android
as I find conflicting information on accessibility from non-priviledged processes
luzpaz [Thu, 24 Oct 2019 16:56:53 +0000 (12:56 -0400)]
Remove prototype of unused, unimplemented function (#2274)
* Fix source typo
Found via `codespell -q 3 -L amin,als,ba,dum,mone,nd,nto,orign -S Changelog.txt,./lapack*`
* Remove beta-thread function per request
Martin Kroeker [Thu, 24 Oct 2019 16:45:27 +0000 (18:45 +0200)]
Restore Goldmont ID and improve QEMU support
#2283 had inadvertently removed Goldmont+, and cpuid was reporting a mix of Core2 and Pentium2 for some QEMU configurations
Martin Kroeker [Thu, 24 Oct 2019 16:40:13 +0000 (18:40 +0200)]
Merge pull request #12 from xianyi/develop
resync with upstream
Martin Kroeker [Sun, 20 Oct 2019 10:44:19 +0000 (12:44 +0200)]
Merge pull request #2286 from wjc404/develop
AVX512 DGEMM kernel
wjc404 [Fri, 18 Oct 2019 19:54:44 +0000 (03:54 +0800)]
native support for icopy_4
90% MKL 1-thread performance.
wjc404 [Fri, 18 Oct 2019 07:00:17 +0000 (15:00 +0800)]
Update dgemm_kernel_8x8_skylakex.c
wjc404 [Fri, 18 Oct 2019 06:58:07 +0000 (14:58 +0800)]
some correction
wjc404 [Fri, 18 Oct 2019 02:47:31 +0000 (10:47 +0800)]
make further changes to icopy_8 easier
wjc404 [Wed, 16 Oct 2019 11:23:36 +0000 (19:23 +0800)]
Add files via upload
wjc404 [Wed, 16 Oct 2019 02:14:51 +0000 (10:14 +0800)]
Update dgemm_kernel_8x8_skylakex.c
wjc404 [Tue, 15 Oct 2019 19:20:08 +0000 (03:20 +0800)]
Update dgemm_kernel_8x8_skylakex.c
wjc404 [Tue, 15 Oct 2019 18:01:13 +0000 (02:01 +0800)]
Add files via upload
wjc404 [Tue, 15 Oct 2019 18:00:34 +0000 (02:00 +0800)]
Add files via upload
Martin Kroeker [Wed, 9 Oct 2019 20:06:09 +0000 (22:06 +0200)]
Merge pull request #2283 from martin-frbg/issue2176
Support QEMU virtual cpu in 64bit mode as CORE2 or BARCELONA
Martin Kroeker [Wed, 9 Oct 2019 16:24:13 +0000 (18:24 +0200)]
Support QEMU cpu calling itself 64bit AMD Athlon as well
Some QEMU instances pretend to be "AuthenticAMD" with the same family 6/model 6 even when running on an Intel host
(could be related to qemu or libvirt version and/or kvm availability). Also fix the define to depend on __x86_64__ set by the
compiler, the defines using __64BIT__ will only work for getarch_2nd.
Martin Kroeker [Tue, 8 Oct 2019 20:30:02 +0000 (22:30 +0200)]
Support QEMU virtual cpu as CORE2
qemu itself claims it is a 64bit P6, which does not exist in the wild.