Martin Kroeker [Wed, 2 Sep 2020 20:48:49 +0000 (22:48 +0200)]
Add Apple Silicon
Martin Kroeker [Wed, 2 Sep 2020 20:47:38 +0000 (22:47 +0200)]
Detect AppleSilicon cpu on OSX
Martin Kroeker [Wed, 2 Sep 2020 20:16:41 +0000 (22:16 +0200)]
Merge pull request #80 from xianyi/develop
rebase
Martin Kroeker [Wed, 2 Sep 2020 14:56:01 +0000 (16:56 +0200)]
Merge pull request #2815 from mhillenibm/clang_s390x
Fix build with clang on s390x
Marius Hillenbrand [Tue, 1 Sep 2020 14:16:53 +0000 (16:16 +0200)]
s390x: enable S/DGEMM block with explicit loop unrolling + interleaving with clang
The code for SGEMM 16x4 and DGEMM 8x4 blocks on z14 and z15 uses
explicit unrolling and interleaving to improve performance. The code
employs an empty inline asm statement with operands that constrain the
compiler's instruction scheduling and thereby enforce proper overlapping
of load and compute phases. Fix an ifdef to apply that for clang builds,
as well.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
Marius Hillenbrand [Tue, 1 Sep 2020 13:09:32 +0000 (15:09 +0200)]
s390x: allow clang to emit fused multiply-adds (replicates gcc's default behavior)
gcc's default setting for floating-point expression contraction is
"fast", which allows the compiler to emit fused multiply adds instead of
separate multiplies and adds (amongst others). Fused multiply-adds,
which assembly kernels typically apply, also bring a significant
performance advantage to the C implementation for matrix-matrix
multiplication on s390x. To enable that performance advantage for builds
with clang, add -ffp-contract=fast to the compiler options.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
Marius Hillenbrand [Tue, 1 Sep 2020 10:08:05 +0000 (12:08 +0200)]
s390x: avoid variable-length arrays in struct for asm operands
... since it is not required and clang does not support that gcc
extension. Instead, use a variable-length array directly for these
operands.
Note that, while the actual inline assembly code does not directly use
these memory operands, they serve to inform the compiler that it cannot
reorder reads or writes to/from the input and output data across the
inline asm statements.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
Marius Hillenbrand [Tue, 1 Sep 2020 10:04:28 +0000 (12:04 +0200)]
s390x: avoid inline assembly for vector loads for clang
... since clang does not support the instruction format for inline
assembly and also it is not required for current versions of clang.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
Marius Hillenbrand [Tue, 1 Sep 2020 09:58:48 +0000 (11:58 +0200)]
s390x: replace nop with "nop 0" in inline assembly
... as a bandaid for building with clang until LLVM's internal assembler
supports nops without operand.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
Marius Hillenbrand [Tue, 1 Sep 2020 11:59:06 +0000 (13:59 +0200)]
s390x: use "lghi" for immediate values to fix build with clang
Some of the kernels written in assembly utilize a "load address"
instruction for loading an immediate value into a register. That is
both unnecessarily complex and LLVM's assembler does not understand that
specific syntax. Thus, replace with the appropriate "load immediate"
instruction, which is also clearer to read.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
Martin Kroeker [Tue, 1 Sep 2020 21:39:46 +0000 (23:39 +0200)]
Merge pull request #2813 from martin-frbg/issue2804-2
Fix for c_check misinterpreting arm64 in uname -m output as armv7
Martin Kroeker [Tue, 1 Sep 2020 17:54:08 +0000 (19:54 +0200)]
Fix c_check misinterpreting arm64 in uname output to mean armv7
additionla fix for upcoming OSX on ARM64 related to #2804, as suggested by fxcoudert in #2805
Martin Kroeker [Tue, 1 Sep 2020 15:19:14 +0000 (17:19 +0200)]
Merge pull request #2811 from martin-frbg/issue2806
Make NO_AVX512 option override the AVX512 compile test in CMAKE builds as well
Martin Kroeker [Tue, 1 Sep 2020 14:04:03 +0000 (16:04 +0200)]
Merge pull request #2797 from martin-frbg/relafixes1
ReLAPACK fixes
Martin Kroeker [Tue, 1 Sep 2020 10:03:53 +0000 (12:03 +0200)]
Merge pull request #79 from xianyi/develop
rebase
Martin Kroeker [Tue, 1 Sep 2020 08:44:48 +0000 (10:44 +0200)]
Fix misnaming of LAPACK_?ggsvp function prototypes as LAPACKE_ (#2808)
* Fix misnaming of LAPACK_?ggsvp and ?ggsvd function prototypes as LAPACKE_
* Drop the LAPACKE matrix_layout parameter from the argument lists, change ints to pointers and add missing work arguments.
Martin Kroeker [Mon, 31 Aug 2020 21:44:56 +0000 (23:44 +0200)]
Merge pull request #2807 from martin-frbg/issue2804
Work around ARMV8 build-time cpu detection problems on non-Linux systems
Martin Kroeker [Mon, 31 Aug 2020 18:03:21 +0000 (20:03 +0200)]
Report cpu as ARMV8 instead of just giving up on non-Linux hosts
Martin Kroeker [Mon, 31 Aug 2020 18:02:08 +0000 (20:02 +0200)]
Handle Apple labeling armv8 as arm64 rather than aarch64
Martin Kroeker [Fri, 28 Aug 2020 20:52:11 +0000 (22:52 +0200)]
Merge pull request #2799 from RajalakshmiSR/p10_ger
POWER10: Avoid setting accumulators to zero in gemm kernels
Rajalakshmi Srinivasaraghavan [Fri, 28 Aug 2020 15:42:54 +0000 (10:42 -0500)]
POWER10: Avoid setting accumulators to zero in gemm kernels
For the first iteration, it is better to use xvf*ger instead of xvf*gerpp
builtins which helps to avoid setting accumulators to zero. This helps
to reduce few instructions.
Martin Kroeker [Fri, 28 Aug 2020 06:30:59 +0000 (08:30 +0200)]
Merge pull request #2798 from kadler/aix-cpuid
Fix compile error on AIX cpuid detection
Kevin Adler [Fri, 28 Aug 2020 04:08:33 +0000 (23:08 -0500)]
Fix compile error on AIX cpuid detection
In 589c74a the cpuid detection was changed to use systemcfg, but a copy
and paste error was introduced during some refactoring that caused
POWER7 detection to reference CPUTYPE_POWER7 (which doesn't exist)
instead of CPUTYPE_POWER6.
Martin Kroeker [Thu, 27 Aug 2020 09:25:18 +0000 (11:25 +0200)]
Add early returns and fix sign errors in workspace calculations
Martin Kroeker [Thu, 27 Aug 2020 09:22:50 +0000 (11:22 +0200)]
Add early returns
Martin Kroeker [Thu, 27 Aug 2020 09:20:31 +0000 (11:20 +0200)]
Add early returns
Martin Kroeker [Thu, 27 Aug 2020 09:15:12 +0000 (11:15 +0200)]
Add early returns
Martin Kroeker [Thu, 27 Aug 2020 08:59:08 +0000 (10:59 +0200)]
Make ILAENV and xGETRF2 functions available
Martin Kroeker [Mon, 24 Aug 2020 18:18:09 +0000 (20:18 +0200)]
Merge pull request #2775 from Guobing-Chen/Fix_OMP_threads_specify
Fix OMP num specify issue
Martin Kroeker [Mon, 24 Aug 2020 06:03:39 +0000 (08:03 +0200)]
Merge pull request #2792 from pkubaj/patch-1
Add aliases for armv6, armv7
pkubaj [Sun, 23 Aug 2020 18:50:19 +0000 (18:50 +0000)]
Add aliases for armv6, armv7
FreeBSD uses those names for 32-bit ARM variants.
Chen, Guobing [Tue, 11 Aug 2020 19:28:25 +0000 (03:28 +0800)]
Fix OMP num specify issue
In current code, no matter what number of threads specified, all
available CPU count is used when invoking OMP, which leads to very bad
performance if the workload is small while all available CPUs are big.
Lots of time are wasted on inter-thread sync. Fix this issue by really
using the number specified by the variable 'num' from calling API.
Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
Martin Kroeker [Sun, 23 Aug 2020 17:33:03 +0000 (19:33 +0200)]
Merge pull request #2791 from martin-frbg/issue2787
Fix crashes in parallelized x86_64 ZDOT particularly on Windows
Martin Kroeker [Sun, 23 Aug 2020 13:08:16 +0000 (15:08 +0200)]
Fix mssing dummy parameter (imag part of alpha) of zdot_thread_function
Martin Kroeker [Sun, 23 Aug 2020 12:42:35 +0000 (14:42 +0200)]
Merge pull request #2790 from martin-frbg/issue2789
Add OpenMP dependency to pkgconfig information if needed
Martin Kroeker [Sat, 22 Aug 2020 11:55:18 +0000 (13:55 +0200)]
Add OpenMP dependency to pkgconfig file if needed
Martin Kroeker [Sat, 22 Aug 2020 11:53:44 +0000 (13:53 +0200)]
Add OpenMP dependency to pkgconfig file if needed
Martin Kroeker [Sat, 22 Aug 2020 11:52:29 +0000 (13:52 +0200)]
Merge pull request #78 from xianyi/develop
rebase
Martin Kroeker [Thu, 20 Aug 2020 17:54:29 +0000 (19:54 +0200)]
Merge pull request #2780 from Guobing-Chen/CPL_build_support
Enable COOPERLAKE build target
Martin Kroeker [Wed, 19 Aug 2020 20:51:10 +0000 (22:51 +0200)]
Update system.cmake
Martin Kroeker [Wed, 19 Aug 2020 20:30:19 +0000 (22:30 +0200)]
Update system.cmake
Martin Kroeker [Wed, 19 Aug 2020 18:48:39 +0000 (20:48 +0200)]
fallback from cooperlake to skylake if gcc<10
Martin Kroeker [Wed, 19 Aug 2020 15:44:23 +0000 (17:44 +0200)]
Typo fix
Martin Kroeker [Wed, 19 Aug 2020 15:22:12 +0000 (17:22 +0200)]
-march=cooperlake requires gcc10
Martin Kroeker [Wed, 19 Aug 2020 15:17:53 +0000 (17:17 +0200)]
-march=cooperlake requires gcc10
Martin Kroeker [Wed, 19 Aug 2020 14:36:55 +0000 (16:36 +0200)]
Fix typo
Martin Kroeker [Wed, 19 Aug 2020 14:10:15 +0000 (16:10 +0200)]
-march=cooperlake only available in gcc >= 10
Martin Kroeker [Wed, 19 Aug 2020 13:06:30 +0000 (15:06 +0200)]
make march=cooperlake option conditional on gcc >= 10.1
Martin Kroeker [Wed, 19 Aug 2020 12:51:09 +0000 (14:51 +0200)]
[WIP] Refactor the driver code for direct SGEMM (#2782)
Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available
(on x86_64 targets only for now) in DYNAMIC_ARCH builds
* Add sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt
* Add direct_sgemm functions to the gotoblas struct in common_param.h
* Move sgemm_direct_performant helper to separate file
* Update gemm.c to macros for sgemm_direct to support dynamic_arch naming via common_s,h
* (Conditionally) add sgemm_direct functions in setparam-ref.c
Martin Kroeker [Wed, 19 Aug 2020 12:42:58 +0000 (14:42 +0200)]
Merge pull request #2785 from albertziegenhagel/always-generate-pkg-config
Do not require pkg-config to generate the *.pc file
Albert Ziegenhagel [Tue, 18 Aug 2020 06:48:48 +0000 (08:48 +0200)]
Do not require pkg-config to generate the *.pc file
Generating the pkg-config file does not actually depend on pkg-config being available.
Martin Kroeker [Mon, 17 Aug 2020 17:06:13 +0000 (19:06 +0200)]
Merge pull request #2784 from martin-frbg/issue2783
Add fallback typedef for bfloat16 to openblas_config.h template
Martin Kroeker [Mon, 17 Aug 2020 13:32:14 +0000 (15:32 +0200)]
Add typedef for bfloat16 if needed
Martin Kroeker [Mon, 17 Aug 2020 13:28:15 +0000 (15:28 +0200)]
Merge pull request #77 from xianyi/develop
rebase
Martin Kroeker [Mon, 17 Aug 2020 13:20:41 +0000 (15:20 +0200)]
revert
Martin Kroeker [Mon, 17 Aug 2020 13:20:16 +0000 (15:20 +0200)]
revert
Martin Kroeker [Mon, 17 Aug 2020 13:19:40 +0000 (15:19 +0200)]
revert
Martin Kroeker [Sat, 15 Aug 2020 13:46:18 +0000 (15:46 +0200)]
Update .drone.yml
Martin Kroeker [Sat, 15 Aug 2020 12:46:26 +0000 (14:46 +0200)]
Update Makefile
Martin Kroeker [Sat, 15 Aug 2020 11:38:05 +0000 (13:38 +0200)]
Add simple MT sgemm precision test and INTERFACE64 build
Martin Kroeker [Sat, 15 Aug 2020 11:33:52 +0000 (13:33 +0200)]
Add simple sgemm preicsion test
Martin Kroeker [Sat, 15 Aug 2020 11:31:28 +0000 (13:31 +0200)]
Update gemm64.cpp
Martin Kroeker [Sat, 15 Aug 2020 11:30:29 +0000 (13:30 +0200)]
Add trivial gemm test for multithread consistency
Chen, Guobing [Wed, 12 Aug 2020 22:17:34 +0000 (06:17 +0800)]
Enable COOPERLAKE build target
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
Martin Kroeker [Wed, 12 Aug 2020 21:08:38 +0000 (23:08 +0200)]
Add a dedicated POWER9 build to the Travis CI (#2774)
* Add dedicated POWER9 build (using new syntax to ensure it runs as a P9-only containerized job rather than a VM that
might end up on P8 hardware half of the time)
* Bump gcc version for POWER9 build
Martin Kroeker [Tue, 11 Aug 2020 20:40:17 +0000 (22:40 +0200)]
Merge pull request #2765 from martin-frbg/issue2760
Add memory barrier to the PPC blas_lock implementation for Linux
Martin Kroeker [Tue, 11 Aug 2020 19:02:55 +0000 (21:02 +0200)]
Merge pull request #2773 from martin-frbg/issue2770
Fix Makefiles still mishandling NO_CBLAS=0 and NO_LAPACKE=0
Martin Kroeker [Tue, 11 Aug 2020 16:14:09 +0000 (18:14 +0200)]
Merge pull request #2772 from mhillenibm/s390x_gemm_tuning
s390x: GEMM tuning for z14
Martin Kroeker [Tue, 11 Aug 2020 11:40:40 +0000 (13:40 +0200)]
Fix mishandling of NO_CBLAS=0 and NO_LAPACKE=0
Martin Kroeker [Tue, 11 Aug 2020 11:27:19 +0000 (13:27 +0200)]
fix another source of NO_CBLAS=0 surprise
Martin Kroeker [Tue, 11 Aug 2020 11:25:12 +0000 (13:25 +0200)]
Merge pull request #76 from xianyi/develop
rebase
Marius Hillenbrand [Tue, 11 Aug 2020 10:55:59 +0000 (12:55 +0200)]
s390x/SGEMM: adjust default P and Q to multiples of M
We recently changed the register blocking for SGEMM on s390x to 16x4.
However, we did not adjust Q to a multiple of 16 and thus fell back to
the 8x4 kernel at each block's margin, without need. Adjust P and Q to
multiples of 16 to employ the faster 16x4 kernel for complete full-sized
blocks.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
Marius Hillenbrand [Tue, 11 Aug 2020 10:55:53 +0000 (12:55 +0200)]
s390x: Factor out small block sizes for SGEMM/DGEMM on z14
For small register blockings that are too small to fill up vector
registers with column vectors, we currently use a generic code block.
Replace that with instantiations of the generic code as individual
functions, so that the compiler can optimize each one separately.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
Marius Hillenbrand [Tue, 11 Aug 2020 10:55:42 +0000 (12:55 +0200)]
s390x: Optimize SGEMM/DGEMM blocks for z14 with explicit loop unrolling/interleaving
Improve performance of SGEMM and DGEMM on z14 and z15 by unrolling and
interleaving the inner loop of the SGEMM 16x4 and DGEMM 8x4 blocks.
Specifically, we explicitly interleave vector register loads and
computation of two iterations.
Note that this change only adds one C function, since SGEMM 16x4 and
DGEMM 8x4 actually map to the same C code: they both hold intermediate
results in a 4x4 grid of vector registers, and the C implementation is
built around that.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
Martin Kroeker [Mon, 10 Aug 2020 11:27:51 +0000 (13:27 +0200)]
Merge pull request #2764 from martin-frbg/lapacktests
Fix array overruns in the LIN part of the LAPACK testsuite
Martin Kroeker [Sun, 9 Aug 2020 17:17:04 +0000 (19:17 +0200)]
Add memory barrier to the blas_lock implementation for Linux
as recommended by cparrott73 in #2760
Martin Kroeker [Sun, 9 Aug 2020 11:02:27 +0000 (13:02 +0200)]
Fix use of unallocated array in workspace query and wrong type of argument to xSCAL
Martin Kroeker [Sun, 9 Aug 2020 10:59:20 +0000 (12:59 +0200)]
Expand TAU array as SGEMQR/DGEMQR read elements 2 and 3
Martin Kroeker [Sat, 8 Aug 2020 16:05:20 +0000 (18:05 +0200)]
Create Jenkinsfile for OSUOSL PowerCI
Martin Kroeker [Sat, 8 Aug 2020 10:20:04 +0000 (12:20 +0200)]
Merge pull request #2761 from RajalakshmiSR/Makefile_err
Remove extra symbol in Makefile
Rajalakshmi Srinivasaraghavan [Fri, 7 Aug 2020 20:27:44 +0000 (15:27 -0500)]
Remove extra symbol in Makefile
While trying out different unroll values, noted that
make failed due to this extra symbol.
Martin Kroeker [Mon, 3 Aug 2020 21:30:26 +0000 (23:30 +0200)]
Merge pull request #2758 from martin-frbg/undef_shift
Fix GCC ubsan warnings in x86_64 complex dot and gemv_t kernels
Martin Kroeker [Sun, 2 Aug 2020 21:05:21 +0000 (23:05 +0200)]
Merge pull request #2757 from martin-frbg/cmake64
Fix lapack-tests linking to a suffixed libopenblas in cmake builds
Martin Kroeker [Sun, 2 Aug 2020 16:29:56 +0000 (18:29 +0200)]
Multiply by 2 instead of left-shifting a potentially negative number
fixes GCC ubsan warning in the BLAS tests
Martin Kroeker [Sun, 2 Aug 2020 16:27:40 +0000 (18:27 +0200)]
Multiply instead of doing a left shift of a potentially negative number
fixes GCC ubsan report in the BLAS tests
Martin Kroeker [Sun, 2 Aug 2020 16:25:09 +0000 (18:25 +0200)]
Multiply by two instead of left-shifting one place
fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests
Martin Kroeker [Sun, 2 Aug 2020 16:22:31 +0000 (18:22 +0200)]
Multiply by two rather than left shift by one place
fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests
Martin Kroeker [Sun, 2 Aug 2020 15:58:33 +0000 (17:58 +0200)]
Apply current library name suffix
Martin Kroeker [Sun, 2 Aug 2020 15:57:12 +0000 (17:57 +0200)]
Apply library name suffix to openblas if any
Martin Kroeker [Sun, 2 Aug 2020 15:50:06 +0000 (17:50 +0200)]
Merge pull request #75 from xianyi/develop
rebase
Martin Kroeker [Sun, 2 Aug 2020 13:32:46 +0000 (15:32 +0200)]
Merge pull request #2753 from martin-frbg/issue2751
Add SYMBOLPREFIX and/or SYMBOLSUFFIX to cblas prototypes
Martin Kroeker [Sun, 2 Aug 2020 09:20:08 +0000 (11:20 +0200)]
Add SYMBOLPREFIX and/or -SUFFIX to cblas.h if needed
Martin Kroeker [Sat, 1 Aug 2020 15:06:03 +0000 (17:06 +0200)]
Improve substitution rules for SYMBOLPREFIX and -SUFFIX addition
Martin Kroeker [Sat, 1 Aug 2020 13:19:02 +0000 (15:19 +0200)]
Merge pull request #2756 from martin-frbg/issue2755
Protect against inadvertent activation of USE_CUDA
Martin Kroeker [Sat, 1 Aug 2020 10:31:39 +0000 (12:31 +0200)]
Protect against inadvertent activation of USE_CUDA
Martin Kroeker [Fri, 31 Jul 2020 14:03:33 +0000 (16:03 +0200)]
Add SYMBOLPREFIX and/or SYMBOLSUFFIX to cblas prototypes
Martin Kroeker [Fri, 31 Jul 2020 10:52:24 +0000 (12:52 +0200)]
Merge pull request #2752 from kadler/cpuid_aix
Use systemcfg APIs for CPU detection on AIX
Kevin Adler [Fri, 31 Jul 2020 01:52:16 +0000 (20:52 -0500)]
Use systemcfg APIs for CPU detection on AIX
AIX libc already provides ready access to an integer that contains a bit
identifying the CPU it's running on, so there's no need to call a
program and grep its output. Additionally, prtconf is not available in
the PASE runtime, which provides an AIX emulation layer on the IBM i
operating system.
The AIX systemcfg.h also provides macro definitions like POWER_8,
POWER_9, etc for all the bits defining the CPUs as well as macros like
__power_8(), __power_9_andup() that return booleans, but I did not use
them. Since these macros depend on the level of the OS in which it is
built, they may not be defined and instead the associated hex literals
are used directly.
Martin Kroeker [Thu, 30 Jul 2020 09:40:52 +0000 (11:40 +0200)]
Fix inadvertent version number reversal to 0.3.9.dev caused by #2710
Martin Kroeker [Thu, 30 Jul 2020 09:35:53 +0000 (11:35 +0200)]
Merge pull request #2749 from martin-frbg/make_ppc
Reorganize OpenMP build options for POWER and allow compiling for POWER9 with old gcc