OpenBLAS ChangeLog
====================================================================
+Version 0.3.7
+11-Aug 2019
+
+common:
+ * having the gmake special variables TARGET_ARCH or TARGET_MACH
+ defined no longer causes build failures in ctest or utest
+ * defining NO_AFFINITY or USE_TLS to 0 in gmake builds no longer
+ has the same effect as setting them to 1
+ * a new test program was added to allow checking the library for
+ thread safety
+ * a new option USE_LOCKING was added to ensure thread safety when
+ OpenBLAS itself is built without multithreading but will be
+ called from multiple threads.
+ * a build failure on Linux with glibc versions earlier than 2.5
+ was fixed
+ * a runtime error with CPU enumeration (and NO_AFFINITY not set)
+ on glibc 2.6 was fixed
+ * NO_AFFINITY was added to the CMAKE options (and defaults to being
+ active on Linux, as in the gmake builds)
+
+x86_64:
+ * the build-time logic for detection of AVX512 availability in
+ the processor and compiler was fixed
+ * gmake builds on OSX now set the internal name of the library to
+ libopenblas.0.dylib (consistent with CMAKE)
+ * the Haswell DGEMM kernel received a significant speedup through
+ improved prefetch and load instructions
+ * performance of DGEMM, DTRMM, DTRSM and ZDOT on Zen/Zen2 was markedly
+ increased by avoiding vpermpd instructions
+ * the SKYLAKEX (AVX512) DGEMM helper functions have now been disabled
+ to fix remaining errors in DGEMM, DSYMM and DTRMM
+
+## POWER:
+ * added support for building on FreeBSD/powerpc64 and FreeBSD/ppc970
+ * added optimized kernels for POWER9 single and double precision complex BLAS3
+ * added optimized kernels for POWER9 SGEMM and STRMM
+
+## ARMV7:
+ * fixed the softfp implementations of xAMAX and IxAMAX
+ * removed the predefined -march= flags on both ARMV5 and ARMV6 as
+ they were appropriate for only a subset of platforms
+
+====================================================================
+Version 0.3.6
+29-Apr-2019
+
+common:
+ * the build tools now check that a given cpu TARGET is actually valid
+ * the build-time check of system features (c_check) has been made
+ less dependent on particular perl features (this should mainly
+ benefit building on Windows)
+ * several problem with the ReLAPACK integration were fixed,
+ including INTERFACE64 support and building a shared library
+ * building with CMAKE on BSD systems was improved
+ * a non-absolute SUM function was added based on the
+ existing optimized code for ASUM
+ * CBLAS interfaces to the IxMIN and IxMAX functions were added
+ * a name clash between LAPACKE and BOOST headers was resolved
+ * CMAKE builds with OpenMP failed to include the appropriate getrf_parallel
+ kernels
+ * a crash on thread (key) deletion with the USE_TLS=1 memory management
+ option was fixed
+ * restored several earlier fixes, in particular for OpenMP performance,
+ building on BSD, and calling fork on CYGWIN, which had inadvertently
+ been dropped in the 0.3.3 rewrite of the memory management code.
+
+x86_64:
+ * the AVX512 DGEMM kernel has been disabled again due to unsolved problems
+ * building with old versions of MSVC was fixed
+ * it is now possible to build a static library on Windows with CMAKE
+ * accessing environment variables on CYGWIN at run time was fixed
+ * the CMAKE build system now recognizes 32bit userspace on 64bit hardware
+ * Intel "Denverton" atom and Hygon "Dhyana" zen CPUs are now autodetected
+ * building for DYNAMIC_ARCH with a DYNAMIC_LIST of targets is now supported
+ with CMAKE as well
+ * building for DYNAMIC_ARCH with GENERIC as the default target is now supported
+ * a buffer overflow in the SSE GEMM kernel for Intel Nano targets was fixed
+ * assembly bugs involving undeclared modification of input operands were fixed
+ in the AXPY, DOT, GEMV, GER, SCAL, SYMV and TRSM microkernels for Nehalem,
+ Sandybridge, Haswell, Bulldozer and Piledriver. These would typically cause
+ test failures or segfaults when compiled with recent versions of gcc from 8 onward.
+ * a similar bug was fixed in the blas_quickdivide code used to split workloads
+ in most functions
+ * a bug in the IxMIN implementation for the GENERIC target made it return the result of IxMAX
+ * fixed building on SkylakeX systems when either the compiler or the (emulated) operating
+ environment does not support AVX512
+ * improved GEMM performance on ZEN targets
+
+x86:
+ * build failures caused by the recently added checks for AVX512 were fixed
+ * an inline assembly bug involving undeclared modification of an input argument was
+ fixed in the blas_quickdivide code used to split workloads in most functions
+ * a bug in the IMIN implementation for the GENERIC target made it return the result of IMAX
+
+MIPS32:
+ * a bug in the IMIN implementation made it return the result of IMAX
+
+POWER:
+ * single precision BLAS1/2 functions have received optimized POWER8 kernels
+ * POWER9 is now a separate target, with an optimized DGEMM/DTRMM kernel
+ * building on PPC970 systems under OSX Leopard or Tiger is now supported
+ * out-of-bounds memory accesses in the gemm_beta microkernels were fixed
+ * building a shared library on AIX is now supported for POWER6
+ * DYNAMIC_ARCH support has been added for POWER6 and newer
+
+ARMv7:
+ * corrected xDOT behaviour with zero INC_X or INC_Y
+ * a bug in the IMIN implementation made it return the result of IMAX
+
+ARMv8:
+ * added support for HiSilicon TSV110 cpus
+ * the CMAKE build system now recognizes 32bit userspace on 64bit hardware
+ * cross-compilation with CMAKE now works again
+ * a bug in the IMIN implementation made it return the result of IMAX
+ * ARMV8 builds with the BINARY=32 option are now automatically handled as ARMV7
+
+IBM Z:
+ * optimized microkernels for single precicion BLAS1/2 functions have been added
+ for both Z13 and Z14
+
+====================================================================
+Version 0.3.5
+31-Dec-2018
+
+common:
+ * loop unrolling in TRMV has been enabled again.
+ * A domain error in the thread workload distribution for SYRK
+ has been fixed.
+ * gmake builds will now automatically add -fPIC to the build
+ options if the platform requires it.
+ * a pthreads key leakage (and associate crash on dlclose) in
+ the USE_TLS codepath was fixed.
+ * building of the utest cases on systems that do not provide
+ an implementation of complex.h was fixed.
+
+x86_64:
+ * the SkylakeX code was changed to compile on OSX.
+ * unwanted application of the -march=skylake-avx512 option
+ to the common code parts of a DYNAMIC_ARCH build was fixed.
+ * improved performance of SGEMM for small workloads on Skylake X.
+ * performance of SGEMM and DGEMM was improved on Haswell.
+
+ARMV8:
+ * a configuration error that broke the CNRM2 kernel was corrected.
+ * compilation of the GEMM kernels with CMAKE was fixed.
+ * DYNAMIC_ARCH builds are now available with CMAKE as well.
+ * using CMAKE for cross-compilation to the new cpu TARGETs
+ introduced in 0.3.4 now works.
+
+POWER:
+ * a problem in cpu autodetection for AIX has been corrected.
+
+====================================================================
+Version 0.3.4
+02-Dec-2018
+
+common:
+ * the new, experimental thread-local memory allocation had
+ inadvertently been left enabled for gmake builds in 0.3.3
+ despite the announcement. It is now disabled by default, and
+ single-threaded builds will keep using the old allocator even
+ if the USE_TLS option is turned on.
+ * OpenBLAS will now provide enough buffer space for at least 50
+ threads by default.
+ * The output of openblas_get_config() now contains the version
+ number.
+ * A serious thread safety bug in GEMV operation with small M and
+ large N size has been fixed.
+ * The code will now automatically call blas_thread_init after a
+ fork if needed before handling a call to openblas_set_num_threads
+ * Accesses to parallelized level3 functions from multiple callers
+ are now serialized to avoid thread races (unless using OpenMP).
+ This should provide better performance than the known-threadsafe
+ (but non-default) USE_SIMPLE_THREADED_LEVEL3 option.
+ * When building LAPACK with gfortran, -frecursive is now (again)
+ enabled by default to ensure correct behaviour.
+ * The OpenBLAS version cblas.h now supports both CBLAS_ORDER and
+ CBLAS_LAYOUT as the name of the matrix row/column order option.
+ * Externally set LDFLAGS are now passed through to the final compile/link
+ steps to facilitate setting platform-specific linker flags.
+ * A potential race condition during the build of LAPACK (that would
+ usually manifest itself as a failure to build TESTING/MATGEN) has been
+ fixed.
+ * xHEMV has been changed to stay single-threaded for small input sizes
+ where the overhead of multithreading exceeds any possible gains
+ * CSWAP and ZSWAP have been limited to a single thread except on ARMV8 or
+ ThunderX hardware with sizable input.
+ * Linker flags for the PGI compiler have been updated
+ * Behaviour of AXPY with zero increments is now handled in the C interface,
+ correcting the result on at least Intel Atom.
+ * The result matrix from calling SGELSS with an all-zero input matrix is
+ now zeroed completely.
+
+x86_64:
+ * Autodetection of AMD Ryzen2 has been fixed (again).
+ * CMAKE builds now support labeling of an INTERFACE64=1 build of
+ the library with the _64 suffix.
+ * AVX512 version of DGEMM has been added and the AVX512 SGEMM kernel
+ has been sped up by rewriting with C intrinsics
+ * Fixed compilation on RHEL5/CENTOS5 (issue with typename __WAIT_STATUS)
+
+POWER:
+ * added support for building on AIX (with gcc and GNU tools from AIX Toolbox).
+ * CPU type detection has been implemented for AIX.
+ * CPU type detection has been fixed for NETBSD.
+
+MIPS64:
+ * AXPY on LOONGSON3A has been corrected to pass "zero increment" utest.
+ * DSDOT on LOONGSON3A has been fixed.
+ * the SGEMM microkernel has been hardened against potential data loss.
+
+ARMV8:
+ * DYNAMic_ARCH support is now available for 64bit ARM
+ * cross-compiling for ARMV8 under iOS now works.
+ * cpu-specific code has been rearranged to make better use of both
+ hardware commonalities and model-specific compiler optimizations.
+ * XGENE1 has been removed as a TARGET, superseded by the improved generic
+ ARMV8 support.
+
+ARMV7:
+ * Older assembly mnemonics have been converted to UAL form to allow
+ building with clang 7.0
+ * Cross compiling LAPACKE for Android has been fixed again (broken by
+ update to LAPACK 3.7.0 some while ago).
+
+====================================================================
+Version 0.3.3
+31-Aug-2018
+
+common:
+ * thread memory allocation has been switched back to the method
+ used before version 0.3.1 due to unexpected problems caused by
+ the new code under some circumstances. A new compile-time option
+ USE_TLS has been added to enable the new code, and it is hoped
+ that this can become the default again in the next version.
+ * LAPAck PR272 has been integrated, which fixes spurious errors
+ in DSYEVR and related functions caused by missing conversion
+ from ILAENV to ILAENV_2STAGE in several _2stage routines.
+ * the cmake-generated OpenBLASConfig.cmake now uses correct case
+ for the name of the library
+ * added support for Haiku OS
+
+x86_64:
+ * added AVX512 implementations of SDOT, DDOT, SAXPY, DAXPY,
+ DSCAL, DGEMVN and DSYMVL
+ * added a workaround for a cygwin issue that prevented compilation
+ of AVX512 code
+
+IBM Z:
+ * added autodetection of Z14
+ * fixed TRMM errors in the generic target
+
+====================================================================
+Version 0.3.2
+30-Jul-2018
+
+common:
+ * fixes for regressions caused by the rewrite of the thread
+ initialization code in 0.3.1
+
+POWER:
+ * fixed cpu autodetection for the BSDs
+
+MIPS64:
+ * fixed utest errors in AXPY, DSDOT, ROT and SWAP
+
+x86_64:
+ * added autodetection of AMD Ryzen 2
+ * fixed build with older versions of MSVC
+
+====================================================================
+Version 0.3.1
+01-Jul-2018
+
+common:
+ * rewritten thread initialization code with significantly reduced overhead
+ * added CBLAS interfaces to the IxAMIN BLAS extension functions
+ * fixed the lapack-test target
+ * CMAKE builds now create an OpenBLASConfig.cmake file
+ * ZAXPY now uses a single thread for small input sizes
+ * the LAPACK code was updated from Reference-LAPACK/lapack#253
+ (fixing LAPACKE interfaces to Aasen's functions)
+
+POWER:
+ * corrected CROT and ZROT behaviour with zero INC_X
+
+ARMV7:
+ * corrected xDOT behaviour with zero INC_X or INC_Y
+
+x86_64:
+ * retired some older targets of DYNAMIC_ARCH builds to a new option DYNAMIC_OLDER,
+ this affects PENRYN,DUNNINGTON,OPTERON,OPTERON_SSE3,BOBCAT,ATOM and NANO
+ (which will still be supported via the slower PRESCOTT kernels when this option is not set)
+ * added an option DYNAMIC_LIST that (used in conjunction with DYNAMIC_ARCH) allows to
+ specify the list of x86_64 targets to include. Any target not on the list will be supported
+ by the Sandybridge or Nehalem kernels if available, or by Prescott.
+ * improved SWITCH_RATIO on Haswell for increased GEMM throughput
+ * added initial support for Intel Skylake X, including an AVX512 SGEMM kernel
+ * added autodetection of Intel Cannon Lake series as Skylake X
+ * added a default L2 cache size for hypervisors that return zero here (Chromebook)
+ * fixed a name clash with recent Windows10 headers that broke the build with (at least)
+ recent mingw from MSYS2
+ * fixed a link error in mixed clang/gfortran builds with OpenMP
+ * updated the OSX deployment target to 10.8
+ * switched on parallel make for builds on MS Windows by default
+
+x86:
+ * fixed SSWAP and DSWAP behaviour with zero INC_X and INC_Y
+
+====================================================================
+Version 0.3.0
+23-May-2108
+
+common:
+ * fixed some more thread race and locking bugs
+ * added preliminary support for calling an OpenMP build of the library from multiple threads
+ * removed performance impact of thread locks added in 0.2.20 on OpenMP code
+ * general code cleanup
+ * optimized DSDOT implementation
+ * improved thread distribution for GEMM
+ * corrected IMATCOPY/OMATCOPY implementation
+ * fixed out-of-bounds accesses in the multithreaded xBMV/xPMV and SYMV implementations
+ * cmake build improvements
+ * pkgconfig file now contains build options
+ * openblas_get_config() now reports USE_OPENMP and NUM_THREADS settings used for the build
+ * corrections and improvements for systems with more than 64 cpus
+ * LAPACK code updated to 3.8.0 including later fixes
+ * added ReLAPACK, a recursive implementation of several LAPACK functions
+ * Rewrote ROTMG to handle cases that the netlib code failed to address
+ * Disabled (broken) multithreading code for xTRMV
+ * corrected prototypes of complex CBLAS functions to make our cblas.h match the generally accepted standard
+ * shared memory access failures on startup are now handled more gracefully
+ * restored utests from earlier releases (and made them pass on all affected systems)
+
+SPARC:
+ * several fixes for cpu autodetection
+
+POWER:
+ * corrected vector register overwriting in several Power8 kernels
+ * optimized additional BLAS functions
+
+ARM:
+ * added support for CortexA53 and A72
+ * added autodetection for ThunderX2T99
+ * made most optimized kernels the default for generic ARMv8 targets
+
+x86_64:
+ * parallelized DDOT kernel for Haswell
+ * changed alignment directives in assembly kernels to boost performance on OSX
+ * fixed register handling in the GEMV microkernels (bug exposed by gcc7)
+ * added support for building on OpenBSD and Dragonfly
+ * updated compiler options to work with Intel release 2018
+ * support fully optimized build with clang/flang on Microsoft Windows
+ * fixed building on AIX
+
+IBM Z:
+ * added optimized BLAS 1/2 functions
+
+MIPS:
+ * fixed cpu autodetection helper code
+ * added mips32 1004K cpu (Mediatek MT7621 and similar SoC)
+ * added mips64 I6500 cpu
+
+====================================================================
+Version 0.2.20
+24-Jul-2017
+
+common:
+ * Improved CMake support
+ * Fixed several thread race and locking bugs
+ * Fixed default LAPACK optimization level
+ * Updated LAPACK to 3.7.0
+ * Added ReLAPACK (https://github.com/HPAC/ReLAPACK, make BUILD_RELAPACK=1)
+
+POWER:
+ * Optimizations for Power9
+ * Fixed several Power8 assembly bugs
+
+ARM:
+ * New optimized Vulcan and ThunderX2T99 targets
+ * Support for ARMV7 SOFT_FP ABI (make ARM_SOFTFP_ABI=1)
+ * Detect all cpu cores including offline ones
+ * Fix compilation with CLANG
+ * Support building a shared library for Android
+
+MIPS:
+ * Fixed several threading issues
+ * Fix compilation with CLANG
+
+x86_64:
+ * Detect Intel Bay Trail and Apollo Lake
+ * Detect Intel Sky Lake and Kaby Lake
+ * Detect Intel Knights Landing
+ * Detect AMD A8, A10, A12 and Ryzen
+ * Support 64bit builds with Visual Studio
+ * Fix building with Intel and PGI compilers
+ * Fix building with MINGW and TDM-GCC
+ * Fix cmake builds for Haswell and related cpus
+ * Fix building for Sandybridge with CLANG 3.9
+ * Add support for the FLANG compiler
+
+IBM Z:
+ * New target z13 with BLAS3 optimizations
+
+====================================================================
+Version 0.2.19
+1-Sep-2016
+common:
+ * Improved cross compiling.
+ * Fix the bug on musl libc.
+
+POWER:
+ * Optimize BLAS on Power8
+ * Fixed Julia+OpenBLAS bugs on Power8
+
+MIPS:
+ * Optimize BLAS on MIPS P5600 and I6400 (Thanks, Shivraj Patil, Kaustubh Raste)
+
+ARM:
+ * Improved on ARM Cortex-A57. (Thanks, Ashwin Sekhar T K)
+
+
+====================================================================
+Version 0.2.18
+12-Apr-2016
+common:
+ * If you set MAKE_NB_JOBS flag less or equal than zero,
+ make will be without -j.
+
+x86/x86_64:
+ * Support building Visual Studio static library. (#813, Thanks, theoractice)
+ * Fix bugs to pass buidbot CI tests (http://build.openblas.net)
+
+ARM:
+ * Provide DGEMM 8x4 kernel for Cortex-A57 (Thanks, Ashwin Sekhar T K)
+
+POWER:
+ * Optimize S and C BLAS3 on Power8
+ * Optimize BLAS2/1 on Power8
+
+====================================================================
Version 0.2.17
20-Mar-2016
common: