[X86] Add support for "light" AVX
authorIlya Tokar <tokarip@google.com>
Thu, 15 Dec 2022 20:00:27 +0000 (15:00 -0500)
committerIlya Tokar <tokarip@google.com>
Tue, 24 Jan 2023 22:02:46 +0000 (17:02 -0500)
commitd7043e8c41bb74a31c9790616c1536596814567b
tree0b8e78e21ee1febe1c437b402640a7bb94a92402
parent7e89420116c91647db340a4457b8ad0d60be1d5e
[X86] Add support for "light" AVX

AVX/AVX512 instructions may cause frequency drop on e.g. Skylake.
The magnitude of frequency/performance drop depends on instruction
(multiplication vs load/store) and vector width. Currently users,
that want to avoid this drop can specify -mprefer-vector-width=128.
However this also prevents generations of 256-bit wide instructions,
that have no associated frequency drop (mainly load/stores).

Add a tuning flag that allows generations of 256-bit AVX load/stores,
even when -mprefer-vector-width=128 is set, to speed-up memcpy&co.
Verified that running memcpy loop on all cores has no frequency impact
and zero CORE_POWER:LVL[12]_TURBO_LICENSE perf counters.

Makes coping memory faster e.g.:
BM_memcpy_aligned/256 80.7GB/s ± 3% 96.3GB/s ± 9% +19.33% (p=0.000 n=9+9)

Differential Revision: https://reviews.llvm.org/D134982
llvm/lib/Target/X86/X86.td
llvm/lib/Target/X86/X86ISelLowering.cpp
llvm/lib/Target/X86/X86Subtarget.h
llvm/lib/Target/X86/X86TargetTransformInfo.h
llvm/test/CodeGen/X86/memcpy-light-avx.ll [new file with mode: 0644]
llvm/test/CodeGen/X86/vector-width-store-merge.ll