review.tizen.org Git - platform/upstream/llvm.git/log

[NFC] Test Commit

Support dwarf fission for wasm object files

Initial support for dwarf fission sections (-gsplit-dwarf) on wasm.
The most interesting change is support for writing 2 files (.o and .dwo) in the
wasm object writer. My approach moves object-writing logic into its own function
and calls it twice, swapping out the endian::Writer (W) in between calls.
It also splits the import-preparation step into its own function (and skips it when writing a dwo).

Differential Revision: https://reviews.llvm.org/D85685

[MemorySSA] Be more conservative when traversing MemoryPhis.

I think we need to be even more conservative when traversing memory
phis, to make sure we catch any loop carried dependences.

This approach updates fillInCurrentPair to use unknown sizes for
locations when we walk over a phi, unless the location is guaranteed to
be loop-invariant for any possible loop. Using an unknown size for
locations should ensure we catch all memory accesses to locations after
the given memory location, which includes loop-carried dependences.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D87778

[NewPM] Fix pr45927.ll under NPM

[llvm-install-name-tool] Update the command-line guide

[InstCombine] Canonicalize SPF_ABS to abs intrinc

Enable canonicalization of SPF_ABS and SPF_NABS to the abs intrinsic.

To be conservative, the one-use check on the comparison is retained,
this may be relaxed if all goes well.

It's pretty likely that this will uncover places that missing
handling for the abs() intrinsic. Please report any seen performance
regressions.

Differential Revision: https://reviews.llvm.org/D87188

[LoopUnrollAndJam] Allow unroll and jam loops forced by user.

Summary: Allow unroll and jam loops forced by user.
LoopUnrollAndJamPass is still disabled by default in the NPM pipeline,
and can be controlled by -enable-npm-unroll-and-jam.

Reviewed By: Meinersbur, dmgreen

Differential Revision: https://reviews.llvm.org/D87786

[GVN] Use that assume(!X) implies X==false (PR47496)

We already use that assume(X) implies X==true, do the same for
assume(!X) implying X==false. This fixes PR47496.

[GVN] Add additional assume tests (NFC)

The other assume tests seem to be dealing with equalities in
particular. Test implication for the condition itself, especially
the negated case from PR47496.

[SCEV] Add test cases for max BTC with loop guard info.

This adds test cases for PR40961 and PR47247. They illustrate cases in
which the max backedge-taken count can be improved by information from
the loop guards.

Disable hoisting MI to hotter basic blocks when using pgo

This is a follow up patch for https://reviews.llvm.org/D63676 to
enable the feature when using pgo.

Differential Revision: https://reviews.llvm.org/D85240

[Lsan] Use fp registers to search for pointers

X86 can use xmm registers for pointers operations. e.g. for std::swap.
I don't know yet if it's possible on other platforms.

NT_X86_XSTATE includes all registers from NT_FPREGSET so
the latter used only if the former is not available. I am not sure how
reasonable to expect that but LLD has such fallback in
NativeRegisterContextLinux_x86_64::ReadFPR.

Reviewed By: morehouse

Differential Revision: https://reviews.llvm.org/D87754

AArch64::ArchKind's underlying type is uint64_t

[gn build] Port 7e4c6fb8546

[IRSim] Adding IR Instruction Mapper

This introduces the IRInstructionMapper, and the associated wrapper for
instructions, IRInstructionData, that maps IR level Instructions to
unsigned integers.

Mapping is done mainly by using the "isSameOperationAs" comparison
between two instructions.  If they return true, the opcode, result type,
and operand types of the instruction are used to hash the instruction
with an unsigned integer.  The mapper accepts instruction ranges, and
adds each resulting integer to a list, and each wrapped instruction to
a separate list.

At present, branches, phi nodes are not mapping and exception handling
is illegal.  Debug instructions are not considered.

The different mapping schemes are tested in
unittests/Analysis/IRSimilarityIdentifierTest.cpp

Recommit of: b04c1a9d3127730c05e8a22a0e931a12a39528df

Differential Revision: https://reviews.llvm.org/D86968

[SVE][WIP] Implement lowering for fixed length VSELECT to Scalable

Map fixed length VSELECT to its Scalable equivalent.

Differential Revision: https://reviews.llvm.org/D85364

[PDB] Split TypeServerSource and extend type index map lifetime

Extending the lifetime of these type index mappings does increase memory
usage (+2% in my case), but it decouples type merging from symbol
merging. This is a pre-requisite for two changes that I have in mind:
- parallel type merging: speeds up slow type merging
- defered symbol merging: avoid heap allocating (relocating) all symbols

This eliminates CVIndexMap and moves its data into TpiSource. The maps
are also split into a SmallVector and ArrayRef component, so that the
ipiMap can alias the tpiMap for /Z7 object files, and so that both maps
can simply alias the PDB type server maps for /Zi files.

Splitting TypeServerSource establishes that all input types to be merged
can be identified with two 32-bit indices:
- The index of the TpiSource object
- The type index of the record
This is useful, because this information can be stored in a single
64-bit atomic word to enable concurrent hashtable insertion.

One last change is that now all object files with debugChunks get a
TpiSource, even if they have no type info. This avoids some null checks
and special cases.

Differential Revision: https://reviews.llvm.org/D87736

[AArch64][GlobalISel] Widen G_EXTRACT_VECTOR_ELT element types if < 8b.

In order to not unnecessarily promote the source vector to greater than our
native vector size of 128b, I've added some cascading rules to widen based on
the number of elements.

[AArch64][GlobalISel] Make <8 x s16> and <16 x s8> legal for shifts.

[VectorCombine] limit load+insert transform to one-use

As discussed in:
https://llvm.org/PR47558
...there are several potential fixes/follow-ups visible
in the test case, but this is the quickest and safest
fix of the perf regression.

[X86] Don't match x87 register inline asm constraints unless the VT is floating point or its a clobber

The register class picked will be the RFP80 register class which has a f80 VT. The code in SelectionDAGBuilder that generates copies around inline assembly doesn't know how to handle an integer and floating point type of different bit widths.

The test case is derived from this https://godbolt.org/z/sEa659 which gcc accepts but clang crashes on. This patch just gives a more graceful error. I'm not sure if the single element struct case is special in gcc. Adding another field to the struct makes gcc reject it. If we want to support this correctly I think we need a change in the frontend to give us the true element type. Right now the frontend just realizes the constraint can take a memory argument so creates an integer type of the same size and bitcasts.

Differential Revision: https://reviews.llvm.org/D87485

[MLIR][Affine] Add parametric tile size support for affine.for tiling

Add support to tile affine.for ops with parametric sizes (i.e., SSA
values). Currently supports hyper-rectangular loop nests with constant
lower bounds only. Move methods

  - moveLoopBody(*)
  - getTileableBands(*)
  - checkTilingLegality(*)
  - tilePerfectlyNested(*)
  - constructTiledIndexSetHyperRect(*)

to allow reuse with constant tile size API. Add a test pass -test-affine
-parametric-tile to test parametric tiling.

Differential Revision: https://reviews.llvm.org/D87353

[MLIR] Support for return values in Affine.For yield

Add support for return values in affine.for yield along the same lines
as scf.for and affine.parallel.

Signed-off-by: Abhishek Varma <abhishek.varma@polymagelabs.com>
Differential Revision: https://reviews.llvm.org/D87437

Revert "[NFC] Refactor DiagnosticBuilder and PartialDiagnostic"

This reverts commit ee5519d323571c4a9a7d92cb817023c9b95334cd.

Revert "[CUDA][HIP] Defer overloading resolution diagnostics for host device functions"

This reverts commit 7f1f89ec8d9944559042bb6d3b1132eabe3409de.

This reverts commit 40df06cdafc010002fc9cfe1dda73d689b7d27a6.

[VectorCombine] rearrange bailouts for load insert for efficiency; NFC

[VectorCombine] add test for multi-use load (PR47558); NFC

[PowerPC][AIX] Don't hardcode python invoke command line

We shouldn't assume python exists, we should let lit
to decide whether it is python or python3 and expand the path.

Add missing include

[AMDGPU] Fix ROCm unit test memref initialization

[Sema] Introduce BuiltinAttr, per-declaration builtin-ness

Instead of relying on whether a certain identifier is a builtin, introduce BuiltinAttr to specify a declaration as having builtin semantics.

This fixes incompatible redeclarations of builtins, as reverting the identifier as being builtin due to one incompatible redeclaration would have broken rest of the builtin calls.
Mostly-compatible redeclarations of builtins also no longer have builtin semantics. They don't call the builtin nor inherit their attributes.
A long-standing FIXME regarding builtins inside a namespace enclosed in extern "C" not being recognized is also addressed.

Due to the more correct handling attributes for builtin functions are added in more places, resulting in more useful warnings.
Tests are updated to reflect that.

Intrinsics without an inline definition in intrin.h had `inline` and `static` removed as they had no effect and caused them to no longer be recognized as builtins otherwise.

A pthread_create() related test is XFAIL-ed, as it relied on it being recognized as a builtin based on its name.
The builtin declaration syntax is too restrictive and doesn't allow custom structs, function pointers, etc.
It seems to be the only case and fixing this would require reworking the current builtin syntax, so this seems acceptable.

Fixes PR45410.

Reviewed By: rsmith, yutsumi

Differential Revision: https://reviews.llvm.org/D77491

[DFSan] Add bcmp wrapper.

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D87801

[SyntaxTree][Synthesis] Fix allocation in `createTree` for more general use

Prior to this change `createTree` could not create arbitrary syntax
trees. Now it dispatches to the constructor of the concrete syntax tree
according to the `NodeKind` passed as argument. This allows reuse inside
the Synthesis API. # Please enter the commit message for your changes.
Lines starting

Differential Revision: https://reviews.llvm.org/D87820

[amdgpu] Compilation fix for Release

Reviewed By: bkramer

Differential Revision: https://reviews.llvm.org/D87838

[InstSimplify] add tests for FP constant miscompile; NFC (PR43907)

[ARM] Expand distributing increments to also handle existing pre/post inc instructions.

This extends the distributing postinc code in load/store optimizer to
also handle the case where there is an existing pre/post inc instruction,
where subsequent instructions can be modified to use the adjusted
offset from the increment. This can save us having to keep the old
register live past the increment instruction.

Differential Revision: https://reviews.llvm.org/D83377

[AArch64][GlobalISel] Fix bug in fewVectorElts action while legalizing oversize G_FPTRUNC vectors.

For <8 x s32> = fptrunc <8 x s64> the fewerElementsVector action tries to break
down the source vector into the final source vectors of <2 x s64> using unmerge.
This fixes a crash due to using the wrong number of elements for the breakdown
type.

Also add some legalizer tests for explicitly G_FPTRUNC which we didn't have.

Differential Revision: https://reviews.llvm.org/D87814

[mlir][Vector] Add a folder for vector.broadcast

Fold the operation if the source is a scalar constant or splat constant.

Update transform-patterns-matmul-to-vector.mlir because the broadcast ops are folded in the conversion.

Reviewed By: aartbik

Differential Revision: https://reviews.llvm.org/D87703

Fix build failure in clangd

ModuloSchedule.cpp - remove unnecessary includes. NFCI.

Already included in ModuloSchedule.h

Revert "[DFSan] Add bcmp wrapper."

This reverts commit 559f9198125392bfa8e7d462aa8e87fcf5030185 due to bot
failure.

[Test] Add tests showing that IndVars cannot prove (X + 1 > X)

[flang][openacc] Lower clauses on loop construct to OpenACC dialect

Lower OpenACCLoopConstruct and most of the clauses to the OpenACC acc.loop operation in MLIR.
This patch refelcts what can be upstream from PR flang-compiler/f18-llvm-project#419

Reviewed By: SouraVX

Differential Revision: https://reviews.llvm.org/D87389

[mlir][openacc] Change operand type from index to AnyInteger in parallel op

This patch change the type of operands async, wait, numGangs, numWorkers and vectorLength from index
to AnyInteger to fit with acc.loop and the OpenACC specification.

Reviewed By: ftynse

Differential Revision: https://reviews.llvm.org/D87712

[ARM] Add more MVE postinc distribution tests. NFC

[CUDA][HIP] Defer overloading resolution diagnostics for host device functions

In CUDA/HIP a function may become implicit host device function by
pragma or constexpr. A host device function is checked in both
host and device compilation. However it may be emitted only
on host or device side, therefore the diagnostics should be
deferred until it is known to be emitted.

Currently clang is only able to defer certain diagnostics. This causes
false alarms and limits the usefulness of host device functions.

This patch lets clang defer all overloading resolution diagnostics for host device functions.

An option -fgpu-defer-diag is added to control this behavior. By default
it is off.

It is NFC for other languages.

Differential Revision: https://reviews.llvm.org/D84364

[AArch64] Match pairwise add/fadd pattern

D75689 turns the faddp pattern into a shuffle with vector add.

Match this new pattern in target-specific DAG combine, rather than ISel,
because legalization (for v2f32) turns it into a bit of a mess.

- extended to cover f16, f32, f64 and i64

Precommit test updates

[DFSan] Add bcmp wrapper.

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D87801

[OpenMP 5.0] Fix user-defined mapper privatization in tasks

This patch fixes the problem that user-defined mapper array is not correctly privatized inside a task. This problem causes openmp/libomptarget/test/offloading/target_depend_nowait.cpp fails.

Differential Revision: https://reviews.llvm.org/D84470

[Coroutine] Fix a bug where Coroutine incorrectly spills phi and invoke defs before CoroBegin

When a spill definition is before CoroBegin, we cannot spill it to the frame immediately after the definition. We have to spill it after the frame is ready.
The current implementation handles it properly for any other kinds of instructions except for PhINode and InvokeInst, which could also be defined before CoroBegin.
This patch fixes it by moving the CoroBegin dominance check earlier, so that it covers all cases.
Added a test.

Differential Revision: https://reviews.llvm.org/D87810

[libc++] Remove some workarounds for missing variadic templates

We don't support GCC in C++03 mode, and Clang provides variadic templates
even in C++03 mode. So there's effectively no supported compiler that
doesn't support variadic templates.

This effectively gets rid of all uses of _LIBCPP_HAS_NO_VARIADICS, but
some workarounds for the lack of variadics remain.

[amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel.

- Need to lower COPY from SGPR to VGPR to a real instruction as the
  standard COPY is used where the source and destination are from the
  same register bank so that we potentially coalesc them together and
  save one COPY. Considering that, backend optimizations, such as CSE,
  won't handle them. However, the copy from SGPR to VGPR always needs
  materializing to a native instruction, it should be lowered into a
  real one before other backend optimizations.

Differential Revision: https://reviews.llvm.org/D87556

[ARM] Sink splats to MVE intrinsics

The predicated MVE intrinsics are generated as, for example,
llvm.arm.mve.add.predicated(x, splat(y). p). We need to sink the splat
value back into the loop, like we do for other instructions, so we can
re-select qr variants.

Differential Revision: https://reviews.llvm.org/D87693

[compiler-rt] [scudo] Fix typo in function attribute

Fixes the build after landing https://reviews.llvm.org/D87562

[mlir][Standard] Canonicalize chains of tensor_cast operations

Adds a pattern that replaces a chain of two tensor_cast operations by a single tensor_cast operation if doing so will not remove constraints on the shapes.

[compiler-rt] [hwasan] Replace INLINE with inline

Fixes the build after landing D87562.

[compiler-rt] [netbsd] Include <sys/dkbad.h>

Fixes build on NetBSD/sparc64.

[AMDGPU] should expand ROTL i16 to shifts.

Instruction combining pass turns library rotl implementation to llvm.fshl.i16.
In the selection dag the intrinsic is turned to ISD::ROTL node that cannot be selected.
Need to expand it to shifts again.

Reviewed By: rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D87618

[compiler-rt] [tsan] [netbsd] Catch unsupported LONG_JMP_SP_ENV_SLOT

Error out during build for unsupported CPU.

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D87602

[compiler-rt] Replace INLINE with inline

This fixes the clash with BSD headers.

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D87562

LiveDebugVariables.cpp - remove unnecessary Compiler.h include. NFCI.

Already included in LiveDebugVariables.h

DwarfExpression.cpp - remove unnecessary includes. NFCI.

Already included in DwarfExpression.h

ValueList.cpp - remove unnecessary includes. NFCI.

Already included in ValueList.h

[compiler-rt] Avoid pulling libatomic to sanitizer tests

Avoid fallbacking to software emulated compiler atomics, that are usually
provided by libatomic, which is not always present.

This fixes the test on NetBSD, which does not provide libatomic in base.

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D87568

SafeStackLayout.cpp - remove unnecessary StackLifetime.h include. NFCI.

Already included in SafeStackLayout.h

[AMDGPU] Bump to ROCm 3.7 dependency hip_hcc->amdhip64

Differential Revision: https://reviews.llvm.org/D87773

InstCombiner.h - remove unnecessary KnownBits.h include. NFCI.

Move the include down to cpp files with an implicit dependency.

[ARM][MachineOutliner] Add missing testcase for calls.

[MemorySSA] Add another loop clobber test case.

[SVE][CodeGen] Lower floating point -> integer conversions

This patch adds new ISD nodes, FCVTZS_MERGE_PASSTHRU &
FCVTZU_MERGE_PASSTHRU, which are used to lower scalable vector
FP_TO_SINT/FP_TO_UINT operations and the following intrinsics:
- llvm.aarch64.sve.fcvtzu
- llvm.aarch64.sve.fcvtzs

Reviewed By: efriedma, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D87232

[obj2yaml] - Don't emit EM_NONE.

When ELF header's `e_machine == 0`, we emit:

```
Machine: EM_NONE
```

We can avoid doing this, because yaml2obj sets the
`e_machine` field to `EM_NONE` by default.

Differential revision: https://reviews.llvm.org/D87829

[llvm-readelf/obj][test] - Document what we print in various places for unnamed section symbols.

We have an issue with `ELFDumper<ELFT>::getSymbolSectionName`:
1) It is used deeply for both LLVM/GNU styles and might return LLVM-style only
   values to describe symbols: "Undefined", "Processor Specific", "Absolute", etc.

2) `getSymbolSectionName` is used by `getFullSymbolName` and these special values
   might appear in instead of symbol names in many places.
   This occurs for unnamed section symbols.

It was not noticed because for most cases I've found it is unexpected to have an
unnamed section symbol. This patch documents the existent behavior, adds tests and FIXMEs.

Differential revision: https://reviews.llvm.org/D87763

[SLP] sort candidates to increase chance of optimal compare reduction

This is one (small) part of improving PR41312:
https://llvm.org/PR41312

As shown there and in the smaller tests here, if we have some member of the
reduction values that does not match the others, we want to push it to the
end (bring the matching members forward and together).

In the regression tests, we have 5 candidates for the 4 slots of the reduction.
If the one "wrong" compare is grouped with the others, it prevents forming the
ideal v4i1 compare reduction.

Differential Revision: https://reviews.llvm.org/D87772

[clang][docs] Fix documentation of -O

D79916 changed the behaviour from -O2 to -O1 but the documentation was
not updated to reflect this.

Remove unnecessary forward declarations. NFCI.

All of these forward declarations are fully defined in headers that are directly included.

[ConstraintSystem] Remove local variable that is set but not read [NFC]

gcc 7.4 warns about it.

[clang-format][regression][PR47461] ifdef causes catch to be seen as a function

https://bugs.llvm.org/show_bug.cgi?id=47461

The following change {D80940} caused a regression in code which ifdef's around the try and catch block cause incorrect brace placement around the catch

```
  try
  {
  }
  catch (...) {
    // This is not a small function
    bar = 1;
  }
}
```

The brace after the catch will be placed on a newline

Reviewed By: curdeius

Differential Revision: https://reviews.llvm.org/D87291

MetadataLoader.cpp - remove unnecessary StringRef include. NFCI.

Already included in MetadataLoader.h

SymbolizableObjectFile.h - remove unnecessary includes. NFCI.

Use forward declarations where possible, move includes down to SymbolizableObjectFile.cpp and avoid duplicate includes.

[NFC][ARM] Tail fold test changes

Run update script on one test and add another.

Revert "[lldb] Don't send invalid region addresses to lldb server"

This reverts commit c687af0c30b4dbdc9f614d5e061c888238e0f9c5
due to a test failure on Windows.

[ARM] Additional tests for qr intrinsics in loops. NFC

DwarfStringPool.cpp - remove unnecessary StringRef include. NFCI.

Already included in DwarfStringPool.h

DwarfFile.h - remove unnecessary includes. NFCI.

Use forward declarations where possible, move includes down to DwarfFile.cpp and avoid duplicate includes.

[ARM] Extra fp16 bitcast tests. NFC

[mlir] turn clang-format back on in C API test

C API test uses FileCheck comments inside C code and needs to
temporarily switch off clang-format to prevent it from messing with
FileCheck directives. A recently landed commit forgot to turn it back on
after a block of FileCheck comments. Fix that.

[gn build] (manually) port c9af34027bc

[MLIR] Turns swapId into a FlatAffineConstraints member func

`swapId` used to be a static function in `AffineStructures.cpp`. This diff makes it accessible from the external world by turning it into a member function of `FlatAffineConstraints`. This will be very helpful for other projects that need to manipulate the content of `FlatAffineConstraints`.

Differential Revision: https://reviews.llvm.org/D87766

[AsmPrinter] DwarfDebug - use DebugLoc const references where possible. NFC.

Avoid unnecessary copies.

[AMDGPU] Remove orphan SITargetLowering::LowerINT_TO_FP declaration. NFCI.

Method implementation no longer exists.

[AsmPrinter] Remove orphan DwarfUnit::shareAcrossDWOCUs declaration. NFCI.

Method implementation no longer exists.

[mlir][Linalg] Convolution tiling added to ConvOp vectorization pass

ConvOp vectorization supports now only convolutions of static shapes with dimensions
of size either 3(vectorized) or 1(not) as underlying vectors have to be of static
shape as well. In this commit we add support for convolutions of any size as well as
dynamic shapes by leveraging existing matmul infrastructure for tiling of both input
and kernel to sizes accepted by the previous version of ConvOp vectorization.
In the future this pass can be extended to take "tiling mask" as a user input which
will enable vectorization of user specified dimensions.

Differential Revision: https://reviews.llvm.org/D87676

[clang][aarch64] ACLE: Support implicit casts between GNU and SVE vectors

This patch adds support for implicit casting between GNU vectors and SVE
vectors when `__ARM_FEATURE_SVE_BITS==N`, as defined by the Arm C
Language Extensions (ACLE, version 00bet5, section 3.7.3.3) for SVE [1].

This behavior makes it possible to use GNU vectors with ACLE functions
that operate on VLAT. For example:

  typedef int8_t vec __attribute__((vector_size(32)));
  vec f(vec x) { return svasrd_x(svptrue_b8(), x, 1); }

Tests are also added for implicit casting between GNU and fixed-length
SVE vectors created by the 'arm_sve_vector_bits' attribute. This
behavior makes it possible to use VLST with existing interfaces that
operate on GNUT. For example:

  typedef int8_t vec1 __attribute__((vector_size(32)));
  void f(vec1);
  #if __ARM_FEATURE_SVE_BITS==256 && __ARM_FEATURE_SVE_VECTOR_OPERATORS
  typedef svint8_t vec2 __attribute__((arm_sve_vector_bits(256)));
  void g(vec2 x) { f(x); } // OK
  #endif

The `__ARM_FEATURE_SVE_VECTOR_OPERATORS` feature macro indicates
interoperability with the GNU vector extension. This is the first patch
providing support for this feature, which once complete will be enabled
by the `-msve-vector-bits` flag, as the `__ARM_FEATURE_SVE_BITS` feature
currently is.

[1] https://developer.arm.com/documentation/100987/latest

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D87607

[lldb] Don't send invalid region addresses to lldb server

Previously when <addr> in "memory region <addr>" didn't
parse correctly, we'd print an error then also ask lldb-server
for a region containing LLDB_INVALID_ADDRESS.

(lldb) memory region not_an_address
error: invalid address argument "not_an_address"...
error: Server returned invalid range

Only send the command to lldb-server if the address
parsed correctly.

(lldb) memory region not_an_address
error: invalid address argument "not_an_address"...

Reviewed By: labath

Differential Revision: https://reviews.llvm.org/D87694

[X86] Fix stack alignment on 32-bit Solaris/x86

On Solaris/x86, several hundred 32-bit tests `FAIL`, all in the same way:

  env ASAN_OPTIONS=halt_on_error=false ./halt_on_error_suppress_equal_pcs.cpp.tmp
  Segmentation Fault (core dumped)

They segfault during startup:

  Thread 2 received signal SIGSEGV, Segmentation fault.
  [Switching to Thread 1 (LWP 1)]
  0x080f21f0 in __sanitizer::internal_mmap(void*, unsigned long, int, int, int, unsigned long long) () at /vol/llvm/src/llvm-project/dist/compiler-rt/lib/sanitizer_common/sanitizer_solaris.cpp:65
  65                              int prot, int flags, int fd, OFF_T offset) {
  1: x/i $pc
  => 0x80f21f0 <_ZN11__sanitizer13internal_mmapEPvmiiiy+16>: movaps 0x30(%esp),%xmm0
  (gdb) p/x $esp
  $3 = 0xfeffd488

The problem is that `movaps` expects 16-byte alignment, while 32-bit Solaris/x86
only guarantees 4-byte alignment following the i386 psABI.

This patch updates `X86Subtarget::initSubtargetFeatures` accordingly,
handles Solaris/x86 in the corresponding testcase, and allows for some
variation in address alignment in
`compiler-rt/test/ubsan/TestCases/TypeCheck/vptr.cpp`.

Tested on `amd64-pc-solaris2.11` and `x86_64-pc-linux-gnu`.

Differential Revision: https://reviews.llvm.org/D87615

Revert "Re-land: Add new hidden option -print-changed which only reports changes to IR"

The test added in this commit is failing on Windows bots:

http://lab.llvm.org:8011/builders/llvm-clang-win-x-armv7l/builds/1269

This reverts commit f9e6d1edc0dad9afb26e773aa125ed62c58f7080 and follow-up commit 6859d95ea2d0f3fe0de2923a3f642170e66a1a14.

[NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32 PHI's, O(n^2) algo is faster (geomean -0.08%)

This is functionally equivalent to the old implementation.

As per https://llvm-compile-time-tracker.com/compare.php?from=5f4e9bf6416e45eba483a4e5e263749989fdb3b3&to=4739e6e4eb54d3736e6457249c0919b30f6c855a&stat=instructions
this is a clear geomean compile-time regression-free win with overall geomean of `-0.08%`

32 PHI's appears to be the sweet spot; both the 16 and 64 performed worse:
https://llvm-compile-time-tracker.com/compare.php?from=5f4e9bf6416e45eba483a4e5e263749989fdb3b3&to=c4efe1fbbfdf0305ac26cd19eacb0c7774cdf60e&stat=instructions
https://llvm-compile-time-tracker.com/compare.php?from=5f4e9bf6416e45eba483a4e5e263749989fdb3b3&to=e4989d1c67010d3339d1a40ff5286a31f10cfe82&stat=instructions

If we have more PHI's than that, we fall-back to the original DenseSet-based implementation,
so the not-so-fast cases will still be handled.

However compile-time isn't the main motivation here.
I can name at least 3 limitations of this CSE:
1. Assumes that all PHI nodes have incoming basic blocks in the same order (can be fixed while keeping the DenseMap)
2. Does not special-handle `undef` incoming values (i don't see how we can do this with hashing)
3. Does not special-handle backedge incoming values (maybe can be fixed by hashing backedge as some magical value)

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D87408

[SplitKit] Only copy live lanes

When splitting a live interval with subranges, only insert copies for
the lanes that are live at the point of the split. This avoids some
unnecessary copies and fixes a problem where copying dead lanes was
generating MIR that failed verification. The test case for this is
test/CodeGen/AMDGPU/splitkit-copy-live-lanes.mir.

Without this fix, some earlier live range splitting would create %430:

%430 [256r,848r:0)[848r,2584r:1)  0@256r 1@848r L0000000000000003 [848r,2584r:0)  0@848r L0000000000000030 [256r,2584r:0)  0@256r weight:1.480938e-03
...
256B     undef %430.sub2:vreg_128 = V_LSHRREV_B32_e32 16, %20.sub1:vreg_128, implicit $exec
...
848B     %430.sub0:vreg_128 = V_AND_B32_e32 %92:sreg_32, %20.sub1:vreg_128, implicit $exec
...
2584B    %431:vreg_128 = COPY %430:vreg_128

Then RAGreedy::tryLocalSplit would split %430 into %432 and %433 just
before 848B giving:

%432 [256r,844r:0)  0@256r L0000000000000030 [256r,844r:0)  0@256r weight:3.066802e-03
%433 [844r,848r:0)[848r,2584r:1)  0@844r 1@848r L0000000000000030 [844r,2584r:0)  0@844r L0000000000000003 [844r,844d:0)[848r,2584r:1)  0@844r 1@848r weight:2.831776e-03
...
256B     undef %432.sub2:vreg_128 = V_LSHRREV_B32_e32 16, %20.sub1:vreg_128, implicit $exec
...
844B     undef %433.sub0:vreg_128 = COPY %432.sub0:vreg_128 {
           internal %433.sub2:vreg_128 = COPY %432.sub2:vreg_128
848B     }
  %433.sub0:vreg_128 = V_AND_B32_e32 %92:sreg_32, %20.sub1:vreg_128, implicit $exec
...
2584B    %431:vreg_128 = COPY %433:vreg_128

Note that the copy from %432 to %433 at 844B is a curious
bundle-without-a-BUNDLE-instruction that SplitKit creates deliberately,
and it includes a copy of .sub0 which is not live at this point, and
that causes it to fail verification:

*** Bad machine code: No live subrange at use ***
- function:    zextload_global_v64i16_to_v64i64
- basic block: %bb.0  (0x7faed48) [0B;2848B)
- instruction: 844B    undef %433.sub0:vreg_128 = COPY %432.sub0:vreg_128
- operand 1:   %432.sub0:vreg_128
- interval:    %432 [256r,844r:0)  0@256r L0000000000000030 [256r,844r:0)  0@256r weight:3.066802e-03
- at:          844B

Using real bundles with a BUNDLE instruction might also fix this
problem, but the current fix is less invasive and also avoids some
unnecessary copies.

https://bugs.llvm.org/show_bug.cgi?id=47492

Differential Revision: https://reviews.llvm.org/D87757

[AMDGPU] Generate test checks for splitkit-copy-bundle.mir

This is a pre-commit for D87757 "[SplitKit] Only copy live lanes".