aarch64: Restore vectorisation of vld1 inputs [PR109072]
authorRichard Sandiford <richard.sandiford@arm.com>
Tue, 28 Mar 2023 11:34:51 +0000 (12:34 +0100)
committerRichard Sandiford <richard.sandiford@arm.com>
Tue, 28 Mar 2023 11:34:51 +0000 (12:34 +0100)
commitfcb411564a655a01f759eea3bb16bfd1bc879bfd
tree1d25efc3e95e69b3965bfec817b53102c7743b0e
parent75cda3be0232f745cda4e177d514f6900390af0b
aarch64: Restore vectorisation of vld1 inputs [PR109072]

Before GCC 12, we would vectorize:

  int32_t arr[] = { x, x, x, x };

at -O3.  Vectorizing the store on its own is often a loss, particularly
for integers, so g:4963079769c99c4073adfd799885410ad484cbbe suppressed it.
This was necessary to fix regressions from enabling vectorisation at -O2,

However, the vectorisation is important if the code subsequently loads
from the array using vld1:

  return vld1q_s32 (arr);

This approach of initialising an array and loading from it is the
recommend endian-agnostic way of constructing an ACLE vector.

As discussed in the PR notes, the general fix would be to fold the
store and load-back to a constructor (preferably before vectorisation).
But that's clearly not stage 4 material.

This patch instead delays folding vld1 until after inlining and
records which decls a vld1 loads from.  It then treats vector
stores to those decls as free, on the optimistic assumption that
they will be removed later.  The patch also brute-forces
vectorization of plain constructor+store sequences, since some
of the CPU costs make that (dubiously) expensive even when the
store is discounted.

Delaying folding showed that we were failing to update the vops.
The patch fixes that too.

Thanks to Tamar for discussion & help with testing.

gcc/
PR target/109072
* config/aarch64/aarch64-protos.h (aarch64_vector_load_decl): Declare.
* config/aarch64/aarch64.h (machine_function::vector_load_decls): New
variable.
* config/aarch64/aarch64-builtins.cc (aarch64_record_vector_load_arg):
New function.
(aarch64_general_gimple_fold_builtin): Delay folding of vld1 until
after inlining.  Record which decls are loaded from.  Fix handling
of vops for loads and stores.
* config/aarch64/aarch64.cc (aarch64_vector_load_decl): New function.
(aarch64_accesses_vector_load_decl_p): Likewise.
(aarch64_vector_costs::m_stores_to_vector_load_decl): New member
variable.
(aarch64_vector_costs::add_stmt_cost): If the function has a vld1
that loads from a decl, treat vector stores to those decls as
zero cost.
(aarch64_vector_costs::finish_cost): ...and in that case,
if the vector code does nothing more than a store, give the
prologue a zero cost as well.

gcc/testsuite/
PR target/109072
* gcc.target/aarch64/pr109072_1.c: New test.
* gcc.target/aarch64/pr109072_2.c: Likewise.
gcc/config/aarch64/aarch64-builtins.cc
gcc/config/aarch64/aarch64-protos.h
gcc/config/aarch64/aarch64.cc
gcc/config/aarch64/aarch64.h
gcc/testsuite/gcc.target/aarch64/pr109072_1.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/pr109072_2.c [new file with mode: 0644]