vect+aarch64: Fix ldp_stp_* regressions
authorRichard Sandiford <richard.sandiford@arm.com>
Tue, 15 Feb 2022 18:09:33 +0000 (18:09 +0000)
committerRichard Sandiford <richard.sandiford@arm.com>
Tue, 15 Feb 2022 18:09:33 +0000 (18:09 +0000)
commit4963079769c99c4073adfd799885410ad484cbbe
treeec53951399724d809c6296a30e7a06e81d3e72a8
parent63a9328cb8c601377fe73e214b708c4ae0441847
vect+aarch64: Fix ldp_stp_* regressions

ldp_stp_1.c, ldp_stp_4.c and ldp_stp_5.c have been failing since
vectorisation was enabled at -O2.  In all three cases SLP is
generating vector code when scalar code would be better.

The problem is that the target costs do not model whether STP could
be used for the scalar or vector code, so the normal latency-based
costs for store-heavy code can be way off.  It would be good to fix
that “properly” at some point, but it isn't easy; see the existing
discussion in aarch64_sve_adjust_stmt_cost for more details.

This patch therefore adds an on-the-side check for whether the
code is doing nothing more than set-up+stores.  It then applies
STP-based costs to those cases only, in addition to the normal
latency-based costs.  (That is, the vector code has to win on
both counts rather than on one count individually.)

However, at the moment, SLP costs one vector set-up instruction
for every vector in an SLP node, even if the contents are the
same as a previous vector in the same node.  Fixing the STP costs
without fixing that would regress other cases, tested in the patch.

The patch therefore makes the SLP costing code check for duplicates
within a node.  Ideally we'd check for duplicates more globally,
but that would require a more global approach to costs: the cost
of an initialisation should be amoritised across all trees that
use the initialisation, rather than fully counted against one
arbitrarily-chosen subtree.

Back on aarch64: an earlier version of the patch tried to apply
the new heuristic to constant stores.  However, that didn't work
too well in practice; see the comments for details.  The patch
therefore just tests the status quo for constant cases, leaving out
a match if the current choice is dubious.

ldp_stp_5.c was affected by the same thing.  The test would be
worth vectorising if we generated better vector code, but:

(1) We do a bad job of moving the { -1, 1 } constant, given that
    we have { -1, -1 } and { 1, 1 } to hand.

(2) The vector code has 6 pairable stores to misaligned offsets.
    We have peephole patterns to handle such misalignment for
    4 pairable stores, but not 6.

So the SLP decision isn't wrong as such.  It's just being let
down by later codegen.

The patch therefore adds -mstrict-align to preserve the original
intention of the test while adding ldp_stp_19.c to check for the
preferred vector code (XFAILed for now).

gcc/
* tree-vectorizer.h (vect_scalar_ops_slice): New struct.
(vect_scalar_ops_slice_hash): Likewise.
(vect_scalar_ops_slice::op): New function.
* tree-vect-slp.cc (vect_scalar_ops_slice::all_same_p): New function.
(vect_scalar_ops_slice_hash::hash): Likewise.
(vect_scalar_ops_slice_hash::equal): Likewise.
(vect_prologue_cost_for_slp): Check for duplicate vectors.
* config/aarch64/aarch64.cc
(aarch64_vector_costs::m_stp_sequence_cost): New member variable.
(aarch64_aligned_constant_offset_p): New function.
(aarch64_stp_sequence_cost): Likewise.
(aarch64_vector_costs::add_stmt_cost): Handle new STP heuristic.
(aarch64_vector_costs::finish_cost): Likewise.

gcc/testsuite/
* gcc.target/aarch64/ldp_stp_5.c: Require -mstrict-align.
* gcc.target/aarch64/ldp_stp_14.h,
* gcc.target/aarch64/ldp_stp_14.c: New test.
* gcc.target/aarch64/ldp_stp_15.c: Likewise.
* gcc.target/aarch64/ldp_stp_16.c: Likewise.
* gcc.target/aarch64/ldp_stp_17.c: Likewise.
* gcc.target/aarch64/ldp_stp_18.c: Likewise.
* gcc.target/aarch64/ldp_stp_19.c: Likewise.
gcc/config/aarch64/aarch64.cc
gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c
gcc/tree-vect-slp.cc
gcc/tree-vectorizer.h