aarch64: Tweak SVE load/store costs
authorRichard Sandiford <richard.sandiford@arm.com>
Tue, 14 Apr 2020 20:04:03 +0000 (21:04 +0100)
committerRichard Sandiford <richard.sandiford@arm.com>
Fri, 17 Apr 2020 15:09:38 +0000 (16:09 +0100)
commit8b50d7a47624030d87645237c60bd8f7ac78b2ec
tree00f4b13286baea1366f15755050f0a7f8e6f7913
parent2e3897490e0f99b22a2813cfb34d59a1ea71ff68
aarch64: Tweak SVE load/store costs

We were seeing performance regressions on 256-bit SVE with code like:

  for (int i = 0; i < count; ++i)
  #pragma GCC unroll 128
    for (int j = 0; j < 128; ++j)
      *dst++ = 1;

(derived from lmbench).

For 128-bit SVE, it's clearly better to use Advanced SIMD STPs here,
since they can store 256 bits at a time.  We already do this for
-msve-vector-bits=128 because in that case Advanced SIMD comes first
in autovectorize_vector_modes.

If we handled full-loop predication well for this kind of loop,
the choice between Advanced SIMD and 256-bit SVE would be mostly
a wash, since both of them could store 256 bits at a time.  However,
SVE would still have the extra prologue overhead of setting up the
predicate, so Advanced SIMD would still be the natural choice.

As things stand though, we don't handle full-loop predication well
for this kind of loop, so the 256-bit SVE code is significantly worse.
Something to fix for GCC 11 (hopefully).  However, even though we
account for the overhead of predication in the cost model, the SVE
version (wrongly) appeared to need half the number of stores.
That was enough to drown out the predication overhead and meant
that we'd pick the SVE code over the Advanced SIMD code.

512-bit SVE has a clear advantage over Advanced SIMD, so we should
continue using SVE there.

This patch tries to account for this in the cost model.  It's a bit
of a compromise; see the comment in the patch for more details.

2020-04-17  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
* config/aarch64/aarch64.c (aarch64_advsimd_ldp_stp_p): New function.
(aarch64_sve_adjust_stmt_cost): Add a vectype parameter.  Double the
cost of load and store insns if one loop iteration has enough scalar
elements to use an Advanced SIMD LDP or STP.
(aarch64_add_stmt_cost): Update call accordingly.

gcc/testsuite/
* gcc.target/aarch64/sve/cost_model_2.c: New test.
* gcc.target/aarch64/sve/cost_model_3.c: Likewise.
* gcc.target/aarch64/sve/cost_model_4.c: Likewise.
* gcc.target/aarch64/sve/cost_model_5.c: Likewise.
* gcc.target/aarch64/sve/cost_model_6.c: Likewise.
* gcc.target/aarch64/sve/cost_model_7.c: Likewise.
gcc/ChangeLog
gcc/config/aarch64/aarch64.c
gcc/testsuite/ChangeLog
gcc/testsuite/gcc.target/aarch64/sve/cost_model_2.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/sve/cost_model_3.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/sve/cost_model_4.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/sve/cost_model_5.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/sve/cost_model_6.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/sve/cost_model_7.c [new file with mode: 0644]