gallium/indices: Use "__restrict" to help the compiler
authorDouglas Anderson <dianders@chromium.org>
Tue, 9 Mar 2021 19:10:43 +0000 (11:10 -0800)
committerMarge Bot <eric+marge@anholt.net>
Thu, 11 Mar 2021 03:14:31 +0000 (03:14 +0000)
commit217d6594dec934b4b34f5c7e0a0cd978339a5ba0
treef1ec242009f08cfca1f9c4ca203c715e1b0bad1b
parente7e297732ed56ce4869b2a0e2b5f0533be69f32e
gallium/indices: Use "__restrict" to help the compiler

In a perf trace translate_quads_uint2uint_last2last_prdisable() was
showing up as a huge hot spot. Digging through the assembly on arm64
found that the compiler wasn't doing any read caching. Specifically,
the generated code looked roughly like this:

  out[j+0] = in[i+0];
  out[j+1] = in[i+1];
  out[j+2] = in[i+3];
  out[j+3] = in[i+1];
  out[j+4] = in[i+2];
  out[j+5] = in[i+3];

...and the compiler was loading "i+1" and "i+3" from memory twice for
no reason (instead of caching it).

If we sprinkle generous amounts of the `__restrict` keyword then the
compiler is able to be much smarter. Not only does it avoid
double-loading but it also generates better instructions. It uses two
LDRD instructions instead of 6 LDR instructions and uses some STRD
too.

In one example test this increased FPS from ~25.7 to ~34.5.

Change-Id: I88bf8bd9ac421fe48a7d6961e224425c3ae7beee
Reported-by: Rob Clark <robdclark@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9485>
src/gallium/auxiliary/indices/u_indices_gen.py