gallium/indices: Use "__restrict" to help the compiler
In a perf trace translate_quads_uint2uint_last2last_prdisable() was
showing up as a huge hot spot. Digging through the assembly on arm64
found that the compiler wasn't doing any read caching. Specifically,
the generated code looked roughly like this:
out[j+0] = in[i+0];
out[j+1] = in[i+1];
out[j+2] = in[i+3];
out[j+3] = in[i+1];
out[j+4] = in[i+2];
out[j+5] = in[i+3];
...and the compiler was loading "i+1" and "i+3" from memory twice for
no reason (instead of caching it).
If we sprinkle generous amounts of the `__restrict` keyword then the
compiler is able to be much smarter. Not only does it avoid
double-loading but it also generates better instructions. It uses two
LDRD instructions instead of 6 LDR instructions and uses some STRD
too.
In one example test this increased FPS from ~25.7 to ~34.5.
Change-Id: I88bf8bd9ac421fe48a7d6961e224425c3ae7beee
Reported-by: Rob Clark <robdclark@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9485>