ARM: make use of UQADD8 instruction even in generic C code paths
ARMv6 has UQADD8 instruction, which implements unsigned saturated
addition for 8-bit values packed in 32-bit registers. It is very useful
for UN8x4_ADD_UN8x4, UN8_rb_ADD_UN8_rb and ADD_UN8 macros (which would
otherwise need a lot of arithmetic operations to simulate this operation).
Since most of the major ARM linux distros are built for ARMv7, we are
much less dependent on runtime CPU detection and can get practical
benefits from conditional compilation here for a lot of users.
The results of cairo-perf-trace benchmark on ARM Cortex-A15 with pixman
compiled by gcc 4.7.2 and PIXMAN_DISABLE set to "arm-simd arm-neon":
Speedups
========
image firefox-talos-gfx (29938.22 0.12%) -> (27814.76 0.51%) : 1.08x speedup
image firefox-asteroids (23241.11 0.07%) -> (21795.19 0.07%) : 1.07x speedup
image firefox-canvas-alpha (174519.85 0.08%) -> (164788.64 0.20%) : 1.06x speedup
image poppler (9464.46 1.61%) -> (8991.53 0.14%) : 1.05x speedup