ARM: a bit faster NEON bilinear scaling for r5g6b5 source images
Instructions scheduling improved in the code responsible for fetching r5g6b5
pixels and converting them to the intermediate x8r8g8b8 color format used in
the interpolation part of code. Still a lot of NEON stalls are remaining,
which can be resolved later by the use of pipelining.
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
10020565, dst=
10020565, speed=32.29 MPix/s
op=1, src=
10020565, dst=
20020888, speed=36.82 MPix/s
after: op=1, src=
10020565, dst=
10020565, speed=41.35 MPix/s
op=1, src=
10020565, dst=
20020888, speed=49.16 MPix/s