ARM: pipelined NEON implementation of bilinear scaled 'src_8888_8888'
Performance of the inner loop when working with the data in L1 cache:
ARM Cortex-A8: 41 cycles per 4 pixels (no stalls and partial dual issue)
ARM Cortex-A9: 48 cycles per 4 pixels (no stalls)
It might be still possible to improve performance even more on ARM Cortex-A8
with a better use of dual issue.
Benchmark on ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
20028888, dst=
20028888, speed=40.38 MPix/s
after: op=1, src=
20028888, dst=
20028888, speed=48.47 MPix/s
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
20028888, dst=
20028888, speed=79.68 MPix/s
after: op=1, src=
20028888, dst=
20028888, speed=93.11 MPix/s