ARM: optimization for scaled src_0565_0565 with nearest filter
The performance improvement is only in the ballpark of 5% when
compared against C code built with a reasonably good compiler
(gcc 4.5.1). But gcc 4.4 produces approximately 30% slower code
here, so assembly optimization makes sense to avoid dependency
on the compiler quality and/or optimization options.
Benchmark from ARM11:
== before ==
op=1, src_fmt=
10020565, dst_fmt=
10020565, speed=34.86 MPix/s
== after ==
op=1, src_fmt=
10020565, dst_fmt=
10020565, speed=36.62 MPix/s
Benchmark from ARM Cortex-A8:
== before ==
op=1, src_fmt=
10020565, dst_fmt=
10020565, speed=89.55 MPix/s
== after ==
op=1, src_fmt=
10020565, dst_fmt=
10020565, speed=94.91 MPix/s