Use more unrolling for scaled src_0565_0565 with nearest filter
Benchmark from Intel Core i7 860:
== before ==
op=1, src_fmt=
10020565, dst_fmt=
10020565, speed=1335.29 MPix/s
== after ==
op=1, src_fmt=
10020565, dst_fmt=
10020565, speed=1550.96 MPix/s
== performance of nonscaled src_0565_0565 operation as a reference ==
op=1, src_fmt=
10020565, dst_fmt=
10020565, speed=2401.31 MPix/s
Benchmark from ARM Cortex-A8:
== before ==
op=1, src_fmt=
10020565, dst_fmt=
10020565, speed=81.79 MPix/s
== after ==
op=1, src_fmt=
10020565, dst_fmt=
10020565, speed=89.55 MPix/s
== performance of nonscaled src_0565_0565 operation as a reference ==
op=1, src_fmt=
10020565, dst_fmt=
10020565, speed=197.44 MPix/s