SSE2 optimization for scaled over_8888_8888 operation with nearest filter
This is the first demo implementation, it should be possible to
generalize it later to cover more operations with less lines of code.
It should be also possible to introduce the use of '__builtin_constant_p'
gcc builtin function for an efficient way of checking if 'unit_x' is known
to be zero at compile time (when processing padding pixels for NONE, or
PAD repeat).
Benchmarks from Intel Core i7 860:
== before (nearest OVER) ==
op=3, src_fmt=
20028888, dst_fmt=
20028888, speed=142.01 MPix/s
== after (nearest OVER) ==
op=3, src_fmt=
20028888, dst_fmt=
20028888, speed=314.99 MPix/s
== performance of nonscaled operation as a reference ==
op=3, src_fmt=
20028888, dst_fmt=
20028888, speed=652.09 MPix/s