Faster fetch for the C variant of r5g6b5 src/dest iterator
Processing two pixels at once is used to reduce the number of
arithmetic operations.
The speedup relative to the generic fetch_scanline_r5g6b5() from
"pixman-access.c" (pixman was compiled with gcc 4.7.2):
MIPS 74K 480MHz : 20.32 MPix/s -> 26.47 MPix/s
ARM11 700MHz : 34.95 MPix/s -> 38.22 MPix/s
ARM Cortex-A8 1000MHz : 87.44 MPix/s -> 100.92 MPix/s
ARM Cortex-A9 1700MHz : 150.95 MPix/s -> 158.13 MPix/s
ARM Cortex-A15 1700MHz : 148.91 MPix/s -> 155.42 MPix/s
IBM Cell PPU 3200MHz : 75.29 MPix/s -> 98.33 MPix/s
Intel Core i7 2800MHz : 257.02 MPix/s -> 376.93 MPix/s
That's the performance for C code (SIMD and assembly optimizations
are disabled via PIXMAN_DISABLE environment variable).