sse2: faster bilinear scaling (use _mm_loadl_epi64)
Using _mm_loadl_epi64() to load two pixels at once (pairs of top
and bottom pixels) is faster than loading each pixel separately
and combining them with _mm_set_epi32().
=== cairo-perf-trace ===
before: image firefox-fishtank 66.912 66.931 0.13% 3/3
after: image firefox-fishtank 57.584 58.349 0.74% 3/3
=== lowlevel-blt-bench ===
before: src_8888_8888 = L1: 181.10 L2: 179.14 M:178.08 ( 11.02%) HT:153.22 VT:133.45 R:142.24 RT: 95.32
after: src_8888_8888 = L1: 228.68 L2: 225.75 M:223.98 ( 14.23%) HT:185.32 VT:155.06 R:162.73 RT:102.52
This improvement was suggested by Matt Turner on irc.