sse2: _mm_madd_epi16 for faster bilinear scaling with 7-bit precision
Reducing interpolation precision allows the use of PMADDWD instruction.
This makes bilinear scaling much faster (on Intel Core i7):
8-bit: image firefox-fishtank 57.584 58.349 0.74% 3/3
7-bit: image firefox-fishtank 51.139 51.229 0.30% 3/3
8-bit: src_8888_8888 = L1: 228.71 L2: 226.52 M:224.82 ( 14.95%) HT:183.22 VT:154.02 R:171.72 RT:109.36
7-bit: src_8888_8888 = L1: 320.45 L2: 317.43 M:314.38 ( 20.77%) HT:215.13 VT:177.35 R:204.46 RT:121.93