sse2: add src_x888_0565
Port of
2ddd1c498b to SSE2.
Uses the pmadd technique described in
http://software.intel.com/sites/landingpage/legacy/mmx/MMX_App_24-16_Bit_Conversion.pdf
Works around lack of packusdw instruction by first sign extending the
values.
fast: src_8888_0565 = L1: 681.40 L2: 689.20 M: 644.76 ( 25.51%) HT:404.42 VT:288.04 R:306.07 RT:150.80 (1619Kops/s)
mmx: src_8888_0565 = L1:2056.03 L2:1985.44 M:1574.91 ( 61.87%) HT:533.10 VT:376.35 R:416.10 RT:178.79 (1833Kops/s)
sse2: src_8888_0565 = L1:3793.42 L2:3653.44 M:1878.83 ( 73.94%) HT:535.03 VT:407.96 R:421.46 RT:163.31 (1727Kops/s)
and for reference, using packusdw
sse4: src_8888_0565 = L1:4396.18 L2:4229.25 M:1904.04 ( 75.18%) HT:559.79 VT:427.96 R:440.06 RT:165.71 (1744Kops/s)
Notice that MMX is faster in the RT case because it can operate on
8-bytes instead of the current 16-bytes for SSE2.