MIPS: DSPr2: Added several bilinear fast paths with a8 mask
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench -b
Referent (before):
src_8888_8_8888 = L1: 6.37 L2: 6.08 M: 5.46 ( 32.57%) HT: 4.64 VT: 4.61 R: 4.52 RT: 2.85 ( 23Kops/s)
src_8888_8_0565 = L1: 5.89 L2: 5.66 M: 5.11 ( 23.71%) HT: 4.36 VT: 4.34 R: 4.26 RT: 2.71 ( 22Kops/s)
src_0565_8_x888 = L1: 3.32 L2: 3.27 M: 3.17 ( 14.71%) HT: 2.86 VT: 2.84 R: 2.81 RT: 2.07 ( 19Kops/s)
src_0565_8_0565 = L1: 3.19 L2: 3.15 M: 3.05 ( 10.11%) HT: 2.75 VT: 2.74 R: 2.71 RT: 2.00 ( 18Kops/s)
over_8888_8_8888 = L1: 4.99 L2: 4.71 M: 4.11 ( 27.22%) HT: 3.59 VT: 3.58 R: 3.50 RT: 2.36 ( 21Kops/s)
add_8888_8_8888 = L1: 5.60 L2: 5.26 M: 4.52 ( 29.95%) HT: 3.92 VT: 3.89 R: 3.80 RT: 2.49 ( 21Kops/s)
Optimized:
src_8888_8_8888 = L1: 13.19 L2: 12.13 M: 9.75 ( 58.22%) HT: 8.60 VT: 8.44 R: 7.90 RT: 5.06 ( 33Kops/s)
src_8888_8_0565 = L1: 11.64 L2: 10.81 M: 9.18 ( 42.63%) HT: 8.04 VT: 7.90 R: 7.57 RT: 5.02 ( 32Kops/s)
src_0565_8_x888 = L1: 8.34 L2: 7.95 M: 7.29 ( 33.85%) HT: 6.55 VT: 6.48 R: 6.25 RT: 4.35 ( 30Kops/s)
src_0565_8_0565 = L1: 7.71 L2: 7.35 M: 6.90 ( 22.90%) HT: 6.14 VT: 6.10 R: 5.94 RT: 4.07 ( 29Kops/s)
over_8888_8_8888 = L1: 9.73 L2: 8.99 M: 7.15 ( 47.41%) HT: 6.40 VT: 6.30 R: 6.11 RT: 4.28 ( 30Kops/s)
add_8888_8_8888 = L1: 13.01 L2: 11.72 M: 8.70 ( 57.68%) HT: 7.59 VT: 7.46 R: 7.20 RT: 4.74 ( 32Kops/s)