ARM: NEON: added forgotten cache preload for over_n_8888/over_n_0565
Prefetch provides up to 40-50% better performance when working
with large images and/or when having lots of L2 cache misses
on ARM Cortex-A8 @ 720MHz:
== before ==
over_n_8888 = L1: 225.83 L2: 181.02 M: 55.57 ( 41.41%)
HT: 38.96 VT: 36.92 R: 32.84 RT: 14.15 ( 123Kops/s)
over_n_0565 = L1: 153.91 L2: 149.69 M: 83.17 ( 30.95%)
HT: 50.41 VT: 49.15 R: 40.56 RT: 15.45 ( 131Kops/s)
== after ==
over_n_8888 = L1: 222.39 L2: 170.95 M: 76.86 ( 57.27%)
HT: 58.80 VT: 53.03 R: 45.51 RT: 14.13 ( 124Kops/s)
over_n_0565 = L1: 151.87 L2: 149.54 M:125.63 ( 46.80%)
HT: 67.85 VT: 57.54 R: 50.21 RT: 15.32 ( 130Kops/s)