ARM: Add 'neon_composite_over_n_8888_0565_ca' fast path
This improves the performance of the firefox-talos-gfx benchmark with
the image16 backend. Benchmark on an 800 MHz ARM Cortex A8:
Before:
[ # ] backend test min(s) median(s) stddev. count
[ 0] image16 firefox-talos-gfx 121.773 122.218 0.15% 6/6
After:
[ # ] backend test min(s) median(s) stddev. count
[ 0] image16 firefox-talos-gfx 85.247 85.563 0.22% 6/6
V2: Slightly better instruction scheduling based on comments from Taekyun Kim.
V3: Eliminate all stalls from the inner loop. Also based on comments from Taekyun Kim.