Use ABD and UDOT to implement Neon sad_4d functions
Implementing sad16_neon using ABD, UDOT instead of ABAL, ABAL2 saves
a cycle and removes resource contention for a single SIMD pipe on
modern out-of-order Arm CPUs. The UDOT accumulation into 32-bit
elements also allows for a faster reduction at the end of each SAD
function.
The existing implementation is retained for CPUs that do not
implement the Armv8.4-A UDOT instruction, and CPUs executing in
AArch32 mode.
Bug: b/
181236880
Change-Id: Ibd0da46e86751d2f808c7b1e424f82b046a1aa6f