Implement d153_predictor using Neon
Add Neon implementations of the d153 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 1.59
Neoverse N1 | LLVM 15 | 8x8 | 4.46
Neoverse N1 | LLVM 15 | 16x16 | 8.77
Neoverse N1 | LLVM 15 | 32x32 | 15.21
Neoverse N1 | GCC 12 | 4x4 | 1.90
Neoverse N1 | GCC 12 | 8x8 | 4.70
Neoverse N1 | GCC 12 | 16x16 | 9.55
Neoverse N1 | GCC 12 | 32x32 | 5.95
Neoverse V1 | LLVM 15 | 4x4 | 2.89
Neoverse V1 | LLVM 15 | 8x8 | 6.94
Neoverse V1 | LLVM 15 | 16x16 | 10.20
Neoverse V1 | LLVM 15 | 32x32 | 15.63
Neoverse V1 | GCC 12 | 4x4 | 4.45
Neoverse V1 | GCC 12 | 8x8 | 7.71
Neoverse V1 | GCC 12 | 16x16 | 9.08
Neoverse V1 | GCC 12 | 32x32 | 7.93
Change-Id: I910692b14917cde8a8952fab5b9c78bed7f7c6ad