Use sum_neon.h helpers in Neon DC predictors
Use sum_neon.h helpers for horizontal reductions in Neon DC predictors,
enabling use of dedicated Neon reduction instructions on AArch64. Some
of the surrounding code is also optimized to remove redundant broadcast
instructions in the dc_store helpers.
Performance is largely unchanged on both the standard as well as the
high bit-depth predictors. The main improvement appears to be the 16x16
standard-bitdepth dc predictor, which improves by 10-15% when
benchmarked on Neoverse N1.
Change-Id: Ibfcc6ecf4b1b2f87ce1e1f63c314d0cc35a0c76f