svolume: Fix ARM alignment issues
As Peter Meerwald <p.meerwald@bct-electronic.com> discovered, our ARM
svolume code performance is quite terrible when the incoming samples are
not word-aligned. This can very easily be the case, since the
architecture only requires that the samples be 16-bit aligned, and we
might end up running the innermost loop after processing modulo-4
samples. The performance degradation was ~50x on a Cortex A9
(Pandaboard).
This reworks the svolume logic to first consume enough samples to make
sure the rest is word aligned, and reordering the processing to work
with 4 samples at a time first, and then finally deal with the
remainder.
With this, performance is comparable for arbitrary alignments (~3x
faster than the C code).