Improve convolve AVX2 intrinsic for speed
This CL refactors the code related to convolve function.
Furthermore, improved the AVX2 intrinsic to compute
convolve vertical for w = 4 case, and convolve horiz for
w = 16 case.
Please note the module level scaling w.r.t C function
(timer based) for existing (AVX2) and new AVX2 intrinsics:
Block Scaling
Size AVX2 AVX2
(existing) (New)
4x4 5.34x 5.91x
4x8 7.10x 7.79x
16x8 23.52x 25.63x
16x16 29.47x 30.22x
16x32 33.42x 33.44x
This is a bit exact change.
Change-Id: If130183bc12faab9ca2bcec0ceeaa8d0af05e413