Improve convolve AVX2 intrinsic for speed
authorAnupam Pandey <anupam.pandey@ittiam.com>
Fri, 12 May 2023 05:26:45 +0000 (10:56 +0530)
committerAnupam Pandey <anupam.pandey@ittiam.com>
Wed, 17 May 2023 08:54:34 +0000 (14:24 +0530)
commite6b9a8d667bb43c58437bb1d6204ffc8047252ac
treede1339461b2aa820da080e337b15670e1517f08c
parent9e0fc37f6f68685066f3e71e1cd0605d6ee2205e
Improve convolve AVX2 intrinsic for speed

This CL refactors the code related to convolve function.
Furthermore, improved the AVX2 intrinsic to compute
convolve vertical for w = 4 case, and convolve horiz for
w = 16 case.

Please note the module level scaling w.r.t C function
(timer based) for existing (AVX2) and new AVX2 intrinsics:

Block     Scaling
Size   AVX2       AVX2
     (existing)   (New)
4x4    5.34x      5.91x
4x8    7.10x      7.79x
16x8  23.52x     25.63x
16x16 29.47x     30.22x
16x32 33.42x     33.44x

This is a bit exact change.

Change-Id: If130183bc12faab9ca2bcec0ceeaa8d0af05e413
vpx_dsp/x86/vpx_subpixel_8t_intrin_avx2.c