[NEON] Optimize highbd 32x32 DCT
authorKonstantinos Margaritis <konstantinos@vectorcamp.gr>
Wed, 26 Oct 2022 22:09:32 +0000 (22:09 +0000)
committerKonstantinos Margaritis <konstantinos@vectorcamp.gr>
Thu, 3 Nov 2022 17:55:13 +0000 (17:55 +0000)
commit3f08aa0d0b2828b670073f808ae079acb35902a4
treec5559ff1d0b40d35a36392b76ffef8247bc128d6
parentf02a1191004e6190cfbb6efc38363f9f166d0256
[NEON] Optimize highbd 32x32 DCT

For --best quality, resulting function
vpx_highbd_fdct32x32_rd_neon takes 0.27% of cpu time in
profiling, vs 6.27% for the sum of scalar functions:
vpx_fdct32, vpx_fdct32.constprop.0, vpx_fdct32x32_rd_c for rd.
For --rt quality, the function takes 0.19% vs 4.57% for the scalar
version.
Overall, this improves encoding time by ~6% compared for highbd
for --best and ~9% for --rt.

Change-Id: I1ce4bbef6e364bbadc76264056aa3f86b1a8edc5
vpx_dsp/arm/fdct32x32_neon.c
vpx_dsp/arm/fdct32x32_neon.h
vpx_dsp/arm/fdct_neon.h
vpx_dsp/vpx_dsp_rtcd_defs.pl