James Zern [Wed, 29 Mar 2023 20:52:46 +0000 (20:52 +0000)]
Merge changes Ie4ffa298,If5ec220a,I670dc379 into main
* changes:
Avoid LD2/ST2 instructions in highbd v predictors in Neon
Avoid interleaving loads/stores in Neon for highbd dc predictor
Avoid LD2/ST2 instructions in vpx_dc_predictor_32x32_neon
Jerome Jiang [Wed, 29 Mar 2023 18:24:18 +0000 (18:24 +0000)]
Merge "svc: Fix a case where target bandwidth is 0" into main
Jerome Jiang [Wed, 29 Mar 2023 17:06:19 +0000 (13:06 -0400)]
svc: Fix a case where target bandwidth is 0
Bug: webrtc:15033
Change-Id: Iea2997c2ce8982f106a1eed3ec4f7dd1c6e83666
George Steed [Wed, 22 Mar 2023 11:49:33 +0000 (11:49 +0000)]
Avoid LD2/ST2 instructions in highbd v predictors in Neon
The interleaving load/store instructions (LD2/LD3/LD4 and ST2/ST3/ST4)
are useful if we are dealing with interleaved data (e.g. real/imag
components of complex numbers), but for simply loading or storing larger
quantities of data it is preferable to simply use the normal load/store
instructions.
This patch replaces such occurrences in the two larger block sizes:
vpx_highbd_v_predictor_16x16_neon and vpx_highbd_v_predictor_32x32_neon.
Change-Id: Ie4ffa298a2466ceaf893566fd0aefe3f66f439e4
George Steed [Wed, 22 Mar 2023 08:44:26 +0000 (08:44 +0000)]
Avoid interleaving loads/stores in Neon for highbd dc predictor
The interleaving load/store instructions (LD2/LD3/LD4 and ST2/ST3/ST4)
are useful if we are dealing with interleaved data (e.g. real/imag
components of complex numbers), but for simply loading or storing larger
quantities of data it is preferable to simply use two or more of the
normal load/store instructions.
This patch replaces such occurrences in the two larger block sizes:
vpx_highbd_dc_predictor_16x16_neon, vpx_highbd_dc_predictor_32x32_neon,
and related helper functions.
Speedups over the original Neon code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 16x16 | 1.25
Neoverse N1 | LLVM 15 | 32x32 | 1.13
Neoverse N1 | GCC 12 | 16x16 | 1.56
Neoverse N1 | GCC 12 | 32x32 | 1.52
Neoverse V1 | LLVM 15 | 16x16 | 1.63
Neoverse V1 | LLVM 15 | 32x32 | 1.08
Neoverse V1 | GCC 12 | 16x16 | 1.59
Neoverse V1 | GCC 12 | 32x32 | 1.37
Change-Id: If5ec220aba9dd19785454eabb0f3d6affec0cc8b
George Steed [Tue, 21 Mar 2023 14:31:50 +0000 (14:31 +0000)]
Avoid LD2/ST2 instructions in vpx_dc_predictor_32x32_neon
The LD2 and ST2 instructions are useful if we are dealing with
interleaved data (e.g. real/imag components of complex numbers), but for
simply loading or storing larger quantities of data it is preferable to
simply use two of the normal load/store instructions.
This patch replaces such occurrences in vpx_dc_predictor_32x32_neon and
related functions.
With Clang-15 this speeds up this function by 10-30% depending on the
micro-architecture being benchmarked on. With GCC-12 this speeds up the
function by 40-60% depending on the micro-architecture being benchmarked
on.
Change-Id: I670dc37908aa238f360104efd74d6c2108ecf945
Yunqing Wang [Tue, 28 Mar 2023 22:14:51 +0000 (22:14 +0000)]
Merge "Add AVX2 for convolve vertical filter for block width 4" into main
James Zern [Tue, 28 Mar 2023 20:14:12 +0000 (20:14 +0000)]
Merge changes If83ff1ad,I8fb00a15,Iaad58e77,Iac166d60 into main
* changes:
Randomize second half of above_row_ in intrapred tests for Neon
Allow non-uniform above array in d63 predictor Neon impl
Allow non-uniform above array in d45 predictor Neon impl
Allow non-uniform above array in highbd d45 predictor Neon impl
James Zern [Tue, 28 Mar 2023 18:36:01 +0000 (18:36 +0000)]
Merge "update libwebm to libwebm-1.0.0.29-9-g1930e3c" into main
Jerome Jiang [Tue, 28 Mar 2023 14:09:16 +0000 (10:09 -0400)]
svc: Fix a case where target bandwidth is 0
Bug: webrtc:15033
Change-Id: I28636de66842671b03284408186c4c18254109a5
George Steed [Fri, 17 Mar 2023 20:00:24 +0000 (20:00 +0000)]
Randomize second half of above_row_ in intrapred tests for Neon
The existing tests duplicate `above_row_[block_size - 1]` after the
first `block_size` elements, which can lead to tests incorrectly passing
due to differing behaviour when calculating the average for the last
elements of the output.
This change adjusts the above array setup to be fully random instead,
allowing us to catch such issues here rather than in other larger tests
like the external MD5 tests.
It doesn't appear that other architectures are fully clean with this
change so restrict it to just Neon for now until they are fixed.
Bug: webm:1797
Change-Id: If83ff1adbf1e8d30f2a92474d7186c65840a5d0b
George Steed [Fri, 17 Mar 2023 19:55:17 +0000 (19:55 +0000)]
Allow non-uniform above array in d63 predictor Neon impl
The existing standard bitdepth implementation doesn't appear to manifest
as a failure in any of the predictor or MD5 tests, but it does rely on
the predictor tests filling the second `bs` elements of the `above`
input array with copies of `above[bs - 1]` in order to match the C
implementation.
This patch adjusts the Neon implementation to correctly match the C
implementation in the case where the elements of the `above` array all
differ.
The geomean of performance for the predictor is approximately a 2%
slowdown compared to the previous vectorized implementation. This is
still considerably faster than the unspecialized naive C implementation.
Bug: webm:1797
Change-Id: I8fb00a154288d54b24a72a7ff63c816bdcf3aca3
George Steed [Fri, 17 Mar 2023 17:59:26 +0000 (17:59 +0000)]
Allow non-uniform above array in d45 predictor Neon impl
The existing implementation doesn't appear to manifest as a failure in
any of the predictor or MD5 tests, but it does rely on the predictor
tests filling the second `bs` elements of the `above` input array with
copies of `above[bs - 1]` in order to match the C implementation.
This patch adjusts the Neon implementation to correctly match the C
implementation in the case where the elements of the `above` array all
differ.
Performance of the predictor is mostly unchanged, except for the 32x32
block size where it appears to have gotten about 40% faster when
compiled with clang-15.
Bug: webm:1797
Change-Id: Iaad58e77c5467307a3c80d6989b7cf2988e09311
George Steed [Thu, 9 Mar 2023 23:46:31 +0000 (23:46 +0000)]
Allow non-uniform above array in highbd d45 predictor Neon impl
The existing implementation doesn't appear to manifest as a failure in
any of the predictor or MD5 tests, but it does rely on the predictor
tests filling the second `bs` elements of the `above` input array with
copies of `above[bs - 1]` in order to match the C implementation.
This patch adjusts the Neon implementation to correctly match the C
implementation in the case where the elements of the `above` array all
differ.
Performance of the predictor is mostly unchanged, except for the 16x16
block size where it appears to have gotten marginally faster across most
compiler/micro-architecture combinations.
Bug: webm:1797
Change-Id: Iac166d6047316c0382e0f2790ce780fc99674b43
Anupam Pandey [Tue, 21 Mar 2023 07:30:25 +0000 (13:00 +0530)]
Add AVX2 for convolve vertical filter for block width 4
Introduced AVX2 intrinsic to compute convolve vertical for
w = 4 case. This is a bit-exact change.
Instruction Count
cpu Resolution Reduction(%)
0 LOWRES2 0.364
0 MIDRES2 0.236
0 HDRES2 0.162
0 Average 0.254
Change-Id: I413f58aa6333a6f2421d4c10d49dec01e55b2098
James Zern [Tue, 7 Mar 2023 23:29:37 +0000 (15:29 -0800)]
vp9_rdopt,block_rd_txfm: fix clang-tidy warning
argument name 'recon' in comment does not match parameter name
'out_recon'.
https://clang.llvm.org/extra/clang-tidy/checks/bugprone/argument-comment.html
+ normalize similar calls, using /*var=*/NULL to better match the style
guidelines
https://google.github.io/styleguide/cppguide.html#Function_Argument_Comments
Change-Id: I089591317f7138965735f737c1536a8b16fcd4e4
James Zern [Fri, 24 Mar 2023 18:04:19 +0000 (18:04 +0000)]
Merge changes Ide512788,I77c7abae into main
* changes:
vp9_scan.h: rename scan_order struct to ScanOrder
vp9_encodeframe.c: clear -Wshadow warnings
James Zern [Wed, 22 Feb 2023 23:16:43 +0000 (15:16 -0800)]
vp9_scan.h: rename scan_order struct to ScanOrder
This matches the style guide and fixes some -Wshadow warnings related to
variables with the same name. Something similar was done in libaom in:
03f6fdcfca Fix warnings reported by -Wshadow: Part1b: scan_order struct
and variable
Bug: webm:1793
Change-Id: Ide5127886b7fd7778e6d8a983bfba6edda21ff28
James Zern [Wed, 22 Feb 2023 21:53:49 +0000 (13:53 -0800)]
vp9_encodeframe.c: clear -Wshadow warnings
Bug: webm:1793
Change-Id: I77c7abae7bbb1e1f4972cd31e3a67d62477b896e
James Zern [Fri, 24 Mar 2023 02:02:12 +0000 (19:02 -0700)]
update libwebm to libwebm-1.0.0.29-9-g1930e3c
changelog:
https://chromium.googlesource.com/webm/libwebm/+log/
ee0bab576..
1930e3ca2
Bug: webm:1792
Change-Id: I5c5c30c767d357528f102ff38957655e2ec0c645
Wan-Teh Chang [Mon, 20 Mar 2023 23:05:11 +0000 (16:05 -0700)]
Fix comment typos (likely copy-and-paste errors)
Fix comment typos for vpx_codec_destroy() and vpx_codec_enc_init_ver().
Based on the change made in libaom:
https://aomedia.googlesource.com/aom/+/
365a968684
365a968684 Fix comment typos (likely copy-and-paste errors)
Change-Id: I39edae835ed0752b569e8e7328d0709c59724ac2
James Zern [Thu, 23 Mar 2023 21:40:13 +0000 (21:40 +0000)]
Merge "Add Neon implementations of vpx_highbd_avg_<w>x<h>_c" into main
James Zern [Thu, 23 Mar 2023 17:22:28 +0000 (17:22 +0000)]
Merge "test.mk: use CONFIG_VP(8|9)_ENCODER for vp8/vp9-only tests" into main
James Zern [Thu, 23 Mar 2023 17:21:57 +0000 (17:21 +0000)]
Merge "svc_encodeframe.c: fix -Wstringop-truncation" into main
Jerome Jiang [Wed, 22 Mar 2023 20:48:44 +0000 (20:48 +0000)]
Merge "Revert "Add codec control to get tpl stats"" into main
Jerome Jiang [Wed, 22 Mar 2023 20:18:39 +0000 (20:18 +0000)]
Revert "Add codec control to get tpl stats"
This reverts commit
9c15fb62b3dfe1c698dc28f9efedb022b0ef8eb8.
Reason for revert:
vpxenc should only use public interface
Original change's description:
> Add codec control to get tpl stats
>
> Add command line flag to vpxenc to export tpl stats
>
> Bug: b/
273736974
> Change-Id: I6980096531b0c12fbf7a307fdef4c562d0c29e32
Bug: b/
273736974
Change-Id: Ifa8951bb34e5936bbfc33086b22e9fc36d379bc9
Wan-Teh Chang [Wed, 22 Mar 2023 16:09:24 +0000 (16:09 +0000)]
Merge "Change UpdateRateControl() to return bool" into main
Salome Thirot [Fri, 10 Mar 2023 16:30:36 +0000 (16:30 +0000)]
Add Neon implementations of vpx_highbd_avg_<w>x<h>_c
Add Neon implementation of vpx_highbd_avg_4x4_c and vpx_highbd_avg_8x8_c
as well as the corresponding tests.
Change-Id: Ib1b06af5206774347690c9c56e194b76aa409c91
James Zern [Wed, 22 Mar 2023 02:14:12 +0000 (02:14 +0000)]
Merge changes I8abac3c9,If678fc19 into main
* changes:
vp9_bitstream.c: clear -Wshadow warnings
vp9_setup_mask: clear -Wshadow warnings
James Zern [Tue, 21 Mar 2023 20:20:51 +0000 (20:20 +0000)]
Merge changes I650b305c,If3e4cf37,I4c791e3a into main
* changes:
sixtappredict_neon.c: remove redundant returns
sixtappredict_neon.c,cosmetics: fix a typo
vp8_sixtap_predict16x16_neon: fix overread
Jerome Jiang [Tue, 21 Mar 2023 18:34:34 +0000 (18:34 +0000)]
Merge "Add codec control to get tpl stats" into main
James Zern [Tue, 21 Mar 2023 00:33:00 +0000 (00:33 +0000)]
Merge "Reland "quantize: use scan_order instead of passing scan/iscan"" into main
James Zern [Tue, 21 Mar 2023 00:28:11 +0000 (17:28 -0700)]
test.mk: use CONFIG_VP(8|9)_ENCODER for vp8/vp9-only tests
fixes some uninstantiated test failures when configured with
--disable-vp8 or --disable-vp9
Change-Id: If9a6705bd070edee02306e89da103ed474688ec8
James Zern [Tue, 21 Mar 2023 00:09:42 +0000 (17:09 -0700)]
svc_encodeframe.c: fix -Wstringop-truncation
use sizeof(buf) - 1 with strncpy.
fixes:
examples/svc_encodeframe.c:282:3: warning: ‘strncpy’ specified bound
1024 equals destination size [-Wstringop-truncation]
282 | strncpy(si->options, options, sizeof(si->options));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Change-Id: I46980872f9865ae1dc2b56330c3a65d8bc6cf1f7
James Zern [Mon, 20 Mar 2023 23:58:28 +0000 (16:58 -0700)]
sixtappredict_neon.c: remove redundant returns
Change-Id: I650b305c2599fc32353daba030e6241d330796a7
James Zern [Mon, 20 Mar 2023 23:56:58 +0000 (16:56 -0700)]
sixtappredict_neon.c,cosmetics: fix a typo
Change-Id: If3e4cf372fc6ed076f0d42c435a72262494aab68
James Zern [Mon, 20 Mar 2023 23:43:47 +0000 (16:43 -0700)]
vp8_sixtap_predict16x16_neon: fix overread
Shift the final read from the source by 3 to avoid breaking the
assumption that the 6-tap filter needs only 5 pixels outside of the
macroblock; this matches the sse2 and ssse3 implementations.
It's possible this restriction could be removed if the source buffers
are assumed to be padded.
Bug: webm:1795
Change-Id: I4c791e3a214898a503c78f4cedca154c75cdbaef
Fixed: webm:1795
Yunqing Wang [Mon, 20 Mar 2023 16:35:44 +0000 (16:35 +0000)]
Merge "Skip trellis coeff opt based on tx block properties" into main
Yunqing Wang [Mon, 20 Mar 2023 16:27:53 +0000 (16:27 +0000)]
Merge "Refactor logic of skipping trellis coeff opt" into main
Jerome Jiang [Fri, 17 Mar 2023 18:34:42 +0000 (14:34 -0400)]
Add codec control to get tpl stats
Add command line flag to vpxenc to export tpl stats
Bug: b/
273736974
Change-Id: I6980096531b0c12fbf7a307fdef4c562d0c29e32
Deepa K G [Thu, 2 Mar 2023 08:09:55 +0000 (13:39 +0530)]
Skip trellis coeff opt based on tx block properties
The trellis coefficient optimization is skipped for blocks
with larger residual mse.
Instruction Count BD-Rate Loss(%)
cpu Resolution Reduction(%) avg.psnr ovr.psnr ssim
0 LOWRES2 9.467 0.0921 0.1057 0.0362
0 MIDRES2 4.328 -0.0155 0.0694 0.0178
0 HDRES2 1.858 0.0231 0.0214 -0.0034
0 Average 5.218 0.0332 0.0655 0.0169
STATS_CHANGED
Change-Id: I321a9b1a34ebb59b7b6a065b5b2d717c8767a4a5
Deepa K G [Thu, 2 Mar 2023 08:09:55 +0000 (13:39 +0530)]
Refactor logic of skipping trellis coeff opt
The code to enable trellis coefficient optimization
is refactored using the sf 'trellis_opt_tx_rd'. This
change facilitates adaptive skipping of trellis
optimization based on block properties.
Change-Id: Ia1ff7cbbe5acf86414410f62655d46c099387847
James Zern [Wed, 22 Feb 2023 21:29:20 +0000 (13:29 -0800)]
vp9_bitstream.c: clear -Wshadow warnings
Bug: webm:1793
Change-Id: I8abac3c901ad24b642b39ea6e6081d8ba626853d
James Zern [Wed, 22 Feb 2023 21:22:08 +0000 (13:22 -0800)]
vp9_setup_mask: clear -Wshadow warnings
Bug: webm:1793
Change-Id: If678fc195ef87cc634d31fb7b24e0c844a5cb7b0
Johann [Mon, 14 Nov 2022 07:47:33 +0000 (16:47 +0900)]
Reland "quantize: use scan_order instead of passing scan/iscan"
This is a reland of commit
14fc40040ff30486c45111056db44ee18590a24a
Parent change fixed in crrev.com/c/webm/libvpx/+/4305500
Original change's description:
> quantize: use scan_order instead of passing scan/iscan
>
> further reduces the arguments for the 32x32. This will be applied to the base
> version as well.
>
> Change-Id: I25a162b5248b14af53d9e20c6a7fa2a77028a6d1
Change-Id: I2a7654558eaddd68bd09336bf317b297f18559d2
James Zern [Fri, 17 Mar 2023 20:35:24 +0000 (20:35 +0000)]
Merge changes I5d9444a2,I1f127df9 into main
* changes:
Add Neon implementation of vpx_highbd_minmax_8x8_c
Add tests for vpx_highbd_minmax_8x8_c
James Zern [Fri, 17 Mar 2023 20:32:11 +0000 (20:32 +0000)]
Merge "Reland "quantize: simplifly highbd 32x32_b args"" into main
Salome Thirot [Thu, 9 Mar 2023 13:58:16 +0000 (13:58 +0000)]
Add Neon implementation of vpx_highbd_minmax_8x8_c
Add Neon implementation of vpx_highbd_minmax_8x8_c as well as the
corresponding tests.
Change-Id: I5d9444a239fb1baa53634c1bdb5292b44067d90c
Salome Thirot [Thu, 9 Mar 2023 21:04:07 +0000 (21:04 +0000)]
Add tests for vpx_highbd_minmax_8x8_c
Write tests for vpx_highbd_minmax_8x8_c, and fix initial value of min in
vpx_highbd_minmax_8x8_c.
Change-Id: I1f127df945bbb8c7d373c5430ff5f94f28575968
Johann [Fri, 11 Nov 2022 23:23:17 +0000 (08:23 +0900)]
Reland "quantize: simplifly highbd 32x32_b args"
This is a reland of commit
573f5e662b544dbc553d73fa2b61055c30dfe8cc
Alignment issue with tests fixed in crrev.com/c/webm/libvpx/+/4305500
Original change's description:
> quantize: simplify highbd 32x32_b args
>
> Change-Id: I431a41279c4c4193bc70cfe819da6ea7e1d2fba1
Change-Id: Ic868b6f987c99d88672858fedd092fa49c125e19
Wan-Teh Chang [Thu, 16 Mar 2023 20:30:01 +0000 (13:30 -0700)]
Change UpdateRateControl() to return bool
Change the VP9RateControlRtcConfig constructor to initialize
ss_number_layers (to 1).
Change UpdateRateControl() to return bool so that it can report failure
(due to invalid configuration).
Also change InitRateControl() to return bool to propagate the return
value of UpdateRateControl().
Note: This is a port of the libaom CL
https://aomedia-review.googlesource.com/c/aom/+/172042.
Change-Id: I90b60353b5f15692dba5d89e7b1a9c81bb2fdd89
Wan-Teh Chang [Fri, 17 Mar 2023 02:54:21 +0000 (02:54 +0000)]
Merge "Set oxcf->ts_rate_decimator[tl] only once" into main
Wan-Teh Chang [Fri, 17 Mar 2023 01:36:13 +0000 (18:36 -0700)]
Set oxcf->ts_rate_decimator[tl] only once
The code that sets oxcf->ts_rate_decimator[tl] does not need to be
inside a loop that iterates over sl. Move the code out of the sl loop so
that oxcf->ts_rate_decimator[tl] is set only once.
Change-Id: I22f6c117d200ec38a757b749a8700660d15436c1
Wan-Teh Chang [Thu, 16 Mar 2023 22:21:49 +0000 (15:21 -0700)]
Remove repeated field from VP9RateControlRtcConfig
Remove the `ts_number_layers` field from VP9RateControlRtcConfig because
the base class VpxRateControlRtcConfig already has that field.
Note: In commit
65a1751e5b98bf7f1d21bcbfdef352af34fb205d,
`ts_number_layers` was moved to the newly created base class
VpxRateControlRtcConfig but was inadvertently left in
VP9RateControlRtcConfig:
https://chromium-review.googlesource.com/c/webm/libvpx/+/3140048,
Change-Id: I98d48e152683ec2e5e62efffb56b7f010c5d0695
Wan-Teh Chang [Thu, 16 Mar 2023 21:40:14 +0000 (21:40 +0000)]
Merge "Update the sample code for VP9RateControlRTC" into main
Yunqing Wang [Thu, 16 Mar 2023 20:44:11 +0000 (20:44 +0000)]
Merge "Add AVX2 for convolve horizontal filter for block width 4" into main
Wan-Teh Chang [Thu, 16 Mar 2023 20:37:56 +0000 (13:37 -0700)]
Update the sample code for VP9RateControlRTC
Update the sample code to the current VP9RateControlRTC interface.
Change-Id: I30b0712c897f93fd62ebce51ce39afce3cac1fd7
Anupam Pandey [Tue, 14 Mar 2023 11:20:31 +0000 (16:50 +0530)]
Add AVX2 for convolve horizontal filter for block width 4
Introduced AVX2 intrinsic to compute convolve horizontal for
w = 4 case. This is a bit-exact change.
Instruction Count
cpu Resolution Reduction(%)
0 LOWRES2 0.763
0 MIDRES2 0.466
0 HDRES2 0.317
0 Average 0.516
Change-Id: I124f3f8e994c24461812f4963b113819466db44f
Salome Thirot [Wed, 8 Mar 2023 14:08:23 +0000 (14:08 +0000)]
Optimize vpx_minmax_8x8_neon for aarch64
Optimize vpx_minmax_8x8_neon on AArch64 targets by using the UMAXV and
UMINV instructions - computing the maximum and minimum elements in a
Neon vector.
Change-Id: I54c3a3a087d266f6774e6113e5947253df288a64
James Zern [Tue, 14 Mar 2023 19:38:04 +0000 (19:38 +0000)]
Merge "Add Neon implementation of vpx_highbd_satd_c" into main
James Zern [Tue, 14 Mar 2023 19:32:32 +0000 (19:32 +0000)]
Merge "Optimize vpx_satd_neon" into main
James Zern [Tue, 14 Mar 2023 19:31:02 +0000 (19:31 +0000)]
Merge "Add Neon implementation of vp9_highbd_block_error_c" into main
Salome Thirot [Wed, 8 Mar 2023 12:01:04 +0000 (12:01 +0000)]
Add Neon implementation of vpx_highbd_satd_c
Add Neon implementation of vpx_highbd_satd_c as well as the
corresponding tests.
Change-Id: I3d50e6abdf168fb13743e7d8da9364f072308b7f
Salome Thirot [Tue, 7 Mar 2023 17:04:31 +0000 (17:04 +0000)]
Optimize vpx_satd_neon
Optimize Neon implementation of vpx_satd by using ABD and UADALP instead
of ABAL and ABAL2, splitting the accumulator and using a dedicated
helper function to perform the final reduction.
Change-Id: Idcfa49e001b68b1dcd87c13fd9acc317a208cd2a
Salome Thirot [Tue, 7 Mar 2023 15:13:17 +0000 (15:13 +0000)]
Add Neon implementation of vp9_highbd_block_error_c
Add Neon implementation of vp9_highbd_block_error_c as well as the
corresponding tests.
Change-Id: Ibe0eb077f959ced0dcd7d0d8d9d529d3b5bc1874
Konstantinos Margaritis [Wed, 1 Mar 2023 23:37:32 +0000 (23:37 +0000)]
[NEON] Add temporal filter functions, 8-bit and highbd
Both are around 3x faster than original C version. 8-bit gives a
small 0.5% speed increase, whereas highbd gives ~2.5%.
Change-Id: I71d75ddd2757b19aa201e879fd9fa8f3a25431ad
James Zern [Tue, 14 Mar 2023 00:22:31 +0000 (00:22 +0000)]
Merge "Fix buffer overrun in highbd Neon subpel variance filters" into main
James Zern [Fri, 10 Mar 2023 21:40:59 +0000 (21:40 +0000)]
Merge "reland: quantize: simplify 32x32_b args" into main
Yunqing Wang [Fri, 10 Mar 2023 01:02:25 +0000 (01:02 +0000)]
Merge "Add AVX2 for vpx_filter_block1d8_v8() function" into main
Anupam Pandey [Mon, 6 Mar 2023 05:08:20 +0000 (10:38 +0530)]
Add AVX2 for vpx_filter_block1d8_v8() function
Introduced AVX2 intrinsic to compute convolve vertical for
w = 8 case. This is a bit-exact change.
Instruction Count
cpu Resolution Reduction(%)
0 LOWRES2 1.347
0 MIDRES2 1.046
0 HDRES2 0.805
0 Average 1.066
Change-Id: Idf77fff054beaf2c985b9bf2335591bda47e811f
Neeraj Gadgil [Thu, 9 Mar 2023 09:21:44 +0000 (14:51 +0530)]
Rename function 'model_rd_for_sb_earlyterm'
Function renamed as 'build_inter_pred_model_rd_earlyterm' and
added a comment to explain its behavior.
Change-Id: I804e6273558ba36241232f62cf18ea754b85e369
Jonathan Wright [Wed, 8 Mar 2023 16:34:20 +0000 (16:34 +0000)]
Fix buffer overrun in highbd Neon subpel variance filters
The high bitdepth Neon code applying the first pass of the bilinear
filter for subpixel variance on blocks of width 4 processed two rows
at a time. This resulted in a source buffer overread, attempting to
produce two rows of padding for the second (vertical) pass of the
bilinear filter.
This patch modifies highbd_var_filter_block2d_bil_w4 and
highbd_avg_pred_var_filter_block2d_bil_w4 such that they only process
a single row per iteration, and only require a single row of padding
for the second pass. This prevents the buffer overread.
Since all block sizes are now processed one row at a time, there is
no need for a "padding" macro parameter - the value is always 1, with
no special case for 4xh blocks. As well as re-enabling the Neon paths
and their associated tests, we remove the now-redundant 'padding'
macro parameter.
Bug: webm:1796
Change-Id: Icd6076b38eb4476139795bb1734ca800c9edf079
James Zern [Wed, 8 Mar 2023 23:05:08 +0000 (23:05 +0000)]
Merge "disable vpx_highbd_*_sub_pixel_avg_variance4x{4,8}_neon" into main
James Zern [Wed, 8 Mar 2023 21:54:30 +0000 (21:54 +0000)]
Merge "Optimize vpx_sum_squares_2d_i16_neon" into main
James Zern [Wed, 8 Mar 2023 21:17:17 +0000 (13:17 -0800)]
disable vpx_highbd_*_sub_pixel_avg_variance4x{4,8}_neon
vpx_highbd_8_sub_pixel_avg_variance4x4_neon
vpx_highbd_8_sub_pixel_avg_variance4x8_neon
vpx_highbd_10_sub_pixel_avg_variance4x4_neon
vpx_highbd_10_sub_pixel_avg_variance4x8_neon
vpx_highbd_12_sub_pixel_avg_variance4x4_neon
vpx_highbd_12_sub_pixel_avg_variance4x8_neon
all cause heap overflows of the form:
i[ RUN ] NEON/VpxHBDSubpelAvgVarianceTest.Ref/33
=================================================================
==535205==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff95bb0b89 at pc 0x00000116dabc bp 0xffffd09f6430 sp 0xffffd09f6428
READ of size 8 at 0xffff95bb0b89 thread T0
#0 0x116dab8 in load_unaligned_u16q vpx_dsp/arm/mem_neon.h:176:3
#1 0x116dab8 in highbd_var_filter_block2d_bil_w4
vpx_dsp/arm/highbd_subpel_variance_neon.c:49:21
#2 0x116dab8 in vpx_highbd_8_sub_pixel_avg_variance4x4_neon
vpx_dsp/arm/highbd_subpel_variance_neon.c:543:1
...
0xffff95bb0b89 is located 0 bytes to the right of 73-byte region
[0xffff95bb0b40,0xffff95bb0b89)
allocated by thread T0 here:
#0 0x5f18b0 in malloc (test_libvpx+0x5f18b0)
#1 0xce4a40 in vpx_memalign vpx_mem/vpx_mem.c:62:10
#2 0xce4a40 in vpx_malloc vpx_mem/vpx_mem.c:70:40
#3 0xa52238 in (anonymous namespace)::SubpelVarianceTest<unsigned
int (*)(unsigned char const*, int, int, int, unsigned char
const*, int, unsigned int*, unsigned char
const*)>::SetUp()
test/variance_test.cc:586:14
...
This is the same issue as:
e33d4c276 disable vpx_highbd_*_sub_pixel_variance4x{4,8}_neon
They have highbd_var_filter_block2d_bil_w4 in common.
Bug: webm:1796
Change-Id: I3ed70d0ba22e127720542612ea9f6665948eedfc
James Zern [Wed, 8 Mar 2023 06:09:37 +0000 (22:09 -0800)]
disable vpx_highbd_*_sub_pixel_variance4x{4,8}_neon
vpx_highbd_8_sub_pixel_variance4x4_neon
vpx_highbd_8_sub_pixel_variance4x8_neon
vpx_highbd_10_sub_pixel_variance4x4_neon
vpx_highbd_10_sub_pixel_variance4x8_neon
vpx_highbd_12_sub_pixel_variance4x4_neon
vpx_highbd_12_sub_pixel_variance4x8_neon
all cause heap overflows of the form:
[ RUN ] NEON/VpxHBDSubpelVarianceTest.Ref/24
=================================================================
==450528==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff8311a571 at pc 0x0000010ca52c bp 0xffffc63e96b0 sp 0xffffc63e96a8
READ of size 8 at 0xffff8311a571 thread T0
#0 0x10ca528 in load_unaligned_u16q vpx_dsp/arm/mem_neon.h:176:3
#1 0x10ca528 in highbd_var_filter_block2d_bil_w4
vpx_dsp/arm/highbd_subpel_variance_neon.c:49:21
#2 0x10ca528 in vpx_highbd_10_sub_pixel_variance4x8_neon
vpx_dsp/arm/highbd_subpel_variance_neon.c:257:1
...
0xffff8311a571 is located 0 bytes to the right of 113-byte region
[0xffff8311a500,0xffff8311a571)
allocated by thread T0 here:
#0 0x5f18b0 in malloc (test_libvpx+0x5f18b0)
#1 0xce4f90 in vpx_memalign vpx_mem/vpx_mem.c:62:10
#2 0xce4f90 in vpx_malloc vpx_mem/vpx_mem.c:70:40
#3 0xa4ad44 in (anonymous namespace)::SubpelVarianceTest<unsigned
int (*)(unsigned char const*, int, int, int, unsigned char
const*, int, unsigned int*)>::SetUp() test/variance_test.cc:586:14
Bug: webm:1796
Change-Id: I39f7f936bae2bcbbe1f803fb10375ec02d1c1277
James Zern [Tue, 7 Mar 2023 23:40:10 +0000 (23:40 +0000)]
Merge "[SSE4_1] Fix overflow in highbd temporal_filter" into main
James Zern [Tue, 7 Mar 2023 23:00:19 +0000 (23:00 +0000)]
Merge changes I79247b5a,Ic6016cf8,Ibab7ec5f into main
* changes:
Add Neon implementation of vp9_block_error_c
Fix return type of horizontal_add_int64x2 helper
Optimize vp9_block_error_fp_neon
James Zern [Tue, 7 Mar 2023 22:48:54 +0000 (22:48 +0000)]
Merge changes Ic021e82e,I2bce6f19,I250ab56e,I910692b1,Iefaa774d into main
* changes:
Implement highbd_d207_predictor using Neon
Implement highbd_d153_predictor using Neon
Implement d207_predictor using Neon
Implement d153_predictor using Neon
Implement highbd_d63_predictor using Neon
Yunqing Wang [Tue, 7 Mar 2023 16:40:52 +0000 (16:40 +0000)]
Merge "Add AVX2 for vpx_filter_block1d8_h8() function" into main
Yunqing Wang [Tue, 7 Mar 2023 16:37:22 +0000 (16:37 +0000)]
Merge "Use cb pattern for interp eval when filter is not switchable" into main
Yunqing Wang [Tue, 7 Mar 2023 16:35:18 +0000 (16:35 +0000)]
Merge "Early terminate interp filt search based on best RD cost" into main
Anupam Pandey [Thu, 2 Mar 2023 05:28:27 +0000 (10:58 +0530)]
Add AVX2 for vpx_filter_block1d8_h8() function
Introduced AVX2 intrinsic to compute convolve horizontal for
w = 8 case. This is a bit-exact change.
Instruction Count
cpu Resolution Reduction(%)
0 LOWRES2 1.509
0 MIDRES2 1.165
0 HDRES2 0.898
0 Average 1.191
Change-Id: I699c94aa3d7ea74c58f901df906eed0b81b4ee79
Salome Thirot [Mon, 6 Mar 2023 11:37:26 +0000 (11:37 +0000)]
Add Neon implementation of vp9_block_error_c
Add Neon implementation of vp9_block_error_c as well as the
corresponding tests.
Change-Id: I79247b5ae24f51b7b55fc5e517d5e403dc86367a
Salome Thirot [Fri, 3 Mar 2023 10:53:27 +0000 (10:53 +0000)]
Fix return type of horizontal_add_int64x2 helper
horizontal_add_int64x2 was incorrectly returning a uint64_t instead of
an int64_t. This patch fixes that.
Change-Id: Ic6016cf87aebfc6a14f540b784d6648757e12b49
Salome Thirot [Wed, 1 Mar 2023 10:06:01 +0000 (10:06 +0000)]
Optimize vp9_block_error_fp_neon
Currently vp9_block_error_fp_neon is only used when
CONFIG_VP9_HIGHBITDEPTH is set to false. This patch optimizes the
implementation and uses tran_low_t instead of int16_t so that the
function can also be used in builds where vp9_highbitdepth is enabled.
Change-Id: Ibab7ec5f74b7652fa2ae5edf328f9ec587088fd3
Neeraj Gadgil [Wed, 1 Mar 2023 15:11:37 +0000 (20:41 +0530)]
Use cb pattern for interp eval when filter is not switchable
This CL uses a checkerboard pattern for interp filter eval when
the filter is not switchable.
Instruction Count BD-Rate Loss(%)
cpu Resolution Reduction(%) avg.psnr ovr.psnr ssim
0 LOWRES2 0.725 0.0017 -0.0000 0.0192
0 MIDRES2 0.968 0.0004 0.0504 0.0810
0 HDRES2 1.135 0.0089 0.0130 0.0113
0 Average 0.943 0.0037 0.0211 0.0372
STATS_CHANGED
Change-Id: Ia713e5170101302f264ffaa2350bc0ab15c27090
Neeraj Gadgil [Wed, 1 Mar 2023 10:05:57 +0000 (15:35 +0530)]
Early terminate interp filt search based on best RD cost
The CL prunes interpolation filter search based on rdcost of
individual planes.
Instruction Count BD-Rate Loss(%)
cpu Resolution Reduction(%) avg.psnr ovr.psnr ssim
0 LOWRES2 1.613 0.0143 0.0208 0.0146
0 MIDRES2 1.637 0.0214 -0.0316 0.0036
0 HDRES2 1.369 0.0171 0.0178 0.1222
0 Average 1.539 0.0176 0.0023 0.0468
STATS_CHANGED
Change-Id: I4be30bd1c7bbbc93c6bbc840565893a97d2598a4
James Zern [Tue, 7 Mar 2023 05:45:53 +0000 (05:45 +0000)]
Merge "Fix heap buffer overrun in vpx_get4x4sse_cs_neon" into main
James Zern [Tue, 7 Mar 2023 01:28:15 +0000 (01:28 +0000)]
Merge changes I05dc4d43,Ia0977ff0 into main
* changes:
Fix potential buffer over-read in highbd d117 predictor Neon
Implement d117_predictor using Neon
Jonathan Wright [Fri, 3 Mar 2023 23:42:50 +0000 (23:42 +0000)]
Fix heap buffer overrun in vpx_get4x4sse_cs_neon
Use a mem_neon.h helper to do strided 4-byte loads instead of Neon
8-byte loads - where the last 4 bytes are out of bounds.
Re-enable the Neon code path and the tests.
Bug: webm:1794
Change-Id: I69ccff730f4a5cbf585dd6a9aa0f3eb13e150074
James Zern [Mon, 6 Mar 2023 21:56:17 +0000 (13:56 -0800)]
vpx_convolve_copy_neon: fix unaligned loads w/w==4
Fixes a -fsanitize=undefined warning:
vpx_dsp/arm/vpx_convolve_copy_neon.c:29:26: runtime error: load of
misaligned address 0xffffa8242bea for type 'const uint32_t' (aka 'const
unsigned int'), which requires 4 byte alignment
0xffffa8242bea: note: pointer points here
88 81 7d 7d 7d 7d 7d 81 81 7d 81 80 87 97 a8 ab a0 91 ...
^
#0 0xb0447c in vpx_convolve_copy_neon
vpx_dsp/arm/vpx_convolve_copy_neon.c:29:26
#1 0x12285c8 in inter_predictor vp9/common/vp9_reconinter.h:29:3
#2 0x1228430 in dec_build_inter_predictors
vp9/decoder/vp9_decodeframe.c
...
Change-Id: Iaec4ac2a400b6e6db72d12e5a7acb316262b12a7
Jonathan Wright [Mon, 6 Mar 2023 17:52:13 +0000 (17:52 +0000)]
Optimize vpx_sum_squares_2d_i16_neon
Add an additional 32-bit vector accumulator to allow parallel
processing on CPUs that have more than one Neon multiply-accumulate
pipeline. Also use sum_neon.h horizontal-add helpers for reduction.
Change-Id: Ibcb48a738f5dee1430c3ebcd305b5ea8ea344c40
George Steed [Thu, 23 Feb 2023 16:25:38 +0000 (16:25 +0000)]
Implement highbd_d207_predictor using Neon
Add Neon implementations of the highbd d207 predictor for 4x4, 8x8,
16x16 and 32x32 block sizes. Also update tests to add new corresponding
cases.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 1.61
Neoverse N1 | LLVM 15 | 8x8 | 5.30
Neoverse N1 | LLVM 15 | 16x16 | 8.93
Neoverse N1 | LLVM 15 | 32x32 | 8.35
Neoverse N1 | GCC 12 | 4x4 | 2.16
Neoverse N1 | GCC 12 | 8x8 | 5.75
Neoverse N1 | GCC 12 | 16x16 | 7.28
Neoverse N1 | GCC 12 | 32x32 | 3.31
Neoverse V1 | LLVM 15 | 4x4 | 1.71
Neoverse V1 | LLVM 15 | 8x8 | 7.46
Neoverse V1 | LLVM 15 | 16x16 | 10.09
Neoverse V1 | LLVM 15 | 32x32 | 8.10
Neoverse V1 | GCC 12 | 4x4 | 1.99
Neoverse V1 | GCC 12 | 8x8 | 7.81
Neoverse V1 | GCC 12 | 16x16 | 8.34
Neoverse V1 | GCC 12 | 32x32 | 5.74
Change-Id: Ic021e82eed0c7bc8263eb68606411354eb5e4870
George Steed [Wed, 22 Feb 2023 15:33:37 +0000 (15:33 +0000)]
Implement highbd_d153_predictor using Neon
Add Neon implementations of the highbd d153 predictor for 4x4, 8x8,
16x16 and 32x32 block sizes. Also update tests to add new corresponding
cases.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 1.71
Neoverse N1 | LLVM 15 | 8x8 | 4.05
Neoverse N1 | LLVM 15 | 16x16 | 7.04
Neoverse N1 | LLVM 15 | 32x32 | 7.71
Neoverse N1 | GCC 12 | 4x4 | 1.84
Neoverse N1 | GCC 12 | 8x8 | 4.19
Neoverse N1 | GCC 12 | 16x16 | 6.07
Neoverse N1 | GCC 12 | 32x32 | 3.14
Neoverse V1 | LLVM 15 | 4x4 | 3.19
Neoverse V1 | LLVM 15 | 8x8 | 5.51
Neoverse V1 | LLVM 15 | 16x16 | 7.73
Neoverse V1 | LLVM 15 | 32x32 | 7.72
Neoverse V1 | GCC 12 | 4x4 | 3.97
Neoverse V1 | GCC 12 | 8x8 | 5.52
Neoverse V1 | GCC 12 | 16x16 | 6.31
Neoverse V1 | GCC 12 | 32x32 | 5.36
Change-Id: I2bce6f1921d76d1c10d163e0cd4f395b40799184
George Steed [Mon, 6 Mar 2023 13:24:47 +0000 (13:24 +0000)]
Fix potential buffer over-read in highbd d117 predictor Neon
The load of `left[bs]` in the standard bitdepth d117 Neon implementation
triggered an address-sanitizer failure.
The highbd equivalent does not appear to trigger any asan failures when
running the VP9/ExternalFrameBufferMD5Test or
VP9/TestVectorTest.MD5Match tests, but for consistency with the standard
bitdepth implementation we adjust it to avoid the over-read.
Performance is roughly identical, with a 0.8% performance improvement on
average over the previous optimised code.
Change-Id: I05dc4d43f244f4915c0ccc52cc0af999bbacb018
George Steed [Tue, 14 Feb 2023 14:56:25 +0000 (14:56 +0000)]
Implement d207_predictor using Neon
Add Neon implementations of the d207 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 1.72
Neoverse N1 | LLVM 15 | 8x8 | 5.68
Neoverse N1 | LLVM 15 | 16x16 | 12.30
Neoverse N1 | LLVM 15 | 32x32 | 16.70
Neoverse N1 | GCC 12 | 4x4 | 1.71
Neoverse N1 | GCC 12 | 8x8 | 6.01
Neoverse N1 | GCC 12 | 16x16 | 12.40
Neoverse N1 | GCC 12 | 32x32 | 6.71
Neoverse V1 | LLVM 15 | 4x4 | 1.99
Neoverse V1 | LLVM 15 | 8x8 | 8.28
Neoverse V1 | LLVM 15 | 16x16 | 14.36
Neoverse V1 | LLVM 15 | 32x32 | 17.55
Neoverse V1 | GCC 12 | 4x4 | 1.99
Neoverse V1 | GCC 12 | 8x8 | 8.43
Neoverse V1 | GCC 12 | 16x16 | 14.41
Neoverse V1 | GCC 12 | 32x32 | 7.82
Change-Id: I250ab56edab3390b0bac9dc96995a4bf9a4da641
George Steed [Mon, 6 Mar 2023 09:27:41 +0000 (09:27 +0000)]
Implement d117_predictor using Neon
Add Neon implementations of the d117 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.
This re-lands commit
360e9069b6cc1dd3a004728b876fb923413f4b11,
previously reverted in commit
394de691a0ef570fc49943f565ad53ee0d22a7f3.
The implementation is mostly identical to the original but with an
adjustment to how data is loaded from the `left` array. In particular
the left array cannot be guaranteed to be larger than the block size, so
the read of e.g. `left[32]` in the `bs=32` case is not valid. This turns
out to be not a problem since the last lane loaded in this case is
unused. I have added comments in the code to explain why this is the
case.
Since we cannot load the last element directly, we instead construct it
from the previous aligned read. This seems to have an inconsistent
affect on performance, improving by up to 10% in some cases and
regressing by up to 10% on others. Either way it is still significantly
faster than the original C code.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 1.88
Neoverse N1 | LLVM 15 | 8x8 | 5.19
Neoverse N1 | LLVM 15 | 16x16 | 9.63
Neoverse N1 | LLVM 15 | 32x32 | 13.85
Neoverse N1 | GCC 12 | 4x4 | 2.04
Neoverse N1 | GCC 12 | 8x8 | 4.62
Neoverse N1 | GCC 12 | 16x16 | 9.79
Neoverse N1 | GCC 12 | 32x32 | 4.69
Neoverse V1 | LLVM 15 | 4x4 | 1.75
Neoverse V1 | LLVM 15 | 8x8 | 6.71
Neoverse V1 | LLVM 15 | 16x16 | 9.62
Neoverse V1 | LLVM 15 | 32x32 | 13.81
Neoverse V1 | GCC 12 | 4x4 | 1.75
Neoverse V1 | GCC 12 | 8x8 | 6.01
Neoverse V1 | GCC 12 | 16x16 | 6.91
Neoverse V1 | GCC 12 | 32x32 | 4.39
Change-Id: Ia0977ff0b0eba2c41c7884b64e7c22ff9bc9549d
George Steed [Thu, 9 Feb 2023 16:12:59 +0000 (16:12 +0000)]
Implement d153_predictor using Neon
Add Neon implementations of the d153 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 1.59
Neoverse N1 | LLVM 15 | 8x8 | 4.46
Neoverse N1 | LLVM 15 | 16x16 | 8.77
Neoverse N1 | LLVM 15 | 32x32 | 15.21
Neoverse N1 | GCC 12 | 4x4 | 1.90
Neoverse N1 | GCC 12 | 8x8 | 4.70
Neoverse N1 | GCC 12 | 16x16 | 9.55
Neoverse N1 | GCC 12 | 32x32 | 5.95
Neoverse V1 | LLVM 15 | 4x4 | 2.89
Neoverse V1 | LLVM 15 | 8x8 | 6.94
Neoverse V1 | LLVM 15 | 16x16 | 10.20
Neoverse V1 | LLVM 15 | 32x32 | 15.63
Neoverse V1 | GCC 12 | 4x4 | 4.45
Neoverse V1 | GCC 12 | 8x8 | 7.71
Neoverse V1 | GCC 12 | 16x16 | 9.08
Neoverse V1 | GCC 12 | 32x32 | 7.93
Change-Id: I910692b14917cde8a8952fab5b9c78bed7f7c6ad
George Steed [Wed, 1 Mar 2023 22:44:38 +0000 (22:44 +0000)]
Implement highbd_d63_predictor using Neon
Add Neon implementations of the highbd d63 predictor for 4x4, 8x8, 16x16
and 32x32 block sizes. Also update tests to add new corresponding cases.
This re-lands commit
7cdf139e3d6237386e0f93bdb0bdc1b459c663bf,
previously reverted in
7478b7e4e481562a4a13f233acb66a60462e1934.
Compared to the previous implementation attempt we now correctly match
the behaviour of the C code when handling the final element loaded from
the 'above' input array. In particular:
- The C code for a 4x4 block performs a full average of the last element
rather than duplicating the final element from the input 'above'
array.
- The C code for other block sizes performs a full average for the
stride=0 and stride=1, and otherwise shifts in duplicates of the final
element from the input 'above' array. Notably this shifting for later
strides _replaces_ the final element which we previously performed an
average on (see {d0,d1}_ext in the code).
It is worth noting that this difference is not caught by the existing
VP9HighbdIntraPredTest test cases since the test vector initialisation
contains this loop:
for (int x = block_size; x < 2 * block_size; x++) {
above_row_[x] = above_row_[block_size - 1];
}
Since AVG2(a, a) and AVG3(a, a, a) are simply 'a', such differences in
behaviour for the final element are not observed.
Tested on AArch64 with:
- ./test_libvpx --gtest_filter="*VP9HighbdIntraPredTest*"
- ./test_libvpx --gtest_filter="*VP9/TestVectorTest.MD5Match*"
- ./test_libvpx --gtest_filter="*VP9/ExternalFrameBufferMD5Test*"
Speedups over the C code (higher is better):
Microarch. | Compiler | Block | Speedup
Neoverse N1 | LLVM 15 | 4x4 | 2.43
Neoverse N1 | LLVM 15 | 8x8 | 3.92
Neoverse N1 | LLVM 15 | 16x16 | 3.19
Neoverse N1 | LLVM 15 | 32x32 | 4.13
Neoverse N1 | GCC 12 | 4x4 | 2.92
Neoverse N1 | GCC 12 | 8x8 | 6.51
Neoverse N1 | GCC 12 | 16x16 | 4.55
Neoverse N1 | GCC 12 | 32x32 | 3.18
Neoverse V1 | LLVM 15 | 4x4 | 1.99
Neoverse V1 | LLVM 15 | 8x8 | 3.65
Neoverse V1 | LLVM 15 | 16x16 | 3.72
Neoverse V1 | LLVM 15 | 32x32 | 3.26
Neoverse V1 | GCC 12 | 4x4 | 2.39
Neoverse V1 | GCC 12 | 8x8 | 4.76
Neoverse V1 | GCC 12 | 16x16 | 3.24
Neoverse V1 | GCC 12 | 32x32 | 2.44
Change-Id: Iefaa774d6a20388b523eaa7f5df6bc5f5cf249e4