platform/upstream/libvpx.git
16 months agoReland "quantize: use scan_order instead of passing scan/iscan"
Johann [Mon, 14 Nov 2022 07:47:33 +0000 (16:47 +0900)]
Reland "quantize: use scan_order instead of passing scan/iscan"

This is a reland of commit 14fc40040ff30486c45111056db44ee18590a24a

Parent change fixed in crrev.com/c/webm/libvpx/+/4305500

Original change's description:
> quantize: use scan_order instead of passing scan/iscan
>
> further reduces the arguments for the 32x32. This will be applied to the base
> version as well.
>
> Change-Id: I25a162b5248b14af53d9e20c6a7fa2a77028a6d1

Change-Id: I2a7654558eaddd68bd09336bf317b297f18559d2

16 months agoMerge changes I5d9444a2,I1f127df9 into main
James Zern [Fri, 17 Mar 2023 20:35:24 +0000 (20:35 +0000)]
Merge changes I5d9444a2,I1f127df9 into main

* changes:
  Add Neon implementation of vpx_highbd_minmax_8x8_c
  Add tests for vpx_highbd_minmax_8x8_c

16 months agoMerge "Reland "quantize: simplifly highbd 32x32_b args"" into main
James Zern [Fri, 17 Mar 2023 20:32:11 +0000 (20:32 +0000)]
Merge "Reland "quantize: simplifly highbd 32x32_b args"" into main

16 months agoAdd Neon implementation of vpx_highbd_minmax_8x8_c
Salome Thirot [Thu, 9 Mar 2023 13:58:16 +0000 (13:58 +0000)]
Add Neon implementation of vpx_highbd_minmax_8x8_c

Add Neon implementation of vpx_highbd_minmax_8x8_c as well as the
corresponding tests.

Change-Id: I5d9444a239fb1baa53634c1bdb5292b44067d90c

16 months agoAdd tests for vpx_highbd_minmax_8x8_c
Salome Thirot [Thu, 9 Mar 2023 21:04:07 +0000 (21:04 +0000)]
Add tests for vpx_highbd_minmax_8x8_c

Write tests for vpx_highbd_minmax_8x8_c, and fix initial value of min in
vpx_highbd_minmax_8x8_c.

Change-Id: I1f127df945bbb8c7d373c5430ff5f94f28575968

16 months agoReland "quantize: simplifly highbd 32x32_b args"
Johann [Fri, 11 Nov 2022 23:23:17 +0000 (08:23 +0900)]
Reland "quantize: simplifly highbd 32x32_b args"

This is a reland of commit 573f5e662b544dbc553d73fa2b61055c30dfe8cc

Alignment issue with tests fixed in crrev.com/c/webm/libvpx/+/4305500

Original change's description:
> quantize: simplify highbd 32x32_b args
>
> Change-Id: I431a41279c4c4193bc70cfe819da6ea7e1d2fba1

Change-Id: Ic868b6f987c99d88672858fedd092fa49c125e19

16 months agoMerge "Set oxcf->ts_rate_decimator[tl] only once" into main
Wan-Teh Chang [Fri, 17 Mar 2023 02:54:21 +0000 (02:54 +0000)]
Merge "Set oxcf->ts_rate_decimator[tl] only once" into main

16 months agoSet oxcf->ts_rate_decimator[tl] only once
Wan-Teh Chang [Fri, 17 Mar 2023 01:36:13 +0000 (18:36 -0700)]
Set oxcf->ts_rate_decimator[tl] only once

The code that sets oxcf->ts_rate_decimator[tl] does not need to be
inside a loop that iterates over sl. Move the code out of the sl loop so
that oxcf->ts_rate_decimator[tl] is set only once.

Change-Id: I22f6c117d200ec38a757b749a8700660d15436c1

16 months agoRemove repeated field from VP9RateControlRtcConfig
Wan-Teh Chang [Thu, 16 Mar 2023 22:21:49 +0000 (15:21 -0700)]
Remove repeated field from VP9RateControlRtcConfig

Remove the `ts_number_layers` field from VP9RateControlRtcConfig because
the base class VpxRateControlRtcConfig already has that field.

Note: In commit 65a1751e5b98bf7f1d21bcbfdef352af34fb205d,
`ts_number_layers` was moved to the newly created base class
VpxRateControlRtcConfig but was inadvertently left in
VP9RateControlRtcConfig:
https://chromium-review.googlesource.com/c/webm/libvpx/+/3140048,

Change-Id: I98d48e152683ec2e5e62efffb56b7f010c5d0695

16 months agoMerge "Update the sample code for VP9RateControlRTC" into main
Wan-Teh Chang [Thu, 16 Mar 2023 21:40:14 +0000 (21:40 +0000)]
Merge "Update the sample code for VP9RateControlRTC" into main

16 months agoMerge "Add AVX2 for convolve horizontal filter for block width 4" into main
Yunqing Wang [Thu, 16 Mar 2023 20:44:11 +0000 (20:44 +0000)]
Merge "Add AVX2 for convolve horizontal filter for block width 4" into main

16 months agoUpdate the sample code for VP9RateControlRTC
Wan-Teh Chang [Thu, 16 Mar 2023 20:37:56 +0000 (13:37 -0700)]
Update the sample code for VP9RateControlRTC

Update the sample code to the current VP9RateControlRTC interface.

Change-Id: I30b0712c897f93fd62ebce51ce39afce3cac1fd7

16 months agoAdd AVX2 for convolve horizontal filter for block width 4
Anupam Pandey [Tue, 14 Mar 2023 11:20:31 +0000 (16:50 +0530)]
Add AVX2 for convolve horizontal filter for block width 4

Introduced AVX2 intrinsic to compute convolve horizontal for
w = 4 case. This is a bit-exact change.

                 Instruction Count
cpu   Resolution   Reduction(%)
 0       LOWRES2      0.763
 0       MIDRES2      0.466
 0        HDRES2      0.317
 0       Average      0.516

Change-Id: I124f3f8e994c24461812f4963b113819466db44f

16 months agoOptimize vpx_minmax_8x8_neon for aarch64
Salome Thirot [Wed, 8 Mar 2023 14:08:23 +0000 (14:08 +0000)]
Optimize vpx_minmax_8x8_neon for aarch64

Optimize vpx_minmax_8x8_neon on AArch64 targets by using the UMAXV and
UMINV instructions - computing the maximum and minimum elements in a
Neon vector.

Change-Id: I54c3a3a087d266f6774e6113e5947253df288a64

16 months agoMerge "Add Neon implementation of vpx_highbd_satd_c" into main
James Zern [Tue, 14 Mar 2023 19:38:04 +0000 (19:38 +0000)]
Merge "Add Neon implementation of vpx_highbd_satd_c" into main

16 months agoMerge "Optimize vpx_satd_neon" into main
James Zern [Tue, 14 Mar 2023 19:32:32 +0000 (19:32 +0000)]
Merge "Optimize vpx_satd_neon" into main

16 months agoMerge "Add Neon implementation of vp9_highbd_block_error_c" into main
James Zern [Tue, 14 Mar 2023 19:31:02 +0000 (19:31 +0000)]
Merge "Add Neon implementation of vp9_highbd_block_error_c" into main

16 months agoAdd Neon implementation of vpx_highbd_satd_c
Salome Thirot [Wed, 8 Mar 2023 12:01:04 +0000 (12:01 +0000)]
Add Neon implementation of vpx_highbd_satd_c

Add Neon implementation of vpx_highbd_satd_c as well as the
corresponding tests.

Change-Id: I3d50e6abdf168fb13743e7d8da9364f072308b7f

16 months agoOptimize vpx_satd_neon
Salome Thirot [Tue, 7 Mar 2023 17:04:31 +0000 (17:04 +0000)]
Optimize vpx_satd_neon

Optimize Neon implementation of vpx_satd by using ABD and UADALP instead
of ABAL and ABAL2, splitting the accumulator and using a dedicated
helper function to perform the final reduction.

Change-Id: Idcfa49e001b68b1dcd87c13fd9acc317a208cd2a

16 months agoAdd Neon implementation of vp9_highbd_block_error_c
Salome Thirot [Tue, 7 Mar 2023 15:13:17 +0000 (15:13 +0000)]
Add Neon implementation of vp9_highbd_block_error_c

Add Neon implementation of vp9_highbd_block_error_c as well as the
corresponding tests.

Change-Id: Ibe0eb077f959ced0dcd7d0d8d9d529d3b5bc1874

16 months ago[NEON] Add temporal filter functions, 8-bit and highbd
Konstantinos Margaritis [Wed, 1 Mar 2023 23:37:32 +0000 (23:37 +0000)]
[NEON] Add temporal filter functions, 8-bit and highbd

Both are around 3x faster than original C version. 8-bit gives a
small 0.5% speed increase, whereas highbd gives ~2.5%.

Change-Id: I71d75ddd2757b19aa201e879fd9fa8f3a25431ad

16 months agoMerge "Fix buffer overrun in highbd Neon subpel variance filters" into main
James Zern [Tue, 14 Mar 2023 00:22:31 +0000 (00:22 +0000)]
Merge "Fix buffer overrun in highbd Neon subpel variance filters" into main

16 months agoMerge "reland: quantize: simplify 32x32_b args" into main
James Zern [Fri, 10 Mar 2023 21:40:59 +0000 (21:40 +0000)]
Merge "reland: quantize: simplify 32x32_b args" into main

16 months agoMerge "Add AVX2 for vpx_filter_block1d8_v8() function" into main
Yunqing Wang [Fri, 10 Mar 2023 01:02:25 +0000 (01:02 +0000)]
Merge "Add AVX2 for vpx_filter_block1d8_v8() function" into main

16 months agoAdd AVX2 for vpx_filter_block1d8_v8() function
Anupam Pandey [Mon, 6 Mar 2023 05:08:20 +0000 (10:38 +0530)]
Add AVX2 for vpx_filter_block1d8_v8() function

Introduced AVX2 intrinsic to compute convolve vertical for
w = 8 case. This is a bit-exact change.

                 Instruction Count
cpu   Resolution   Reduction(%)
 0       LOWRES2      1.347
 0       MIDRES2      1.046
 0        HDRES2      0.805
 0       Average      1.066

Change-Id: Idf77fff054beaf2c985b9bf2335591bda47e811f

16 months agoRename function 'model_rd_for_sb_earlyterm'
Neeraj Gadgil [Thu, 9 Mar 2023 09:21:44 +0000 (14:51 +0530)]
Rename function 'model_rd_for_sb_earlyterm'

Function renamed as 'build_inter_pred_model_rd_earlyterm' and
added a comment to explain its behavior.

Change-Id: I804e6273558ba36241232f62cf18ea754b85e369

16 months agoFix buffer overrun in highbd Neon subpel variance filters
Jonathan Wright [Wed, 8 Mar 2023 16:34:20 +0000 (16:34 +0000)]
Fix buffer overrun in highbd Neon subpel variance filters

The high bitdepth Neon code applying the first pass of the bilinear
filter for subpixel variance on blocks of width 4 processed two rows
at a time. This resulted in a source buffer overread, attempting to
produce two rows of padding for the second (vertical) pass of the
bilinear filter.

This patch modifies highbd_var_filter_block2d_bil_w4 and
highbd_avg_pred_var_filter_block2d_bil_w4 such that they only process
a single row per iteration, and only require a single row of padding
for the second pass. This prevents the buffer overread.

Since all block sizes are now processed one row at a time, there is
no need for a "padding" macro parameter - the value is always 1, with
no special case for 4xh blocks. As well as re-enabling the Neon paths
and their associated tests, we remove the now-redundant 'padding'
macro parameter.

Bug: webm:1796
Change-Id: Icd6076b38eb4476139795bb1734ca800c9edf079

16 months agoMerge "disable vpx_highbd_*_sub_pixel_avg_variance4x{4,8}_neon" into main
James Zern [Wed, 8 Mar 2023 23:05:08 +0000 (23:05 +0000)]
Merge "disable vpx_highbd_*_sub_pixel_avg_variance4x{4,8}_neon" into main

16 months agoMerge "Optimize vpx_sum_squares_2d_i16_neon" into main
James Zern [Wed, 8 Mar 2023 21:54:30 +0000 (21:54 +0000)]
Merge "Optimize vpx_sum_squares_2d_i16_neon" into main

16 months agodisable vpx_highbd_*_sub_pixel_avg_variance4x{4,8}_neon
James Zern [Wed, 8 Mar 2023 21:17:17 +0000 (13:17 -0800)]
disable vpx_highbd_*_sub_pixel_avg_variance4x{4,8}_neon

vpx_highbd_8_sub_pixel_avg_variance4x4_neon
vpx_highbd_8_sub_pixel_avg_variance4x8_neon
vpx_highbd_10_sub_pixel_avg_variance4x4_neon
vpx_highbd_10_sub_pixel_avg_variance4x8_neon
vpx_highbd_12_sub_pixel_avg_variance4x4_neon
vpx_highbd_12_sub_pixel_avg_variance4x8_neon

all cause heap overflows of the form:

i[ RUN      ] NEON/VpxHBDSubpelAvgVarianceTest.Ref/33
=================================================================
==535205==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff95bb0b89 at pc 0x00000116dabc bp 0xffffd09f6430 sp 0xffffd09f6428
READ of size 8 at 0xffff95bb0b89 thread T0
    #0 0x116dab8 in load_unaligned_u16q vpx_dsp/arm/mem_neon.h:176:3
    #1 0x116dab8 in highbd_var_filter_block2d_bil_w4
       vpx_dsp/arm/highbd_subpel_variance_neon.c:49:21
    #2 0x116dab8 in vpx_highbd_8_sub_pixel_avg_variance4x4_neon
       vpx_dsp/arm/highbd_subpel_variance_neon.c:543:1
    ...

0xffff95bb0b89 is located 0 bytes to the right of 73-byte region
[0xffff95bb0b40,0xffff95bb0b89)
allocated by thread T0 here:
    #0 0x5f18b0 in malloc (test_libvpx+0x5f18b0)
    #1 0xce4a40 in vpx_memalign vpx_mem/vpx_mem.c:62:10
    #2 0xce4a40 in vpx_malloc vpx_mem/vpx_mem.c:70:40
    #3 0xa52238 in (anonymous namespace)::SubpelVarianceTest<unsigned
       int (*)(unsigned char const*, int, int, int, unsigned char
               const*, int, unsigned int*, unsigned char
               const*)>::SetUp()
       test/variance_test.cc:586:14
    ...

This is the same issue as:
  e33d4c276 disable vpx_highbd_*_sub_pixel_variance4x{4,8}_neon
They have highbd_var_filter_block2d_bil_w4 in common.

Bug: webm:1796
Change-Id: I3ed70d0ba22e127720542612ea9f6665948eedfc

16 months agodisable vpx_highbd_*_sub_pixel_variance4x{4,8}_neon
James Zern [Wed, 8 Mar 2023 06:09:37 +0000 (22:09 -0800)]
disable vpx_highbd_*_sub_pixel_variance4x{4,8}_neon

vpx_highbd_8_sub_pixel_variance4x4_neon
vpx_highbd_8_sub_pixel_variance4x8_neon
vpx_highbd_10_sub_pixel_variance4x4_neon
vpx_highbd_10_sub_pixel_variance4x8_neon
vpx_highbd_12_sub_pixel_variance4x4_neon
vpx_highbd_12_sub_pixel_variance4x8_neon

all cause heap overflows of the form:

[ RUN      ] NEON/VpxHBDSubpelVarianceTest.Ref/24
=================================================================
==450528==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff8311a571 at pc 0x0000010ca52c bp 0xffffc63e96b0 sp 0xffffc63e96a8
READ of size 8 at 0xffff8311a571 thread T0
    #0 0x10ca528 in load_unaligned_u16q vpx_dsp/arm/mem_neon.h:176:3
    #1 0x10ca528 in highbd_var_filter_block2d_bil_w4
       vpx_dsp/arm/highbd_subpel_variance_neon.c:49:21
    #2 0x10ca528 in vpx_highbd_10_sub_pixel_variance4x8_neon
       vpx_dsp/arm/highbd_subpel_variance_neon.c:257:1
    ...

0xffff8311a571 is located 0 bytes to the right of 113-byte region
[0xffff8311a500,0xffff8311a571)
allocated by thread T0 here:
    #0 0x5f18b0 in malloc (test_libvpx+0x5f18b0)
    #1 0xce4f90 in vpx_memalign vpx_mem/vpx_mem.c:62:10
    #2 0xce4f90 in vpx_malloc vpx_mem/vpx_mem.c:70:40
    #3 0xa4ad44 in (anonymous namespace)::SubpelVarianceTest<unsigned
       int (*)(unsigned char const*, int, int, int, unsigned char
       const*, int, unsigned int*)>::SetUp() test/variance_test.cc:586:14

Bug: webm:1796
Change-Id: I39f7f936bae2bcbbe1f803fb10375ec02d1c1277

16 months agoMerge "[SSE4_1] Fix overflow in highbd temporal_filter" into main
James Zern [Tue, 7 Mar 2023 23:40:10 +0000 (23:40 +0000)]
Merge "[SSE4_1] Fix overflow in highbd temporal_filter" into main

16 months agoMerge changes I79247b5a,Ic6016cf8,Ibab7ec5f into main
James Zern [Tue, 7 Mar 2023 23:00:19 +0000 (23:00 +0000)]
Merge changes I79247b5a,Ic6016cf8,Ibab7ec5f into main

* changes:
  Add Neon implementation of vp9_block_error_c
  Fix return type of horizontal_add_int64x2 helper
  Optimize vp9_block_error_fp_neon

16 months agoMerge changes Ic021e82e,I2bce6f19,I250ab56e,I910692b1,Iefaa774d into main
James Zern [Tue, 7 Mar 2023 22:48:54 +0000 (22:48 +0000)]
Merge changes Ic021e82e,I2bce6f19,I250ab56e,I910692b1,Iefaa774d into main

* changes:
  Implement highbd_d207_predictor using Neon
  Implement highbd_d153_predictor using Neon
  Implement d207_predictor using Neon
  Implement d153_predictor using Neon
  Implement highbd_d63_predictor using Neon

16 months agoMerge "Add AVX2 for vpx_filter_block1d8_h8() function" into main
Yunqing Wang [Tue, 7 Mar 2023 16:40:52 +0000 (16:40 +0000)]
Merge "Add AVX2 for vpx_filter_block1d8_h8() function" into main

16 months agoMerge "Use cb pattern for interp eval when filter is not switchable" into main
Yunqing Wang [Tue, 7 Mar 2023 16:37:22 +0000 (16:37 +0000)]
Merge "Use cb pattern for interp eval when filter is not switchable" into main

16 months agoMerge "Early terminate interp filt search based on best RD cost" into main
Yunqing Wang [Tue, 7 Mar 2023 16:35:18 +0000 (16:35 +0000)]
Merge "Early terminate interp filt search based on best RD cost" into main

16 months agoAdd AVX2 for vpx_filter_block1d8_h8() function
Anupam Pandey [Thu, 2 Mar 2023 05:28:27 +0000 (10:58 +0530)]
Add AVX2 for vpx_filter_block1d8_h8() function

Introduced AVX2 intrinsic to compute convolve horizontal for
w = 8 case. This is a bit-exact change.

                 Instruction Count
cpu   Resolution   Reduction(%)
 0       LOWRES2      1.509
 0       MIDRES2      1.165
 0        HDRES2      0.898
 0       Average      1.191

Change-Id: I699c94aa3d7ea74c58f901df906eed0b81b4ee79

16 months agoAdd Neon implementation of vp9_block_error_c
Salome Thirot [Mon, 6 Mar 2023 11:37:26 +0000 (11:37 +0000)]
Add Neon implementation of vp9_block_error_c

Add Neon implementation of vp9_block_error_c as well as the
corresponding tests.

Change-Id: I79247b5ae24f51b7b55fc5e517d5e403dc86367a

16 months agoFix return type of horizontal_add_int64x2 helper
Salome Thirot [Fri, 3 Mar 2023 10:53:27 +0000 (10:53 +0000)]
Fix return type of horizontal_add_int64x2 helper

horizontal_add_int64x2 was incorrectly returning a uint64_t instead of
an int64_t. This patch fixes that.

Change-Id: Ic6016cf87aebfc6a14f540b784d6648757e12b49

16 months agoOptimize vp9_block_error_fp_neon
Salome Thirot [Wed, 1 Mar 2023 10:06:01 +0000 (10:06 +0000)]
Optimize vp9_block_error_fp_neon

Currently vp9_block_error_fp_neon is only used when
CONFIG_VP9_HIGHBITDEPTH is set to false. This patch optimizes the
implementation and uses tran_low_t instead of int16_t so that the
function can also be used in builds where vp9_highbitdepth is enabled.

Change-Id: Ibab7ec5f74b7652fa2ae5edf328f9ec587088fd3

16 months agoUse cb pattern for interp eval when filter is not switchable
Neeraj Gadgil [Wed, 1 Mar 2023 15:11:37 +0000 (20:41 +0530)]
Use cb pattern for interp eval when filter is not switchable

This CL uses a checkerboard pattern for interp filter eval when
the filter is not switchable.

                 Instruction Count        BD-Rate Loss(%)
cpu   Resolution   Reduction(%)    avg.psnr   ovr.psnr    ssim
 0       LOWRES2      0.725         0.0017    -0.0000    0.0192
 0       MIDRES2      0.968         0.0004     0.0504    0.0810
 0        HDRES2      1.135         0.0089     0.0130    0.0113
 0       Average      0.943         0.0037     0.0211    0.0372

STATS_CHANGED

Change-Id: Ia713e5170101302f264ffaa2350bc0ab15c27090

16 months agoEarly terminate interp filt search based on best RD cost
Neeraj Gadgil [Wed, 1 Mar 2023 10:05:57 +0000 (15:35 +0530)]
Early terminate interp filt search based on best RD cost

The CL prunes interpolation filter search based on rdcost of
individual planes.

                 Instruction Count        BD-Rate Loss(%)
cpu   Resolution   Reduction(%)    avg.psnr   ovr.psnr    ssim
 0       LOWRES2      1.613         0.0143     0.0208    0.0146
 0       MIDRES2      1.637         0.0214    -0.0316    0.0036
 0        HDRES2      1.369         0.0171     0.0178    0.1222
 0       Average      1.539         0.0176     0.0023    0.0468

STATS_CHANGED

Change-Id: I4be30bd1c7bbbc93c6bbc840565893a97d2598a4

16 months agoMerge "Fix heap buffer overrun in vpx_get4x4sse_cs_neon" into main
James Zern [Tue, 7 Mar 2023 05:45:53 +0000 (05:45 +0000)]
Merge "Fix heap buffer overrun in vpx_get4x4sse_cs_neon" into main

16 months agoMerge changes I05dc4d43,Ia0977ff0 into main
James Zern [Tue, 7 Mar 2023 01:28:15 +0000 (01:28 +0000)]
Merge changes I05dc4d43,Ia0977ff0 into main

* changes:
  Fix potential buffer over-read in highbd d117 predictor Neon
  Implement d117_predictor using Neon

16 months agoFix heap buffer overrun in vpx_get4x4sse_cs_neon
Jonathan Wright [Fri, 3 Mar 2023 23:42:50 +0000 (23:42 +0000)]
Fix heap buffer overrun in vpx_get4x4sse_cs_neon

Use a mem_neon.h helper to do strided 4-byte loads instead of Neon
8-byte loads - where the last 4 bytes are out of bounds.

Re-enable the Neon code path and the tests.

Bug: webm:1794
Change-Id: I69ccff730f4a5cbf585dd6a9aa0f3eb13e150074

16 months agovpx_convolve_copy_neon: fix unaligned loads w/w==4
James Zern [Mon, 6 Mar 2023 21:56:17 +0000 (13:56 -0800)]
vpx_convolve_copy_neon: fix unaligned loads w/w==4

Fixes a -fsanitize=undefined warning:

vpx_dsp/arm/vpx_convolve_copy_neon.c:29:26: runtime error: load of
misaligned address 0xffffa8242bea for type 'const uint32_t' (aka 'const
unsigned int'), which requires 4 byte alignment
0xffffa8242bea: note: pointer points here
 88 81  7d 7d 7d 7d 7d 81 81 7d  81 80 87 97 a8 ab a0 91 ...
              ^
    #0 0xb0447c in vpx_convolve_copy_neon
       vpx_dsp/arm/vpx_convolve_copy_neon.c:29:26
    #1 0x12285c8 in inter_predictor vp9/common/vp9_reconinter.h:29:3
    #2 0x1228430 in dec_build_inter_predictors
       vp9/decoder/vp9_decodeframe.c
    ...

Change-Id: Iaec4ac2a400b6e6db72d12e5a7acb316262b12a7

16 months agoOptimize vpx_sum_squares_2d_i16_neon
Jonathan Wright [Mon, 6 Mar 2023 17:52:13 +0000 (17:52 +0000)]
Optimize vpx_sum_squares_2d_i16_neon

Add an additional 32-bit vector accumulator to allow parallel
processing on CPUs that have more than one Neon multiply-accumulate
pipeline. Also use sum_neon.h horizontal-add helpers for reduction.

Change-Id: Ibcb48a738f5dee1430c3ebcd305b5ea8ea344c40

16 months agoImplement highbd_d207_predictor using Neon
George Steed [Thu, 23 Feb 2023 16:25:38 +0000 (16:25 +0000)]
Implement highbd_d207_predictor using Neon

Add Neon implementations of the highbd d207 predictor for 4x4, 8x8,
16x16 and 32x32 block sizes. Also update tests to add new corresponding
cases.

Speedups over the C code (higher is better):

Microarch.  | Compiler | Block | Speedup
Neoverse N1 |  LLVM 15 |   4x4 |    1.61
Neoverse N1 |  LLVM 15 |   8x8 |    5.30
Neoverse N1 |  LLVM 15 | 16x16 |    8.93
Neoverse N1 |  LLVM 15 | 32x32 |    8.35
Neoverse N1 |   GCC 12 |   4x4 |    2.16
Neoverse N1 |   GCC 12 |   8x8 |    5.75
Neoverse N1 |   GCC 12 | 16x16 |    7.28
Neoverse N1 |   GCC 12 | 32x32 |    3.31
Neoverse V1 |  LLVM 15 |   4x4 |    1.71
Neoverse V1 |  LLVM 15 |   8x8 |    7.46
Neoverse V1 |  LLVM 15 | 16x16 |   10.09
Neoverse V1 |  LLVM 15 | 32x32 |    8.10
Neoverse V1 |   GCC 12 |   4x4 |    1.99
Neoverse V1 |   GCC 12 |   8x8 |    7.81
Neoverse V1 |   GCC 12 | 16x16 |    8.34
Neoverse V1 |   GCC 12 | 32x32 |    5.74

Change-Id: Ic021e82eed0c7bc8263eb68606411354eb5e4870

16 months agoImplement highbd_d153_predictor using Neon
George Steed [Wed, 22 Feb 2023 15:33:37 +0000 (15:33 +0000)]
Implement highbd_d153_predictor using Neon

Add Neon implementations of the highbd d153 predictor for 4x4, 8x8,
16x16 and 32x32 block sizes. Also update tests to add new corresponding
cases.

Speedups over the C code (higher is better):

Microarch.  | Compiler | Block | Speedup
Neoverse N1 |  LLVM 15 |   4x4 |    1.71
Neoverse N1 |  LLVM 15 |   8x8 |    4.05
Neoverse N1 |  LLVM 15 | 16x16 |    7.04
Neoverse N1 |  LLVM 15 | 32x32 |    7.71
Neoverse N1 |   GCC 12 |   4x4 |    1.84
Neoverse N1 |   GCC 12 |   8x8 |    4.19
Neoverse N1 |   GCC 12 | 16x16 |    6.07
Neoverse N1 |   GCC 12 | 32x32 |    3.14
Neoverse V1 |  LLVM 15 |   4x4 |    3.19
Neoverse V1 |  LLVM 15 |   8x8 |    5.51
Neoverse V1 |  LLVM 15 | 16x16 |    7.73
Neoverse V1 |  LLVM 15 | 32x32 |    7.72
Neoverse V1 |   GCC 12 |   4x4 |    3.97
Neoverse V1 |   GCC 12 |   8x8 |    5.52
Neoverse V1 |   GCC 12 | 16x16 |    6.31
Neoverse V1 |   GCC 12 | 32x32 |    5.36

Change-Id: I2bce6f1921d76d1c10d163e0cd4f395b40799184

16 months agoFix potential buffer over-read in highbd d117 predictor Neon
George Steed [Mon, 6 Mar 2023 13:24:47 +0000 (13:24 +0000)]
Fix potential buffer over-read in highbd d117 predictor Neon

The load of `left[bs]` in the standard bitdepth d117 Neon implementation
triggered an address-sanitizer failure.

The highbd equivalent does not appear to trigger any asan failures when
running the VP9/ExternalFrameBufferMD5Test or
VP9/TestVectorTest.MD5Match tests, but for consistency with the standard
bitdepth implementation we adjust it to avoid the over-read.

Performance is roughly identical, with a 0.8% performance improvement on
average over the previous optimised code.

Change-Id: I05dc4d43f244f4915c0ccc52cc0af999bbacb018

16 months agoImplement d207_predictor using Neon
George Steed [Tue, 14 Feb 2023 14:56:25 +0000 (14:56 +0000)]
Implement d207_predictor using Neon

Add Neon implementations of the d207 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.

Speedups over the C code (higher is better):

Microarch.  | Compiler | Block | Speedup
Neoverse N1 |  LLVM 15 |   4x4 |    1.72
Neoverse N1 |  LLVM 15 |   8x8 |    5.68
Neoverse N1 |  LLVM 15 | 16x16 |   12.30
Neoverse N1 |  LLVM 15 | 32x32 |   16.70
Neoverse N1 |   GCC 12 |   4x4 |    1.71
Neoverse N1 |   GCC 12 |   8x8 |    6.01
Neoverse N1 |   GCC 12 | 16x16 |   12.40
Neoverse N1 |   GCC 12 | 32x32 |    6.71
Neoverse V1 |  LLVM 15 |   4x4 |    1.99
Neoverse V1 |  LLVM 15 |   8x8 |    8.28
Neoverse V1 |  LLVM 15 | 16x16 |   14.36
Neoverse V1 |  LLVM 15 | 32x32 |   17.55
Neoverse V1 |   GCC 12 |   4x4 |    1.99
Neoverse V1 |   GCC 12 |   8x8 |    8.43
Neoverse V1 |   GCC 12 | 16x16 |   14.41
Neoverse V1 |   GCC 12 | 32x32 |    7.82

Change-Id: I250ab56edab3390b0bac9dc96995a4bf9a4da641

16 months agoImplement d117_predictor using Neon
George Steed [Mon, 6 Mar 2023 09:27:41 +0000 (09:27 +0000)]
Implement d117_predictor using Neon

Add Neon implementations of the d117 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.

This re-lands commit 360e9069b6cc1dd3a004728b876fb923413f4b11,
previously reverted in commit 394de691a0ef570fc49943f565ad53ee0d22a7f3.

The implementation is mostly identical to the original but with an
adjustment to how data is loaded from the `left` array. In particular
the left array cannot be guaranteed to be larger than the block size, so
the read of e.g. `left[32]` in the `bs=32` case is not valid. This turns
out to be not a problem since the last lane loaded in this case is
unused. I have added comments in the code to explain why this is the
case.

Since we cannot load the last element directly, we instead construct it
from the previous aligned read. This seems to have an inconsistent
affect on performance, improving by up to 10% in some cases and
regressing by up to 10% on others. Either way it is still significantly
faster than the original C code.

Speedups over the C code (higher is better):

Microarch.  | Compiler | Block | Speedup
Neoverse N1 |  LLVM 15 |   4x4 |    1.88
Neoverse N1 |  LLVM 15 |   8x8 |    5.19
Neoverse N1 |  LLVM 15 | 16x16 |    9.63
Neoverse N1 |  LLVM 15 | 32x32 |   13.85
Neoverse N1 |   GCC 12 |   4x4 |    2.04
Neoverse N1 |   GCC 12 |   8x8 |    4.62
Neoverse N1 |   GCC 12 | 16x16 |    9.79
Neoverse N1 |   GCC 12 | 32x32 |    4.69
Neoverse V1 |  LLVM 15 |   4x4 |    1.75
Neoverse V1 |  LLVM 15 |   8x8 |    6.71
Neoverse V1 |  LLVM 15 | 16x16 |    9.62
Neoverse V1 |  LLVM 15 | 32x32 |   13.81
Neoverse V1 |   GCC 12 |   4x4 |    1.75
Neoverse V1 |   GCC 12 |   8x8 |    6.01
Neoverse V1 |   GCC 12 | 16x16 |    6.91
Neoverse V1 |   GCC 12 | 32x32 |    4.39

Change-Id: Ia0977ff0b0eba2c41c7884b64e7c22ff9bc9549d

16 months agoImplement d153_predictor using Neon
George Steed [Thu, 9 Feb 2023 16:12:59 +0000 (16:12 +0000)]
Implement d153_predictor using Neon

Add Neon implementations of the d153 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.

Speedups over the C code (higher is better):

Microarch.  | Compiler | Block | Speedup
Neoverse N1 |  LLVM 15 |   4x4 |    1.59
Neoverse N1 |  LLVM 15 |   8x8 |    4.46
Neoverse N1 |  LLVM 15 | 16x16 |    8.77
Neoverse N1 |  LLVM 15 | 32x32 |   15.21
Neoverse N1 |   GCC 12 |   4x4 |    1.90
Neoverse N1 |   GCC 12 |   8x8 |    4.70
Neoverse N1 |   GCC 12 | 16x16 |    9.55
Neoverse N1 |   GCC 12 | 32x32 |    5.95
Neoverse V1 |  LLVM 15 |   4x4 |    2.89
Neoverse V1 |  LLVM 15 |   8x8 |    6.94
Neoverse V1 |  LLVM 15 | 16x16 |   10.20
Neoverse V1 |  LLVM 15 | 32x32 |   15.63
Neoverse V1 |   GCC 12 |   4x4 |    4.45
Neoverse V1 |   GCC 12 |   8x8 |    7.71
Neoverse V1 |   GCC 12 | 16x16 |    9.08
Neoverse V1 |   GCC 12 | 32x32 |    7.93

Change-Id: I910692b14917cde8a8952fab5b9c78bed7f7c6ad

16 months agoImplement highbd_d63_predictor using Neon
George Steed [Wed, 1 Mar 2023 22:44:38 +0000 (22:44 +0000)]
Implement highbd_d63_predictor using Neon

Add Neon implementations of the highbd d63 predictor for 4x4, 8x8, 16x16
and 32x32 block sizes. Also update tests to add new corresponding cases.

This re-lands commit 7cdf139e3d6237386e0f93bdb0bdc1b459c663bf,
previously reverted in 7478b7e4e481562a4a13f233acb66a60462e1934.

Compared to the previous implementation attempt we now correctly match
the behaviour of the C code when handling the final element loaded from
the 'above' input array. In particular:

- The C code for a 4x4 block performs a full average of the last element
  rather than duplicating the final element from the input 'above'
  array.

- The C code for other block sizes performs a full average for the
  stride=0 and stride=1, and otherwise shifts in duplicates of the final
  element from the input 'above' array. Notably this shifting for later
  strides _replaces_ the final element which we previously performed an
  average on (see {d0,d1}_ext in the code).

It is worth noting that this difference is not caught by the existing
VP9HighbdIntraPredTest test cases since the test vector initialisation
contains this loop:

    for (int x = block_size; x < 2 * block_size; x++) {
        above_row_[x] = above_row_[block_size - 1];
    }

Since AVG2(a, a) and AVG3(a, a, a) are simply 'a', such differences in
behaviour for the final element are not observed.

Tested on AArch64 with:

- ./test_libvpx --gtest_filter="*VP9HighbdIntraPredTest*"
- ./test_libvpx --gtest_filter="*VP9/TestVectorTest.MD5Match*"
- ./test_libvpx --gtest_filter="*VP9/ExternalFrameBufferMD5Test*"

Speedups over the C code (higher is better):

Microarch.  | Compiler | Block | Speedup
Neoverse N1 |  LLVM 15 |   4x4 |    2.43
Neoverse N1 |  LLVM 15 |   8x8 |    3.92
Neoverse N1 |  LLVM 15 | 16x16 |    3.19
Neoverse N1 |  LLVM 15 | 32x32 |    4.13
Neoverse N1 |   GCC 12 |   4x4 |    2.92
Neoverse N1 |   GCC 12 |   8x8 |    6.51
Neoverse N1 |   GCC 12 | 16x16 |    4.55
Neoverse N1 |   GCC 12 | 32x32 |    3.18
Neoverse V1 |  LLVM 15 |   4x4 |    1.99
Neoverse V1 |  LLVM 15 |   8x8 |    3.65
Neoverse V1 |  LLVM 15 | 16x16 |    3.72
Neoverse V1 |  LLVM 15 | 32x32 |    3.26
Neoverse V1 |   GCC 12 |   4x4 |    2.39
Neoverse V1 |   GCC 12 |   8x8 |    4.76
Neoverse V1 |   GCC 12 | 16x16 |    3.24
Neoverse V1 |   GCC 12 | 32x32 |    2.44

Change-Id: Iefaa774d6a20388b523eaa7f5df6bc5f5cf249e4

16 months agoreland: quantize: simplify 32x32_b args
Johann [Sat, 5 Nov 2022 00:53:07 +0000 (09:53 +0900)]
reland: quantize: simplify 32x32_b args

Allocate mb_plane_ on the heap to ensure src is aligned.

Now that all the implementations of the 32x32 quantize are in
intrinsics we can reference struct members directly. Saves
pushing them to the stack.

n_coeffs is not used at all for this function.

Change-Id: Ib551f7f583977602504d962b72063bc6eda9dda9

16 months agodisable vp8_sixtap_predict16x16_neon
James Zern [Fri, 3 Mar 2023 23:33:16 +0000 (15:33 -0800)]
disable vp8_sixtap_predict16x16_neon

This causes various buffer overflows in the tests:

[ RUN      ] NEON/SixtapPredictTest.TestWithPresetData/0
=================================================================
==22346==ERROR: AddressSanitizer: global-buffer-overflow on address
0x0000012b4a5b at pc 0x000000df0f60 bp 0xffffcf6e64b0 sp 0xffffcf6e64a8
READ of size 8 at 0x0000012b4a5b thread T0
    #0 0xdf0f5c in vp8_sixtap_predict16x16_neon
       vp8/common/arm/neon/sixtappredict_neon.c:1507:13
    #1 0x8819e4 in (anonymous
        namespace)::SixtapPredictTest_TestWithPresetData_Test::TestBody()
       test/predict_test.cc:293:3
    ...

0x0000012b4a5b is located 2 bytes to the right of global variable
'kTestData' defined in '../test/predict_test.cc:237:24' (0x12b48a0) of
size 441

[ RUN      ] NEON/SixtapPredictTest.TestWithRandomData/0
=================================================================
==22338==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff8b5321fb at pc 0x000000df0f60 bp 0xfffff7e0cf30 sp 0xfffff7e0cf28
READ of size 8 at 0xffff8b5321fb thread T0
    #0 0xdf0f5c in vp8_sixtap_predict16x16_neon
       vp8/common/arm/neon/sixtappredict_neon.c:1507:13
    #1 0x87d4c0 in (anonymous
       namespace)::PredictTestBase::TestWithRandomData(void (*)(unsigned
       char*, int, int, int, unsigned char*, int))
       test/predict_test.cc:170:9
    ...

0xffff8b5321fb is located 2 bytes to the right of 441-byte region
[0xffff8b532040,0xffff8b5321f9)
allocated by thread T0 here:
    #0 0x5fd4f0 in operator new[](unsigned long) (test_libvpx+0x5fd4f0)
    #1 0x87c2e0 in (anonymous namespace)::PredictTestBase::SetUp()
       test/predict_test.cc:47:12
    #2 0x87d074 in non-virtual thunk to (anonymous
       namespace)::PredictTestBase::SetUp() test/predict_test.cc
    ...

Bug: webm:1795
Change-Id: I32213a381eef91547d00f88acf90f1cf2ec2ea75

16 months agodisable vpx_get4x4sse_cs_neon
James Zern [Fri, 3 Mar 2023 20:56:29 +0000 (20:56 +0000)]
disable vpx_get4x4sse_cs_neon

This function causes a heap overflow in the tests:
[ RUN      ] NEON/VpxSseTest.RefSse/0
=================================================================
==876922==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff8949d903 at pc 0x000000dd95d4 bp 0xfffffdd7f260 sp 0xfffffdd7f258
READ of size 8 at 0xffff8949d903 thread T0
    #0 0xdd95d0 in vpx_get4x4sse_cs_neon
       vpx_dsp/arm/variance_neon.c:556:10
    #1 0x9d4894 in (anonymous namespace)::MainTestClass<unsigned int
       (*)(unsigned char const*, int, unsigned char const*,
           int)>::RefTestSse() test/variance_test.cc:531:5
    #2 0x9d4894 in (anonymous
       namespace)::VpxSseTest_RefSse_Test::TestBody()
           test/variance_test.cc:772:30
    ...

0xffff8949d903 is located 3 bytes to the right of 16-byte region
[0xffff8949d8f0,0xffff8949d900)
allocated by thread T0 here:
    #0 0x5fd050 in operator new[](unsigned long) (test_libvpx+0x5fd050)
    #1 0x9d3e04 in (anonymous namespace)::MainTestClass<unsigned int
       (*)(unsigned char const*, int, unsigned char const*,
           int)>::SetUp() test/variance_test.cc:299:12

Bug: webm:1794
Change-Id: I4bc681eb9a436743ef8bfe2a2abae59ce754309c

16 months agoRevert "Implement d117_predictor using Neon"
James Zern [Fri, 3 Mar 2023 20:34:36 +0000 (12:34 -0800)]
Revert "Implement d117_predictor using Neon"

This reverts commit 360e9069b6cc1dd3a004728b876fb923413f4b11.

This causes ASan errors:
[ RUN      ] VP9/TestVectorTest.MD5Match/1
=================================================================
==837858==ERROR: AddressSanitizer: stack-buffer-overflow on address
0xffff82ecad40 at pc 0x000000c494d4 bp 0xffffe1695800 sp 0xffffe16957f8
READ of size 16 at 0xffff82ecad40 thread T0
    #0 0xc494d0 in vpx_d117_predictor_32x32_neon (test_libvpx+0xc494d0)
    #1 0x1040b34 in vp9_predict_intra_block (test_libvpx+0x1040b34)
    #2 0xf8feec in decode_block (test_libvpx+0xf8feec)
    #3 0xf8f588 in decode_partition (test_libvpx+0xf8f588)
    #4 0xf7be5c in vp9_decode_frame (test_libvpx+0xf7be5c)
    ...
Address 0xffff82ecad40 is located in stack of thread T0 at offset 64 in
frame
    #0 0x103fd3c in vp9_predict_intra_block (test_libvpx+0x103fd3c)

  This frame has 2 object(s):
    [32, 64) 'left_col.i' <== Memory access at offset 64 overflows this
                              variable
    [96, 176) 'above_data.i'

Change-Id: I058213364617dfe1036126c33a3307f8288d9ae0

16 months agoRevert "Allow macroblock_plane to have its own rounding buffer"
Johann [Fri, 3 Mar 2023 03:46:01 +0000 (12:46 +0900)]
Revert "Allow macroblock_plane to have its own rounding buffer"

This reverts commit 5359ae810cdbb974060297ecf935183baf7b009b.

Reason for revert: Blocks quantize cleanups

Original change's description:
> Allow macroblock_plane to have its own rounding buffer
>
> Add 8 bytes buffer to macroblock_plane to support rounding factor.
>
> Change-Id: I3751689e4449c0caea28d3acf6cd17d7f39508ed

Change-Id: Ia2424d2114207370f0b45350313a5ff8521d25a8

16 months ago[SSE4_1] Fix overflow in highbd temporal_filter
Konstantinos Margaritis [Wed, 1 Mar 2023 23:54:51 +0000 (23:54 +0000)]
[SSE4_1] Fix overflow in highbd temporal_filter

While porting this function to NEON, using SSE4_1 implementation
as base I noticed that both were producing files with different
checksums to the C reference implementation. After investigating
further I found that this saturating pack was the culprit. Doing
the multiplication on the 32-bit values, leads to producing the
correct results with the C implementation.

Change-Id: I40c2a36551b2db363a58ea9aa19ef327f2676de3

16 months agoRevert "quantize: simplify 32x32_b args"
James Zern [Wed, 1 Mar 2023 23:53:18 +0000 (15:53 -0800)]
Revert "quantize: simplify 32x32_b args"

This reverts commit 848f6e733789c627b6606baf1c85e32be997e36f.

This has alignment issues, causing crashes in the tests:
SSSE3/VP9QuantizeTest.EOBCheck/*

Change-Id: Ic12014ab0a78ed3cde02d642509061552cdc8fc9

16 months agoRevert "quantize: simplifly highbd 32x32_b args"
James Zern [Wed, 1 Mar 2023 23:53:14 +0000 (15:53 -0800)]
Revert "quantize: simplifly highbd 32x32_b args"

This reverts commit 573f5e662b544dbc553d73fa2b61055c30dfe8cc.

This has alignment issues, causing crashes in the tests:
SSSE3/VP9QuantizeTest.EOBCheck/*

Change-Id: Ibf05e6b116c46f6e2c11187b3e3578bbd2d2c227

16 months agoRevert "quantize: use scan_order instead of passing scan/iscan"
James Zern [Wed, 1 Mar 2023 23:52:20 +0000 (15:52 -0800)]
Revert "quantize: use scan_order instead of passing scan/iscan"

This reverts commit 14fc40040ff30486c45111056db44ee18590a24a.

This has alignment issues, causing crashes in the tests:
SSSE3/VP9QuantizeTest.EOBCheck/*

Change-Id: I934f9a4c3ce3db33058a65180fa645c8649c3670

16 months agoMerge "Optimize Neon implementation of high bitdepth MSE functions" into main
James Zern [Wed, 1 Mar 2023 23:13:34 +0000 (23:13 +0000)]
Merge "Optimize Neon implementation of high bitdepth MSE functions" into main

16 months agoRevert "Implement highbd_d63_predictor using Neon"
James Zern [Wed, 1 Mar 2023 20:14:51 +0000 (12:14 -0800)]
Revert "Implement highbd_d63_predictor using Neon"

This reverts commit 7cdf139e3d6237386e0f93bdb0bdc1b459c663bf.

This causes failures in the VP9/ExternalFrameBufferMD5Test and
VP9/TestVectorTest.MD5Match tests in both armv7 and aarch64 builds.

Change-Id: I7ac4ba0ddc70e7e7860df9f962e6658defe1cdd5

16 months agoOptimize Neon implementation of high bitdepth MSE functions
Salome Thirot [Mon, 27 Feb 2023 17:58:18 +0000 (17:58 +0000)]
Optimize Neon implementation of high bitdepth MSE functions

Currently MSE functions just call the variance helpers but don't
actually use the computed sum. This patch adds dedicated helpers to
perform the computation of sse.

Add the corresponding tests as well.

Change-Id: I96a8590e3410e84d77f7187344688e02efe03902

16 months agoquantize: use scan_order instead of passing scan/iscan
Johann [Mon, 14 Nov 2022 07:47:33 +0000 (16:47 +0900)]
quantize: use scan_order instead of passing scan/iscan

further reduces the arguments for the 32x32. This will be applied to the base
version as well.

Change-Id: I25a162b5248b14af53d9e20c6a7fa2a77028a6d1

16 months agoquantize: simplifly highbd 32x32_b args
Johann [Fri, 11 Nov 2022 23:23:17 +0000 (08:23 +0900)]
quantize: simplifly highbd 32x32_b args

Change-Id: I431a41279c4c4193bc70cfe819da6ea7e1d2fba1

16 months agoMerge changes I892fbd2c,Ic59df16c,I7228327b,Ib4a1a2cb into main
James Zern [Tue, 28 Feb 2023 21:50:11 +0000 (21:50 +0000)]
Merge changes I892fbd2c,Ic59df16c,I7228327b,Ib4a1a2cb into main

* changes:
  Implement highbd_d117_predictor using Neon
  Implement highbd_d63_predictor using Neon
  Implement d117_predictor using Neon
  Implement d63_predictor using Neon

16 months agoMerge "quantize: simplify 32x32_b args" into main
James Zern [Tue, 28 Feb 2023 21:40:26 +0000 (21:40 +0000)]
Merge "quantize: simplify 32x32_b args" into main

16 months agoImplement highbd_d117_predictor using Neon
George Steed [Tue, 21 Feb 2023 11:17:10 +0000 (11:17 +0000)]
Implement highbd_d117_predictor using Neon

Add Neon implementations of the highbd d117 predictor for 4x4, 8x8,
16x16 and 32x32 block sizes. Also update tests to add new corresponding
cases.

An explanation of the general implementation strategy is given in the
8x8 implementation body, and is mostly identical to the non-highbd
version.

Speedups over the C code (higher is better):

Microarch.  | Compiler | Block | Speedup
Neoverse N1 |  LLVM 15 |   4x4 |    1.99
Neoverse N1 |  LLVM 15 |   8x8 |    4.37
Neoverse N1 |  LLVM 15 | 16x16 |    6.81
Neoverse N1 |  LLVM 15 | 32x32 |    6.49
Neoverse N1 |   GCC 12 |   4x4 |    2.49
Neoverse N1 |   GCC 12 |   8x8 |    4.10
Neoverse N1 |   GCC 12 | 16x16 |    5.58
Neoverse N1 |   GCC 12 | 32x32 |    2.16
Neoverse V1 |  LLVM 15 |   4x4 |    1.99
Neoverse V1 |  LLVM 15 |   8x8 |    5.03
Neoverse V1 |  LLVM 15 | 16x16 |    6.61
Neoverse V1 |  LLVM 15 | 32x32 |    6.01
Neoverse V1 |   GCC 12 |   4x4 |    2.09
Neoverse V1 |   GCC 12 |   8x8 |    4.52
Neoverse V1 |   GCC 12 | 16x16 |    4.23
Neoverse V1 |   GCC 12 | 32x32 |    2.70

Change-Id: I892fbd2c17ac527ddc22b91acca907ffc84c5cd2

16 months agoImplement highbd_d63_predictor using Neon
George Steed [Mon, 20 Feb 2023 11:41:40 +0000 (11:41 +0000)]
Implement highbd_d63_predictor using Neon

Add Neon implementations of the highbd d63 predictor for 4x4, 8x8, 16x16
and 32x32 block sizes. Also update tests to add new corresponding cases.

Speedups over the C code (higher is better):

Microarch.  | Compiler | Block | Speedup
Neoverse N1 |  LLVM 15 |   4x4 |    2.43
Neoverse N1 |  LLVM 15 |   8x8 |    4.03
Neoverse N1 |  LLVM 15 | 16x16 |    3.07
Neoverse N1 |  LLVM 15 | 32x32 |    4.11
Neoverse N1 |   GCC 12 |   4x4 |    2.92
Neoverse N1 |   GCC 12 |   8x8 |    7.20
Neoverse N1 |   GCC 12 | 16x16 |    4.43
Neoverse N1 |   GCC 12 | 32x32 |    3.18
Neoverse V1 |  LLVM 15 |   4x4 |    1.99
Neoverse V1 |  LLVM 15 |   8x8 |    3.66
Neoverse V1 |  LLVM 15 | 16x16 |    3.60
Neoverse V1 |  LLVM 15 | 32x32 |    3.29
Neoverse V1 |   GCC 12 |   4x4 |    2.39
Neoverse V1 |   GCC 12 |   8x8 |    4.76
Neoverse V1 |   GCC 12 | 16x16 |    3.29
Neoverse V1 |   GCC 12 | 32x32 |    2.43

Change-Id: Ic59df16ceeb468003754b4374be2f4d9af6589e4

16 months agoImplement d117_predictor using Neon
George Steed [Tue, 7 Feb 2023 12:16:00 +0000 (12:16 +0000)]
Implement d117_predictor using Neon

Add Neon implementations of the d117 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.

An explanation of the general implementation strategy is given in the
8x8 implementation body.

Speedups over the C code (higher is better):

Microarch.  | Compiler | Block | Speedup
Neoverse N1 |  LLVM 15 |   4x4 |    1.73
Neoverse N1 |  LLVM 15 |   8x8 |    5.24
Neoverse N1 |  LLVM 15 | 16x16 |    9.77
Neoverse N1 |  LLVM 15 | 32x32 |   14.13
Neoverse N1 |   GCC 12 |   4x4 |    2.04
Neoverse N1 |   GCC 12 |   8x8 |    4.70
Neoverse N1 |   GCC 12 | 16x16 |    8.64
Neoverse N1 |   GCC 12 | 32x32 |    4.57
Neoverse V1 |  LLVM 15 |   4x4 |    1.75
Neoverse V1 |  LLVM 15 |   8x8 |    6.79
Neoverse V1 |  LLVM 15 | 16x16 |    9.16
Neoverse V1 |  LLVM 15 | 32x32 |   14.47
Neoverse V1 |   GCC 12 |   4x4 |    1.75
Neoverse V1 |   GCC 12 |   8x8 |    6.00
Neoverse V1 |   GCC 12 | 16x16 |    7.63
Neoverse V1 |   GCC 12 | 32x32 |    4.32

Change-Id: I7228327b5be27ee7a68deecafa05be0bd2a40ff4

16 months agoImplement d63_predictor using Neon
George Steed [Fri, 3 Feb 2023 17:12:46 +0000 (17:12 +0000)]
Implement d63_predictor using Neon

Add Neon implementations of the d63 predictor for 4x4, 8x8, 16x16 and
32x32 block sizes. Also update tests to add new corresponding cases.

Speedups over the C code (higher is better):

Microarch.  | Compiler | Block | Speedup
Neoverse N1 |  LLVM 15 |   4x4 |    2.10
Neoverse N1 |  LLVM 15 |   8x8 |    4.45
Neoverse N1 |  LLVM 15 | 16x16 |    4.74
Neoverse N1 |  LLVM 15 | 32x32 |    2.27
Neoverse N1 |   GCC 12 |   4x4 |    2.46
Neoverse N1 |   GCC 12 |   8x8 |   10.37
Neoverse N1 |   GCC 12 | 16x16 |   11.46
Neoverse N1 |   GCC 12 | 32x32 |    6.57
Neoverse V1 |  LLVM 15 |   4x4 |    2.24
Neoverse V1 |  LLVM 15 |   8x8 |    3.53
Neoverse V1 |  LLVM 15 | 16x16 |    4.44
Neoverse V1 |  LLVM 15 | 32x32 |    2.17
Neoverse V1 |   GCC 12 |   4x4 |    2.25
Neoverse V1 |   GCC 12 |   8x8 |    7.67
Neoverse V1 |   GCC 12 | 16x16 |    8.97
Neoverse V1 |   GCC 12 | 32x32 |    4.77

Change-Id: Ib4a1a2cb5a5c4495ae329529f8847664cbd0dfe0

16 months agoquantize: simplify 32x32_b args
Johann [Sat, 5 Nov 2022 00:53:07 +0000 (09:53 +0900)]
quantize: simplify 32x32_b args

Now that all the implementations of the 32x32 quantize are in
intrinsics we can reference struct members directly. Saves
pushing them to the stack.

n_coeffs is not used at all for this function.

Change-Id: I2104fea3fa20c455087e21b347d6abd7ea1f3e1e

16 months agoMerge "Add Neon implementations of standard bitdepth MSE functions" into main
James Zern [Tue, 28 Feb 2023 02:44:28 +0000 (02:44 +0000)]
Merge "Add Neon implementations of standard bitdepth MSE functions" into main

16 months agoMerge "Optimize transpose_neon.h helper functions" into main
James Zern [Tue, 28 Feb 2023 02:36:41 +0000 (02:36 +0000)]
Merge "Optimize transpose_neon.h helper functions" into main

16 months agotools_common,VpxInterface: remove unneeded const
James Zern [Mon, 27 Feb 2023 21:48:47 +0000 (13:48 -0800)]
tools_common,VpxInterface: remove unneeded const

Change-Id: Ic309aab2ff1750bdbcc36e8aafe05d52930ba694

16 months agoMerge "tools_common,VpxInterface: fix interface fn ptr proto" into main
James Zern [Mon, 27 Feb 2023 19:52:18 +0000 (19:52 +0000)]
Merge "tools_common,VpxInterface: fix interface fn ptr proto" into main

16 months agoAdd Neon implementations of standard bitdepth MSE functions
Salome Thirot [Fri, 24 Feb 2023 18:05:43 +0000 (18:05 +0000)]
Add Neon implementations of standard bitdepth MSE functions

Currently only vpx_mse16x16 has a Neon implementation. This patch adds
optimized Armv8.0 and Armv8.4 dot-product paths for all block sizes:
8x8, 8x16, 16x8 and 16x16.

Add the corresponding tests as well.

Change-Id: Ib0357fdcdeb05860385fec89633386e34395e260

16 months agoOptimize transpose_neon.h helper functions
Jonathan Wright [Sat, 25 Feb 2023 00:43:46 +0000 (00:43 +0000)]
Optimize transpose_neon.h helper functions

1) Use vtrn[12]q_[su]64 in vpx_vtrnq_[su]64* helpers on AArch64
   targets. This produces half as many TRN1/2 instructions compared to
   the number of MOVs that result from vcombine.

2) Use vpx_vtrnq_[su]64* helpers wherever applicable.

3) Refactor transpose_4x8_s16 to operate on 128-bit vectors.

Change-Id: I9a8b1c1fe2a98a429e0c5f39def5eb2f65759127

16 months agotools_common,VpxInterface: fix interface fn ptr proto
James Zern [Sat, 25 Feb 2023 03:25:39 +0000 (19:25 -0800)]
tools_common,VpxInterface: fix interface fn ptr proto

Use (void) to indicate an empty parameter list and match the declaration
of vpx_codec_vp[89]_[cd]x. This fixes a cfi sanitizer error.

Change-Id: I190f432eea4d1765afffd84c7458ec44d863f90c

16 months agoMerge changes I65d86038,If3299fe5,I3ef1ff19 into main
James Zern [Fri, 24 Feb 2023 17:58:15 +0000 (17:58 +0000)]
Merge changes I65d86038,If3299fe5,I3ef1ff19 into main

* changes:
  Add Neon implementation of high bitdepth 32x32 hadamard transform
  Add Neon implementation of high bitdepth 16x16 hadamard transform
  Add Neon implementation of high bitdepth 8x8 hadamard transform

16 months agoMerge changes Ia64d175a,Ie4ea8f0a into main
James Zern [Fri, 24 Feb 2023 17:49:25 +0000 (17:49 +0000)]
Merge changes Ia64d175a,Ie4ea8f0a into main

* changes:
  vp9_loop_filter_alloc: clear -Wshadow warnings
  vp9_adapt_mode_probs: clear -Wshadow warning

16 months agoAdd Neon implementation of high bitdepth 32x32 hadamard transform
Salome Thirot [Thu, 23 Feb 2023 12:05:30 +0000 (12:05 +0000)]
Add Neon implementation of high bitdepth 32x32 hadamard transform

Add Neon implementation of vpx_highbd_hadamard_32x32 as well as the
corresponding tests.

Change-Id: I65d8603896649de1996b353aa79eee54824b4708

16 months agoAdd Neon implementation of high bitdepth 16x16 hadamard transform
Salome Thirot [Wed, 22 Feb 2023 17:27:56 +0000 (17:27 +0000)]
Add Neon implementation of high bitdepth 16x16 hadamard transform

Add Neon implementation of vpx_highbd_hadamard_16x16 as well as the
corresponding tests.

Change-Id: If3299fe556351dfe3db994ac171d83a95ea1504b

16 months agoMerge "vp9 rc test: change param type to bool" into main
Jerome Jiang [Fri, 24 Feb 2023 01:45:54 +0000 (01:45 +0000)]
Merge "vp9 rc test: change param type to bool" into main

16 months agovp9 rc test: change param type to bool
Jerome Jiang [Thu, 23 Feb 2023 19:28:30 +0000 (14:28 -0500)]
vp9 rc test: change param type to bool

Change-Id: Ib45522e32d9137678da9062830044e9dd87537e5

16 months agoMerge "Disable some intra modes for TX_32X32" into main
Chi Yo Tsai [Thu, 23 Feb 2023 18:01:05 +0000 (18:01 +0000)]
Merge "Disable some intra modes for TX_32X32" into main

16 months agoAdd Neon implementation of high bitdepth 8x8 hadamard transform
Salome Thirot [Tue, 21 Feb 2023 17:40:20 +0000 (17:40 +0000)]
Add Neon implementation of high bitdepth 8x8 hadamard transform

Add Neon implementation of vpx_highbd_hadamard_8x8 as well as the
corresponding tests.

Change-Id: I3ef1ff199d76b6b010591ef15a81b0f36c9ded03

16 months agovp9_loop_filter_alloc: clear -Wshadow warnings
James Zern [Wed, 22 Feb 2023 21:25:29 +0000 (13:25 -0800)]
vp9_loop_filter_alloc: clear -Wshadow warnings

Bug: webm:1793
Change-Id: Ia64d175aa69dc2ecde2babf64bde04f02b32795b

16 months agovp9_adapt_mode_probs: clear -Wshadow warning
James Zern [Wed, 22 Feb 2023 21:21:27 +0000 (13:21 -0800)]
vp9_adapt_mode_probs: clear -Wshadow warning

Bug: webm:1793
Change-Id: Ie4ea8f0a3295e6f58dc6f7d5c61d46700c539d40

16 months agoMerge "vp9_block.h: rename diff struct to Diff" into main
James Zern [Thu, 23 Feb 2023 06:07:25 +0000 (06:07 +0000)]
Merge "vp9_block.h: rename diff struct to Diff" into main

16 months agoDisable some intra modes for TX_32X32
chiyotsai [Wed, 22 Feb 2023 20:44:47 +0000 (12:44 -0800)]
Disable some intra modes for TX_32X32

Performance:
| SPD_SET | TESTSET | AVG_PSNR | OVR_PSNR |  SSIM   | ENC_T |
|---------|---------|----------|----------|---------|-------|
|    0    | hdres2  | +0.036%  | +0.032%  | +0.014% | -3.9% |
|    0    | lowres2 | -0.002%  | -0.011%  | +0.020% | -3.6% |
|    0    | midres2 | +0.045%  | +0.025%  | -0.007% | -4.0% |

STATS_CHANGED

Change-Id: I75a927333d26f2a37f0dda57a641b455b845f5b9

16 months agovpx_subpixel_8t_intrin_avx2: clear -Wshadow warnings
James Zern [Wed, 22 Feb 2023 20:54:21 +0000 (12:54 -0800)]
vpx_subpixel_8t_intrin_avx2: clear -Wshadow warnings

no changes to assembly

Bug: webm:1793
Change-Id: I6a82290cafee7f4a7909d497ccfdefd5a78fb8ed

16 months agovp9_block.h: rename diff struct to Diff
James Zern [Wed, 22 Feb 2023 19:34:30 +0000 (11:34 -0800)]
vp9_block.h: rename diff struct to Diff

This matches the style guide and fixes some -Wshadow warnings related to
variables with the same name. Something similar was done in libaom in:
863b04994b Fix warnings reported by -Wshadow: Part2: av1 directory

Bug: webm:1793
Change-Id: I4df1bbc8d079a3174d75f0d35d54c200ffdbb677

16 months agoMerge "Skip redundant iterations in joint motion search " into main
Yunqing Wang [Wed, 22 Feb 2023 19:28:17 +0000 (19:28 +0000)]
Merge "Skip redundant iterations in joint motion search " into main

16 months agoMerge "vp9 rc: Make it work for SVC parallel encoding" into main
Jerome Jiang [Wed, 22 Feb 2023 14:59:49 +0000 (14:59 +0000)]
Merge "vp9 rc: Make it work for SVC parallel encoding" into main

17 months agoOptimize Neon implementation of high bitpdeth variance functions
Salome Thirot [Mon, 13 Feb 2023 16:11:31 +0000 (16:11 +0000)]
Optimize Neon implementation of high bitpdeth variance functions

Specialize implementation of high bitdepth variance functions such that
we only widen data processing element types when absolutely necessary.

Change-Id: If4cc3fea7b5ab0821e3129ebd79ff63706a512bf