The testcase morphed in a way no longer testing what it was originally supposed to do and slightly altering it shows the original issue isn't fixed (anymore).
The limit as set as result of PR91403 (and dups) prevents the issue for larger
arrays but the testcase has
double a[128][128];
which results in a group size of "just" 512 (the limit is 4096). Avoiding
the 'BB vectorization with gaps at the end of a load is not supported'
by altering it to do
void foo(void)
{
b[0] = a[0][0];
b[1] = a[1][0];
b[2] = a[2][0];
b[3] = a[3][127];
}
shows that costing has improved further to not account the dead loads making
the previous test inefficient. In fact the underlying issue isn't fixed
(we do code-generate dead loads).
In fact the vector permute load is even profitable, just the excessive
code-generation issue exists (and is "fixed" by capping it a constant
boundary, just too high for this particular testcase).
The testcase now has "dups", so I'll simply remove it.
2021-01-15 Richard Biener <rguenther@suse.de>
PR testsuite/96098
* gcc.dg/vect/bb-slp-pr68892.c: Remove.
+++ /dev/null
-/* { dg-do compile } */
-/* { dg-additional-options "-fvect-cost-model=dynamic" } */
-/* { dg-require-effective-target vect_double } */
-
-double a[128][128];
-double b[128];
-
-void foo(void)
-{
- b[0] = a[0][0];
- b[1] = a[1][0];
- b[2] = a[2][0];
- b[3] = a[3][0];
-}
-
-/* ??? Due to the gaps we fall back to scalar loads which makes the
- vectorization profitable. */
-/* { dg-final { scan-tree-dump "not profitable" "slp2" { xfail { ! aarch64*-*-* } } } } */
-/* { dg-final { scan-tree-dump "BB vectorization with gaps at the end of a load is not supported" "slp2" } } */
-/* { dg-final { scan-tree-dump-times "Basic block will be vectorized" 1 "slp2" { xfail aarch64*-*-* } } } */