review.tizen.org Git - platform/upstream/gcc.git/log

aarch64: Use memcpy to copy structures in bfloat vst* intrinsics

Use __builtin_memcpy to copy vector structures instead of using a
union - or constructing a new opaque structure one vector at a time -
in each of the vst[234][q] and vst1[q]_x[234] bfloat Neon intrinsics
in arm_neon.h.

Add new code generation tests to verify that superfluous move
instructions are not generated for the vst[234]q or vst1q_x[234]
bfloat intrinsics.

gcc/ChangeLog:

2021-07-30 Jonathan Wright <jonathan.wright@arm.com>

* config/aarch64/arm_neon.h (vst1_bf16_x2): Use
__builtin_memcpy instead of constructing an additional
__builtin_aarch64_simd_oi one vector at a time.
(vst1q_bf16_x2): Likewise.
(vst1_bf16_x3): Use __builtin_memcpy instead of constructing
an additional __builtin_aarch64_simd_ci one vector at a time.
(vst1q_bf16_x3): Likewise.
(vst1_bf16_x4): Use __builtin_memcpy instead of a union.
(vst1q_bf16_x4): Likewise.
(vst2_bf16): Use __builtin_memcpy instead of constructing an
additional __builtin_aarch64_simd_oi one vector at a time.
(vst2q_bf16): Likewise.
(vst3_bf16): Use __builtin_memcpy instead of constructing an
additional __builtin_aarch64_simd_ci mode one vector at a
time.
(vst3q_bf16): Likewise.
(vst4_bf16): Use __builtin_memcpy instead of constructing an
additional __builtin_aarch64_simd_xi one vector at a time.
(vst4q_bf16): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.

aarch64: Use memcpy to copy structures in vst2[q]_lane intrinsics

Use __builtin_memcpy to copy vector structures instead of using a
union - or constructing a new opaque structure one vector at a time -
in each of the vst2[q]_lane Neon intrinsics in arm_neon.h.

Add new code generation tests to verify that superfluous move
instructions are not generated for the vst2q_lane intrinsics.

gcc/ChangeLog:

2021-07-30 Jonathan Wright <jonathan.wright@arm.com>

* config/aarch64/arm_neon.h (__ST2_LANE_FUNC): Delete.
(__ST2Q_LANE_FUNC): Delete.
(vst2_lane_f16): Use __builtin_memcpy to copy vector
structure instead of constructing __builtin_aarch64_simd_oi
one vector at a time.
(vst2_lane_f32): Likewise.
(vst2_lane_f64): Likewise.
(vst2_lane_p8): Likewise.
(vst2_lane_p16): Likewise.
(vst2_lane_p64): Likewise.
(vst2_lane_s8): Likewise.
(vst2_lane_s16): Likewise.
(vst2_lane_s32): Likewise.
(vst2_lane_s64): Likewise.
(vst2_lane_u8): Likewise.
(vst2_lane_u16): Likewise.
(vst2_lane_u32): Likewise.
(vst2_lane_u64): Likewise.
(vst2_lane_bf16): Likewise.
(vst2q_lane_f16): Use __builtin_memcpy to copy vector
structure instead of using a union.
(vst2q_lane_f32): Likewise.
(vst2q_lane_f64): Likewise.
(vst2q_lane_p8): Likewise.
(vst2q_lane_p16): Likewise.
(vst2q_lane_p64): Likewise.
(vst2q_lane_s8): Likewise.
(vst2q_lane_s16): Likewise.
(vst2q_lane_s32): Likewise.
(vst2q_lane_s64): Likewise.
(vst2q_lane_u8): Likewise.
(vst2q_lane_u16): Likewise.
(vst2q_lane_u32): Likewise.
(vst2q_lane_u64): Likewise.
(vst2q_lane_bf16): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.

aarch64: Use memcpy to copy structures in vst3[q]_lane intrinsics

Use __builtin_memcpy to copy vector structures instead of using a
union - or constructing a new opaque structure one vector at a time -
in each of the vst3[q]_lane Neon intrinsics in arm_neon.h.

Add new code generation tests to verify that superfluous move
instructions are not generated for the vst3q_lane intrinsics.

gcc/ChangeLog:

2021-07-30 Jonathan Wright <jonathan.wright@arm.com>

* config/aarch64/arm_neon.h (__ST3_LANE_FUNC): Delete.
(__ST3Q_LANE_FUNC): Delete.
(vst3_lane_f16): Use __builtin_memcpy to copy vector
structure instead of constructing __builtin_aarch64_simd_ci
one vector at a time.
(vst3_lane_f32): Likewise.
(vst3_lane_f64): Likewise.
(vst3_lane_p8): Likewise.
(vst3_lane_p16): Likewise.
(vst3_lane_p64): Likewise.
(vst3_lane_s8): Likewise.
(vst3_lane_s16): Likewise.
(vst3_lane_s32): Likewise.
(vst3_lane_s64): Likewise.
(vst3_lane_u8): Likewise.
(vst3_lane_u16): Likewise.
(vst3_lane_u32): Likewise.
(vst3_lane_u64): Likewise.
(vst3_lane_bf16): Likewise.
(vst3q_lane_f16): Use __builtin_memcpy to copy vector
structure instead of using a union.
(vst3q_lane_f32): Likewise.
(vst3q_lane_f64): Likewise.
(vst3q_lane_p8): Likewise.
(vst3q_lane_p16): Likewise.
(vst3q_lane_p64): Likewise.
(vst3q_lane_s8): Likewise.
(vst3q_lane_s16): Likewise.
(vst3q_lane_s32): Likewise.
(vst3q_lane_s64): Likewise.
(vst3q_lane_u8): Likewise.
(vst3q_lane_u16): Likewise.
(vst3q_lane_u32): Likewise.
(vst3q_lane_u64): Likewise.
(vst3q_lane_bf16): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.

aarch64: Use memcpy to copy structures in vst4[q]_lane intrinsics

Use __builtin_memcpy to copy vector structures instead of using a
union - or constructing a new opaque structure one vector at a time -
in each of the vst4[q]_lane Neon intrinsics in arm_neon.h.

Add new code generation tests to verify that superfluous move
instructions are not generated for the vst4q_lane intrinsics.

gcc/ChangeLog:

2021-07-29 Jonathan Wright <jonathan.wright@arm.com>

* config/aarch64/arm_neon.h (__ST4_LANE_FUNC): Delete.
(__ST4Q_LANE_FUNC): Delete.
(vst4_lane_f16): Use __builtin_memcpy to copy vector
structure instead of constructing __builtin_aarch64_simd_xi
one vector at a time.
(vst4_lane_f32): Likewise.
(vst4_lane_f64): Likewise.
(vst4_lane_p8): Likewise.
(vst4_lane_p16): Likewise.
(vst4_lane_p64): Likewise.
(vst4_lane_s8): Likewise.
(vst4_lane_s16): Likewise.
(vst4_lane_s32): Likewise.
(vst4_lane_s64): Likewise.
(vst4_lane_u8): Likewise.
(vst4_lane_u16): Likewise.
(vst4_lane_u32): Likewise.
(vst4_lane_u64): Likewise.
(vst4_lane_bf16): Likewise.
(vst4q_lane_f16): Use __builtin_memcpy to copy vector
structure instead of using a union.
(vst4q_lane_f32): Likewise.
(vst4q_lane_f64): Likewise.
(vst4q_lane_p8): Likewise.
(vst4q_lane_p16): Likewise.
(vst4q_lane_p64): Likewise.
(vst4q_lane_s8): Likewise.
(vst4q_lane_s16): Likewise.
(vst4q_lane_s32): Likewise.
(vst4q_lane_s64): Likewise.
(vst4q_lane_u8): Likewise.
(vst4q_lane_u16): Likewise.
(vst4q_lane_u32): Likewise.
(vst4q_lane_u64): Likewise.
(vst4q_lane_bf16): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_structure_intrinsics.c: Add new
tests.

rs6000: Fix restored rs6000_long_double_type_size

As mentioned in the "Fallout: save/restore target options in handle_optimize_attribute"
thread, we need to support target option restore
of rs6000_long_double_type_size == FLOAT_PRECISION_TFmode.

gcc/ChangeLog:

* config/rs6000/rs6000.c (rs6000_option_override_internal): When
a target option is restored, it can have
rs6000_long_double_type_size set to FLOAT_PRECISION_TFmode
and error should not be emitted.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pragma-optimize.c: New test.

Fixup gfortran.dg/vect/vect-8.f90 for aarch64

With the emulated gather changes we now consistently vectorize
for aarch64 and we can remove the SVE special-casing.

2021-08-06 Richard Biener <rguenther@suse.de>

* gfortran.dg/vect/vect-8.f90: Simplify aarch64 scanning.

gcov: Add __gcov_info_to_gdca()

Add __gcov_info_to_gcda() to libgcov to get the gcda data for a gcda info in a
freestanding environment.  It is intended to be used with the
-fprofile-info-section option.  A crude test program which doesn't use a linker
script is (use "gcc -coverage -fprofile-info-section -lgcov test.c" to compile
it):

  #include <gcov.h>
  #include <stdio.h>
  #include <stdlib.h>

  extern const struct gcov_info *my_info;

  static void
  filename (const char *f, void *arg)
  {
    printf("filename: %s\n", f);
  }

  static void
  dump (const void *d, unsigned n, void *arg)
  {
    const unsigned char *c = d;

    for (unsigned i = 0; i < n; ++i)
      printf ("%02x", c[i]);
  }

  static void *
  allocate (unsigned length, void *arg)
  {
    return malloc (length);
  }

  int main()
  {
    __asm__ volatile (".set my_info, .LPBX2");
    __gcov_info_to_gcda (my_info, filename, dump, allocate, NULL);
    return 0;
  }

With this patch, <stdint.h> is included in libgcov-driver.c even if
inhibit_libc is defined.  This header file should be also available for
freestanding environments.  If this is not the case, then we have to define
intptr_t somehow.

The patch removes one use of memset() which makes the <string.h> include
superfluous.

gcc/

* gcov-io.h (gcov_write): Declare.
* gcov-io.c (gcov_write): New.
(gcov_write_counter): Remove.
(gcov_write_tag_length): Likewise.
(gcov_write_summary): Replace gcov_write_tag_length() with calls to
gcov_write_unsigned().
* doc/invoke.texi (fprofile-info-section): Mention
__gcov_info_to_gdca().

gcc/testsuite/

* gcc.dg/gcov-info-to-gcda.c: New test.

libgcc/

* Makefile.in (LIBGCOV_DRIVER): Add _gcov_info_to_gcda.
* gcov.h (gcov_info): Declare.
(__gcov_info_to_gdca): Likewise.
* libgcov.h (gcov_write_counter): Remove.
(gcov_write_tag_length): Likewise.
* libgcov-driver.c (#include <stdint.h>): New.
(#include <string.h>): Remove.
(NEED_L_GCOV): Conditionally define.
(NEED_L_GCOV_INFO_TO_GCDA): Likewise.
(are_all_counters_zero): New.
(gcov_dump_handler): Likewise.
(gcov_allocate_handler): Likewise.
(dump_unsigned): Likewise.
(dump_counter): Likewise.
(write_topn_counters): Add dump_fn, allocate_fn, and arg parameters.
Use dump_unsigned() and dump_counter().
(write_one_data): Add dump_fn, allocate_fn, and arg parameters.  Use
dump_unsigned(), dump_counter(), and are_all_counters_zero().
(__gcov_info_to_gcda): New.

Adjust by-value function vec arguments to by-reference.

gcc/c/ChangeLog:

* c-parser.c (c_parser_declaration_or_fndef): Adjust by-value function
vec arguments to by-reference.
(c_finish_omp_declare_simd): Same.
(c_parser_compound_statement_nostart): Same.
(c_parser_for_statement): Same.
(c_parser_objc_methodprotolist): Same.
(c_parser_oacc_routine): Same.
(c_parser_omp_for_loop): Same.
(c_parser_omp_declare_simd): Same.

gcc/ChangeLog:

* dominance.c (prune_bbs_to_update_dominators): Adjust by-value vec
arguments to by-reference.
(iterate_fix_dominators): Same.
* dominance.h (iterate_fix_dominators): Same.
* ipa-prop.h: Call auto_vec::to_vec_legacy.
* tree-data-ref.c (dump_data_dependence_relation): Adjust by-value vec
arguments to by-reference.
(debug_data_dependence_relation): Same.
(dump_data_dependence_relations): Same.
* tree-data-ref.h (debug_data_dependence_relation): Same.
(dump_data_dependence_relations): Same.
* tree-predcom.c (dump_chains): Same.
(initialize_root_vars_lm): Same.
(determine_unroll_factor): Same.
(replace_phis_by_defined_names): Same.
(insert_init_seqs): Same.
(pcom_worker::tree_predictive_commoning_loop): Call
auto_vec::to_vec_legacy.
* tree-ssa-pre.c (insert_into_preds_of_block): Adjust by-value vec
arguments to by-reference.
* tree-ssa-threadbackward.c (populate_worklist): Same.
(back_threader::resolve_def): Same.
* tree-vect-data-refs.c (vect_check_nonzero_value): Same.
(vect_enhance_data_refs_alignment): Same.
(vect_check_lower_bound): Same.
(vect_prune_runtime_alias_test_list): Same.
(vect_permute_store_chain): Same.
* tree-vect-slp-patterns.c (vect_normalize_conj_loc): Same.
* tree-vect-stmts.c (vect_create_vectorized_demotion_stmts): Same.
* tree-vectorizer.h (vect_permute_store_chain): Same.
* vec.c (test_init): New function.
(vec_c_tests): Call new function.
* vec.h (vec): Declare ctors, dtor, and assignment.
(auto_vec::vec_to_legacy): New function.
(vec::copy): Adjust initialization.

Daily bump.

runtime: extend internal atomics to comply with sync/atomic

This is the gofrontend version of https://golang.org/cl/289152.

Reviewed-on: https://go-review.googlesource.com/c/gofrontend/+/339690

libstdc++: Move [[nodiscard]] attributes again [PR101782]

Where I moved these nodiscard attributes to made them apply to the
function type, not to the function. This meant they no longer generated
the desired -Wunused-result warnings, and were ill-formed with Clang
(but only a pedwarn with GCC).

Clang also detected ill-formed attributes in <queue> which this fixes.

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

PR libstdc++/101782
* include/bits/ranges_base.h (ranges::begin, ranges::end)
(ranges::rbegin, ranges::rend, ranges::size, ranges::ssize)
(ranges::empty, ranges::data): Move attribute after the
declarator-id instead of at the end of the declarator.
* include/bits/stl_iterator.h (__gnu_cxx::__normal_iterator):
Move attributes back to the start of the function declarator,
but move the requires-clause to the end.
(common_iterator): Move attribute after the declarator-id.
* include/bits/stl_queue.h (queue): Remove ill-formed attributes
from friend declaration that are not definitions.
* include/std/ranges (views::all, views::filter)
(views::transform, views::take, views::take_while,
views::drop) (views::drop_while, views::join,
views::lazy_split) (views::split, views::counted,
views::common, views::reverse) (views::elements): Move
attributes after the declarator-id.

libcpp: Regenerate ucnid.h using Unicode 13.0.0 files [PR100977]

The following patch (incremental to the makeucnid.c fix) regenerates
ucnid.h with https://www.unicode.org/Public/13.0.0/ucd/ files.

2021-08-05 Jakub Jelinek <jakub@redhat.com>

PR c++/100977
* ucnid.h: Regenerated using Unicode 13.0.0 files.

libcpp: Fix makeucnid bug with combining values [PR100977]

I've noticed in ucnid.h two adjacent lines that had all flags and combine
values identical and as such were supposed to be merged.

This is due to a bug in makeucnid.c, which records last_flag,
last_combine and really_safe of what has just been printed, but
because of a typo mishandles it for last_combine, always compares against
the combining_value[0] which is 0.

This has two effects on the table, one is that often the table is
unnecessarily large, as for non-zero .combine every character has its own
record instead of adjacent characters with the same flags and combine
being merged. This means larger tables.
The other is that sometimes the last char that has combine set doesn't
actually have it in the tables, because the code is printing entries only
upon seeing the next character and if that character does have
combining_value of 0 and flags are otherwise the same as previously printed,
it will not print anything.

The following patch fixes that, for clarity what exactly it affects
I've regenerated with the same Unicode files as last time it has
been regenerated.

2021-08-05 Jakub Jelinek <jakub@redhat.com>

PR c++/100977
* makeucnid.c (write_table): Fix computation of last_combine.
* ucnid.h: Regenerated using Unicode 6.3.0 files.

libgcc: Honor LDFLAGS_FOR_TARGET when linking libgcc_s

When building gcc with some specific LDFLAGS_FOR_TARGET, e.g.
LDFLAGS_FOR_TARGET=-Wl,-z,relro,-z,now
those flags propagate info linking of target shared libraries,
e.g. lib{ubsan,tsan,stdc++,quadmath,objc,lsan,itm,gphobos,gdruntime,gomp,go,gfortran,atomic,asan}.so.*
but there is one important exception, libgcc_s.so.* linking ignores it.

The following patch fixes that.

Bootstrapped/regtested on x86_64-linux with LDFLAGS_FOR_TARGET=-Wl,-z,relro,-z,now
and verified that libgcc_s.so.* is BIND_NOW when it previously wasn't, and
without any LDFLAGS_FOR_TARGET on x86_64-linux and i686-linux.
There on x86_64-linux I've verified that the libgcc_s.so.1 linking command
line for -m64 is identical except for whitespace to one without the patch,
and for -m32 multilib $(LDFLAGS) actually do supply there an extra -m32
that also repeats later in the @multilib_flags@, which should be harmless.

2021-08-04 Jakub Jelinek <jakub@redhat.com>

* config/t-slibgcc (SHLIB_LINK): Add $(LDFLAGS).
* config/t-slibgcc-darwin (SHLIB_LINK): Likewise.
* config/t-slibgcc-vms (SHLIB_LINK): Likewise.
* config/t-slibgcc-fuchsia (SHLIB_LDFLAGS): Remove $(LDFLAGS).

openmp: Implement omp_get_device_num routine

This patch implements the omp_get_device_num library routine, specified in
OpenMP 5.0.

GOMP_DEVICE_NUM_VAR is a macro symbol which defines name of a "device number"
variable, is defined on the device-side libgomp, has it's address returned to
host-side libgomp during device initialization, and the host libgomp then
sets its value to the designated device number.

libgomp/ChangeLog:

* icv-device.c (omp_get_device_num): New API function, host side.
* fortran.c (omp_get_device_num_): New interface function.
* libgomp-plugin.h (GOMP_DEVICE_NUM_VAR): Define macro symbol.
* libgomp.map (OMP_5.0.2): New version space with omp_get_device_num,
omp_get_device_num_.
* libgomp.texi (omp_get_device_num): Add documentation for new API
function.
* omp.h.in (omp_get_device_num): Add declaration.
* omp_lib.f90.in (omp_get_device_num): Likewise.
* omp_lib.h.in (omp_get_device_num): Likewise.
* target.c (gomp_load_image_to_device): If additional entry for device
number exists at end of returned entries from 'load_image_func' hook,
copy the assigned device number over to the device variable.

* config/gcn/icv-device.c (GOMP_DEVICE_NUM_VAR): Define static global.
(omp_get_device_num): New API function, device side.
* plugin/plugin-gcn.c ("symcat.h"): Add include.
(GOMP_OFFLOAD_load_image): Add addresses of device GOMP_DEVICE_NUM_VAR
at end of returned 'target_table' entries.

* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Define static global.
(omp_get_device_num): New API function, device side.
* plugin/plugin-nvptx.c ("symcat.h"): Add include.
(GOMP_OFFLOAD_load_image): Add addresses of device GOMP_DEVICE_NUM_VAR
at end of returned 'target_table' entries.

* testsuite/lib/libgomp.exp
(check_effective_target_offload_target_intelmic): New function for
testing for intelmic offloading.
* testsuite/libgomp.c-c++-common/target-45.c: New test.
* testsuite/libgomp.fortran/target10.f90: New test.

libstdc++: Add [[nodiscard]] to <compare>

This adds the [[nodiscard]] attribute to all conversion operators,
comparison operators, call operators and non-member functions in
<compare>. Nothing in this header except constructors has side effects.

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

* libsupc++/compare (partial_ordering, weak_ordering)
(strong_ordering, is_eq, is_neq, is_lt, is_lteq, is_gt, is_gteq)
(compare_three_way, strong_order, weak_order, partial_order)
(compare_strong_order_fallback, compare_weak_order_fallback)
(compare_partial_order_fallback, __detail::__synth3way): Add
nodiscard attribute.
* testsuite/18_support/comparisons/categories/zero_neg.cc: Add
-Wno-unused-result to options.

testsuite: Fix warning introduced by nodiscard in libstdc++

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
gcc/testsuite/ChangeLog:

* g++.old-deja/g++.other/inline7.C: Cast nodiscard call to void.

libstdc++: Move attributes that follow requires-clauses [PR101782]

As explained in the PR, the grammar in the Concepts TS means that a [
token following a requires-clause is parsed as part of the
logical-or-expression rather than the start of an attribute. That makes
the following ill-formed when using -fconcepts-ts:

template<typename T> requires foo<T> [[nodiscard]] int f(T);

This change moves all attributes that follow a requires-clause to the
end of the function declarator.

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

PR libstdc++/101782
* include/bits/ranges_base.h (ranges::begin, ranges::end)
(ranges::rbegin, ranges::rend, ranges::size, ranges::ssize)
(ranges::empty, ranges::data): Move attribute to the end of
the declarator.
* include/bits/stl_iterator.h (__gnu_cxx::__normal_iterator)
(common_iterator): Likewise for non-member operator functions.
* include/std/ranges (views::all, views::filter)
(views::transform, views::take, views::take_while, views::drop)
(views::drop_while, views::join, views::lazy_split)
(views::split, views::counted, views::common, views::reverse)
(views::elements): Likewise.
* testsuite/std/ranges/access/101782.cc: New test.

<x86gprintrin.h>: Add pragma GCC target("general-regs-only")

1. Intrinsics in <x86gprintrin.h> only require GPR ISAs. Add

#if defined __MMX__ || defined __SSE__
#pragma GCC push_options
#pragma GCC target("general-regs-only")
#define __DISABLE_GENERAL_REGS_ONLY__
#endif

and

#ifdef __DISABLE_GENERAL_REGS_ONLY__
#undef __DISABLE_GENERAL_REGS_ONLY__
#pragma GCC pop_options
#endif /* __DISABLE_GENERAL_REGS_ONLY__ */

to <x86gprintrin.h> to disable non-GPR ISAs so that they can be used in
functions with __attribute__ ((target("general-regs-only"))).
2. When checking always_inline attribute, if callee only uses GPRs,
ignore MASK_80387 since enable MASK_80387 in caller has no impact on
callee inline.

gcc/

PR target/99744
* config/i386/i386.c (ix86_can_inline_p): Ignore MASK_80387 if
callee only uses GPRs.
* config/i386/ia32intrin.h: Revert commit 5463cee2770.
* config/i386/serializeintrin.h: Revert commit 71958f740f1.
* config/i386/x86gprintrin.h: Add
#pragma GCC target("general-regs-only") and #pragma GCC pop_options
to disable non-GPR ISAs.

gcc/testsuite/

PR target/99744
* gcc.target/i386/pr99744-3.c: New test.
* gcc.target/i386/pr99744-4.c: Likewise.
* gcc.target/i386/pr99744-5.c: Likewise.
* gcc.target/i386/pr99744-6.c: Likewise.
* gcc.target/i386/pr99744-7.c: Likewise.
* gcc.target/i386/pr99744-8.c: Likewise.

doc: Document cond_* shift optabs in md.texi

gcc/
PR middle-end/101787
* doc/md.texi (cond_ashl, cond_ashr, cond_lshr): Document.

vect: Move costing helpers from aarch64 code

aarch64.c has various routines to test for specific kinds of
vector statement cost. The routines aren't really target-specific,
so following a suggestion from Richi, this patch moves them to a new
section of tree-vectorizer.h.

gcc/
* tree-vectorizer.h (vect_is_store_elt_extraction, vect_is_reduction)
(vect_reduc_type, vect_embedded_comparison_type, vect_comparison_type)
(vect_is_extending_load, vect_is_integer_truncation): New functions,
moved from aarch64.c but given different names.
* config/aarch64/aarch64.c (aarch64_is_store_elt_extraction)
(aarch64_is_reduction, aarch64_reduc_type)
(aarch64_embedded_comparison_type, aarch64_comparison_type)
(aarch64_extending_load_p, aarch64_integer_truncation_p): Delete
in favor of the above. Update callers accordingly.

arm: reorder assembler architecture directives [PR101723]

A change to the way gas interprets the .fpu directive in binutils-2.34
means that issuing .fpu will clear any features set by .arch_extension
that apply to the floating point or simd units.  This unfortunately
causes problems for more recent versions of the architecture because
we currently emit .arch, .arch_extension and .fpu directives at
different times and try to suppress redundant changes.

This change addresses this by firstly unifying all the places where we
emit these directives to a single block of code and secondly
(re)emitting all the directives if any changes have been made to the
target options.  Whilst this is slightly more than the strict minimum
it should be enough to catch all cases where a change could have
happened.  The new code also emits the directives in the order: .arch,
.fpu, .arch_extension.  This ensures that the additional architectural
extensions are not removed by a later .fpu directive.

Whilst writing this patch I also noticed that in the corner case where
the last function to be compiled had a non-standard set of
architecture flags, the assembler would add an incorrect set of
derived attributes for the file as a whole.  Instead of reflecting the
command-line options it would reflect the flags from the last file in
the function.  To address this I've also added a call to re-emit the
flags from the asm_file_end callback so the assembler will be in the
correct state when it finishes processing the intput.

There's some slight churn to the testsuite as a consequence of this,
because previously we had a hack to suppress emitting a .fpu directive
for one specific case, but with the new order this is no-longer
necessary.

gcc/ChangeLog:

PR target/101723
* config/arm/arm-cpus.in (generic-armv7-a): Add quirk to suppress
writing .cpu directive in asm output.
* config/arm/arm.c (arm_identify_fpu_from_isa): New variable.
(arm_last_printed_arch_string): Delete.
(arm_last-printed_fpu_string): Delete.
(arm_configure_build_target): If use of floating-point/SIMD is
disabled, remove all fp/simd related features from the target ISA.
(last_arm_targ_options): New variable.
(arm_print_asm_arch_directives): Add new parameters.  Change order
of emitted directives and handle all cases here.
(arm_file_start): Always call arm_print_asm_arch_directives, move
all generation of .arch/.arch_extension here.
(arm_file_end): Call arm_print_asm_arch.
(arm_declare_function_name): Call arm_print_asm_arch_directives
instead of printing .arch/.fpu directives directly.

gcc/testsuite/ChangeLog:

PR target/101723
* gcc.target/arm/cortex-m55-nofp-flag-hard.c: Update expected output.
* gcc.target/arm/cortex-m55-nofp-flag-softfp.c: Likewise.
* gcc.target/arm/cortex-m55-nofp-nomve-flag-softfp.c: Likewise.
* gcc.target/arm/mve/intrinsics/mve_fpu1.c: Convert to dg-do assemble.
Add a non-no-op function body.
* gcc.target/arm/mve/intrinsics/mve_fpu2.c: Likewise.
* gcc.target/arm/pr98636.c (dg-options): Add -mfloat-abi=softfp.
* gcc.target/arm/attr-neon.c: Tighten scan-assembler tests.
* gcc.target/arm/attr-neon2.c: Use -Ofast, convert test to use
check-function-bodies.
* gcc.target/arm/attr-neon3.c: Likewise.
* gcc.target/arm/pr69245.c: Tighten scan-assembler match, but allow
multiple instances.
* gcc.target/arm/pragma_fpu_attribute.c: Likewise.
* gcc.target/arm/pragma_fpu_attribute_2.c: Likewise.

arm: Don't reconfigure globals in arm_configure_build_target

arm_configure_build_target is usually used to reconfigure the
arm_active_target structure, which is then used to reconfigure a
number of other global variables describing the current target.
Occasionally, however, we need to use arm_configure_build_target to
construct a temporary target structure and in that case it is wrong to
try to reconfigure the global variables (although probably harmless,
since arm_option_reconfigure_globals() only looks at
arm_active_target). At the very least, however, this is wasted work,
so it is best not to do it unless needed. What's more, several
callers of arm_configure_build target call
arm_option_reconfigure_globals themselves within a few lines, making
the call from within arm_configure_build_target completely redundant.

So this patch moves the responsibility of calling of
arm_configure_build_target to its callers (only two places needed
updating).

gcc:
* config/arm/arm.c (arm_configure_build_target): Don't call
arm_option_reconfigure_globals.
(arm_option_restore): Call arm_option_reconfigure_globals after
reconfiguring the target.
* config/arm/arm-c.c (arm_pragma_target_parse): Likewise.

arm: ensure the arch_name is always set for the build target

This should never happen now if GCC is invoked by the driver, but in
the unusual case of calling cc1 (or its ilk) directly from the command
line the build target's arch_name string can remain NULL. This can
complicate later processing meaning that we need to check for this
case explicitly in some circumstances. Nothing should rely on this
behaviour, so it's simpler to always set the arch_name when
configuring the build target and be done with it.

gcc:

* config/arm/arm.c (arm_configure_build_target): Ensure the target's
arch_name is always set.

aarch64: Don't include vec_select high-half in SIMD subtract cost

The Neon subtract-long/subract-widen instructions can select the top
or bottom half of the operand registers. This selection does not
change the cost of the underlying instruction and this should be
reflected by the RTL cost function.

This patch adds RTL tree traversal in the Neon subtract cost function
to match vec_select high-half of its operands. This traversal
prevents the cost of the vec_select from being added into the cost of
the subtract - meaning that these instructions can now be emitted in
the combine pass as they are no longer deemed prohibitively
expensive.

gcc/ChangeLog:

2021-07-28 Jonathan Wright <jonathan.wright@arm.com>

* config/aarch64/aarch64.c: Traverse RTL tree to prevent cost
of vec_select high-half from being added into Neon subtract
cost.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vsubX_high_cost.c: New test.

aarch64: Don't include vec_select high-half in SIMD add cost

The Neon add-long/add-widen instructions can select the top or bottom
half of the operand registers. This selection does not change the
cost of the underlying instruction and this should be reflected by
the RTL cost function.

This patch adds RTL tree traversal in the Neon add cost function to
match vec_select high-half of its operands. This traversal prevents
the cost of the vec_select from being added into the cost of the
subtract - meaning that these instructions can now be emitted in the
combine pass as they are no longer deemed prohibitively expensive.

gcc/ChangeLog:

2021-07-28 Jonathan Wright <jonathan.wright@arm.com>

* config/aarch64/aarch64.c: Traverse RTL tree to prevent cost
of vec_select high-half from being added into Neon add cost.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vaddX_high_cost.c: New test.

Adjust gcc.dg/vect/bb-slp-pr101756.c

This adjusts the testcase for excess diagnostics emitted by some
targets because of the attribute simd usage like

warning: GCC does not currently support mixed size types for 'simd' functions

on aarch64.

2021-08-05 Richard Biener <rguenther@suse.de>

* gcc.dg/vect/bb-slp-pr101756.c: Add -w.

cfgloop: Make loops_list support an optional loop_p root

This patch follows Richi's suggestion to add one optional
argument class loop* root to loops_list's CTOR, it can
provide the ability to construct a visiting list starting
from the given class loop* ROOT rather than the default
tree_root of loops_for_fn (FN), for visiting a subset of
the loop tree.

It unifies all orders of walkings into walk_loop_tree, but
it still uses linear search for LI_ONLY_INNERMOST when
looking at the whole loop tree since it has a more stable
bound.

gcc/ChangeLog:

* cfgloop.h (loops_list::loops_list): Add one optional argument
root and adjust accordingly, update loop tree walking and factor
out to ...
* cfgloop.c (loops_list::walk_loop_tree): ... this. New function.

Fix oversight in handling of reverse SSO in SRA pass

The scalar storage order does not apply to pointer and vector components.

gcc/
PR tree-optimization/101626
* tree-sra.c (propagate_subaccesses_from_rhs): Do not set the
reverse scalar storage order on a pointer or vector component.

gcc/testsuite/
* gcc.dg/sso-15.c: New test.

compiler: make escape analysis more robust about builtin functions

In the places where we handle builtin functions, list all
supported ones, and fail if an unexpected one is seen. So if a
new builtin function is added in the future we can detect it,
instead of silently treating it as nonescaping.

Reviewed-on: https://go-review.googlesource.com/c/gofrontend/+/339992

Support cond_{xor,ior,and} for vector integer mode under AVX512.

gcc/ChangeLog:

* config/i386/sse.md (cond_<code><mode>): New expander.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cond_op_anylogic_d-1.c: New test.
* gcc.target/i386/cond_op_anylogic_d-2.c: New test.
* gcc.target/i386/cond_op_anylogic_q-1.c: New test.
* gcc.target/i386/cond_op_anylogic_q-2.c: New test.

Support cond_{smax,smin} for vector float/double modes under AVX512.

gcc/ChangeLog:

* config/i386/sse.md (cond_<code><mode>): New expander.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cond_op_maxmin_double-1.c: New test.
* gcc.target/i386/cond_op_maxmin_double-2.c: New test.
* gcc.target/i386/cond_op_maxmin_float-1.c: New test.
* gcc.target/i386/cond_op_maxmin_float-2.c: New test.

Support cond_{smax,smin,umax,umin} for vector integer modes under AVX512.

gcc/ChangeLog:

* config/i386/sse.md (cond_<code><mode>): New expander.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cond_op_maxmin_b-1.c: New test.
* gcc.target/i386/cond_op_maxmin_b-2.c: New test.
* gcc.target/i386/cond_op_maxmin_d-1.c: New test.
* gcc.target/i386/cond_op_maxmin_d-2.c: New test.
* gcc.target/i386/cond_op_maxmin_q-1.c: New test.
* gcc.target/i386/cond_op_maxmin_q-2.c: New test.
* gcc.target/i386/cond_op_maxmin_ub-1.c: New test.
* gcc.target/i386/cond_op_maxmin_ub-2.c: New test.
* gcc.target/i386/cond_op_maxmin_ud-1.c: New test.
* gcc.target/i386/cond_op_maxmin_ud-2.c: New test.
* gcc.target/i386/cond_op_maxmin_uq-1.c: New test.
* gcc.target/i386/cond_op_maxmin_uq-2.c: New test.
* gcc.target/i386/cond_op_maxmin_uw-1.c: New test.
* gcc.target/i386/cond_op_maxmin_uw-2.c: New test.
* gcc.target/i386/cond_op_maxmin_w-1.c: New test.
* gcc.target/i386/cond_op_maxmin_w-2.c: New test.

Daily bump.

analyzer: initial implementation of asm support [PR101570]

gcc/ChangeLog:
PR analyzer/101570
* Makefile.in (ANALYZER_OBJS): Add analyzer/region-model-asm.o.

gcc/analyzer/ChangeLog:
PR analyzer/101570
* analyzer.cc (maybe_reconstruct_from_def_stmt): Add GIMPLE_ASM
case.
* analyzer.h (class asm_output_svalue): New forward decl.
(class reachable_regions): New forward decl.
* complexity.cc (complexity::from_vec_svalue): New.
* complexity.h (complexity::from_vec_svalue): New decl.
* engine.cc (feasibility_state::maybe_update_for_edge): Handle
asm stmts by calling on_asm_stmt.
* region-model-asm.cc: New file.
* region-model-manager.cc
(region_model_manager::maybe_fold_asm_output_svalue): New.
(region_model_manager::get_or_create_asm_output_svalue): New.
(region_model_manager::log_stats): Log m_asm_output_values_map.
* region-model.cc (region_model::on_stmt_pre): Handle GIMPLE_ASM.
* region-model.h (visitor::visit_asm_output_svalue): New.
(region_model_manager::get_or_create_asm_output_svalue): New decl.
(region_model_manager::maybe_fold_asm_output_svalue): New decl.
(region_model_manager::asm_output_values_map_t): New typedef.
(region_model_manager::m_asm_output_values_map): New field.
(region_model::on_asm_stmt): New.
* store.cc (binding_cluster::on_asm): New.
* store.h (binding_cluster::on_asm): New decl.
* svalue.cc (svalue::cmp_ptr): Handle SK_ASM_OUTPUT.
(asm_output_svalue::dump_to_pp): New.
(asm_output_svalue::dump_input): New.
(asm_output_svalue::input_idx_to_asm_idx): New.
(asm_output_svalue::accept): New.
* svalue.h (enum svalue_kind): Add SK_ASM_OUTPUT.
(svalue::dyn_cast_asm_output_svalue): New.
(class asm_output_svalue): New.
(is_a_helper <const asm_output_svalue *>::test): New.
(struct default_hash_traits<asm_output_svalue::key_t>): New.

gcc/testsuite/ChangeLog:
PR analyzer/101570
* gcc.dg/analyzer/asm-x86-1.c: New test.
* gcc.dg/analyzer/asm-x86-lp64-1.c: New test.
* gcc.dg/analyzer/asm-x86-lp64-2.c: New test.
* gcc.dg/analyzer/pr101570.c: New test.
* gcc.dg/analyzer/torture/asm-x86-linux-array_index_mask_nospec.c:
New test.
* gcc.dg/analyzer/torture/asm-x86-linux-cpuid-paravirt-1.c: New
test.
* gcc.dg/analyzer/torture/asm-x86-linux-cpuid-paravirt-2.c: New
test.
* gcc.dg/analyzer/torture/asm-x86-linux-cpuid.c: New test.
* gcc.dg/analyzer/torture/asm-x86-linux-rdmsr-paravirt.c: New
test.
* gcc.dg/analyzer/torture/asm-x86-linux-rdmsr.c: New test.
* gcc.dg/analyzer/torture/asm-x86-linux-wfx_get_ps_timeout-full.c:
New test.
* gcc.dg/analyzer/torture/asm-x86-linux-wfx_get_ps_timeout-reduced.c:
New test.

Signed-off-by: David Malcolm <dmalcolm@redhat.com>

x86: Update STORE_MAX_PIECES

Update STORE_MAX_PIECES to allow 16/32/64 bytes only if inter-unit move
is enabled since vec_duplicate enabled by inter-unit move is used to
implement store_by_pieces of 16/32/64 bytes.

gcc/

PR target/101742
* config/i386/i386.h (STORE_MAX_PIECES): Allow 16/32/64 bytes
only if TARGET_INTER_UNIT_MOVES_TO_VEC is true.

gcc/testsuite/

PR target/101742
* gcc.target/i386/pr101742a.c: New test.
* gcc.target/i386/pr101742b.c: Likewise.

x86: Avoid stack realignment when copying data with SSE register

To avoid stack realignment, call ix86_gen_scratch_sse_rtx to get a
scratch SSE register to copy data with with SSE register from one
memory location to another.

gcc/

PR target/101772
* config/i386/i386-expand.c (ix86_expand_vector_move): Call
ix86_gen_scratch_sse_rtx to get a scratch SSE register to copy
data with SSE register from one memory location to another.

gcc/testsuite/

PR target/101772
* gcc.target/i386/eh_return-2.c: New test.

IBM Z: Implement TARGET_VECTORIZE_VEC_PERM_CONST for vpdi

This patch makes use of the vector permute double immediate
instruction for constant permute vectors.

gcc/ChangeLog:

* config/s390/s390.c (expand_perm_with_vpdi): New function.
(vectorize_vec_perm_const_1): Call expand_perm_with_vpdi.
* config/s390/vector.md (*vpdi1<mode>, @vpdi1<mode>): Enable a
parameterized expander.
(*vpdi4<mode>, @vpdi4<mode>): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/s390/vector/perm-vpdi.c: New test.

IBM Z: Implement TARGET_VECTORIZE_VEC_PERM_CONST for vector merge

This patch implements the TARGET_VECTORIZE_VEC_PERM_CONST in the IBM Z
backend. The initial implementation only exploits the vector merge
instruction but there is more to come.

gcc/ChangeLog:

* config/s390/s390.c (MAX_VECT_LEN): Define macro.
(struct expand_vec_perm_d): Define struct.
(expand_perm_with_merge): New function.
(vectorize_vec_perm_const_1): New function.
(s390_vectorize_vec_perm_const): New function.
(TARGET_VECTORIZE_VEC_PERM_CONST): Define target macro.

gcc/testsuite/ChangeLog:

* gcc.target/s390/vector/perm-merge.c: New test.
* gcc.target/s390/vector/vec-types.h: New test.

IBM Z: Remove redundant V_HW_64 mode iterator.

gcc/ChangeLog:

* config/s390/vector.md (V_HW_64): Remove mode iterator.
(*vec_load_pair<mode>): Use V_HW_2 instead of V_HW_64.
* config/s390/vx-builtins.md
(vec_scatter_element<V_HW_2:mode>_SI): Use V_HW_2 instead of
V_HW_64.

IBM Z: Get rid of vpdi unspec

The patch gets rid of the unspec used for the vector permute double
immediate instruction and replaces it with generic rtx.

gcc/ChangeLog:

* config/s390/s390.md (UNSPEC_VEC_PERMI): Remove constant
definition.
* config/s390/vector.md (*vpdi1<mode>, *vpdi4<mode>): New pattern
definitions.
* config/s390/vx-builtins.md (*vec_permi<mode>): Emit generic rtx
instead of an unspec.

gcc/testsuite/ChangeLog:

* gcc.target/s390/zvector/vec-permi.c: Removed.
* gcc.target/s390/zvector/vec_permi.c: New test.

IBM Z: Get rid of vec merge unspec

This patch gets rid of the unspecs we were using for the vector merge
instruction and replaces it with generic rtx.

gcc/ChangeLog:

* config/s390/s390-modes.def: Add more vector modes to support
concatenation of two vectors.
* config/s390/s390-protos.h (s390_expand_merge_perm_const): Add
prototype.
(s390_expand_merge): Likewise.
* config/s390/s390.c (s390_expand_merge_perm_const): New function.
(s390_expand_merge): New function.
* config/s390/s390.md (UNSPEC_VEC_MERGEH, UNSPEC_VEC_MERGEL):
Remove constant definitions.
* config/s390/vector.md (V_HW_2): Add mode iterators.
(VI_HW_4, V_HW_4): Rename VI_HW_4 to V_HW_4.
(vec_2x_nelts, vec_2x_wide): New mode attributes.
(*vmrhb, *vmrlb, *vmrhh, *vmrlh, *vmrhf, *vmrlf, *vmrhg, *vmrlg):
New pattern definitions.
(vec_widen_umult_lo_<mode>, vec_widen_umult_hi_<mode>)
(vec_widen_smult_lo_<mode>, vec_widen_smult_hi_<mode>)
(vec_unpacks_lo_v4sf, vec_unpacks_hi_v4sf, vec_unpacks_lo_v2df)
(vec_unpacks_hi_v2df): Adjust expanders to emit non-unspec RTX for
vec merge.
* config/s390/vx-builtins.md (V_HW_4): Remove mode iterator. Now
in vector.md.
(vec_mergeh<mode>, vec_mergel<mode>): Use s390_expand_merge to
emit vec merge pattern.

gcc/testsuite/ChangeLog:

* gcc.target/s390/vector/long-double-asm-in-out-hard-fp-reg.c:
Instead of vpdi with 0 and 5 vmrlg and vmrhg are used now.
* gcc.target/s390/vector/long-double-asm-inout-hard-fp-reg.c: Likewise.
* gcc.target/s390/zvector/vec-types.h: New test.
* gcc.target/s390/zvector/vec_merge.c: New test.

aarch64: Don't include vec_select high-half in SIMD multiply cost

The Neon multiply/multiply-accumulate/multiply-subtract instructions
can select the top or bottom half of the operand registers. This
selection does not change the cost of the underlying instruction and
this should be reflected by the RTL cost function.

This patch adds RTL tree traversal in the Neon multiply cost function
to match vec_select high-half of its operands. This traversal
prevents the cost of the vec_select from being added into the cost of
the multiply - meaning that these instructions can now be emitted in
the combine pass as they are no longer deemed prohibitively
expensive.

gcc/ChangeLog:

2021-07-19 Jonathan Wright <jonathan.wright@arm.com>

* config/aarch64/aarch64.c (aarch64_strip_extend_vec_half):
Define.
(aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of
vec_select high-half from being added into Neon multiply
cost.
* rtlanal.c (vec_series_highpart_p): Define.
* rtlanal.h (vec_series_highpart_p): Declare.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vmul_high_cost.c: New test.

aarch64: Don't include vec_select element in SIMD multiply cost

The Neon multiply/multiply-accumulate/multiply-subtract instructions
can take various forms - multiplying full vector registers of values
or multiplying one vector by a single element of another. Regardless
of the form used, these instructions have the same cost, and this
should be reflected by the RTL cost function.

This patch adds RTL tree traversal in the Neon multiply cost function
to match the vec_select used by the lane-referencing forms of the
instructions already mentioned. This traversal prevents the cost of
the vec_select from being added into the cost of the multiply -
meaning that these instructions can now be emitted in the combine
pass as they are no longer deemed prohibitively expensive.

gcc/ChangeLog:

2021-07-19 Jonathan Wright <jonathan.wright@arm.com>

* config/aarch64/aarch64.c (aarch64_strip_duplicate_vec_elt):
Define.
(aarch64_rtx_mult_cost): Traverse RTL tree to prevent
vec_select cost from being added into Neon multiply cost.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vmul_element_cost.c: New test.

vect: Tweak comparisons with existing epilogue loops

This patch uses a more accurate scalar iteration estimate when
comparing the epilogue of a constant-iteration loop with a candidate
replacement epilogue.

In the testcase, the patch prevents a 1-to-3-element SVE epilogue
from seeming better than a 64-bit Advanced SIMD epilogue.

gcc/
* tree-vect-loop.c (vect_better_loop_vinfo_p): Detect cases in
which old_loop_vinfo is an epilogue loop that handles a constant
number of iterations.

gcc/testsuite/
* gcc.target/aarch64/sve/cost_model_12.c: New test.

vect: Tweak dump messages for vector mode choice

After vect_analyze_loop has successfully analysed a loop for
one base vector mode B1, it considers using following base vector
modes to vectorise an epilogue.  However, for VECT_COMPARE_COSTS,
a later mode B2 might turn out to be better than B1 was.  Initially
this comparison will be between an epilogue loop (for B2) and a main
loop (for B1).  However, in r11-6458 I'd added code to reanalyse the
B2 epilogue loop as a main loop, partly for correctness and partly
for better costing.

This can lead to a situation in which we think that the B2 epilogue
loop was better than the B1 main loop, but that the B2 main loop is
not better than the B1 main loop.  There was no dump message to say
that this had happened, which made it look like B2 had still won.

gcc/
* tree-vect-loop.c (vect_analyze_loop): Print a dump message
when a reanalyzed loop fails to be cheaper than the current
main loop.

aarch64: Fix a typo

gcc/
* config/aarch64/aarch64.c: Fix a typo.

gcov: check return code of a fclose

gcc/ChangeLog:

PR gcov-profile/101773
* gcov-io.c (gcov_close): Check return code of a fclose.

Fix debug info for ignored decls at start of assembly

Ignored functions decls that are compiled at the start of
the assembly have bogus line numbers until the first .file
directive, as reported in PR101575.

The corresponding binutils bug report is
https://sourceware.org/bugzilla/show_bug.cgi?id=28149

The work around for this issue is to emit a dummy .file
directive before the first function is compiled, unless
another .file directive was already emitted previously.

2021-08-04 Bernd Edlinger <bernd.edlinger@hotmail.de>

PR ada/101575
* dwarf2out.c (dwarf2out_assembly_start): Emit a dummy
.file statement when needed.

[testsuite] Fix trapping access in test PR101750

I believe PR101750 to be a testism. Fix it by giving the class a name.

gcc/testsuite/ChangeLog:

PR tree-optimization/101750
* g++.dg/vect/pr99149.cc: Name class.

Add emulated gather capability to the vectorizer

This adds a gather vectorization capability to the vectorizer
without target support by decomposing the offset vector, doing
sclar loads and then building a vector from the result.  This
is aimed mainly at cases where vectorizing the rest of the loop
offsets the cost of vectorizing the gather.

Note it's difficult to avoid vectorizing the offset load, but in
some cases later passes can turn the vector load + extract into
scalar loads, see the followup patch.

On SPEC CPU 2017 510.parest_r this improves runtime from 250s
to 219s on a Zen2 CPU which has its native gather instructions
disabled (using those the runtime instead increases to 254s)
using -Ofast -march=znver2 [-flto].  It turns out the critical
loops in this benchmark all perform gather operations.

2021-07-30  Richard Biener  <rguenther@suse.de>

* tree-vect-data-refs.c (vect_check_gather_scatter):
Include widening conversions only when the result is
still handed by native gather or the current offset
size not already matches the data size.
Also succeed analysis in case there's no native support,
noted by a IFN_LAST ifn and a NULL decl.
(vect_analyze_data_refs): Always consider gathers.
* tree-vect-patterns.c (vect_recog_gather_scatter_pattern):
Test for no IFN gather rather than decl gather.
* tree-vect-stmts.c (vect_model_load_cost): Pass in the
gather-scatter info and cost emulated gathers accordingly.
(vect_truncate_gather_scatter_offset): Properly test for
no IFN gather.
(vect_use_strided_gather_scatters_p): Likewise.
(get_load_store_type): Handle emulated gathers and its
restrictions.
(vectorizable_load): Likewise.  Emulate them by extracting
scalar offsets, doing scalar loads and a vector construct.

* gcc.target/i386/vect-gather-1.c: New testcase.
* gfortran.dg/vect/vect-8.f90: Adjust.

by_pieces: Pass MAX_PIECES to op_by_pieces_d

Pass MAX_PIECES to op_by_pieces_d::op_by_pieces_d for move, store and
compare.

PR target/101742
* expr.c (op_by_pieces_d::op_by_pieces_d): Add a max_pieces
argument to set m_max_size.
(move_by_pieces_d): Pass MOVE_MAX_PIECES to op_by_pieces_d.
(store_by_pieces_d): Pass STORE_MAX_PIECES to op_by_pieces_d.
(compare_by_pieces_d): Pass COMPARE_MAX_PIECES to op_by_pieces_d.

Fold (X<<C1)^(X<<C2) to a multiplication when possible.

The easiest way to motivate these additions to match.pd is with the
following example:

unsigned int foo(unsigned char i) {
  return i | (i<<8) | (i<<16) | (i<<24);
}

which mainline with -O2 on x86_64 currently generates:
foo: movzbl  %dil, %edi
movl    %edi, %eax
movl    %edi, %edx
sall    $8, %eax
sall    $16, %edx
orl     %edx, %eax
orl     %edi, %eax
sall    $24, %edi
orl     %edi, %eax
ret

but with this patch now becomes:
foo: movzbl  %dil, %eax
        imull   $16843009, %eax, %eax
        ret

Interestingly, this transformation is already applied when using
addition, allowing synth_mult to select an optimal sequence, but
not when using the equivalent bit-wise ior or xor operators.

The solution is to use tree_nonzero_bits to check that the
potentially non-zero bits of each operand don't overlap, which
ensures that BIT_IOR_EXPR and BIT_XOR_EXPR produce the same
results as PLUS_EXPR, which effectively generalizes the old
fold_plusminus_mult_expr.  Technically, the transformation
is to canonicalize (X*C1)|(X*C2) and (X*C1)^(X*C2) to
X*(C1+C2) where X and X<<C are considered special cases.

2021-08-04  Roger Sayle  <roger@nextmovesoftware.com>
    Marc Glisse  <marc.glisse@inria.fr>

gcc/ChangeLog
* match.pd (bit_ior, bit_xor): Canonicalize (X*C1)|(X*C2) and
(X*C1)^(X*C2) as X*(C1+C2), and related variants, using
tree_nonzero_bits to ensure that operands are bit-wise disjoint.

gcc/testsuite/ChangeLog
* gcc.dg/fold-ior-4.c: New test.

libstdc++: Add [[nodiscard]] to sequence containers

... and container adaptors.

This adds the [[nodiscard]] attribute to functions with no side-effects
for the sequence containers and their iterators, and the debug versions
of those containers, and the container adaptors,

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

* include/bits/forward_list.h: Add [[nodiscard]] to functions
with no side-effects.
* include/bits/stl_bvector.h: Likewise.
* include/bits/stl_deque.h: Likewise.
* include/bits/stl_list.h: Likewise.
* include/bits/stl_queue.h: Likewise.
* include/bits/stl_stack.h: Likewise.
* include/bits/stl_vector.h: Likewise.
* include/debug/deque: Likewise.
* include/debug/forward_list: Likewise.
* include/debug/list: Likewise.
* include/debug/safe_iterator.h: Likewise.
* include/debug/vector: Likewise.
* include/std/array: Likewise.
* testsuite/23_containers/array/creation/3_neg.cc: Use
-Wno-unused-result.
* testsuite/23_containers/array/debug/back1_neg.cc: Cast result
to void.
* testsuite/23_containers/array/debug/back2_neg.cc: Likewise.
* testsuite/23_containers/array/debug/front1_neg.cc: Likewise.
* testsuite/23_containers/array/debug/front2_neg.cc: Likewise.
* testsuite/23_containers/array/debug/square_brackets_operator1_neg.cc:
Likewise.
* testsuite/23_containers/array/debug/square_brackets_operator2_neg.cc:
Likewise.
* testsuite/23_containers/array/tuple_interface/get_neg.cc:
Adjust dg-error line numbers.
* testsuite/23_containers/deque/cons/clear_allocator.cc: Cast
result to void.
* testsuite/23_containers/deque/debug/invalidation/4.cc:
Likewise.
* testsuite/23_containers/deque/types/1.cc: Use
-Wno-unused-result.
* testsuite/23_containers/list/types/1.cc: Cast result to void.
* testsuite/23_containers/priority_queue/members/7161.cc:
Likewise.
* testsuite/23_containers/queue/members/7157.cc: Likewise.
* testsuite/23_containers/vector/59829.cc: Likewise.
* testsuite/23_containers/vector/ext_pointer/types/1.cc:
Likewise.
* testsuite/23_containers/vector/ext_pointer/types/2.cc:
Likewise.
* testsuite/23_containers/vector/types/1.cc: Use
-Wno-unused-result.

libstdc++: Add [[nodiscard]] to iterators and related utilities

This adds [[nodiscard]] throughout <iterator>, as proposed by P2377R0
(with some minor corrections).

The attribute is added for all modes from C++11 up, using
[[__nodiscard__]] or _GLIBCXX_NODISCARD where C++17 [[nodiscard]] can't
be used directly.

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

* include/bits/iterator_concepts.h (iter_move): Add
[[nodiscard]].
* include/bits/range_access.h (begin, end, cbegin, cend)
(rbegin, rend, crbegin, crend, size, data, ssize): Likewise.
* include/bits/ranges_base.h (ranges::begin, ranges::end)
(ranges::cbegin, ranges::cend, ranges::rbegin, ranges::rend)
(ranges::crbegin, ranges::crend, ranges::size, ranges::ssize)
(ranges::empty, ranges::data, ranges::cdata): Likewise.
* include/bits/stl_iterator.h (reverse_iterator, __normal_iterator)
(back_insert_iterator, front_insert_iterator, insert_iterator)
(move_iterator, move_sentinel, common_iterator)
(counted_iterator): Likewise.
* include/bits/stl_iterator_base_funcs.h (distance, next, prev):
Likewise.
* include/bits/stream_iterator.h (istream_iterator)
(ostream_iterartor): Likewise.
* include/bits/streambuf_iterator.h (istreambuf_iterator)
(ostreambuf_iterator): Likewise.
* include/std/ranges (views::single, views::iota, views::all)
(views::filter, views::transform, views::take, views::take_while)
(views::drop, views::drop_while, views::join, views::lazy_split)
(views::split, views::counted, views::common, views::reverse)
(views::elements): Likewise.
* testsuite/20_util/rel_ops.cc: Use -Wno-unused-result.
* testsuite/24_iterators/move_iterator/greedy_ops.cc: Likewise.
* testsuite/24_iterators/normal_iterator/greedy_ops.cc:
Likewise.
* testsuite/24_iterators/reverse_iterator/2.cc: Likewise.
* testsuite/24_iterators/reverse_iterator/greedy_ops.cc:
Likewise.
* testsuite/21_strings/basic_string/range_access/char/1.cc:
Cast result to void.
* testsuite/21_strings/basic_string/range_access/wchar_t/1.cc:
Likewise.
* testsuite/21_strings/basic_string_view/range_access/char/1.cc:
Likewise.
* testsuite/21_strings/basic_string_view/range_access/wchar_t/1.cc:
Likewise.
* testsuite/23_containers/array/range_access.cc: Likewise.
* testsuite/23_containers/deque/range_access.cc: Likewise.
* testsuite/23_containers/forward_list/range_access.cc:
Likewise.
* testsuite/23_containers/list/range_access.cc: Likewise.
* testsuite/23_containers/map/range_access.cc: Likewise.
* testsuite/23_containers/multimap/range_access.cc: Likewise.
* testsuite/23_containers/multiset/range_access.cc: Likewise.
* testsuite/23_containers/set/range_access.cc: Likewise.
* testsuite/23_containers/unordered_map/range_access.cc:
Likewise.
* testsuite/23_containers/unordered_multimap/range_access.cc:
Likewise.
* testsuite/23_containers/unordered_multiset/range_access.cc:
Likewise.
* testsuite/23_containers/unordered_set/range_access.cc:
Likewise.
* testsuite/23_containers/vector/range_access.cc: Likewise.
* testsuite/24_iterators/customization_points/iter_move.cc:
Likewise.
* testsuite/24_iterators/istream_iterator/sentinel.cc:
Likewise.
* testsuite/24_iterators/istreambuf_iterator/sentinel.cc:
Likewise.
* testsuite/24_iterators/move_iterator/dr2061.cc: Likewise.
* testsuite/24_iterators/operations/prev_neg.cc: Likewise.
* testsuite/24_iterators/ostreambuf_iterator/2.cc: Likewise.
* testsuite/24_iterators/range_access/range_access.cc:
Likewise.
* testsuite/24_iterators/range_operations/100768.cc: Likewise.
* testsuite/26_numerics/valarray/range_access2.cc: Likewise.
* testsuite/28_regex/range_access.cc: Likewise.
* testsuite/experimental/string_view/range_access/char/1.cc:
Likewise.
* testsuite/experimental/string_view/range_access/wchar_t/1.cc:
Likewise.
* testsuite/ext/vstring/range_access.cc: Likewise.
* testsuite/std/ranges/adaptors/take.cc: Likewise.
* testsuite/std/ranges/p2259.cc: Likewise.

Rewrite more vector loads to scalar loads

This teaches forwprop to rewrite more vector loads that are only
used in BIT_FIELD_REFs as scalar loads.  This provides the
remaining uplift to SPEC CPU 2017 510.parest_r on Zen 2 which
has CPU gathers disabled.

In particular vector load + vec_unpack + bit-field-ref is turned
into (extending) scalar loads which avoids costly XMM/GPR
transitions.  To not conflict with vector load + bit-field-ref
+ vector constructor matching to vector load + shuffle the
extended transform is only done after vector lowering.

2021-07-30  Richard Biener  <rguenther@suse.de>

* tree-ssa-forwprop.c (pass_forwprop::execute): Split
out code to decompose vector loads ...
(optimize_vector_load): ... here.  Generalize it to
handle intermediate widening and TARGET_MEM_REF loads
and apply it to loads with a supported vector mode as well.

tree-optimization/101756 - avoid vectorizing boolean MAX reductions

The following avoids vectorizing MIN/MAX reductions on bools which,
when ending up as vector(2) <signed-boolean:64> would need to be
adjusted because of the sign change. The fix instead avoids any
reduction vectorization where the result isn't compatible
to the original scalar type since we don't compensate for that
either.

2021-08-04 Richard Biener <rguenther@suse.de>

PR tree-optimization/101756
* tree-vect-slp.c (vectorizable_bb_reduc_epilogue): Make sure
the result of the reduction epilogue is compatible to the original
scalar result.

* gcc.dg/vect/bb-slp-pr101756.c: New testcase.

c++: Fix up #pragma omp declare {simd,variant} and acc routine parsing

When parsing default arguments, we need to temporarily clear parser->omp_declare_simd
and parser->oacc_routine, otherwise it can clash with further declarations
inside of e.g. lambdas inside of those default arguments.

2021-08-04 Jakub Jelinek <jakub@redhat.com>

PR c++/101759
* parser.c (cp_parser_default_argument): Temporarily override
parser->omp_declare_simd and parser->oacc_routine to NULL.

* g++.dg/gomp/pr101759.C: New test.
* g++.dg/goacc/pr101759.C: New test.

testsuite: Fix duplicated content of gcc.c-torture/execute/ieee/pr29302-1.x

The file has two identical halves, seems like twice applied patch.

2021-08-04 Jakub Jelinek <jakub@redhat.com>

* gcc.c-torture/execute/ieee/pr29302-1.x: Undo doubly applied patch.

Refine predicate of peephole2 to general_reg_operand. [PR target/101743]

The define_peephole2 which is added by r12-2640-gf7bf03cf69ccb7dc
should only work on general registers, considering that x86 also
supports mov instructions between gpr, sse reg, mask reg, limiting the
peephole2 predicate to general_reg_operand.

gcc/ChangeLog:

PR target/101743
* config/i386/i386.md (peephole2): Refine predicate from
register_operand to general_reg_operand.

libgcc: Fix duplicated content of config/t-slibgcc-fuchsia

The file has two identical halves, seems like twice applied patch.

2021-08-04 Jakub Jelinek <jakub@redhat.com>

* config/t-slibgcc-fuchsia: Undo doubly applied patch.

Mark path_range_query::dump as override.

gcc/ChangeLog:

* gimple-range-path.h (path_range_query::dump): Mark override.

tree-optimization/101769 - tail recursion creates possibly infinite loop

This makes tail recursion optimization produce a loop structure
manually rather than relying on loop fixup.  That also allows the
loop to be marked as finite (it would eventually blow the stack
if it were not).

2021-08-04  Richard Biener  <rguenther@suse.de>

PR tree-optimization/101769
* tree-tailcall.c (eliminate_tail_call): Add the created loop
for the first recursion and return it via the new output parameter.
(optimize_tail_call): Pass through new output param.
(tree_optimize_tail_calls_1): After creating all latches,
add the created loop to the loop tree.  Do not mark loops for fixup.

* g++.dg/tree-ssa/pr101769.C: New testcase.

docs: document threader-mode param

gcc/ChangeLog:

* doc/invoke.texi: Document threader-mode param.

Add dg-require-effective-target for testcases.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cond_op_addsubmul_d-2.c: Add
dg-require-effective-target for avx512.
* gcc.target/i386/cond_op_addsubmul_q-2.c: Ditto.
* gcc.target/i386/cond_op_addsubmul_w-2.c: Ditto.
* gcc.target/i386/cond_op_addsubmuldiv_double-2.c: Ditto.
* gcc.target/i386/cond_op_addsubmuldiv_float-2.c: Ditto.
* gcc.target/i386/cond_op_fma_double-2.c: Ditto.
* gcc.target/i386/cond_op_fma_float-2.c: Ditto.

Support cond_{fma,fms,fnma,fnms} for vector float/double under AVX512.

gcc/ChangeLog:

* config/i386/sse.md (cond_fma<mode>): New expander.
(cond_fms<mode>): Ditto.
(cond_fnma<mode>): Ditto.
(cond_fnms<mode>): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cond_op_fma_double-1.c: New test.
* gcc.target/i386/cond_op_fma_double-2.c: New test.
* gcc.target/i386/cond_op_fma_float-1.c: New test.
* gcc.target/i386/cond_op_fma_float-2.c: New test.

compiler: support new language constructs in escape analysis

Previous CLs add new language constructs in Go 1.17, specifically,
unsafe.Add, unsafe.Slice, and conversion from a slice to a pointer
to an array. This CL handles them in the escape analysis.

At the point of the escape analysis, unsafe.Add and unsafe.Slice
are still builtin calls, so just handle them in data flow.
Conversion from a slice to a pointer to an array has already been
lowered to a combination of compound expression, conditional
expression and slice info expressions, so handle them in the
escape analysis.

Reviewed-on: https://go-review.googlesource.com/c/gofrontend/+/339671

Daily bump.

compile, runtime: make selectnbrecv return two values

The only different between selectnbrecv and selectnbrecv2 is the later
set the input pointer value by second return value from chanrecv.

So by making selectnbrecv return two values from chanrecv, we can get
rid of selectnbrecv2, the compiler can now call only selectnbrecv and
generate simpler code.

This is the gofrontend version of https://golang.org/cl/292890.

Reviewed-on: https://go-review.googlesource.com/c/gofrontend/+/339529

compiler: check slice to pointer-to-array conversion element type

When checking a slice to pointer-to-array conversion, I forgot to
verify that the elements types are identical.

For golang/go#395

Reviewed-on: https://go-review.googlesource.com/c/gofrontend/+/339329

rs6000: Replace & by &&

2021-08-03 Segher Boessenkool <segher@kernel.crashing.org>

* config/rs6000/vsx.md (*vsx_le_perm_store_<mode>): Use && instead of &.

rs6000: "e" is not a free constraint letter

It is the prefix of the "es" and "eI" constraints.

2021-08-03 Segher Boessenkool <segher@kernel.crashing.org>

* config/rs6000/constraints.md: Remove "e" from the list of available
constraint characters.

Fix indirect call inlining with AutoFDO

The histogram value for indirect calls was incorrectly set up.
That is fixed now.

With this change the tree-prof tests checking indirect call inlining with AutoFDO
in gcc.dg and g++.dg are passing.

Resolves:
PR gcov-profile/71672 - inlining indirect calls does not work with autofdo

gcc/ChangeLog:
PR gcov-profile/71672
* auto-profile.c (afdo_indirect_call): Fix setup of the historgram value for indirect calls.

Fixes for AutoFDO testing

* create_gcov tool doesn't currently support dwarf 5 so I made a change in profopt.exp
  to pass -gdwarf-4 when compiling the binary to profile.

* I updated the invocation of create_gcov in profopt.exp to pass -gcov_version=2.
  I recently made a change to create_gcov to support version 2:
  https://github.com/google/autofdo/pull/117 .

* I removed useless -o perf.data from the invocation of gcc-auto-profile in
  target-supports.exp.

These changes contribute to fixing PR gcov-profile/71672.

gcc/testsuite/ChangeLog:

* lib/profopt.exp: Pass gdwarf-4 when compiling test to profile; pass -gcov_version=2.
* lib/target-supports.exp: Remove unnecessary -o perf.data passed to gcc-auto-profile.

Fix indir-call-prof-2.c with AutoFDO

indir-call-prof-2.c has -fno-early-inlining but AutoFDO can't work without
early inlining (it needs to match the inlining of the profiled binary).
I changed profopt.exp to always pass -fearly-inlining for AutoFDO.
With that change the indirect call inlining in indir-call-prof-2.c happens in the early inliner
so I changed the dg-final-use-autofdo.

Contributes to fixing PR gcov-profile/71672

gcc/testsuite/ChangeLog:

* gcc.dg/tree-prof/indir-call-prof-2.c: Fix dg-final-use-autofdo.
* lib/profopt.exp: Pass -fearly-inlining when compiling with AutoFDO.

Fixes for AutoFDO tests

* Changed several tests to use -fdump-ipa-afdo-optimized instead of -fdump-ipa-afdo
in dg-options so that the expected output can be found

* Increased the number of iterations in several tests so that perf can have
enough sampling events

Contributes to fixing PR gcov-profile/71672.

gcc/testsuite/ChangeLog:

* g++.dg/tree-prof/indir-call-prof.C: Fix options, increase the number of iterations.
* g++.dg/tree-prof/morefunc.C: Fix options, increase the number of iterations.
* g++.dg/tree-prof/reorder.C: Fix options, increase the number of iterations.
* gcc.dg/tree-prof/indir-call-prof-2.c: Fix options, increase the number of iterations.
* gcc.dg/tree-prof/indir-call-prof.c: Fix options.

Disable a test case in ILP32 [PR101688].

Resolves:
PR testsuite/101688 - g++.dg/warn/Wstringop-overflow-4.C fails on 32-bit archs with new jump threader

gcc/testsuite:
PR testsuite/101688
* g++.dg/warn/Wstringop-overflow-4.C: Disable a test case in ILP32.

rs6000: Add test for _mm_minpos_epu16

Copy the test for _mm_minpos_epu16 from
gcc/testsuite/gcc.target/i386/sse4_1-phminposuw.c, with
a few adjustments:

- Adjust the dejagnu directives for powerpc platform.
- Make the data not be monotonically increasing,
  such that some of the returned values are not
  always the first value (index 0).
- Create a list of input data testing various scenarios
  including more than one minimum value and different
  orders and indices of the minimum value.
- Fix a masking issue where the index was being truncated
  to 2 bits instead of 3 bits, which wasn't found because
  all of the returned indices were 0 with the original
  generated data.
- Support big-endian.

2021-08-03  Paul A. Clarke  <pc@us.ibm.com>

gcc/testsuite
* gcc.target/powerpc/sse4_1-phminposuw.c: Copy from
gcc/testsuite/gcc.target/i386, adjust dg directives to suit,
make more robust.

rs6000: Add support for _mm_minpos_epu16

Add a naive implementation of the subject x86 intrinsic to
ease porting.

2021-08-03 Paul A. Clarke <pc@us.ibm.com>

gcc
* config/rs6000/smmintrin.h (_mm_minpos_epu16): New.

libstdc++: Suppress redundant definitions of inline variables

In C++17 the out-of-class definitions for static constexpr variables are
redundant, because they are implicitly inline. This change avoids
"redundant redeclaration" warnings from -Wsystem-headers -Wdeprecated.

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

* include/bits/random.tcc (linear_congruential_engine): Do not
define static constexpr members when they are implicitly inline.
* include/std/ratio (ratio, __ratio_multiply, __ratio_divide)
(__ratio_add, __ratio_subtract): Likewise.
* include/std/type_traits (integral_constant): Likewise.
* testsuite/26_numerics/random/pr60037-neg.cc: Adjust dg-error
line number.

libstdc++: Replace TR1 components with C++11 ones in test utils

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

* testsuite/util/testsuite_common_types.h: Replace uses of
tr1::unordered_map and tr1::unordered_set with their C++11
equivalents.
* testsuite/29_atomics/atomic/cons/assign_neg.cc: Adjust
dg-error line number.
* testsuite/29_atomics/atomic/cons/copy_neg.cc: Likewise.
* testsuite/29_atomics/atomic_integral/cons/assign_neg.cc:
Likewise.
* testsuite/29_atomics/atomic_integral/cons/copy_neg.cc:
Likewise.
* testsuite/29_atomics/atomic_integral/operators/bitwise_neg.cc:
Likewise.
* testsuite/29_atomics/atomic_integral/operators/decrement_neg.cc:
Likewise.
* testsuite/29_atomics/atomic_integral/operators/increment_neg.cc:
Likewise.

libstdc++: Specialize allocator_traits<pmr::polymorphic_allocator<T>>

This adds a partial specialization of allocator_traits, similar to what
was already done for std::allocator. This means that most uses of
polymorphic_allocator via the traits can avoid the metaprogramming
overhead needed to deduce the properties from polymorphic_allocator.

In addition, I'm changing polymorphic_allocator::delete_object to invoke
the destructor (or pseudo-destructor) directly, rather than calling
allocator_traits::destroy, which calls polymorphic_allocator::destroy
(which is deprecated). This is observable if a user has specialized
allocator_traits<polymorphic_allocator<Foo>> and expects to see its
destroy member function called. I consider explicit specializations of
allocator_traits to be wrong-headed, and this use case seems unnecessary
to support. So delete_object just invokes the destructor directly.

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

* include/std/memory_resource (polymorphic_allocator::delete_object):
Call destructor directly instead of using destroy.
(allocator_traits<polymorphic_allocator<T>>): Define partial
specialization.

libstdc++: Remove trailing whitespace in some tests

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

* testsuite/20_util/function_objects/binders/3113.cc: Remove
trailing whitespace.
* testsuite/20_util/shared_ptr/assign/auto_ptr.cc: Likewise.
* testsuite/20_util/shared_ptr/assign/auto_ptr_neg.cc: Likewise.
* testsuite/20_util/shared_ptr/assign/auto_ptr_rvalue.cc:
Likewise.
* testsuite/20_util/shared_ptr/creation/dr925.cc: Likewise.
* testsuite/25_algorithms/headers/algorithm/synopsis.cc:
Likewise.
* testsuite/25_algorithms/random_shuffle/requirements/explicit_instantiation/2.cc:
Likewise.
* testsuite/25_algorithms/random_shuffle/requirements/explicit_instantiation/pod.cc:
Likewise.

libstdc++: Deprecate std::random_shuffle for C++14

The std::random_shuffle algorithm was removed in C++14 (without
deprecation). This adds the deprecated attribute for C++14 and later, so
that users are warned they should not be using it in those dialects.

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

* doc/xml/manual/evolution.xml: Document deprecation.
* doc/html/*: Regenerate.
* include/bits/c++config (_GLIBCXX14_DEPRECATED): Define.
(_GLIBCXX14_DEPRECATED_SUGGEST): Define.
* include/bits/stl_algo.h (random_shuffle): Deprecate for C++14
and later.
* testsuite/25_algorithms/headers/algorithm/synopsis.cc: Adjust
for C++11 and C++14 changes to std::random_shuffle and
std::shuffle.
* testsuite/25_algorithms/random_shuffle/1.cc: Add options to
use deprecated algorithms.
* testsuite/25_algorithms/random_shuffle/59603.cc: Likewise.
* testsuite/25_algorithms/random_shuffle/moveable.cc: Likewise.
* testsuite/25_algorithms/random_shuffle/requirements/explicit_instantiation/2.cc:
Likewise.
* testsuite/25_algorithms/random_shuffle/requirements/explicit_instantiation/pod.cc:
Likewise.

libstdc++: Add testsuite proc for testing deprecated features

This change adds options to tests that explicitly use deprecated
features, so that -D_GLIBCXX_USE_DEPRECATED=0 can be used to run the
rest of the testsuite. The tests that explicitly/intentionally use
deprecated features will still be able to use them, but they can be
disabled for the majority of tests.

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

* testsuite/23_containers/forward_list/operations/3.cc:
Use lambda instead of std::bind2nd.
* testsuite/20_util/function_objects/binders/3113.cc: Add
options for testing deprecated features.
* testsuite/20_util/pair/cons/99957.cc: Likewise.
* testsuite/20_util/shared_ptr/assign/auto_ptr.cc: Likewise.
* testsuite/20_util/shared_ptr/assign/auto_ptr_neg.cc: Likewise.
* testsuite/20_util/shared_ptr/assign/auto_ptr_rvalue.cc:
Likewise.
* testsuite/20_util/shared_ptr/cons/43820_neg.cc: Likewise.
* testsuite/20_util/shared_ptr/cons/auto_ptr.cc: Likewise.
* testsuite/20_util/shared_ptr/cons/auto_ptr_neg.cc: Likewise.
* testsuite/20_util/shared_ptr/creation/dr925.cc: Likewise.
* testsuite/20_util/unique_ptr/cons/auto_ptr.cc: Likewise.
* testsuite/20_util/unique_ptr/cons/auto_ptr_neg.cc: Likewise.
* testsuite/ext/pb_ds/example/priority_queue_erase_if.cc:
Likewise.
* testsuite/ext/pb_ds/example/priority_queue_split_join.cc:
Likewise.
* testsuite/lib/dg-options.exp (dg_add_options_using-deprecated):
New proc.

libstdc++: Reduce header dependencies in <regex>

This reduces the size of <regex> a little. This is one of the largest
and slowest headers in the library.

By using <bits/stl_algobase.h> and <bits/stl_algo.h> instead of
<algorithm> we don't need to parse all the parallel algorithms and
std::ranges:: algorithms that are not needed by <regex>. Similarly, by
using <bits/stl_tree.h> and <bits/stl_map.h> instead of <map> we don't
need to parse the definition of std::multimap.

The _State_info type is not movable or copyable, so doesn't need to use
std::unique_ptr<bool[]> to manage a bitset, we can just delete it in the
destructor. It would use a lot less space if we used a bitset instead,
but that would be an ABI break. We could do it for the versioned
namespace, but this patch doesn't do so. For future reference, using
vector<bool> would work, but would increase sizeof(_State_info) by two
pointers, because it's three times as large as unique_ptr<bool[]>. We
can't use std::bitset because the length isn't constant. We want a
bitset with a non-constant but fixed length.

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

* include/bits/regex_executor.h (_State_info): Replace
unique_ptr<bool[]> with array of bool.
* include/bits/regex_executor.tcc: Likewise.
* include/bits/regex_scanner.tcc: Replace std::strchr with
__builtin_strchr.
* include/std/regex: Replace standard headers with smaller
internal ones.
* testsuite/28_regex/traits/char/lookup_classname.cc: Include
<string.h> for strlen.
* testsuite/28_regex/traits/char/lookup_collatename.cc:
Likewise.

x86: Use XMM31 for scratch SSE register

In 64-bit mode, use XMM31 for scratch SSE register to avoid vzeroupper
if possible.

gcc/

* config/i386/i386.c (ix86_gen_scratch_sse_rtx): In 64-bit mode,
try XMM31 to avoid vzeroupper.

gcc/testsuite/

* gcc.target/i386/avx-vzeroupper-14.c: Pass -mno-avx512f to
disable XMM31.
* gcc.target/i386/avx-vzeroupper-15.c: Likewise.
* gcc.target/i386/pr82941-1.c: Updated. Check for vzeroupper.
* gcc.target/i386/pr82942-1.c: Likewise.
* gcc.target/i386/pr82990-1.c: Likewise.
* gcc.target/i386/pr82990-3.c: Likewise.
* gcc.target/i386/pr82990-5.c: Likewise.
* gcc.target/i386/pr100865-4b.c: Likewise.
* gcc.target/i386/pr100865-6b.c: Likewise.
* gcc.target/i386/pr100865-7b.c: Likewise.
* gcc.target/i386/pr100865-10b.c: Likewise.
* gcc.target/i386/pr100865-8b.c: Updated.
* gcc.target/i386/pr100865-9b.c: Likewise.
* gcc.target/i386/pr100865-11b.c: Likewise.
* gcc.target/i386/pr100865-12b.c: Likewise.

libstdc++: Avoid using std::unique_ptr in <locale>

std::wstring_convert and std::wbuffer_convert types are not copyable or
movable, and store a plain pointer without a deleter. That means a much
simpler type that just uses delete in its destructor can be used instead
of std::unique_ptr.

That avoids including and parsing all of <bits/unique_ptr.h> in every
header that includes <locale>. It also avoids instantiating
unique_ptr<C> and std::tuple<C*, default_delete<C>> when the conversion
utilities are used.

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
libstdc++-v3/ChangeLog:

* include/bits/locale_conv.h (__detail::_Scoped_ptr): Define new
RAII class template.
(wstring_convert, wbuffer_convert): Use __detail::_Scoped_ptr
instead of unique_ptr.

aarch64: Add -mtune=neoverse-512tvb

This patch adds an option to tune for Neoverse cores that have
a total vector bandwidth of 512 bits (4x128 for Advanced SIMD
and a vector-length-dependent equivalent for SVE). This is intended
to be a compromise between tuning aggressively for a single core like
Neoverse V1 (which can be too narrow) and tuning for AArch64 cores
in general (which can be too wide).

-mcpu=neoverse-512tvb is equivalent to -mcpu=neoverse-v1
-mtune=neoverse-512tvb.

gcc/
* doc/invoke.texi: Document -mtune=neoverse-512tvb and
-mcpu=neoverse-512tvb.
* config/aarch64/aarch64-cores.def (neoverse-512tvb): New entry.
* config/aarch64/aarch64-tune.md: Regenerate.
* config/aarch64/aarch64.c (neoverse512tvb_sve_vector_cost)
(neoverse512tvb_sve_issue_info, neoverse512tvb_vec_issue_info)
(neoverse512tvb_vector_cost, neoverse512tvb_tunings): New structures.
(aarch64_adjust_body_cost_sve): Handle -mtune=neoverse-512tvb.
(aarch64_adjust_body_cost): Likewise.

aarch64: Restrict issue heuristics to inner vector loop

The AArch64 vector costs try to take issue rates into account.
However, when vectorising an outer loop, we lumped the inner
and outer operations together, which is somewhat meaningless.
This patch restricts the heuristic to the inner loop.

gcc/
* config/aarch64/aarch64.c (aarch64_add_stmt_cost): Only
record issue information for operations that occur in the
innermost loop.

aarch64: Tweak MLA vector costs

The issue-based vector costs currently assume that a multiply-add
sequence can be implemented using a single instruction.  This is
generally true for scalars (which have a 4-operand instruction)
and SVE (which allows the output to be tied to any input).
However, for Advanced SIMD, multiplying two values and adding
an invariant will end up being a move and an MLA.

The only target to use the issue-based vector costs is Neoverse V1,
which would generally prefer SVE in this case anyway.  I therefore
don't have a self-contained testcase.  However, the distinction
becomes more important with a later patch.

gcc/
* config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
parameter.  Detect cases in which an Advanced SIMD MLA would almost
certainly require a MOV.
(aarch64_count_ops): Update accordingly.

aarch64: Tweak the cost of elementwise stores

When the vectoriser scalarises a strided store, it counts one
scalar_store for each element plus one vec_to_scalar extraction
for each element. However, extracting element 0 is free on AArch64,
so it should have zero cost.

I don't have a testcase that requires this for existing -mtune
options, but it becomes more important with a later patch.

gcc/
* config/aarch64/aarch64.c (aarch64_is_store_elt_extraction): New
function, split out from...
(aarch64_detect_vector_stmt_subtype): ...here.
(aarch64_add_stmt_cost): Treat extracting element 0 as free.

aarch64: Add gather_load_xNN_cost tuning fields

This patch adds tuning fields for the total cost of a gather load
instruction. Until now, we've costed them as one scalar load
per element instead. Those scalar_load-based values are also
what the patch uses to fill in the new fields for existing
cost structures.

gcc/
* config/aarch64/aarch64-protos.h (sve_vec_cost):
Add gather_load_x32_cost and gather_load_x64_cost.
* config/aarch64/aarch64.c (generic_sve_vector_cost)
(a64fx_sve_vector_cost, neoversev1_sve_vector_cost): Update
accordingly, using the values given by the scalar_load * number
of elements calculation that we used previously.
(aarch64_detect_vector_stmt_subtype): Use the new fields.

aarch64: Split out aarch64_adjust_body_cost_sve

This patch splits the SVE-specific part of aarch64_adjust_body_cost
out into its own subroutine, so that a future patch can call it
more than once. I wondered about using a lambda to avoid having
to pass all the arguments, but in the end this way seemed clearer.

gcc/
* config/aarch64/aarch64.c (aarch64_adjust_body_cost_sve): New
function, split out from...
(aarch64_adjust_body_cost): ...here.

aarch64: Add a simple fixed-point class for costing

This patch adds a simple fixed-point class for holding fractional
cost values.  It can exactly represent the reciprocal of any
single-vector SVE element count (including the non-power-of-2 ones).
This means that it can also hold 1/N for all N in [1, 16], which should
be enough for the various *_per_cycle fields.

For now the assumption is that the number of possible reciprocals
is fixed at compile time and so the class should always be able
to hold an exact value.

The class uses a uint64_t to hold the fixed-point value, which means
that it can hold any scaled uint32_t cost.  Normally we don't worry
about overflow when manipulating raw uint32_t costs, but just to be
on the safe side, the class uses saturating arithmetic for all
operations.

As far as the changes to the cost routines themselves go:

- The changes to aarch64_add_stmt_cost and its subroutines are
  just laying groundwork for future patches; no functional change
  intended.

- The changes to aarch64_adjust_body_cost mean that we now
  take fractional differences into account.

gcc/
* config/aarch64/fractional-cost.h: New file.
* config/aarch64/aarch64.c: Include <algorithm> (indirectly)
and cost_fraction.h.
(vec_cost_fraction): New typedef.
(aarch64_detect_scalar_stmt_subtype): Use it for statement costs.
(aarch64_detect_vector_stmt_subtype): Likewise.
(aarch64_sve_adjust_stmt_cost, aarch64_adjust_stmt_cost): Likewise.
(aarch64_estimate_min_cycles_per_iter): Use vec_cost_fraction
for cycle counts.
(aarch64_adjust_body_cost): Likewise.
(aarch64_test_cost_fraction): New function.
(aarch64_run_selftests): Call it.

aarch64: Turn sve_width tuning field into a bitmask

The tuning structures have an sve_width field that specifies the
number of bits in an SVE vector (or SVE_NOT_IMPLEMENTED if not
applicable).  This patch turns the field into a bitmask so that
it can specify multiple widths at the same time.  For now we
always treat the mininum width as the likely width.

An alternative would have been to add extra fields, which would
have coped correctly with non-power-of-2 widths.  However,
we're very far from supporting constant non-power-of-2 vectors
in GCC, so I think the non-power-of-2 case will in reality always
have to be hidden behind VLA.

gcc/
* config/aarch64/aarch64-protos.h (tune_params::sve_width): Turn
into a bitmask.
* config/aarch64/aarch64.c (aarch64_cmp_autovec_modes): Update
accordingly.
(aarch64_estimated_poly_value): Likewise.  Use the least significant
set bit for the minimum and likely values.  Use the most significant
set bit for the maximum value.

Add cond_add/sub/mul for vector integer modes.

gcc/ChangeLog:

* config/i386/sse.md (cond_<insn><mode>): New expander.
(cond_mul<mode>): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cond_op_addsubmul_d-1.c: New test.
* gcc.target/i386/cond_op_addsubmul_d-2.c: New test.
* gcc.target/i386/cond_op_addsubmul_q-1.c: New test.
* gcc.target/i386/cond_op_addsubmul_q-2.c: New test.
* gcc.target/i386/cond_op_addsubmul_w-1.c: New test.
* gcc.target/i386/cond_op_addsubmul_w-2.c: New test.

Fix bashism in `libsanitizer/configure.tgt'

Appending to a string variable with `+=' is a bashism and does not work in
strict POSIX shells like dash. This results in the extra compilation flags not
to be set correctly. This patch replaces the `+=' syntax with a simple string
interpolation to append to the `EXTRA_CXXFLAGS' variable.

libsanitizer/ChangeLog

PR sanitizer/101111
* configure.tgt: Fix bashism in setting of `EXTRA_CXXFLAGS'.

analyzer: Fix ICE on MD builtin [PR101721]

The following testcase ICEs because DECL_FUNCTION_CODE asserts the builtin
is BUILT_IN_NORMAL, but it sees a backend (MD) builtin instead.
The FE, normal and MD builtin numbers overlap, so one should always
check what kind of builtin it is before looking at specific codes.

On the other side, region-model.cc has:
      if (fndecl_built_in_p (callee_fndecl, BUILT_IN_NORMAL)
          && gimple_builtin_call_types_compatible_p (call, callee_fndecl))
        switch (DECL_UNCHECKED_FUNCTION_CODE (callee_fndecl))
which IMO should use DECL_FUNCTION_CODE instead, it checked first it is
a normal builtin...

2021-08-03  Jakub Jelinek  <jakub@redhat.com>

PR analyzer/101721
* sm-malloc.cc (known_allocator_p): Only check DECL_FUNCTION_CODE on
BUILT_IN_NORMAL builtins.

* gcc.dg/analyzer/pr101721.c: New test.

ChangeLog: add problematic commit 2e96b5f14e4025691b57d2301d71aa6092ed44bc.

gcc/ChangeLog:

* ChangeLog: Add manually.

libgomp/ChangeLog:

* ChangeLog: Add manually.

gcc/testsuite/ChangeLog:

* ChangeLog: Add manually.