[nvptx] Fix reduction lock
When I run the libgomp test-case reduction-cplx-dbl.c on an nvptx accelerator
(T400, driver version 470.86), I run into:
...
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c \
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 \
execution test
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c \
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O2 \
execution test
...
The problem is in this code generated for a gang reduction:
...
$L39:
atom.global.cas.b32 %r59, [__reduction_lock], 0, 1;
setp.ne.u32 %r116, %r59, 0;
@%r116 bra $L39;
ld.f64 %r60, [%r44];
ld.f64 %r61, [%r44+8];
ld.f64 %r64, [%r44];
ld.f64 %r65, [%r44+8];
add.f64 %r117, %r64, %r22;
add.f64 %r118, %r65, %r41;
st.f64 [%r44], %r117;
st.f64 [%r44+8], %r118;
atom.global.cas.b32 %r119, [__reduction_lock], 1, 0;
...
which is taking and releasing a lock, but missing the appropriate barriers to
protect the loads and store inside the lock.
Fix this by adding membar.gl barriers.
Likewise, add membar.cta barriers if we protect shared memory loads and
stores (even though the worker-partitioning part of the test-case is not
failing).
Tested on x86_64 with nvptx accelerator.
gcc/ChangeLog:
2022-01-27 Tom de Vries <tdevries@suse.de>
* config/nvptx/nvptx.cc (enum nvptx_builtins): Add
NVPTX_BUILTIN_MEMBAR_GL and NVPTX_BUILTIN_MEMBAR_CTA.
(VOID): New macro.
(nvptx_init_builtins): Add MEMBAR_GL and MEMBAR_CTA.
(nvptx_expand_builtin): Handle NVPTX_BUILTIN_MEMBAR_GL and
NVPTX_BUILTIN_MEMBAR_CTA.
(nvptx_lockfull_update): Add level parameter. Emit barriers.
(nvptx_reduction_update, nvptx_goacc_reduction_fini): Update call to
nvptx_lockfull_update.
* config/nvptx/nvptx.md (define_c_enum "unspecv"): Add
UNSPECV_MEMBAR_GL.
(define_expand "nvptx_membar_gl"): New expand.
(define_insn "*nvptx_membar_gl"): New insn.