radeonsi: remove the GDS variants of compute-based primitive discard
authorMarek Olšák <marek.olsak@amd.com>
Mon, 31 May 2021 01:28:53 +0000 (21:28 -0400)
committerMarge Bot <eric+marge@anholt.net>
Mon, 28 Jun 2021 13:23:14 +0000 (13:23 +0000)
The GDS ordered append variant is unstable due to kernel and firmware bugs.
The unordered GDS variant isn't faster than the memory-based variant.

Only the memory-based variant is kept.

Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11510>

src/gallium/drivers/radeonsi/si_compute_prim_discard.c
src/gallium/drivers/radeonsi/si_gfx_cs.c
src/gallium/drivers/radeonsi/si_pipe.c
src/gallium/drivers/radeonsi/si_pipe.h
src/gallium/drivers/radeonsi/si_shader.c
src/gallium/drivers/radeonsi/si_shader.h
src/gallium/drivers/radeonsi/si_state_draw.cpp

index 939423e..ff875c1 100644 (file)
@@ -38,7 +38,6 @@
  */
 
 /* This file implements primitive culling using asynchronous compute.
- * It's written to be GL conformant.
  *
  * It takes a monolithic VS in LLVM IR returning gl_Position and invokes it
  * in a compute shader. The shader processes 1 primitive/thread by invoking
  * small-primitive culling, and generates a new index buffer that doesn't
  * contain culled primitives.
  *
- * The index buffer is generated using the Ordered Count feature of GDS,
- * which is an atomic counter that is incremented in the wavefront launch
- * order, so that the original primitive order is preserved.
- *
- * Another GDS ordered counter is used to eliminate primitive restart indices.
- * If a restart index lands on an even thread ID, the compute shader has to flip
- * the primitive orientation of the whole following triangle strip. The primitive
- * orientation has to be correct after strip and fan decomposition for two-sided
- * shading to behave correctly. The decomposition also needs to be aware of
- * which vertex is the provoking vertex for flat shading to behave correctly.
+ * There is no primitive ordering. The generated index buffer will contain
+ * primitives in a random order.
  *
  * IB = a GPU command buffer
  *
  * Both the compute and gfx IBs run in parallel sort of like CE and DE.
  * The gfx IB has a CP barrier (REWIND packet) before a draw packet. REWIND
- * doesn't continue if its word isn't 0x80000000. Once compute shaders are
- * finished culling, the last wave will write the final primitive count from
- * GDS directly into the count word of the draw packet in the gfx IB, and
- * a CS_DONE event will signal the REWIND packet to continue. It's really
- * a direct draw with command buffer patching from the compute queue.
+ * doesn't continue if its word isn't 0x80000000. The vertex count is being
+ * atomically incremented within the draw packet. A CS_DONE event will signal
+ * the REWIND packet to continue. It's really a direct draw with command
+ * buffer patching from the compute queue.
  *
  * The compute IB doesn't have to start when its corresponding gfx IB starts,
  * but can start sooner. The compute IB is signaled to start after the last
  *   The decomposition differs based on the provoking vertex state.
  * - Instanced draws are converted into non-instanced draws for 16-bit indices.
  *   (InstanceID is stored in the high bits of VertexID and unpacked by VS)
- * - Primitive restart is fully supported with triangle strips, including
- *   correct primitive orientation across multiple waves. (restart indices
- *   reset primitive orientation)
  * - W<0 culling (W<0 is behind the viewer, sort of like near Z culling).
  * - Back face culling, incl. culling zero-area / degenerate primitives.
  * - View XY culling.
- * - View Z culling (disabled due to limited impact with perspective projection).
  * - Small primitive culling for all MSAA modes and all quant modes.
  *
  * The following are not implemented:
@@ -97,8 +83,7 @@
  *
  * Limitations (and unimplemented features that may be possible to implement):
  * - Only triangles and triangle strips are supported.
- * - Primitive restart is only supported with triangle strips.
- * - Instancing and primitive restart can't be used together.
+ * - Primitive restart is not supported.
  * - Instancing is only supported with 16-bit indices and instance count <= 2^16.
  * - The instance divisor buffer is unavailable, so all divisors must be
  *   either 0 or 1.
  *     0..3: input index buffer - typed buffer view
  *     4..7: output index buffer - typed buffer view
  *     8..11: viewport state - scale.xy, translate.xy
- *   VERTEX_COUNTER: counter address or first primitive ID
- *     - If unordered memory counter: address of "count" in the draw packet
- *       and is incremented atomically by the shader.
- *     - If unordered GDS counter: address of "count" in GDS starting from 0,
- *       must be initialized to 0 before the dispatch.
- *     - If ordered GDS counter: the primitive ID that should reset the vertex
- *       counter to 0 in GDS
- *   LAST_WAVE_PRIM_ID: the primitive ID that should write the final vertex
- *       count to memory if using GDS ordered append
- *   VERTEX_COUNT_ADDR: where the last wave should write the vertex count if
- *       using GDS ordered append
+ *   VERTEX_COUNTER: address of "count" in the draw packet incremented
+ *       atomically by the shader.
  *   VS.VERTEX_BUFFERS:           same value as VS
  *   VS.CONST_AND_SHADER_BUFFERS: same value as VS
  *   VS.SAMPLERS_AND_IMAGES:      same value as VS
  *   NUM_PRIMS_UDIV_TERMS:
  *     - Bits [0:4]: "post_shift" for fast 31-bit division for instancing.
  *     - Bits [5:31]: The number of primitives per instance for computing the remainder.
- *   PRIMITIVE_RESTART_INDEX
  *   SMALL_PRIM_CULLING_PRECISION: Scale the primitive bounding box by this number.
  *
- *
- * The code contains 3 codepaths:
- * - Unordered memory counter (for debugging, random primitive order, no primitive restart)
- * - Unordered GDS counter (for debugging, random primitive order, no primitive restart)
- * - Ordered GDS counter (it preserves the primitive order)
- *
  * How to test primitive restart (the most complicated part because it needs
  * to get the primitive orientation right):
  *   Set THREADGROUP_SIZE to 2 to exercise both intra-wave and inter-wave
 #define THREADGROUPS_PER_CU  1   /* TGs to launch on 1 CU before going onto the next, max 8 */
 #define MAX_WAVES_PER_SH     0   /* no limit */
 #define INDEX_STORES_USE_SLC 1   /* don't cache indices if L2 is full */
-/* 0 = unordered memory counter, 1 = unordered GDS counter, 2 = ordered GDS counter */
-#define VERTEX_COUNTER_GDS_MODE 2
-#define GDS_SIZE_UNORDERED      (4 * 1024) /* only for the unordered GDS counter */
 
 /* Grouping compute dispatches for small draw calls: How many primitives from multiple
  * draw calls to process by compute before signaling the gfx IB. This reduces the number
  * of EOP events + REWIND packets, because they decrease performance. */
 #define PRIMS_PER_BATCH (512 * 1024)
 /* Draw call splitting at the packet level. This allows signaling the gfx IB
- * for big draw calls sooner, but doesn't allow context flushes between packets.
- * Primitive restart is supported. Only implemented for ordered append. */
+ * for big draw calls sooner, but doesn't allow context flushes between packets. */
 #define SPLIT_PRIMS_PACKET_LEVEL_VALUE PRIMS_PER_BATCH
 /* If there is not enough ring buffer space for the current IB, split draw calls into
  * this number of primitives, so that we can flush the context and get free ring space. */
 /* Derived values. */
 #define WAVES_PER_TG DIV_ROUND_UP(THREADGROUP_SIZE, 64)
 #define SPLIT_PRIMS_PACKET_LEVEL                                                                   \
-   (VERTEX_COUNTER_GDS_MODE == 2 ? SPLIT_PRIMS_PACKET_LEVEL_VALUE                                  \
+   (false /* TODO */ ? SPLIT_PRIMS_PACKET_LEVEL_VALUE                                  \
                                  : UINT_MAX & ~(THREADGROUP_SIZE - 1))
 
 #define REWIND_SIGNAL_BIT 0x80000000
 
+static LLVMValueRef si_expand_32bit_pointer(struct si_shader_context *ctx, LLVMValueRef ptr);
+
 void si_initialize_prim_discard_tunables(struct si_screen *sscreen, bool is_aux_context,
                                          unsigned *prim_discard_vertex_count_threshold,
                                          unsigned *index_ring_size_per_ib)
@@ -188,10 +155,10 @@ void si_initialize_prim_discard_tunables(struct si_screen *sscreen, bool is_aux_
    *prim_discard_vertex_count_threshold = UINT_MAX; /* disable */
 
    if (sscreen->info.chip_class <= GFX7 || /* SI-CI support is not implemented */
-       !sscreen->info.has_gds_ordered_append || sscreen->debug_flags & DBG(NO_PD) || is_aux_context)
+       sscreen->debug_flags & DBG(NO_PD) || is_aux_context)
       return;
 
-   /* TODO: enable this after the GDS kernel memory management is fixed */
+   /* TODO: enable this */
    bool enable_on_pro_graphics_by_default = false;
 
    if (sscreen->debug_flags & DBG(ALWAYS_PD) || sscreen->debug_flags & DBG(PD) ||
@@ -220,30 +187,6 @@ void si_initialize_prim_discard_tunables(struct si_screen *sscreen, bool is_aux_
    }
 }
 
-/* Opcode can be "add" or "swap". */
-static LLVMValueRef si_build_ds_ordered_op(struct si_shader_context *ctx, const char *opcode,
-                                           LLVMValueRef m0, LLVMValueRef value,
-                                           unsigned ordered_count_index, bool release, bool done)
-{
-   if (ctx->screen->info.chip_class >= GFX10)
-      ordered_count_index |= 1 << 24; /* number of dwords == 1 */
-
-   LLVMValueRef args[] = {
-      LLVMBuildIntToPtr(ctx->ac.builder, m0, LLVMPointerType(ctx->ac.i32, AC_ADDR_SPACE_GDS), ""),
-      value,
-      LLVMConstInt(ctx->ac.i32, LLVMAtomicOrderingMonotonic, 0), /* ordering */
-      ctx->ac.i32_0,                                             /* scope */
-      ctx->ac.i1false,                                           /* volatile */
-      LLVMConstInt(ctx->ac.i32, ordered_count_index, 0),
-      LLVMConstInt(ctx->ac.i1, release, 0),
-      LLVMConstInt(ctx->ac.i1, done, 0),
-   };
-
-   char intrinsic[64];
-   snprintf(intrinsic, sizeof(intrinsic), "llvm.amdgcn.ds.ordered.%s", opcode);
-   return ac_build_intrinsic(&ctx->ac, intrinsic, ctx->ac.i32, args, ARRAY_SIZE(args), 0);
-}
-
 static LLVMValueRef si_expand_32bit_pointer(struct si_shader_context *ctx, LLVMValueRef ptr)
 {
    uint64_t hi = (uint64_t)ctx->screen->info.address32_hi << 32;
@@ -321,16 +264,14 @@ void si_build_prim_discard_compute_shader(struct si_shader_context *ctx)
    struct ac_arg param_index_buffers_and_constants, param_vertex_counter;
    struct ac_arg param_vb_desc, param_const_desc;
    struct ac_arg param_base_vertex, param_start_instance;
-   struct ac_arg param_block_id, param_local_id, param_ordered_wave_id;
-   struct ac_arg param_restart_index, param_smallprim_precision;
+   struct ac_arg param_block_id, param_local_id;
+   struct ac_arg param_smallprim_precision;
    struct ac_arg param_num_prims_udiv_multiplier, param_num_prims_udiv_terms;
-   struct ac_arg param_sampler_desc, param_last_wave_prim_id, param_vertex_count_addr;
+   struct ac_arg param_sampler_desc;
 
    ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_CONST_DESC_PTR,
               &param_index_buffers_and_constants);
    ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_INT, &param_vertex_counter);
-   ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_INT, &param_last_wave_prim_id);
-   ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_INT, &param_vertex_count_addr);
    ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_CONST_DESC_PTR, &param_vb_desc);
    ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, const_desc_type, &param_const_desc);
    ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_CONST_IMAGE_PTR, &param_sampler_desc);
@@ -338,13 +279,10 @@ void si_build_prim_discard_compute_shader(struct si_shader_context *ctx)
    ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_INT, &param_start_instance);
    ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_INT, &param_num_prims_udiv_multiplier);
    ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_INT, &param_num_prims_udiv_terms);
-   ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_INT, &param_restart_index);
    ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_FLOAT, &param_smallprim_precision);
 
    /* Block ID and thread ID inputs. */
    ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_INT, &param_block_id);
-   if (VERTEX_COUNTER_GDS_MODE == 2)
-      ac_add_arg(&ctx->args, AC_ARG_SGPR, 1, AC_ARG_INT, &param_ordered_wave_id);
    ac_add_arg(&ctx->args, AC_ARG_VGPR, 1, AC_ARG_INT, &param_local_id);
 
    /* Create the compute shader function. */
@@ -353,12 +291,6 @@ void si_build_prim_discard_compute_shader(struct si_shader_context *ctx)
    si_llvm_create_func(ctx, "prim_discard_cs", NULL, 0, THREADGROUP_SIZE);
    ctx->stage = old_stage;
 
-   if (VERTEX_COUNTER_GDS_MODE == 2) {
-      ac_llvm_add_target_dep_function_attr(ctx->main_fn, "amdgpu-gds-size", 256);
-   } else if (VERTEX_COUNTER_GDS_MODE == 1) {
-      ac_llvm_add_target_dep_function_attr(ctx->main_fn, "amdgpu-gds-size", GDS_SIZE_UNORDERED);
-   }
-
    /* Assemble parameters for VS. */
    LLVMValueRef vs_params[16];
    unsigned num_vs_params = 0;
@@ -451,16 +383,6 @@ void si_build_prim_discard_compute_shader(struct si_shader_context *ctx)
       }
    }
 
-   LLVMValueRef ordered_wave_id = NULL;
-
-   /* Extract the ordered wave ID. */
-   if (VERTEX_COUNTER_GDS_MODE == 2) {
-      ordered_wave_id = ac_get_arg(&ctx->ac, param_ordered_wave_id);
-      ordered_wave_id =
-         LLVMBuildLShr(builder, ordered_wave_id, LLVMConstInt(ctx->ac.i32, 6, 0), "");
-      ordered_wave_id =
-         LLVMBuildAnd(builder, ordered_wave_id, LLVMConstInt(ctx->ac.i32, 0xfff, 0), "");
-   }
    LLVMValueRef thread_id = LLVMBuildAnd(builder, ac_get_arg(&ctx->ac, param_local_id),
                                          LLVMConstInt(ctx->ac.i32, 63, 0), "");
 
@@ -477,140 +399,9 @@ void si_build_prim_discard_compute_shader(struct si_shader_context *ctx)
        * Only primitive restart can flip it with respect to the first vertex
        * of the draw call.
        */
-      LLVMValueRef first_is_odd = ctx->ac.i1false;
-
-      /* Handle primitive restart. */
-      if (key->opt.cs_primitive_restart) {
-         /* Get the GDS primitive restart continue flag and clear
-          * the flag in vertex_counter. This flag is used when the draw
-          * call was split and we need to load the primitive orientation
-          * flag from GDS for the first wave too.
-          */
-         LLVMValueRef gds_prim_restart_continue =
-            LLVMBuildLShr(builder, vertex_counter, LLVMConstInt(ctx->ac.i32, 31, 0), "");
-         gds_prim_restart_continue =
-            LLVMBuildTrunc(builder, gds_prim_restart_continue, ctx->ac.i1, "");
-         vertex_counter =
-            LLVMBuildAnd(builder, vertex_counter, LLVMConstInt(ctx->ac.i32, 0x7fffffff, 0), "");
-
-         LLVMValueRef index0_is_reset;
-
-         for (unsigned i = 0; i < 3; i++) {
-            LLVMValueRef not_reset = LLVMBuildICmp(builder, LLVMIntNE, index[i],
-                                                   ac_get_arg(&ctx->ac, param_restart_index), "");
-            if (i == 0)
-               index0_is_reset = LLVMBuildNot(builder, not_reset, "");
-            prim_restart_accepted = LLVMBuildAnd(builder, prim_restart_accepted, not_reset, "");
-         }
-
-         /* If the previous waves flip the primitive orientation
-          * of the current triangle strip, it will be stored in GDS.
-          *
-          * Sometimes the correct orientation is not needed, in which case
-          * we don't need to execute this.
-          */
-         if (key->opt.cs_need_correct_orientation && VERTEX_COUNTER_GDS_MODE == 2) {
-            /* If there are reset indices in this wave, get the thread index
-             * where the most recent strip starts relative to each thread.
-             */
-            LLVMValueRef preceding_threads_mask =
-               LLVMBuildSub(builder,
-                            LLVMBuildShl(builder, ctx->ac.i64_1,
-                                         LLVMBuildZExt(builder, thread_id, ctx->ac.i64, ""), ""),
-                            ctx->ac.i64_1, "");
-
-            LLVMValueRef reset_threadmask = ac_get_i1_sgpr_mask(&ctx->ac, index0_is_reset);
-            LLVMValueRef preceding_reset_threadmask =
-               LLVMBuildAnd(builder, reset_threadmask, preceding_threads_mask, "");
-            LLVMValueRef strip_start = ac_build_umsb(&ctx->ac, preceding_reset_threadmask, NULL);
-            strip_start = LLVMBuildAdd(builder, strip_start, ctx->ac.i32_1, "");
-
-            /* This flips the orientation based on reset indices within this wave only. */
-            first_is_odd = LLVMBuildTrunc(builder, strip_start, ctx->ac.i1, "");
-
-            LLVMValueRef last_strip_start, prev_wave_state, ret, tmp;
-            LLVMValueRef is_first_wave, current_wave_resets_index;
-
-            /* Get the thread index where the last strip starts in this wave.
-             *
-             * If the last strip doesn't start in this wave, the thread index
-             * will be 0.
-             *
-             * If the last strip starts in the next wave, the thread index will
-             * be 64.
-             */
-            last_strip_start = ac_build_umsb(&ctx->ac, reset_threadmask, NULL);
-            last_strip_start = LLVMBuildAdd(builder, last_strip_start, ctx->ac.i32_1, "");
-
-            struct si_thread0_section section;
-            si_enter_thread0_section(ctx, &section, thread_id, NULL);
-
-            /* This must be done in the thread 0 section, because
-             * we expect PrimID to be 0 for the whole first wave
-             * in this expression.
-             *
-             * NOTE: This will need to be different if we wanna support
-             * instancing with primitive restart.
-             */
-            is_first_wave = LLVMBuildICmp(builder, LLVMIntEQ, prim_id, ctx->ac.i32_0, "");
-            is_first_wave = LLVMBuildAnd(builder, is_first_wave,
-                                         LLVMBuildNot(builder, gds_prim_restart_continue, ""), "");
-            current_wave_resets_index =
-               LLVMBuildICmp(builder, LLVMIntNE, last_strip_start, ctx->ac.i32_0, "");
-
-            ret = ac_build_alloca_undef(&ctx->ac, ctx->ac.i32, "prev_state");
-
-            /* Save the last strip start primitive index in GDS and read
-             * the value that previous waves stored.
-             *
-             * if (is_first_wave || current_wave_resets_strip)
-             *    // Read the value that previous waves stored and store a new one.
-             *    first_is_odd = ds.ordered.swap(last_strip_start);
-             * else
-             *    // Just read the value that previous waves stored.
-             *    first_is_odd = ds.ordered.add(0);
-             */
-            ac_build_ifcc(
-               &ctx->ac, LLVMBuildOr(builder, is_first_wave, current_wave_resets_index, ""), 12602);
-            {
-               /* The GDS address is always 0 with ordered append. */
-               tmp = si_build_ds_ordered_op(ctx, "swap", ordered_wave_id, last_strip_start, 1, true,
-                                            false);
-               LLVMBuildStore(builder, tmp, ret);
-            }
-            ac_build_else(&ctx->ac, 12603);
-            {
-               /* Just read the value from GDS. */
-               tmp = si_build_ds_ordered_op(ctx, "add", ordered_wave_id, ctx->ac.i32_0, 1, true,
-                                            false);
-               LLVMBuildStore(builder, tmp, ret);
-            }
-            ac_build_endif(&ctx->ac, 12602);
-
-            prev_wave_state = LLVMBuildLoad(builder, ret, "");
-            /* Ignore the return value if this is the first wave. */
-            prev_wave_state =
-               LLVMBuildSelect(builder, is_first_wave, ctx->ac.i32_0, prev_wave_state, "");
-            si_exit_thread0_section(&section, &prev_wave_state);
-            prev_wave_state = LLVMBuildTrunc(builder, prev_wave_state, ctx->ac.i1, "");
-
-            /* If the strip start appears to be on thread 0 for the current primitive
-             * (meaning the reset index is not present in this wave and might have
-             * appeared in previous waves), use the value from GDS to determine
-             * primitive orientation.
-             *
-             * If the strip start is in this wave for the current primitive, use
-             * the value from the current wave to determine primitive orientation.
-             */
-            LLVMValueRef strip_start_is0 =
-               LLVMBuildICmp(builder, LLVMIntEQ, strip_start, ctx->ac.i32_0, "");
-            first_is_odd =
-               LLVMBuildSelect(builder, strip_start_is0, prev_wave_state, first_is_odd, "");
-         }
-      }
-      /* prim_is_odd = (first_is_odd + current_is_odd) % 2. */
+      /* prim_is_odd = current_is_odd % 2. */
       LLVMValueRef prim_is_odd = LLVMBuildXor(
-         builder, first_is_odd, LLVMBuildTrunc(builder, thread_id, ctx->ac.i1, ""), "");
+         builder, ctx->ac.i1false, LLVMBuildTrunc(builder, thread_id, ctx->ac.i1, ""), "");
 
       /* Convert triangle strip indices to triangle indices. */
       ac_build_triangle_strip_indices_to_triangle(
@@ -672,106 +463,16 @@ void si_build_prim_discard_compute_shader(struct si_shader_context *ctx)
    struct si_thread0_section section;
    si_enter_thread0_section(ctx, &section, thread_id, num_prims_accepted);
    {
-      if (VERTEX_COUNTER_GDS_MODE == 0) {
-         LLVMValueRef num_indices = LLVMBuildMul(
-            builder, num_prims_accepted, LLVMConstInt(ctx->ac.i32, vertices_per_prim, 0), "");
-         vertex_counter = si_expand_32bit_pointer(ctx, vertex_counter);
-         start = LLVMBuildAtomicRMW(builder, LLVMAtomicRMWBinOpAdd, vertex_counter, num_indices,
-                                    LLVMAtomicOrderingMonotonic, false);
-      } else if (VERTEX_COUNTER_GDS_MODE == 1) {
-         LLVMValueRef num_indices = LLVMBuildMul(
-            builder, num_prims_accepted, LLVMConstInt(ctx->ac.i32, vertices_per_prim, 0), "");
-         vertex_counter = LLVMBuildIntToPtr(builder, vertex_counter,
-                                            LLVMPointerType(ctx->ac.i32, AC_ADDR_SPACE_GDS), "");
-         start = LLVMBuildAtomicRMW(builder, LLVMAtomicRMWBinOpAdd, vertex_counter, num_indices,
-                                    LLVMAtomicOrderingMonotonic, false);
-      } else if (VERTEX_COUNTER_GDS_MODE == 2) {
-         LLVMValueRef tmp_store = ac_build_alloca_undef(&ctx->ac, ctx->ac.i32, "");
-
-         /* If the draw call was split into multiple subdraws, each using
-          * a separate draw packet, we need to start counting from 0 for
-          * the first compute wave of the subdraw.
-          *
-          * vertex_counter contains the primitive ID of the first thread
-          * in the first wave.
-          *
-          * This is only correct with VERTEX_COUNTER_GDS_MODE == 2:
-          */
-         LLVMValueRef is_first_wave =
-            LLVMBuildICmp(builder, LLVMIntEQ, global_thread_id, vertex_counter, "");
-
-         /* Store the primitive count for ordered append, not vertex count.
-          * The idea is to avoid GDS initialization via CP DMA. The shader
-          * effectively stores the first count using "swap".
-          *
-          * if (first_wave) {
-          *    ds.ordered.swap(num_prims_accepted); // store the first primitive count
-          *    previous = 0;
-          * } else {
-          *    previous = ds.ordered.add(num_prims_accepted) // add the primitive count
-          * }
-          */
-         ac_build_ifcc(&ctx->ac, is_first_wave, 12604);
-         {
-            /* The GDS address is always 0 with ordered append. */
-            si_build_ds_ordered_op(ctx, "swap", ordered_wave_id, num_prims_accepted, 0, true, true);
-            LLVMBuildStore(builder, ctx->ac.i32_0, tmp_store);
-         }
-         ac_build_else(&ctx->ac, 12605);
-         {
-            LLVMBuildStore(builder,
-                           si_build_ds_ordered_op(ctx, "add", ordered_wave_id, num_prims_accepted,
-                                                  0, true, true),
-                           tmp_store);
-         }
-         ac_build_endif(&ctx->ac, 12604);
-
-         start = LLVMBuildLoad(builder, tmp_store, "");
-      }
+      LLVMValueRef num_indices = LLVMBuildMul(
+         builder, num_prims_accepted, LLVMConstInt(ctx->ac.i32, vertices_per_prim, 0), "");
+      vertex_counter = si_expand_32bit_pointer(ctx, vertex_counter);
+      start = LLVMBuildAtomicRMW(builder, LLVMAtomicRMWBinOpAdd, vertex_counter, num_indices,
+                                 LLVMAtomicOrderingMonotonic, false);
    }
    si_exit_thread0_section(&section, &start);
 
-   /* Write the final vertex count to memory. An EOS/EOP event could do this,
-    * but those events are super slow and should be avoided if performance
-    * is a concern. Thanks to GDS ordered append, we can emulate a CS_DONE
-    * event like this.
-    */
-   if (VERTEX_COUNTER_GDS_MODE == 2) {
-      ac_build_ifcc(&ctx->ac,
-                    LLVMBuildICmp(builder, LLVMIntEQ, global_thread_id,
-                                  ac_get_arg(&ctx->ac, param_last_wave_prim_id), ""),
-                    12606);
-      LLVMValueRef count = LLVMBuildAdd(builder, start, num_prims_accepted, "");
-      count = LLVMBuildMul(builder, count, LLVMConstInt(ctx->ac.i32, vertices_per_prim, 0), "");
-
-      /* GFX8 needs to disable caching, so that the CP can see the stored value.
-       * MTYPE=3 bypasses TC L2.
-       */
-      if (ctx->screen->info.chip_class <= GFX8) {
-         LLVMValueRef desc[] = {
-            ac_get_arg(&ctx->ac, param_vertex_count_addr),
-            LLVMConstInt(ctx->ac.i32, S_008F04_BASE_ADDRESS_HI(ctx->screen->info.address32_hi), 0),
-            LLVMConstInt(ctx->ac.i32, 4, 0),
-            LLVMConstInt(
-               ctx->ac.i32,
-               S_008F0C_DATA_FORMAT(V_008F0C_BUF_DATA_FORMAT_32) | S_008F0C_MTYPE(3 /* uncached */),
-               0),
-         };
-         LLVMValueRef rsrc = ac_build_gather_values(&ctx->ac, desc, 4);
-         ac_build_buffer_store_dword(&ctx->ac, rsrc, count, 1, ctx->ac.i32_0, ctx->ac.i32_0, 0,
-                                     ac_glc | ac_slc);
-      } else {
-         LLVMBuildStore(
-            builder, count,
-            si_expand_32bit_pointer(ctx, ac_get_arg(&ctx->ac, param_vertex_count_addr)));
-      }
-      ac_build_endif(&ctx->ac, 12606);
-   } else {
-      /* For unordered modes that increment a vertex count instead of
-       * primitive count, convert it into the primitive index.
-       */
-      start = LLVMBuildUDiv(builder, start, LLVMConstInt(ctx->ac.i32, vertices_per_prim, 0), "");
-   }
+   /* Convert it into the primitive index. */
+   start = LLVMBuildUDiv(builder, start, LLVMConstInt(ctx->ac.i32, vertices_per_prim, 0), "");
 
    /* Now we need to store the indices of accepted primitives into
     * the output index buffer.
@@ -789,18 +490,6 @@ void si_build_prim_discard_compute_shader(struct si_shader_context *ctx)
             index[i] = LLVMBuildOr(builder, index[i], instance_id, "");
       }
 
-      if (VERTEX_COUNTER_GDS_MODE == 2) {
-         /* vertex_counter contains the first primitive ID
-          * for this dispatch. If the draw call was split into
-          * multiple subdraws, the first primitive ID is > 0
-          * for subsequent subdraws. Each subdraw uses a different
-          * portion of the output index buffer. Offset the store
-          * vindex by the first primitive ID to get the correct
-          * store address for the subdraw.
-          */
-         start = LLVMBuildAdd(builder, start, vertex_counter, "");
-      }
-
       /* Write indices for accepted primitives. */
       LLVMValueRef vindex = LLVMBuildAdd(builder, start, prim_index, "");
       LLVMValueRef vdata = ac_build_gather_values(&ctx->ac, index, 3);
@@ -818,16 +507,11 @@ void si_build_prim_discard_compute_shader(struct si_shader_context *ctx)
 
 /* Return false if the shader isn't ready. */
 static bool si_shader_select_prim_discard_cs(struct si_context *sctx,
-                                             const struct pipe_draw_info *info,
-                                             bool primitive_restart)
+                                             const struct pipe_draw_info *info)
 {
    struct si_state_rasterizer *rs = sctx->queued.named.rasterizer;
    struct si_shader_key key;
 
-   /* Primitive restart needs ordered counters. */
-   assert(!primitive_restart || VERTEX_COUNTER_GDS_MODE == 2);
-   assert(!primitive_restart || info->instance_count == 1);
-
    memset(&key, 0, sizeof(key));
    si_shader_selector_key_vs(sctx, sctx->shader.vs.cso, &key, &key.part.vs.prolog);
    assert(!key.part.vs.prolog.instance_divisor_is_fetched);
@@ -837,20 +521,8 @@ static bool si_shader_select_prim_discard_cs(struct si_context *sctx,
    key.opt.cs_prim_type = info->mode;
    key.opt.cs_indexed = info->index_size != 0;
    key.opt.cs_instancing = info->instance_count > 1;
-   key.opt.cs_primitive_restart = primitive_restart;
    key.opt.cs_provoking_vertex_first = rs->provoking_vertex_first;
 
-   /* Primitive restart with triangle strips needs to preserve primitive
-    * orientation for cases where front and back primitive orientation matters.
-    */
-   if (primitive_restart) {
-      struct si_shader_selector *ps = sctx->shader.ps.cso;
-
-      key.opt.cs_need_correct_orientation = rs->cull_front != rs->cull_back ||
-                                            ps->info.uses_frontface ||
-                                            (rs->two_side && ps->info.colors_read);
-   }
-
    if (rs->rasterizer_discard) {
       /* Just for performance testing and analysis of trivial bottlenecks.
        * This should result in a very short compute shader. */
@@ -885,30 +557,9 @@ static bool si_initialize_prim_discard_cmdbuf(struct si_context *sctx)
 
    if (!sctx->prim_discard_compute_cs.priv) {
       struct radeon_winsys *ws = sctx->ws;
-      unsigned gds_size =
-         VERTEX_COUNTER_GDS_MODE == 1 ? GDS_SIZE_UNORDERED : VERTEX_COUNTER_GDS_MODE == 2 ? 8 : 0;
-      unsigned num_oa_counters = VERTEX_COUNTER_GDS_MODE == 2 ? 2 : 0;
-
-      if (gds_size) {
-         sctx->gds = ws->buffer_create(ws, gds_size, 4, RADEON_DOMAIN_GDS,
-                                       RADEON_FLAG_DRIVER_INTERNAL);
-         if (!sctx->gds)
-            return false;
-
-         ws->cs_add_buffer(&sctx->gfx_cs, sctx->gds, RADEON_USAGE_READWRITE, 0, 0);
-      }
-      if (num_oa_counters) {
-         assert(gds_size);
-         sctx->gds_oa = ws->buffer_create(ws, num_oa_counters, 1, RADEON_DOMAIN_OA,
-                                          RADEON_FLAG_DRIVER_INTERNAL);
-         if (!sctx->gds_oa)
-            return false;
-
-         ws->cs_add_buffer(&sctx->gfx_cs, sctx->gds_oa, RADEON_USAGE_READWRITE, 0, 0);
-      }
 
       if (!ws->cs_add_parallel_compute_ib(&sctx->prim_discard_compute_cs,
-                                          &sctx->gfx_cs, num_oa_counters > 0))
+                                          &sctx->gfx_cs, false))
          return false;
    }
 
@@ -934,11 +585,10 @@ enum si_prim_discard_outcome
 si_prepare_prim_discard_or_split_draw(struct si_context *sctx, const struct pipe_draw_info *info,
                                       unsigned drawid_offset,
                                       const struct pipe_draw_start_count_bias *draws,
-                                      unsigned num_draws, bool primitive_restart,
-                                      unsigned total_count)
+                                      unsigned num_draws, unsigned total_count)
 {
    /* If the compute shader compilation isn't finished, this returns false. */
-   if (!si_shader_select_prim_discard_cs(sctx, info, primitive_restart))
+   if (!si_shader_select_prim_discard_cs(sctx, info))
       return SI_PRIM_DISCARD_DISABLED;
 
    if (!si_initialize_prim_discard_cmdbuf(sctx))
@@ -1006,8 +656,6 @@ si_prepare_prim_discard_or_split_draw(struct si_context *sctx, const struct pipe
       struct pipe_draw_start_count_bias split_draw_range = draws[0];
       unsigned base_start = split_draw_range.start;
 
-      split_draw.primitive_restart = primitive_restart;
-
       if (prim == PIPE_PRIM_TRIANGLES) {
          assert(vert_count_per_subdraw < count);
 
@@ -1027,12 +675,7 @@ si_prepare_prim_discard_or_split_draw(struct si_context *sctx, const struct pipe
             split_draw_range.count = MIN2(count - start, vert_count_per_subdraw + 2);
 
             sctx->b.draw_vbo(&sctx->b, &split_draw, drawid_offset, NULL, &split_draw_range, 1);
-
-            if (start == 0 && primitive_restart &&
-                sctx->cs_prim_discard_state.current->key.opt.cs_need_correct_orientation)
-               sctx->preserve_prim_restart_gds_at_flush = true;
          }
-         sctx->preserve_prim_restart_gds_at_flush = false;
       }
 
       return SI_PRIM_DISCARD_DRAW_SPLIT;
@@ -1053,7 +696,6 @@ si_prepare_prim_discard_or_split_draw(struct si_context *sctx, const struct pipe
                           num_subdraws * 8; /* use REWIND(2) + DRAW(6) */
 
    if (ring_full ||
-       (VERTEX_COUNTER_GDS_MODE == 1 && sctx->compute_gds_offset + 8 > GDS_SIZE_UNORDERED) ||
        !sctx->ws->cs_check_space(gfx_cs, need_gfx_dw, false)) {
       /* If the current IB is empty but the size is too small, add a NOP
        * packet to force a flush and get a bigger IB.
@@ -1083,7 +725,7 @@ void si_compute_signal_gfx(struct si_context *sctx)
    unsigned writeback_L2_flags = 0;
 
    /* GFX8 needs to flush L2 for CP to see the updated vertex count. */
-   if (sctx->chip_class == GFX8 && VERTEX_COUNTER_GDS_MODE == 0)
+   if (sctx->chip_class == GFX8)
       writeback_L2_flags = EVENT_TC_WB_ACTION_ENA | EVENT_TC_NC_ACTION_ENA;
 
    if (!sctx->compute_num_prims_in_batch)
@@ -1138,12 +780,10 @@ void si_dispatch_prim_discard_cs_and_draw(struct si_context *sctx,
 
    unsigned out_indexbuf_offset;
    uint64_t output_indexbuf_size = num_prims * vertices_per_prim * 4;
-   bool first_dispatch = !sctx->prim_discard_compute_ib_initialized;
 
    /* Initialize the compute IB if it's empty. */
    if (!sctx->prim_discard_compute_ib_initialized) {
       /* 1) State initialization. */
-      sctx->compute_gds_offset = 0;
       sctx->compute_ib_last_shader = NULL;
 
       if (sctx->last_ib_barrier_fence) {
@@ -1179,12 +819,6 @@ void si_dispatch_prim_discard_cs_and_draw(struct si_context *sctx,
                                  S_0085F0_SH_KCACHE_ACTION_ENA(1));
       }
 
-      /* Restore the GDS prim restart counter if needed. */
-      if (sctx->preserve_prim_restart_gds_at_flush) {
-         si_cp_copy_data(sctx, cs, COPY_DATA_GDS, NULL, 4, COPY_DATA_SRC_MEM,
-                         sctx->wait_mem_scratch, 4);
-      }
-
       si_emit_initial_compute_regs(sctx, cs);
 
       radeon_begin(cs);
@@ -1200,14 +834,6 @@ void si_dispatch_prim_discard_cs_and_draw(struct si_context *sctx,
       radeon_set_sh_reg_seq(cs, R_00B814_COMPUTE_START_Y, 2);
       radeon_emit(cs, 0);
       radeon_emit(cs, 0);
-
-      /* Disable ordered alloc for OA resources. */
-      for (unsigned i = 0; i < 2; i++) {
-         radeon_set_uconfig_reg_seq(cs, R_031074_GDS_OA_CNTL, 3, false);
-         radeon_emit(cs, S_031074_INDEX(i));
-         radeon_emit(cs, 0);
-         radeon_emit(cs, S_03107C_ENABLE(0));
-      }
       radeon_end();
 
       if (sctx->last_ib_barrier_buf) {
@@ -1292,8 +918,8 @@ void si_dispatch_prim_discard_cs_and_draw(struct si_context *sctx,
    desc[11] = fui(cull_info.translate[1]);
 
    /* Set user data SGPRs. */
-   /* This can't be greater than 14 if we want the fastest launch rate. */
-   unsigned user_sgprs = 13;
+   /* This can't be >= 16 if we want the fastest launch rate. */
+   unsigned user_sgprs = 10;
 
    uint64_t index_buffers_va = indexbuf_desc->gpu_address + indexbuf_desc_offset;
    unsigned vs_const_desc = si_const_and_shader_buffer_descriptors_idx(PIPE_SHADER_VERTEX);
@@ -1303,7 +929,6 @@ void si_dispatch_prim_discard_cs_and_draw(struct si_context *sctx,
    uint64_t vb_desc_va = sctx->vb_descriptors_buffer
                             ? sctx->vb_descriptors_buffer->gpu_address + sctx->vb_descriptors_offset
                             : 0;
-   unsigned gds_offset, gds_size;
    struct si_fast_udiv_info32 num_prims_udiv = {};
 
    if (info->instance_count > 1)
@@ -1315,30 +940,6 @@ void si_dispatch_prim_discard_cs_and_draw(struct si_context *sctx,
 
    si_resource_reference(&indexbuf_desc, NULL);
 
-   bool primitive_restart = sctx->cs_prim_discard_state.current->key.opt.cs_primitive_restart;
-
-   if (VERTEX_COUNTER_GDS_MODE == 1) {
-      gds_offset = sctx->compute_gds_offset;
-      gds_size = primitive_restart ? 8 : 4;
-      sctx->compute_gds_offset += gds_size;
-
-      /* Reset the counters in GDS for the first dispatch using WRITE_DATA.
-       * The remainder of the GDS will be cleared after the dispatch packet
-       * in parallel with compute shaders.
-       */
-      if (first_dispatch) {
-         radeon_begin(cs);
-         radeon_emit(cs, PKT3(PKT3_WRITE_DATA, 2 + gds_size / 4, 0));
-         radeon_emit(cs, S_370_DST_SEL(V_370_GDS) | S_370_WR_CONFIRM(1));
-         radeon_emit(cs, gds_offset);
-         radeon_emit(cs, 0);
-         radeon_emit(cs, 0); /* value to write */
-         if (gds_size == 8)
-            radeon_emit(cs, 0);
-         radeon_end();
-      }
-   }
-
    /* Set shader registers. */
    struct si_shader *shader = sctx->cs_prim_discard_state.current;
 
@@ -1364,7 +965,6 @@ void si_dispatch_prim_discard_cs_and_draw(struct si_context *sctx,
                 S_00B848_WGP_MODE(sctx->chip_class >= GFX10));
       radeon_emit(cs, S_00B84C_SCRATCH_EN(0 /* no scratch */) | S_00B84C_USER_SGPR(user_sgprs) |
                          S_00B84C_TGID_X_EN(1 /* only blockID.x is used */) |
-                         S_00B84C_TG_SIZE_EN(VERTEX_COUNTER_GDS_MODE == 2 /* need the wave ID */) |
                          S_00B84C_TIDIG_COMP_CNT(0 /* only threadID.x is used */) |
                          S_00B84C_LDS_SIZE(shader->config.lds_size));
 
@@ -1423,22 +1023,8 @@ void si_dispatch_prim_discard_cs_and_draw(struct si_context *sctx,
 
       /* Continue with the compute IB. */
       if (start_prim == 0) {
-         uint32_t gds_prim_restart_continue_bit = 0;
-
-         if (sctx->preserve_prim_restart_gds_at_flush) {
-            assert(primitive_restart && info->mode == PIPE_PRIM_TRIANGLE_STRIP);
-            assert(start_prim < 1 << 31);
-            gds_prim_restart_continue_bit = 1 << 31;
-         }
-
          radeon_set_sh_reg_seq(cs, R_00B900_COMPUTE_USER_DATA_0, user_sgprs);
          radeon_emit(cs, index_buffers_va);
-         radeon_emit(cs, VERTEX_COUNTER_GDS_MODE == 0
-                            ? count_va
-                            : VERTEX_COUNTER_GDS_MODE == 1
-                                 ? gds_offset
-                                 : start_prim | gds_prim_restart_continue_bit);
-         radeon_emit(cs, start_prim + num_subdraw_prims - 1);
          radeon_emit(cs, count_va);
          radeon_emit(cs, vb_desc_va);
          radeon_emit(cs, vs_const_desc_va);
@@ -1447,16 +1033,14 @@ void si_dispatch_prim_discard_cs_and_draw(struct si_context *sctx,
          radeon_emit(cs, info->start_instance);
          radeon_emit(cs, num_prims_udiv.multiplier);
          radeon_emit(cs, num_prims_udiv.post_shift | (num_prims_per_instance << 5));
-         radeon_emit(cs, info->restart_index);
          /* small-prim culling precision (same as rasterizer precision = QUANT_MODE) */
          radeon_emit(cs, fui(cull_info.small_prim_precision));
       } else {
-         assert(VERTEX_COUNTER_GDS_MODE == 2);
+#if 0 /* TODO: draw splitting could be enabled */
          /* Only update the SGPRs that changed. */
-         radeon_set_sh_reg_seq(cs, R_00B904_COMPUTE_USER_DATA_1, 3);
-         radeon_emit(cs, start_prim);
-         radeon_emit(cs, start_prim + num_subdraw_prims - 1);
+         radeon_set_sh_reg_seq(cs, R_00B904_COMPUTE_USER_DATA_1, 1);
          radeon_emit(cs, count_va);
+#endif
       }
 
       /* Set grid dimensions. */
@@ -1474,33 +1058,9 @@ void si_dispatch_prim_discard_cs_and_draw(struct si_context *sctx,
       radeon_emit(cs, 1);
       radeon_emit(cs, 1);
       radeon_emit(cs, S_00B800_COMPUTE_SHADER_EN(1) | S_00B800_PARTIAL_TG_EN(!!partial_block_size) |
-                         S_00B800_ORDERED_APPEND_ENBL(VERTEX_COUNTER_GDS_MODE == 2) |
-                         S_00B800_ORDER_MODE(0 /* launch in order */));
+                      S_00B800_ORDER_MODE(0 /* launch in order */));
       radeon_end();
 
-      /* This is only for unordered append. Ordered append writes this from
-       * the shader.
-       *
-       * Note that EOP and EOS events are super slow, so emulating the event
-       * in a shader is an important optimization.
-       */
-      if (VERTEX_COUNTER_GDS_MODE == 1) {
-         si_cp_release_mem(sctx, cs, V_028A90_CS_DONE, 0,
-                           sctx->chip_class <= GFX8 ? EOP_DST_SEL_MEM : EOP_DST_SEL_TC_L2,
-                           EOP_INT_SEL_NONE, EOP_DATA_SEL_GDS, NULL,
-                           count_va | ((uint64_t)sctx->screen->info.address32_hi << 32),
-                           EOP_DATA_GDS(gds_offset / 4, 1), SI_NOT_QUERY);
-
-         /* Now that compute shaders are running, clear the remainder of GDS. */
-         if (first_dispatch) {
-            unsigned offset = gds_offset + gds_size;
-            si_cp_dma_clear_buffer(
-               sctx, cs, NULL, offset, GDS_SIZE_UNORDERED - offset, 0,
-               SI_OP_CPDMA_SKIP_CHECK_CS_SPACE, SI_COHERENCY_NONE, L2_BYPASS);
-         }
-      }
-      first_dispatch = false;
-
       assert(cs->current.cdw <= cs->current.max_dw);
       assert(gfx_cs->current.cdw <= gfx_cs->current.max_dw);
    }
index baf4abf..6cb7c93 100644 (file)
@@ -115,24 +115,9 @@ void si_flush_gfx_cs(struct si_context *ctx, unsigned flags, struct pipe_fence_h
 
    ctx->gfx_flush_in_progress = true;
 
-   if (radeon_emitted(&ctx->prim_discard_compute_cs, 0)) {
-      struct radeon_cmdbuf *compute_cs = &ctx->prim_discard_compute_cs;
+   if (radeon_emitted(&ctx->prim_discard_compute_cs, 0))
       si_compute_signal_gfx(ctx);
 
-      /* Make sure compute shaders are idle before leaving the IB, so that
-       * the next IB doesn't overwrite GDS that might be in use. */
-      radeon_begin(compute_cs);
-      radeon_emit(compute_cs, PKT3(PKT3_EVENT_WRITE, 0, 0));
-      radeon_emit(compute_cs, EVENT_TYPE(V_028A90_CS_PARTIAL_FLUSH) | EVENT_INDEX(4));
-      radeon_end();
-
-      /* Save the GDS prim restart counter if needed. */
-      if (ctx->preserve_prim_restart_gds_at_flush) {
-         si_cp_copy_data(ctx, compute_cs, COPY_DATA_DST_MEM, ctx->wait_mem_scratch, 4,
-                         COPY_DATA_GDS, NULL, 4);
-      }
-   }
-
    if (ctx->has_graphics) {
       if (!list_is_empty(&ctx->active_queries))
          si_suspend_queries(ctx);
index 3003055..e34abd6 100644 (file)
@@ -488,7 +488,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen, unsign
    /* Initialize private allocators. */
    u_suballocator_init(&sctx->allocator_zeroed_memory, &sctx->b, 128 * 1024, 0,
                        PIPE_USAGE_DEFAULT,
-                       SI_RESOURCE_FLAG_UNMAPPABLE | SI_RESOURCE_FLAG_CLEAR, false);
+                       SI_RESOURCE_FLAG_CLEAR | SI_RESOURCE_FLAG_32BIT, false);
 
    sctx->cached_gtt_allocator = u_upload_create(&sctx->b, 16 * 1024, 0, PIPE_USAGE_STAGING, 0);
    if (!sctx->cached_gtt_allocator)
index 127b694..b34427c 100644 (file)
@@ -989,16 +989,15 @@ struct si_context {
    uint32_t vram_kb;
    uint32_t gtt_kb;
 
-   /* Compute-based primitive discard. */
-   unsigned prim_discard_vertex_count_threshold;
+   /* NGG streamout. */
    struct pb_buffer *gds;
    struct pb_buffer *gds_oa;
+   /* Compute-based primitive discard. */
+   unsigned prim_discard_vertex_count_threshold;
    struct radeon_cmdbuf prim_discard_compute_cs;
-   unsigned compute_gds_offset;
    struct si_shader *compute_ib_last_shader;
    uint32_t compute_rewind_va;
    unsigned compute_num_prims_in_batch;
-   bool preserve_prim_restart_gds_at_flush;
    /* index_ring is divided into 2 halves for doublebuffering. */
    struct si_resource *index_ring;
    unsigned index_ring_base;        /* offset of a per-IB portion */
@@ -1514,8 +1513,7 @@ enum si_prim_discard_outcome
 si_prepare_prim_discard_or_split_draw(struct si_context *sctx, const struct pipe_draw_info *info,
                                       unsigned drawid_offset,
                                       const struct pipe_draw_start_count_bias *draws,
-                                      unsigned num_draws, bool primitive_restart,
-                                      unsigned total_count);
+                                      unsigned num_draws, unsigned total_count);
 void si_compute_signal_gfx(struct si_context *sctx);
 void si_dispatch_prim_discard_cs_and_draw(struct si_context *sctx,
                                           const struct pipe_draw_info *info,
index d4482b7..0cdd3a9 100644 (file)
@@ -1187,9 +1187,7 @@ static void si_dump_shader_key(const struct si_shader *shader, FILE *f)
       fprintf(f, "  opt.cs_prim_type = %s\n", tgsi_primitive_names[key->opt.cs_prim_type]);
       fprintf(f, "  opt.cs_indexed = %u\n", key->opt.cs_indexed);
       fprintf(f, "  opt.cs_instancing = %u\n", key->opt.cs_instancing);
-      fprintf(f, "  opt.cs_primitive_restart = %u\n", key->opt.cs_primitive_restart);
       fprintf(f, "  opt.cs_provoking_vertex_first = %u\n", key->opt.cs_provoking_vertex_first);
-      fprintf(f, "  opt.cs_need_correct_orientation = %u\n", key->opt.cs_need_correct_orientation);
       fprintf(f, "  opt.cs_cull_front = %u\n", key->opt.cs_cull_front);
       fprintf(f, "  opt.cs_cull_back = %u\n", key->opt.cs_cull_back);
       break;
index f6681f4..b3847db 100644 (file)
@@ -689,9 +689,7 @@ struct si_shader_key {
       unsigned cs_prim_type : 4;
       unsigned cs_indexed : 1;
       unsigned cs_instancing : 1;
-      unsigned cs_primitive_restart : 1;
       unsigned cs_provoking_vertex_first : 1;
-      unsigned cs_need_correct_orientation : 1;
       unsigned cs_cull_front : 1;
       unsigned cs_cull_back : 1;
 
index ce72a53..1a525f8 100644 (file)
@@ -1920,26 +1920,21 @@ static void si_draw_vbo(struct pipe_context *ctx,
    unsigned original_index_size = index_size;
 
    /* Determine if we can use the primitive discard compute shader. */
+   /* TODO: this requires that primitives can be drawn out of order, so check depth/stencil/blend states. */
    if (ALLOW_PRIM_DISCARD_CS &&
        (total_direct_count > sctx->prim_discard_vertex_count_threshold
            ? (sctx->compute_num_verts_rejected += total_direct_count, true)
            : /* Add, then return true. */
            (sctx->compute_num_verts_ineligible += total_direct_count,
             false)) && /* Add, then return false. */
-       (primitive_restart ?
-                          /* Supported prim types with primitive restart: */
-           (prim == PIPE_PRIM_TRIANGLE_STRIP || pd_msg("bad prim type with primitive restart")) &&
-              /* Disallow instancing with primitive restart: */
-              (instance_count == 1 || pd_msg("instance_count > 1 with primitive restart"))
-                          :
-                          /* Supported prim types without primitive restart + allow instancing: */
-           (1 << prim) & ((1 << PIPE_PRIM_TRIANGLES) | (1 << PIPE_PRIM_TRIANGLE_STRIP)) &&
-              /* Instancing is limited to 16-bit indices, because InstanceID is packed into
-                 VertexID. */
-              /* Instanced index_size == 0 requires that start + count < USHRT_MAX, so just reject it. */
-              (instance_count == 1 ||
-               (instance_count <= USHRT_MAX && index_size && index_size <= 2) ||
-               pd_msg("instance_count too large or index_size == 4 or DrawArraysInstanced"))) &&
+       (!primitive_restart || pd_msg("primitive restart")) &&
+       /* Supported prim types. */
+       (1 << prim) & ((1 << PIPE_PRIM_TRIANGLES) | (1 << PIPE_PRIM_TRIANGLE_STRIP)) &&
+       /* Instancing is limited to 16-bit indices, because InstanceID is packed into VertexID. */
+       /* Instanced index_size == 0 requires that start + count < USHRT_MAX, so just reject it. */
+       (instance_count == 1 ||
+        (instance_count <= USHRT_MAX && index_size && index_size <= 2) ||
+        pd_msg("instance_count too large or index_size == 4 or DrawArraysInstanced")) &&
        ((drawid_offset == 0 && (num_draws == 1 || !info->increment_draw_id)) ||
         !sctx->shader.vs.cso->info.uses_drawid || pd_msg("draw_id > 0")) &&
        (!sctx->render_cond || pd_msg("render condition")) &&
@@ -1966,7 +1961,7 @@ static void si_draw_vbo(struct pipe_context *ctx,
        (si_all_vs_resources_read_only(sctx, index_size ? indexbuf : NULL) ||
         pd_msg("write reference"))) {
       switch (si_prepare_prim_discard_or_split_draw(sctx, info, drawid_offset, draws, num_draws,
-                                                    primitive_restart, total_direct_count)) {
+                                                    total_direct_count)) {
       case SI_PRIM_DISCARD_ENABLED:
          original_index_size = index_size;
          prim_discard_cs_instancing = instance_count > 1;
@@ -1976,7 +1971,6 @@ static void si_draw_vbo(struct pipe_context *ctx,
          prim = PIPE_PRIM_TRIANGLES;
          index_size = 4;
          instance_count = 1;
-         primitive_restart = false;
          sctx->compute_num_verts_rejected -= total_direct_count;
          sctx->compute_num_verts_accepted += total_direct_count;
          break;