This was 4th on the list of things to try in
3ee2e84c608 ("nir:
Rematerialize compare instructions"). This is implemented as a separate
subpass that tries to find ALU instructions (with restrictions) that are
only used by comparisons with zero that are in turn only used as
conditions for bcsel or if-statements.
There are two restrictions implemented. One of the sources must be a
constant. This is done in an attempt to prevent increasing register
pressure. Additionally, the opcode of the instruction must be one that
has a high probablility of getting a conditional modifier on Intel
GPUs. Not all instructions can have a conditional modifiers (e.g., min
and max), so I don't think there is any benefit to moving these
instructions.
v2: Rebase on many, many recent NIR infrastructure changes.
v3: Make data in commit message more clear. Suggested by Matt. Rebase on
b5d6b7c402a ("nir: Drop most uses if nir_instr_rewrite_src()").
All of the affected shaders on ILK and G45 are in CS:GO. There is some
brief analysis of the changes in the MR.
Reviewed-by: Matt Tuner <mattst88@gmail.com>
Shader-db results:
DG2
total instructions in shared programs:
22824637 ->
22824258 (<.01%)
instructions in affected programs: 365742 -> 365363 (-0.10%)
helped: 190 / HURT: 97
total cycles in shared programs:
832186193 ->
832157290 (<.01%)
cycles in affected programs:
41245259 ->
41216356 (-0.07%)
helped: 208 / HURT: 117
total spills in shared programs: 4072 -> 4060 (-0.29%)
spills in affected programs: 366 -> 354 (-3.28%)
helped: 4 / HURT: 2
total fills in shared programs: 3601 -> 3607 (0.17%)
fills in affected programs: 708 -> 714 (0.85%)
helped: 4 / HURT: 2
LOST: 0
GAINED: 1
Tiger Lake and Ice Lake had similar results. (Ice Lake shown)
total instructions in shared programs:
20320934 ->
20320689 (<.01%)
instructions in affected programs: 236592 -> 236347 (-0.10%)
helped: 176 / HURT: 29
total cycles in shared programs:
849846341 ->
849843856 (<.01%)
cycles in affected programs:
41277336 ->
41274851 (<.01%)
helped: 195 / HURT: 110
LOST: 0
GAINED: 1
Skylake
total instructions in shared programs:
18550811 ->
18550470 (<.01%)
instructions in affected programs: 233908 -> 233567 (-0.15%)
helped: 182 / HURT: 25
total cycles in shared programs:
835910983 ->
835889167 (<.01%)
cycles in affected programs:
38764359 ->
38742543 (-0.06%)
helped: 207/ HURT: 94
total spills in shared programs: 4522 -> 4506 (-0.35%)
spills in affected programs: 324 -> 308 (-4.94%)
helped: 4 / HURT: 0
total fills in shared programs: 5296 -> 5280 (-0.30%)
fills in affected programs: 324 -> 308 (-4.94%)
helped: 4 / HURT: 0
LOST: 0
GAINED: 1
Broadwell
total instructions in shared programs:
18199130 ->
18197920 (<.01%)
instructions in affected programs: 214664 -> 213454 (-0.56%)
helped: 191 / HURT: 0
total cycles in shared programs:
935131908 ->
934870248 (-0.03%)
cycles in affected programs:
75770568 ->
75508908 (-0.35%)
helped: 203 / HURT: 84
total spills in shared programs: 13896 -> 13734 (-1.17%)
spills in affected programs: 162 -> 0
helped: 3 / HURT: 0
total fills in shared programs: 16989 -> 16761 (-1.34%)
fills in affected programs: 228 -> 0
helped: 3 / HURT: 0
Haswell
total instructions in shared programs:
16969502 ->
16969085 (<.01%)
instructions in affected programs: 185498 -> 185081 (-0.22%)
helped: 121 / HURT: 1
total cycles in shared programs:
925290863 ->
924806827 (-0.05%)
cycles in affected programs:
30200863 ->
29716827 (-1.60%)
helped: 100 / HURT: 85
total spills in shared programs: 13565 -> 13533 (-0.24%)
spills in affected programs: 736 -> 704 (-4.35%)
helped: 8 / HURT: 0
total fills in shared programs: 15468 -> 15436 (-0.21%)
fills in affected programs: 740 -> 708 (-4.32%)
helped: 8 / HURT: 0
LOST: 0
GAINED: 1
Ivy Bridge
total instructions in shared programs:
15839127 ->
15838947 (<.01%)
instructions in affected programs: 77776 -> 77596 (-0.23%)
helped: 58 / HURT: 0
total cycles in shared programs:
459852774 ->
459739770 (-0.02%)
cycles in affected programs:
11970210 ->
11857206 (-0.94%)
helped: 79 / HURT: 53
Sandy Bridge
total instructions in shared programs:
14106847 ->
14106831 (<.01%)
instructions in affected programs: 1611 -> 1595 (-0.99%)
helped: 10 / HURT: 0
total cycles in shared programs:
775004024 ->
775007516 (<.01%)
cycles in affected programs: 2530686 -> 2534178 (0.14%)
helped: 55 / HURT: 48
Iron Lake
total cycles in shared programs:
257753356 ->
257754900 (<.01%)
cycles in affected programs: 2977374 -> 2978918 (0.05%)
helped: 12 / HURT: 106
GM45
total cycles in shared programs:
169711382 ->
169712816 (<.01%)
cycles in affected programs: 2402070 -> 2403504 (0.06%)
helped: 12 / HURT: 57
Fossil-db results:
All Intel platforms had similar results. (DG2 shown)
Totals:
Instrs:
193884596 ->
193465896 (-0.22%); split: -0.25%, +0.03%
Cycles:
14050193354 ->
14048194826 (-0.01%); split: -0.34%, +0.33%
Spill count: 114944 -> 100449 (-12.61%); split: -13.59%, +0.98%
Fill count: 201525 -> 179534 (-10.91%); split: -11.22%, +0.31%
Scratch Memory Size:
10028032 -> 8468480 (-15.55%)
Totals from 16912 (2.59% of 653124) affected shaders:
Instrs:
34173709 ->
33755009 (-1.23%); split: -1.41%, +0.19%
Cycles:
2945969110 ->
2943970582 (-0.07%); split: -1.62%, +1.55%
Spill count: 97753 -> 83258 (-14.83%); split: -15.98%, +1.15%
Fill count: 176355 -> 154364 (-12.47%); split: -12.82%, +0.35%
Scratch Memory Size: 8619008 -> 7059456 (-18.09%)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20176>
}
}
+static inline bool
+is_zero(const nir_alu_instr *instr, unsigned src, unsigned num_components,
+ const uint8_t *swizzle)
+{
+ /* only constant srcs: */
+ if (!nir_src_is_const(instr->src[src].src))
+ return false;
+
+ for (unsigned i = 0; i < num_components; i++) {
+ nir_alu_type type = nir_op_infos[instr->op].input_types[src];
+ switch (nir_alu_type_get_base_type(type)) {
+ case nir_type_int:
+ case nir_type_uint: {
+ if (nir_src_comp_as_int(instr->src[src].src, swizzle[i]) != 0)
+ return false;
+ break;
+ }
+ case nir_type_float: {
+ if (nir_src_comp_as_float(instr->src[src].src, swizzle[i]) != 0)
+ return false;
+ break;
+ }
+ default:
+ return false;
+ }
+ }
+
+ return true;
+}
+
static bool
all_uses_are_bcsel(const nir_alu_instr *instr)
{
}
static bool
+all_uses_are_compare_with_zero(const nir_alu_instr *instr)
+{
+ nir_foreach_use(use, &instr->def) {
+ if (use->parent_instr->type != nir_instr_type_alu)
+ return false;
+
+ nir_alu_instr *const alu = nir_instr_as_alu(use->parent_instr);
+ if (!is_two_src_comparison(alu))
+ return false;
+
+ if (!is_zero(alu, 0, 1, alu->src[0].swizzle) &&
+ !is_zero(alu, 1, 1, alu->src[1].swizzle))
+ return false;
+
+ if (!all_uses_are_bcsel(alu))
+ return false;
+ }
+
+ return true;
+}
+
+static bool
nir_opt_rematerialize_compares_impl(nir_shader *shader, nir_function_impl *impl)
{
bool progress = false;
return progress;
}
+static bool
+nir_opt_rematerialize_alu_impl(nir_shader *shader, nir_function_impl *impl)
+{
+ bool progress = false;
+
+ nir_foreach_block(block, impl) {
+ nir_foreach_instr(instr, block) {
+ if (instr->type != nir_instr_type_alu)
+ continue;
+
+ nir_alu_instr *const alu = nir_instr_as_alu(instr);
+
+ /* This list only include ALU ops that are likely to be able to have
+ * cmod propagation on Intel GPUs.
+ */
+ switch (alu->op) {
+ case nir_op_ineg:
+ case nir_op_iabs:
+ case nir_op_fneg:
+ case nir_op_fabs:
+ case nir_op_fadd:
+ case nir_op_iadd:
+ case nir_op_iadd_sat:
+ case nir_op_uadd_sat:
+ case nir_op_isub_sat:
+ case nir_op_usub_sat:
+ case nir_op_irhadd:
+ case nir_op_urhadd:
+ case nir_op_fmul:
+ case nir_op_inot:
+ case nir_op_iand:
+ case nir_op_ior:
+ case nir_op_ixor:
+ case nir_op_ffloor:
+ case nir_op_ffract:
+ case nir_op_uclz:
+ case nir_op_ishl:
+ case nir_op_ishr:
+ case nir_op_ushr:
+ case nir_op_urol:
+ case nir_op_uror:
+ break; /* ... from switch. */
+ default:
+ continue; /* ... with loop. */
+ }
+
+ /* To help prevent increasing live ranges, require that one of the
+ * sources be a constant.
+ */
+ if (nir_op_infos[alu->op].num_inputs == 2 &&
+ !nir_src_is_const(alu->src[0].src) &&
+ !nir_src_is_const(alu->src[1].src))
+ continue;
+
+ if (!all_uses_are_compare_with_zero(alu))
+ continue;
+
+ /* At this point it is known that the alu is only used by a
+ * comparison with zero that is used by nir_op_bcsel and possibly by
+ * if-statements (though the latter has not been explicitly checked).
+ *
+ * Iterate through each use of the ALU. For every use that is in a
+ * different block, emit a copy of the ALU. Care must be taken here.
+ * The original instruction must be duplicated only once in each
+ * block because CSE cannot be run after this pass.
+ */
+ nir_foreach_use_safe(use, &alu->def) {
+ nir_instr *const use_instr = use->parent_instr;
+
+ /* If the use is in the same block as the def, don't
+ * rematerialize.
+ */
+ if (use_instr->block == alu->instr.block)
+ continue;
+
+ nir_alu_instr *clone = nir_alu_instr_clone(shader, alu);
+
+ nir_instr_insert_before(use_instr, &clone->instr);
+
+ nir_alu_instr *const use_alu = nir_instr_as_alu(use_instr);
+ for (unsigned i = 0; i < nir_op_infos[use_alu->op].num_inputs; i++) {
+ if (use_alu->src[i].src.ssa == &alu->def) {
+ nir_src_rewrite(&use_alu->src[i].src, &clone->def);
+ progress = true;
+ }
+ }
+ }
+ }
+ }
+
+ if (progress) {
+ nir_metadata_preserve(impl, nir_metadata_block_index |
+ nir_metadata_dominance);
+ } else {
+ nir_metadata_preserve(impl, nir_metadata_all);
+ }
+
+ return progress;
+}
+
bool
nir_opt_rematerialize_compares(nir_shader *shader)
{
nir_foreach_function_impl(impl, shader) {
progress = nir_opt_rematerialize_compares_impl(shader, impl) || progress;
+
+ progress = nir_opt_rematerialize_alu_impl(shader, impl) || progress;
}
return progress;