From 22d98280dd8ee70064899eefb973a1c020605874 Mon Sep 17 00:00:00 2001
From: JackAKirk <jack.kirk@codeplay.com>
Date: Fri, 20 Jan 2023 16:51:11 +0000
Subject: [PATCH] [NVPTX] Increase inline threshold multiplier to 11 in nvptx
 backend.

I used https://github.com/zjin-lcf/HeCBench (with nvcc usage swapped to
clang++), which is an adaptation of the classic Rodinia benchmarks aimed
at CUDA and SYCL programming models, to compare different values of the
multiplier using both clang++ cuda and clang++ sycl nvptx backends. I
find that the value is currently too low for both cases. Qualitatively
(and in most cases there is very a close quantitative agreement across
both cases) the change in code execution time for a range of values from
5 to 1000 matches in both variations (CUDA clang++ vs SYCL (with cuda
backend) using the intel/llvm clang++ compiler) of the HeCbench samples.
This value of 11 is optimal for clang++ cuda for all cases I've
investigated. I have not found a single case where performance is
deprecated by this change of the value from 5 to 11. For one sample the
sycl cuda backend preferred a higher value. However we are happy to
prioritize clang++ cuda, and we find that this value is close to ideal
for both cases anyway. It would be good to do some further investigation
using clang++ openmp cuda offload. However since I do not know of an
appropriate set of benchmarks for this case, and the fact that we are
now getting complaints about register spills related to insufficient
inlining on a weekly basis, we have decided to propose this change and
potentially seek some more input from someone who may have more
expertise in the openmp case. Incidentally this value coincides with the
value used for the amd-gcn backend. We have also been able to use the
amd backend of the intel/llvm "dpc++" compiler to compare the inlining
behaviour of an identical code when targetting amd (compared to nvptx).
Unsurprisingly the amd backend with a multiplier value of 11 was
performing better (with regard to inlining) than the nvptx case when the
value of 5 was used. When the two backends use the same multiplier value
the inlining behaviors appear to align closely.

This also considerably improves the performance of at least one of the
most popular HPC applications: NWCHEMX.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Reviewed by: tra
Differential Revision: https://reviews.llvm.org/D142232
---
 llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index 0b1195e..b7bc0f2 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -90,9 +90,9 @@ public:
     return true;
   }
 
-  // Increase the inlining cost threshold by a factor of 5, reflecting that
+  // Increase the inlining cost threshold by a factor of 11, reflecting that
   // calls are particularly expensive in NVPTX.
-  unsigned getInliningThresholdMultiplier() { return 5; }
+  unsigned getInliningThresholdMultiplier() { return 11; }
 
   InstructionCost getArithmeticInstrCost(
       unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
-- 
2.7.4