From 22d98280dd8ee70064899eefb973a1c020605874 Mon Sep 17 00:00:00 2001 From: JackAKirk Date: Fri, 20 Jan 2023 16:51:11 +0000 Subject: [PATCH] [NVPTX] Increase inline threshold multiplier to 11 in nvptx backend. I used https://github.com/zjin-lcf/HeCBench (with nvcc usage swapped to clang++), which is an adaptation of the classic Rodinia benchmarks aimed at CUDA and SYCL programming models, to compare different values of the multiplier using both clang++ cuda and clang++ sycl nvptx backends. I find that the value is currently too low for both cases. Qualitatively (and in most cases there is very a close quantitative agreement across both cases) the change in code execution time for a range of values from 5 to 1000 matches in both variations (CUDA clang++ vs SYCL (with cuda backend) using the intel/llvm clang++ compiler) of the HeCbench samples. This value of 11 is optimal for clang++ cuda for all cases I've investigated. I have not found a single case where performance is deprecated by this change of the value from 5 to 11. For one sample the sycl cuda backend preferred a higher value. However we are happy to prioritize clang++ cuda, and we find that this value is close to ideal for both cases anyway. It would be good to do some further investigation using clang++ openmp cuda offload. However since I do not know of an appropriate set of benchmarks for this case, and the fact that we are now getting complaints about register spills related to insufficient inlining on a weekly basis, we have decided to propose this change and potentially seek some more input from someone who may have more expertise in the openmp case. Incidentally this value coincides with the value used for the amd-gcn backend. We have also been able to use the amd backend of the intel/llvm "dpc++" compiler to compare the inlining behaviour of an identical code when targetting amd (compared to nvptx). Unsurprisingly the amd backend with a multiplier value of 11 was performing better (with regard to inlining) than the nvptx case when the value of 5 was used. When the two backends use the same multiplier value the inlining behaviors appear to align closely. This also considerably improves the performance of at least one of the most popular HPC applications: NWCHEMX. Signed-off-by: JackAKirk Reviewed by: tra Differential Revision: https://reviews.llvm.org/D142232 --- llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h index 0b1195e..b7bc0f2 100644 --- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h +++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h @@ -90,9 +90,9 @@ public: return true; } - // Increase the inlining cost threshold by a factor of 5, reflecting that + // Increase the inlining cost threshold by a factor of 11, reflecting that // calls are particularly expensive in NVPTX. - unsigned getInliningThresholdMultiplier() { return 5; } + unsigned getInliningThresholdMultiplier() { return 11; } InstructionCost getArithmeticInstrCost( unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind, -- 2.7.4