From 66feaf9c3cef152da61b189ad85300ea26919794 Mon Sep 17 00:00:00 2001 From: Justin Lebar Date: Wed, 7 Sep 2016 21:46:53 +0000 Subject: [PATCH] [CUDA] Rework "optimizations" and "publication" section in CompileCudaWithLLVM.rst. llvm-svn: 280869 --- llvm/docs/CompileCudaWithLLVM.rst | 97 ++++++++++++++++++--------------------- 1 file changed, 45 insertions(+), 52 deletions(-) diff --git a/llvm/docs/CompileCudaWithLLVM.rst b/llvm/docs/CompileCudaWithLLVM.rst index 1751bfbd..890204f 100644 --- a/llvm/docs/CompileCudaWithLLVM.rst +++ b/llvm/docs/CompileCudaWithLLVM.rst @@ -158,67 +158,60 @@ detect NVCC specifically by looking for ``__NVCC__``. Optimizations ============= -CPU and GPU have different design philosophies and architectures. For example, a -typical CPU has branch prediction, out-of-order execution, and is superscalar, -whereas a typical GPU has none of these. Due to such differences, an -optimization pipeline well-tuned for CPUs may be not suitable for GPUs. - -LLVM performs several general and CUDA-specific optimizations for GPUs. The -list below shows some of the more important optimizations for GPUs. Most of -them have been upstreamed to ``lib/Transforms/Scalar`` and -``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a -customizable target-independent optimization pipeline. - -* **Straight-line scalar optimizations**. These optimizations reduce redundancy - in straight-line code. Details can be found in the `design document for - straight-line scalar optimizations `_. - -* **Inferring memory spaces**. `This optimization - `_ - infers the memory space of an address so that the backend can emit faster - special loads and stores from it. - -* **Aggressive loop unrooling and function inlining**. Loop unrolling and +Modern CPUs and GPUs are architecturally quite different, so code that's fast +on a CPU isn't necessarily fast on a GPU. We've made a number of changes to +LLVM to make it generate good GPU code. Among these changes are: + +* `Straight-line scalar optimizations `_ -- These + reduce redundancy within straight-line code. + +* `Aggressive speculative execution + `_ + -- This is mainly for promoting straight-line scalar optimizations, which are + most effective on code along dominator paths. + +* `Memory space inference + `_ -- + In PTX, we can operate on pointers that are in a paricular "address space" + (global, shared, constant, or local), or we can operate on pointers in the + "generic" address space, which can point to anything. Operations in a + non-generic address space are faster, but pointers in CUDA are not explicitly + annotated with their address space, so it's up to LLVM to infer it where + possible. + +* `Bypassing 64-bit divides + `_ -- + This was an existing optimization that we enabled for the PTX backend. + + 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs. + Many of the 64-bit divides in our benchmarks have a divisor and dividend + which fit in 32-bits at runtime. This optimization provides a fast path for + this common case. + +* Aggressive loop unrooling and function inlining -- Loop unrolling and function inlining need to be more aggressive for GPUs than for CPUs because - control flow transfer in GPU is more expensive. They also promote other - optimizations such as constant propagation and SROA which sometimes speed up - code by over 10x. An empirical inline threshold for GPUs is 1100. This - configuration has yet to be upstreamed with a target-specific optimization - pipeline. LLVM also provides `loop unrolling pragmas + control flow transfer in GPU is more expensive. More aggressive unrolling and + inlining also promote other optimizations, such as constant propagation and + SROA, which sometimes speed up code by over 10x. + + (Programmers can force unrolling and inline using clang's `loop unrolling pragmas `_ - and ``__attribute__((always_inline))`` for programmers to force unrolling and - inling. - -* **Aggressive speculative execution**. `This transformation - `_ is - mainly for promoting straight-line scalar optimizations which are most - effective on code along dominator paths. - -* **Memory-space alias analysis**. `This alias analysis - `_ infers that two pointers in different - special memory spaces do not alias. It has yet to be integrated to the new - alias analysis infrastructure; the new infrastructure does not run - target-specific alias analysis. - -* **Bypassing 64-bit divides**. `An existing optimization - `_ - enabled in the NVPTX backend. 64-bit integer divides are much slower than - 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit - divides in our benchmarks have a divisor and dividend which fit in 32-bits at - runtime. This optimization provides a fast path for this common case. + and ``__attribute__((always_inline))``.) Publication =========== +The team at Google published a paper in CGO 2016 detailing the optimizations +they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name: +The relevant tools are now just vanilla clang/LLVM. + | `gpucc: An Open-Source GPGPU Compiler `_ | Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt | *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)* -| `Slides for the CGO talk `_ - -Tutorial -======== - -`CGO 2016 gpucc tutorial `_ +| +| `Slides from the CGO talk `_ +| +| `Tutorial given at CGO `_ Obtaining Help ============== -- 2.7.4