6 There are four distinct types of LLVM/OpenMP runtimes
8 LLVM/OpenMP Host Runtime (``libomp``)
9 -------------------------------------
11 An `early (2015) design document <https://openmp.llvm.org/Reference.pdf>`_ for
12 the LLVM/OpenMP host runtime, aka. `libomp.so`, is available as a `pdf
13 <https://openmp.llvm.org/Reference.pdf>`_.
16 LLVM/OpenMP Target Host Runtime (``libomptarget``)
17 --------------------------------------------------
19 .. _libopenmptarget_environment_vars:
24 ``libomptarget`` uses environment variables to control different features of the
25 library at runtime. This allows the user to obtain useful runtime information as
26 well as enable or disable certain features. A full list of supported environment
27 variables is defined below.
29 * ``LIBOMPTARGET_DEBUG=<Num>``
30 * ``LIBOMPTARGET_PROFILE=<Filename>``
31 * ``LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD=<Num>``
32 * ``LIBOMPTARGET_INFO=<Num>``
33 * ``LIBOMPTARGET_HEAP_SIZE=<Num>``
34 * ``LIBOMPTARGET_STACK_SIZE=<Num>``
35 * ``LIBOMPTARGET_SHARED_MEMORY_SIZE=<Num>``
40 ``LIBOMPTARGET_DEBUG`` controls whether or not debugging information will be
41 displayed. This feature is only availible if ``libomptarget`` was built with
42 ``-DOMPTARGET_DEBUG``. The debugging output provided is intended for use by
43 ``libomptarget`` developers. More user-friendly output is presented when using
44 ``LIBOMPTARGET_INFO``.
48 ``LIBOMPTARGET_PROFILE`` allows ``libomptarget`` to generate time profile output
49 similar to Clang's ``-ftime-trace`` option. This generates a JSON file based on
50 `Chrome Tracing`_ that can be viewed with ``chrome://tracing`` or the
51 `Speedscope App`_. Building this feature depends on the `LLVM Support Library`_
52 for time trace output. Using this library is enabled by default when building
53 using the CMake option ``OPENMP_ENABLE_LIBOMPTARGET_PROFILING``. The output will
54 be saved to the filename specified by the environment variable. For multi-threaded
55 applications, profiling in ``libomp`` is also needed. Setting the CMake option
56 ``OPENMP_ENABLE_LIBOMP_PROFILING=ON`` to enable the feature. Note that this will
57 turn ``libomp`` into a C++ library.
59 .. _`Chrome Tracing`: https://www.chromium.org/developers/how-tos/trace-event-profiling-tool
61 .. _`Speedscope App`: https://www.speedscope.app/
63 .. _`LLVM Support Library`: https://llvm.org/docs/SupportLibrary.html
65 LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD
66 """""""""""""""""""""""""""""""""""""
68 ``LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`` sets the threshold size for which the
69 ``libomptarget`` memory manager will handle the allocation. Any allocations
70 larger than this threshold will not use the memory manager and be freed after
71 the device kernel exits. The default threshold value is ``8KB``. If
72 ``LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`` is set to ``0`` the memory manager
73 will be completely disabled.
78 ``LIBOMPTARGET_INFO`` allows the user to request different types of runtime
79 information from ``libomptarget``. ``LIBOMPTARGET_INFO`` uses a 32-bit field to
80 enable or disable different types of information. This includes information
81 about data-mappings and kernel execution. It is recommended to build your
82 application with debugging information enabled, this will enable filenames and
83 variable declarations in the information messages. OpenMP Debugging information
84 is enabled at any level of debugging so a full debug runtime is not required.
85 For minimal debugging information compile with `-gline-tables-only`, or compile
86 with `-g` for full debug information. A full list of flags supported by
87 ``LIBOMPTARGET_INFO`` is given below.
89 * Print all data arguments upon entering an OpenMP device kernel: ``0x01``
90 * Indicate when a mapped address already exists in the device mapping table:
92 * Dump the contents of the device pointer map at kernel exit: ``0x04``
93 * Indicate when an entry is changed in the device mapping table: ``0x08``
94 * Print OpenMP kernel information from device plugins: ``0x10``
95 * Indicate when data is copied to and from the device: ``0x20``
97 Any combination of these flags can be used by setting the appropriate bits. For
98 example, to enable printing all data active in an OpenMP target region along
99 with ``CUDA`` information, run the following ``bash`` command.
101 .. code-block:: console
103 $ env LIBOMPTARGET_INFO=$((1 << 0x1 | 1 << 0x10)) ./your-application
105 Or, to enable every flag run with every bit set.
107 .. code-block:: console
109 $ env LIBOMPTARGET_INFO=-1 ./your-application
111 For example, given a small application implementing the ``ZAXPY`` BLAS routine,
112 ``Libomptarget`` can provide useful information about data mappings and thread
119 using complex = std::complex<double>;
121 void zaxpy(complex *X, complex *Y, complex D, std::size_t N) {
122 #pragma omp target teams distribute parallel for
123 for (std::size_t i = 0; i < N; ++i)
124 Y[i] = D * X[i] + Y[i];
128 const std::size_t N = 1024;
129 complex X[N], Y[N], D;
130 #pragma omp target data map(to:X[0 : N]) map(tofrom:Y[0 : N])
134 Compiling this code targeting ``nvptx64`` with all information enabled will
135 provide the following output from the runtime library.
137 .. code-block:: console
139 $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only zaxpy.cpp -o zaxpy
140 $ env LIBOMPTARGET_INFO=-1 ./zaxpy
144 Info: Entering OpenMP data region at zaxpy.cpp:14:1 with 2 arguments:
145 Info: to(X[0:N])[16384]
146 Info: tofrom(Y[0:N])[16384]
147 Info: Creating new map entry with HstPtrBegin=0x00007fff0d259a40,
148 TgtPtrBegin=0x00007fdba5800000, Size=16384, RefCount=1, Name=X[0:N]
149 Info: Copying data from host to device, HstPtr=0x00007fff0d259a40,
150 TgtPtr=0x00007fdba5800000, Size=16384, Name=X[0:N]
151 Info: Creating new map entry with HstPtrBegin=0x00007fff0d255a40,
152 TgtPtrBegin=0x00007fdba5804000, Size=16384, RefCount=1, Name=Y[0:N]
153 Info: Copying data from host to device, HstPtr=0x00007fff0d255a40,
154 TgtPtr=0x00007fdba5804000, Size=16384, Name=Y[0:N]
155 Info: OpenMP Host-Device pointer mappings after block at zaxpy.cpp:14:1:
156 Info: Host Ptr Target Ptr Size (B) RefCount Declaration
157 Info: 0x00007fff0d255a40 0x00007fdba5804000 16384 1 Y[0:N] at zaxpy.cpp:13:17
158 Info: 0x00007fff0d259a40 0x00007fdba5800000 16384 1 X[0:N] at zaxpy.cpp:13:11
159 Info: Entering OpenMP kernel at zaxpy.cpp:6:1 with 4 arguments:
160 Info: firstprivate(N)[8] (implicit)
161 Info: use_address(Y)[0] (implicit)
162 Info: tofrom(D)[16] (implicit)
163 Info: use_address(X)[0] (implicit)
164 Info: Mapping exists (implicit) with HstPtrBegin=0x00007fff0d255a40,
165 TgtPtrBegin=0x00007fdba5804000, Size=0, RefCount=2 (incremented), Name=Y
166 Info: Creating new map entry with HstPtrBegin=0x00007fff0d2559f0,
167 TgtPtrBegin=0x00007fdba5808000, Size=16, RefCount=1, Name=D
168 Info: Copying data from host to device, HstPtr=0x00007fff0d2559f0,
169 TgtPtr=0x00007fdba5808000, Size=16, Name=D
170 Info: Mapping exists (implicit) with HstPtrBegin=0x00007fff0d259a40,
171 TgtPtrBegin=0x00007fdba5800000, Size=0, RefCount=2 (incremented), Name=X
172 Info: Mapping exists with HstPtrBegin=0x00007fff0d255a40,
173 TgtPtrBegin=0x00007fdba5804000, Size=0, RefCount=2 (update suppressed)
174 Info: Mapping exists with HstPtrBegin=0x00007fff0d2559f0,
175 TgtPtrBegin=0x00007fdba5808000, Size=16, RefCount=1 (update suppressed)
176 Info: Mapping exists with HstPtrBegin=0x00007fff0d259a40,
177 TgtPtrBegin=0x00007fdba5800000, Size=0, RefCount=2 (update suppressed)
178 Info: Launching kernel __omp_offloading_10305_c08c86__Z5zaxpyPSt7complexIdES1_S0_m_l6
179 with 8 blocks and 128 threads in SPMD mode
180 Info: Mapping exists with HstPtrBegin=0x00007fff0d259a40,
181 TgtPtrBegin=0x00007fdba5800000, Size=0, RefCount=1 (decremented)
182 Info: Mapping exists with HstPtrBegin=0x00007fff0d2559f0,
183 TgtPtrBegin=0x00007fdba5808000, Size=16, RefCount=1 (deferred final decrement)
184 Info: Copying data from device to host, TgtPtr=0x00007fdba5808000,
185 HstPtr=0x00007fff0d2559f0, Size=16, Name=D
186 Info: Mapping exists with HstPtrBegin=0x00007fff0d255a40,
187 TgtPtrBegin=0x00007fdba5804000, Size=0, RefCount=1 (decremented)
188 Info: Removing map entry with HstPtrBegin=0x00007fff0d2559f0,
189 TgtPtrBegin=0x00007fdba5808000, Size=16, Name=D
190 Info: OpenMP Host-Device pointer mappings after block at zaxpy.cpp:6:1:
191 Info: Host Ptr Target Ptr Size (B) RefCount Declaration
192 Info: 0x00007fff0d255a40 0x00007fdba5804000 16384 1 Y[0:N] at zaxpy.cpp:13:17
193 Info: 0x00007fff0d259a40 0x00007fdba5800000 16384 1 X[0:N] at zaxpy.cpp:13:11
194 Info: Exiting OpenMP data region at zaxpy.cpp:14:1 with 2 arguments:
195 Info: to(X[0:N])[16384]
196 Info: tofrom(Y[0:N])[16384]
197 Info: Mapping exists with HstPtrBegin=0x00007fff0d255a40,
198 TgtPtrBegin=0x00007fdba5804000, Size=16384, RefCount=1 (deferred final decrement)
199 Info: Copying data from device to host, TgtPtr=0x00007fdba5804000,
200 HstPtr=0x00007fff0d255a40, Size=16384, Name=Y[0:N]
201 Info: Mapping exists with HstPtrBegin=0x00007fff0d259a40,
202 TgtPtrBegin=0x00007fdba5800000, Size=16384, RefCount=1 (deferred final decrement)
203 Info: Removing map entry with HstPtrBegin=0x00007fff0d255a40,
204 TgtPtrBegin=0x00007fdba5804000, Size=16384, Name=Y[0:N]
205 Info: Removing map entry with HstPtrBegin=0x00007fff0d259a40,
206 TgtPtrBegin=0x00007fdba5800000, Size=16384, Name=X[0:N]
208 From this information, we can see the OpenMP kernel being launched on the CUDA
209 device with enough threads and blocks for all ``1024`` iterations of the loop in
210 simplified :doc:`SPMD Mode <Offloading>`. The information from the OpenMP data
211 region shows the two arrays ``X`` and ``Y`` being copied from the host to the
212 device. This creates an entry in the host-device mapping table associating the
213 host pointers to the newly created device data. The data mappings in the OpenMP
214 device kernel show the default mappings being used for all the variables used
215 implicitly on the device. Because ``X`` and ``Y`` are already mapped in the
216 device's table, no new entries are created. Additionally, the default mapping
217 shows that ``D`` will be copied back from the device once the OpenMP device
218 kernel region ends even though it isn't written to. Finally, at the end of the
219 OpenMP data region the entries for ``X`` and ``Y`` are removed from the table.
221 The information level can be controlled at runtime using an internal
222 libomptarget library call ``__tgt_set_info_flag``. This allows for different
223 levels of information to be enabled or disabled for certain regions of code.
224 Using this requires declaring the function signature as an external function so
225 it can be linked with the runtime library.
229 extern "C" void __tgt_set_info_flag(uint32_t);
234 __tgt_set_info_flag(0x10);
239 .. _libopenmptarget_errors:
244 ``libomptarget`` provides error messages when the program fails inside the
245 OpenMP target region. Common causes of failure could be an invalid pointer
246 access, running out of device memory, or trying to offload when the device is
247 busy. If the application was built with debugging symbols the error messages
248 will additionally provide the source location of the OpenMP target region.
250 For example, consider the following code that implements a simple parallel
251 reduction on the GPU. This code has a bug that causes it to fail in the
258 double sum(double *A, std::size_t N) {
260 #pragma omp target teams distribute parallel for reduction(+:sum)
261 for (int i = 0; i < N; ++i)
273 If this code is compiled and run, there will be an error message indicating what is
276 .. code-block:: console
278 $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only sum.cpp -o sum
283 CUDA error: an illegal memory access was encountered
284 Libomptarget error: Copying data from device failed.
285 Libomptarget error: Call to targetDataEnd failed, abort target.
286 Libomptarget error: Failed to process data after launching the kernel.
287 Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
288 sum.cpp:5:1: Libomptarget error 1: failure of target construct while offloading is mandatory
290 This shows that there is an illegal memory access occuring inside the OpenMP
291 target region once execution has moved to the CUDA device, suggesting a
292 segmentation fault. This then causes a chain reaction of failures in
293 ``libomptarget``. Another message suggests using the ``LIBOMPTARGET_INFO``
294 environment variable as described in :ref:`libopenmptarget_environment_vars`. If
295 we do this it will print the sate of the host-target pointer mappings at the
298 .. code-block:: console
300 $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only sum.cpp -o sum
301 $ env LIBOMPTARGET_INFO=4 ./sum
305 info: OpenMP Host-Device pointer mappings after block at sum.cpp:5:1:
306 info: Host Ptr Target Ptr Size (B) RefCount Declaration
307 info: 0x00007ffc058280f8 0x00007f4186600000 8 1 sum at sum.cpp:4:10
309 This tells us that the only data mapped between the host and the device is the
310 ``sum`` variable that will be copied back from the device once the reduction has
311 ended. There is no entry mapping the host array ``A`` to the device. In this
312 situation, the compiler cannot determine the size of the array at compile time
313 so it will simply assume that the pointer is mapped on the device already by
314 default. The solution is to add an explicit map clause in the target region.
318 double sum(double *A, std::size_t N) {
320 #pragma omp target teams distribute parallel for reduction(+:sum) map(to:A[0 : N])
321 for (int i = 0; i < N; ++i)
327 LIBOMPTARGET_STACK_SIZE
328 """""""""""""""""""""""
330 This environment variable sets the stack size in bytes for the CUDA plugin. This
331 can be used to increase or decrease the standard amount of memory reserved for
334 LIBOMPTARGET_HEAP_SIZE
335 """""""""""""""""""""""
337 This environment variable sets the amount of memory in bytes that can be
338 allocated using ``malloc`` and ``free`` for the CUDA plugin. This is necessary
339 for some applications that allocate too much memory either through the user or
342 LIBOMPTARGET_SHARED_MEMORY_SIZE
343 """""""""""""""""""""""""""""""
345 This environment variable sets the amount of dynamic shared memory in bytes used
346 by the kernel once it is launched. A pointer to the dynamic memory buffer can
347 currently only be accessed using the ``__kmpc_get_dynamic_shared`` device
356 LLVM/OpenMP Target Host Runtime Plugins (``libomptarget.rtl.XXXX``)
357 -------------------------------------------------------------------
362 .. _remote_offloading_plugin:
364 Remote Offloading Plugin:
365 ^^^^^^^^^^^^^^^^^^^^^^^^^
367 The remote offloading plugin permits the execution of OpenMP target regions
368 on devices in remote hosts in addition to the devices connected to the local
369 host. All target devices on the remote host will be exposed to the
370 application as if they were local devices, that is, the remote host CPU or
371 its GPUs can be offloaded to with the appropriate device number. If the
372 server is running on the same host, each device may be identified twice:
373 once through the device plugins and once through the device plugins that the
374 server application has access to.
376 This plugin consists of ``libomptarget.rtl.rpc.so`` and
377 ``openmp-offloading-server`` which should be running on the (remote) host. The
378 server application does not have to be running on a remote host, and can
379 instead be used on the same host in order to debug memory mapping during offloading.
380 These are implemented via gRPC/protobuf so these libraries are required to
381 build and use this plugin. The server must also have access to the necessary
382 target-specific plugins in order to perform the offloading.
384 Due to the experimental nature of this plugin, the CMake variable
385 ``LIBOMPTARGET_ENABLE_EXPERIMENTAL_REMOTE_PLUGIN`` must be set in order to
386 build this plugin. For example, the rpc plugin is not designed to be
387 thread-safe, the server cannot concurrently handle offloading from multiple
388 applications at once (it is synchronous) and will terminate after a single
389 execution. Note that ``openmp-offloading-server`` is unable to
390 remote offload onto a remote host itself and will error out if this is attempted.
392 Remote offloading is configured via environment variables at runtime of the OpenMP application:
393 * ``LIBOMPTARGET_RPC_ADDRESS=<Address>:<Port>``
394 * ``LIBOMPTARGET_RPC_ALLOCATOR_MAX=<NumBytes>``
395 * ``LIBOMPTARGET_BLOCK_SIZE=<NumBytes>``
396 * ``LIBOMPTARGET_RPC_LATENCY=<Seconds>``
398 LIBOMPTARGET_RPC_ADDRESS
399 """"""""""""""""""""""""
400 The address and port at which the server is running. This needs to be set for
401 the server and the application, the default is ``0.0.0.0:50051``. A single
402 OpenMP executable can offload onto multiple remote hosts by setting this to
403 comma-seperated values of the addresses.
405 LIBOMPTARGET_RPC_ALLOCATOR_MAX
406 """"""""""""""""""""""""""""""
407 After allocating this size, the protobuf allocator will clear. This can be set for both endpoints.
409 LIBOMPTARGET_BLOCK_SIZE
410 """""""""""""""""""""""
411 This is the maximum size of a single message while streaming data transfers between the two endpoints and can be set for both endpoints.
413 LIBOMPTARGET_RPC_LATENCY
414 """"""""""""""""""""""""
415 This is the maximum amount of time the client will wait for a response from the server.
417 LLVM/OpenMP Target Device Runtime (``libomptarget-ARCH-SUBARCH.bc``)
418 --------------------------------------------------------------------
420 The target device runtime is an LLVM bitcode library that implements OpenMP
421 runtime functions on the target device. It is linked with the device code's LLVM
422 IR during compilation.
427 The device runtime supports debugging in the runtime itself. This is configured
428 at compile-time using the flag ``-fopenmp-target-debug=<N>`` rather than using a
429 separate debugging build. If debugging is not enabled, the debugging paths will
430 be considered trivially dead and removed by the compiler with zero overhead.
431 Debugging is enabled at runtime by running with the environment variable
432 ``LIBOMPTARGET_DEVICE_RTL_DEBUG=<N>`` set. The number set is a 32-bit field used
433 to selectively enable and disable different features. Currently, the following
434 debugging features are supported.
436 * Enable debugging assertions in the device. ``0x01``
437 * Enable OpenMP runtime function traces in the device. ``0x2``
441 void copy(double *X, double *Y) {
442 #pragma omp target teams distribute parallel for
443 for (std::size_t i = 0; i < N; ++i)
447 Compiling this code targeting ``nvptx64`` with debugging enabled will
448 provide the following output from the device runtime library.
450 .. code-block:: console
452 $ clang++ -fopenmp -fopenmp-targets=nvptx64 -fopenmp-target-new-runtime \
453 -fopenmp-target-debug=3
454 $ env LIBOMPTARGET_DEVICE_RTL_DEBUG=3 ./zaxpy
458 Kernel.cpp:70: Thread 0 Entering int32_t __kmpc_target_init()
459 Parallelism.cpp:196: Thread 0 Entering int32_t __kmpc_global_thread_num()
460 Mapping.cpp:239: Thread 0 Entering uint32_t __kmpc_get_hardware_num_threads_in_block()
461 Workshare.cpp:616: Thread 0 Entering void __kmpc_distribute_static_init_4()
462 Parallelism.cpp:85: Thread 0 Entering void __kmpc_parallel_51()
463 Parallelism.cpp:69: Thread 0 Entering <OpenMP Outlined Function>
464 Workshare.cpp:575: Thread 0 Entering void __kmpc_for_static_init_4()
465 Workshare.cpp:660: Thread 0 Entering void __kmpc_distribute_static_fini()
466 Workshare.cpp:660: Thread 0 Entering void __kmpc_distribute_static_fini()
467 Kernel.cpp:103: Thread 0 Entering void __kmpc_target_deinit()