According to the NVIDIA documentation on `cp.async.wait_group`
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-wait-group-cp-async-wait-all),
the `numGroups` attribute in `nvgpu.device_async_wait` should give an upper
bound of pending async group count (instead of a lower bound) when the
executing thread can be unblocked.
Reviewed By: christopherbate
Differential Revision: https://reviews.llvm.org/D154046
The `nvgpu.device_async_wait` op will block the execution thread until the group
associated with the source token is fully completed.
- The optional `$numGroup` attribute gives a lower bound of the number of
- groups uncompleted when the wait can unblock the thread.
+ The optional `$numGroups` attribute gives an upper bound of the number of
+ groups uncompleted when the wait can unblock the thread. For example, if
+ 16 async groups are pushe and `$numGroups` is set to 12, then the thread
+ will unblock when 12 groups or fewer are in flight (4 groups have
+ completed).
Example: