GBE: use soft mask to handle the barrier call.
As the GPU is running under predication control, the following IR
may lead one single barrier be called twice at runtime.
A:
barrier()
instructions after barrier()
B:
...
BR(cond) A
C:
...
BR A
When it runs to B's BR instruction, and if any of the condition bits is
true, it will jump to block A to execute the barrier. Then latter, if
any of the condition bits is false, it will continue to execute the
block C's code and at the end of the C block, it jump to A to execute
the barrier again.
If on the other thread, all the condition bits are true, then it triggers
a hang.
And even if all the threads run the same count of barrier, it may cause
incorrect result, as it executes the instructions after barrier() in block
A before all the work items hit the barrier point.
The solution to fix this issue is to use a soft mask register. The register
is shared by all barrier call. We initialize it to !emask at the beginning
of the program.
barrierMask = !emask.
Then when it runs into the barrier call, we set current predication bits
to the mask register, and check whether all the lanes are set. If any of
the lanes is disabled, we simply jump to next basic block. Then latter
when it runs into barrier again, we can set more bits/lanes to 1, and
check it again, if all the bits are 1, then we set the preciation flag 0,0
to all 1 and execute the barrier call and after the wait, we reinitialize
the barrierMask to !emask, and run all the other instructions after the
barrier() in block A with all lanes enabled.
After this patch, we can fix the hang issue when testing the opencv's
transpose test cases.
v2:
1. If there are still some lanes not reach the barrier, we need to set all
the finished lanes' block ip to FFFF, and we also need to clear all the
flag0 to zero. Thus we can avoid to execute those instructions after the
barrier too early.
2. fix some typos.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>