GBE: Optimize the bool register allocation/processing.
Previously, we have a global flag allocation implemntation.
After some analysis, I found the global flag allocation is not
the best solution here.
As for the cross block reference of bool value, we have to
combine it with current emask. There is no obvious advantage to
allocate deadicate physical flag register for those cross block usage.
We just need to allocate physical flag within each BB. We need to handle
the following cases:
1. The bool's liveness never beyond this BB. And the bool is only used as
a dst register or a pred register. This bool value could be
allocated in physical flag only if there is enough physical flag.
We already identified those bool at the instruction select stage, and
put them in the flagBooleans set.
2. The bool is defined in another BB and used in this BB, then we need
to prepend an instruction at the position where we use it.
3. The bool is defined in this BB but is also used as some instruction's
source registers rather than the pred register. We have to keep the normal
grf (UW8/UW16) register for this bool. For some CMP instruction, we need to
append a SEL instruction convert the flag to the grf register.
4. Even for the spilling flag, if there is only one spilling flag, we will also
try to reuse the temporary flag register latter. This requires all the
instructions should got it flag at the instruction selection stage. And should
not use the flag physical number directly at the gen_context stage. Otherwise,
may break the algorithm here.
We will track all the validated bool value and to avoid any redundant
validation for the same flag. But if there is no enough physical flag,
we have to spill the previous allocated physical flag. And the spilling
policy is to spill the allocate flag which live to the last time end point.
Let's see an real example of the improvement of this patch:
I take the compiler_vect_compare as an example, before this patch, the
instructions are as below:
( 24) cmp.g.f1.1(8) null g110<8,8,1>D 0D { align1 WE_normal 1Q };
( 26) cmp.g.f1.1(8) null g111<8,8,1>D 0D { align1 WE_normal 2Q };
( 28) (+f1.1) sel(16) g109<1>UW g1.2<0,1,0>UW g1<0,1,0>UW { align1 WE_normal 1H };
( 30) cmp.g.f1.1(8) null g112<8,8,1>D 0D { align1 WE_normal 1Q };
( 32) cmp.g.f1.1(8) null g113<8,8,1>D 0D { align1 WE_normal 2Q };
( 34) (+f1.1) sel(16) g108<1>UW g1.2<0,1,0>UW g1<0,1,0>UW { align1 WE_normal 1H };
( 36) cmp.g.f1.1(8) null g114<8,8,1>D 0D { align1 WE_normal 1Q };
( 38) cmp.g.f1.1(8) null g115<8,8,1>D 0D { align1 WE_normal 2Q };
( 40) (+f1.1) sel(16) g107<1>UW g1.2<0,1,0>UW g1<0,1,0>UW { align1 WE_normal 1H };
( 42) cmp.g.f1.1(8) null g116<8,8,1>D 0D { align1 WE_normal 1Q };
( 44) cmp.g.f1.1(8) null g117<8,8,1>D 0D { align1 WE_normal 2Q };
( 46) (+f1.1) sel(16) g106<1>UW g1.2<0,1,0>UW g1<0,1,0>UW { align1 WE_normal 1H };
( 48) mov(16) g104<1>F -nanF { align1 WE_normal 1H };
( 50) cmp.ne.f1.1(16) null g109<8,8,1>UW 0x0UW { align1 WE_normal 1H switch };
( 52) (+f1.1) sel(16) g96<1>D g104<8,8,1>D 0D { align1 WE_normal 1H };
( 54) cmp.ne.f1.1(16) null g108<8,8,1>UW 0x0UW { align1 WE_normal 1H switch };
( 56) (+f1.1) sel(16) g98<1>D g104<8,8,1>D 0D { align1 WE_normal 1H };
( 58) cmp.ne.f1.1(16) null g107<8,8,1>UW 0x0UW { align1 WE_normal 1H switch };
( 60) (+f1.1) sel(16) g100<1>D g104<8,8,1>D 0D { align1 WE_normal 1H };
( 62) cmp.ne.f1.1(16) null g106<8,8,1>UW 0x0UW { align1 WE_normal 1H switch };
( 64) (+f1.1) sel(16) g102<1>D g104<8,8,1>D 0D { align1 WE_normal 1H };
( 66) add(16) g94<1>D g1.3<0,1,0>D g120<8,8,1>D { align1 WE_normal 1H };
( 68) send(16) null g94<8,8,1>UD
data (bti: 1, rgba: 0, SIMD16, legacy, Untyped Surface Write) mlen 10 rlen 0 { align1 WE_normal 1H };
( 70) mov(16) g2<1>UW 0x1UW { align1 WE_normal 1H };
( 72) endif(16) 2 null { align1 WE_normal 1H };
After this patch, it becomes:
( 24) cmp.g(8) null g110<8,8,1>D 0D { align1 WE_normal 1Q };
( 26) cmp.g(8) null g111<8,8,1>D 0D { align1 WE_normal 2Q };
( 28) cmp.g.f1.1(8) null g112<8,8,1>D 0D { align1 WE_normal 1Q };
( 30) cmp.g.f1.1(8) null g113<8,8,1>D 0D { align1 WE_normal 2Q };
( 32) cmp.g.f0.1(8) null g114<8,8,1>D 0D { align1 WE_normal 1Q };
( 34) cmp.g.f0.1(8) null g115<8,8,1>D 0D { align1 WE_normal 2Q };
( 36) (+f0.1) sel(16) g109<1>UW g1.2<0,1,0>UW g1<0,1,0>UW { align1 WE_normal 1H };
( 38) cmp.g.f1.0(8) null g116<8,8,1>D 0D { align1 WE_normal 1Q };
( 40) cmp.g.f1.0(8) null g117<8,8,1>D 0D { align1 WE_normal 2Q };
( 42) mov(16) g106<1>F -nanF { align1 WE_normal 1H };
( 44) (+f0) sel(16) g98<1>D g106<8,8,1>D 0D { align1 WE_normal 1H };
( 46) (+f1.1) sel(16) g100<1>D g106<8,8,1>D 0D { align1 WE_normal 1H };
( 48) (+f0.1) sel(16) g102<1>D g106<8,8,1>D 0D { align1 WE_normal 1H };
( 50) (+f1) sel(16) g104<1>D g106<8,8,1>D 0D { align1 WE_normal 1H };
( 52) add(16) g96<1>D g1.3<0,1,0>D g120<8,8,1>D { align1 WE_normal 1H };
( 54) send(16) null g96<8,8,1>UD
data (bti: 1, rgba: 0, SIMD16, legacy, Untyped Surface Write) mlen 10 rlen 0 { align1 WE_normal 1H };
( 56) mov(16) g2<1>UW 0x1UW { align1 WE_normal 1H };
( 58) endif(16) 2 null { align1 WE_normal 1H };
It reduces the instruction count from 25 to 18. Save about 28% instructions.
v2:
Fix some minor bugs.
Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>