Zhigang Gong [Tue, 6 May 2014 03:20:41 +0000 (11:20 +0800)]
GBE: fix one potential bug in UnsignedI64ToFloat.
Set exp to a proper value to make sure all the inactive lanes
flag bits are 1s which satisfy the requirement of the following
ALL16/ALL8H condition check.
v2:
enable the first JMPI's optimization rather the second,
as it has higher probability.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Chuanbo Weng [Tue, 6 May 2014 10:48:26 +0000 (18:48 +0800)]
GBE: Fix one build error of friend declaration for a class.
If the g++ is older than 4.7.0, the class-key of the
elaborated-type-specifier is required in a friend declaration
for a class. So modify the code to make it compitible with old
g++ version.
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Sun, 4 May 2014 01:16:05 +0000 (09:16 +0800)]
GBE: remove some useless code.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Sun, 4 May 2014 01:14:08 +0000 (09:14 +0800)]
GBE: increase the global memory size to 1GB.
Also increase the global memory to 1GB.
v2: change the max memory size to 256MB
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Sun, 4 May 2014 00:59:41 +0000 (08:59 +0800)]
GBE: fixed a regression at "Long" div/rem.
If the GEN_PREDICATE_ALIGN1_ANY8H/ANY16H or ALL8H/ALL16H
are used, we must make sure those inactive lanes are initialized
correctly. For "ANY" condition, all the inactive lanes need to
be clear to zero. For "ALL" condition, all the inactive lanes
need to be set to 1s. Otherwise, it may cause infinite loop.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Yi Sun [Mon, 28 Apr 2014 05:31:05 +0000 (13:31 +0800)]
Init Benchmark suite
The first benchmark case is name enqueue_copy_buf.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Fri, 25 Apr 2014 13:52:34 +0000 (21:52 +0800)]
GBE: reserve flag0.0 for large basic block.
As in large basic block, there are more than one IF instruction which
need to use the flag0.0. We have to reserve flag 0.0 to those IF
instructions.
Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Fri, 25 Apr 2014 07:38:22 +0000 (15:38 +0800)]
GBE: fix the large if/endif block issue.
Some test cases have some very large block which contains
more than 32768/2 instructions which could fit into one
if/endif block.
This patch introduce a ifendif fix switch at the GenContext.
Once we encounter one of such error, we set the switch on
and then recompile the kernel. When the switch is on, we will
insert extra endif/if pair to the block to split one if/endif
block to multiple ones to fix the large if/endif issue.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Fri, 25 Apr 2014 04:36:59 +0000 (12:36 +0800)]
GBE: fix the hard coded endif offset calculation.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Thu, 24 Apr 2014 07:24:07 +0000 (15:24 +0800)]
GBE: Avoid unecessary dag/liveness computing at backend.
We don't need to compute dag/liveness at the backend when
we switch to a new code gen strategy.
For the unit test case, this patch could save 15% of the
overall execution time. For the luxmark with STRICT conformance
mode, it saves about 40% of the build time.
v3: fix some minor bugs.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Thu, 24 Apr 2014 10:18:40 +0000 (18:18 +0800)]
GBE: fixed a potential scalarize bug.
We need to append extract instruction when do a bitcast to
a vector. Otherwise, we may trigger an assert as the extract
instruction uses a undefined vector.
After this patch, it becomes safe to do many rounds of scalarize
pass.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Song, Ruiling" <ruiling.song@intel.com>
Guo Yejun [Wed, 23 Apr 2014 18:18:00 +0000 (02:18 +0800)]
add support for cross compiler
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Thu, 24 Apr 2014 02:09:13 +0000 (10:09 +0800)]
GBE: refine the gen program strategy.
The limitRegisterPressure only affects the MAD pattern matching
which could not bring noticeable difference here. I change it to always
be false. And add the reserved registers for spill to the strategy
structure. Thus we can try to build a program as the following
strategy:
1. SIMD16 without spilling
2. SIMD16 with 10 spilling registers and with a default spilling threshold
value 16. When need to spill more than 16 registers, we fall back to next
method.
3. SIMD8 without spilling
4. SIMD8 with 8 spilling registers.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 10 Apr 2014 06:33:48 +0000 (14:33 +0800)]
GBE: fixed the undefined phi value's liveness analysis.
If a phi component is undef from one of the predecessors,
we should not pass it as the predecessor's liveout registers.
Otherwise, that phi register's liveness may be extent to
the basic block zero which is not good.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Ruiling Song [Wed, 23 Apr 2014 06:31:29 +0000 (14:31 +0800)]
GBE: Try expire some register before register allocation
1. This would free unused register asap, so it becomes easy to allocate
contiguous registers.
2. We previously met many hidden register liveness issue. Let's try
to reuse the expired register early. Then I think wrong liveness may
easy to find.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Ruiling Song [Wed, 23 Apr 2014 02:56:50 +0000 (10:56 +0800)]
GBE: Optimize byte gather read using untyped read.
Untyped read seems better than byte gather read.
Some performance test in opencv got doubled after the patch.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Guo Yejun [Fri, 18 Apr 2014 05:42:29 +0000 (13:42 +0800)]
add test for __gen_ocl_simd_any and __gen_ocl_simd_all
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Guo Yejun [Fri, 18 Apr 2014 05:42:16 +0000 (13:42 +0800)]
support __gen_ocl_simd_any and __gen_ocl_simd_all
short __gen_ocl_simd_any(short x):
if x in any of the active threads in the same SIMD is not zero,
the return value for all these threads is not zero, otherwise, zero returned.
short __gen_ocl_simd_all(short x):
only if x in all of the active threads in the same SIMD is not zero,
the return value for all these threads is not zero, otherwise, zero returned.
for example:
to check if a special value exists in a global buffer, use one SIMD
to do the searching parallelly, the whole SIMD can stop the task
once the value is found. The key kernel code looks like:
for(; ; ) {
...
if (__gen_ocl_simd_any(...))
break; //the whole SIMD stop the searching
}
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Sun, Yi [Tue, 8 Apr 2014 02:53:40 +0000 (10:53 +0800)]
Delete the printing of dynamic statistics line.
summary:
---------------------
1. Delete the printing of dynamic statistics line.
2. Add function to catch signals(like CTRL+C,core dumped ...),
if caught, reminder user the signal name.
core dumped example:
...
displacement_map_element() [SUCCESS]
compiler_clod() Interrupt signal (SIGSEGV) received.
summary:
----------
total: 657
run: 297
pass: 271
fail: 26
pass rate: 0.960426
Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: Yangwei Shui <yangweix.shui@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Tue, 15 Apr 2014 08:53:17 +0000 (16:53 +0800)]
GBE: Implement instruction compact.
A native GEN ASM would takes 2*64bit, but GEN also support compact instruction
which only takes 64bit. To make code easily understood, GenInstruction now only
stands for 64bit memory, and use GenNativeInstruction & GenCompactInstruction
to represent normal(native) and compact instruction.
After this change, it is not easily to map SelectionInstruction distance to ASM distance.
As the instructions in the distance maybe compacted. To not introduce too much
complexity, JMP, IF, ENDIF, NOP will NEVER be compacted.
Some experiment in luxMark shows it could reduce about 20% instruction memory.
But it is sad that no performance improvement observed.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Thu, 17 Apr 2014 09:41:58 +0000 (17:41 +0800)]
GBE: fix a Q64 spilling bug in non-simd8 mode.
For simd16 mode, the payload need to have 2 GRFs not the hard coded 1 GRF.
This patch fixes the corresponding regression on piglit.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 17 Apr 2014 06:59:00 +0000 (14:59 +0800)]
GBE: work around baytrail-t hang issue.
There is an unkown issue with baytrail-t platform. It will hang at
utest's compiler_global_constant case. After some investigation,
it turns out to be related to the DWORD GATHER READ send message
on the constand cache data port. I change to use data cache data
port could work around that hang issue.
Now we only fail one more case on baytrail-t compare to the IVB
desktop platform which is the:
profiling_exec() [FAILED]
Error: Too large time from submit to start
That may be caused by kernel related issue. And that bug will not
cause serious issue for normal kernel. So after this patch, the
baytrail-t platform should be in a pretty good shape with beignet.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>
Zhigang Gong [Thu, 17 Apr 2014 06:56:08 +0000 (14:56 +0800)]
GBE/Runtime: pass the device id to the compiler backend.
For some reason, we need to know current target device id
at the code generation stage. This patch introduces such
a mechanism. This is the preparation for the baytrail werid
hang issue fixing.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>
Zhigang Gong [Thu, 17 Apr 2014 05:11:50 +0000 (13:11 +0800)]
Runtime: increase the build log buffer size to 1000.
200 is too small sometimes.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>
Chuanbo Weng [Thu, 10 Apr 2014 08:17:53 +0000 (16:17 +0800)]
Runtime: Add support for Bay Trail-T device.
According to the baytrial-t spec, baytrail-t has 4 EUs and each
EU has 8 threads. So the compute unit is 32 and the maximum
work group size is 32 * 8 which is 256.
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Jesper Pedersen [Sun, 13 Apr 2014 13:58:12 +0000 (09:58 -0400)]
Mark SandyBridge as unsupported
Signed-off-by: Jesper Pedersen <jesper.pedersen@comcast.net>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhenyu Wang [Thu, 10 Apr 2014 10:09:44 +0000 (18:09 +0800)]
Use pkg-config to check modules
Instead of use pre-defined path for dependent modules, e.g libdrm,
libdrm_intel, etc. Use pkg-config helper for cmake instead. This makes
it easy to work with developer own built version of those dependences.
Also remove libGL dependence for 'gbe_bin_generator' which is not required.
libutest.so still requires libGL now but might be fixed by checking real
GL dependence.
v2: Fix build with mesa source (92e6260) and link required EGL lib with utests too.
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by:Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Fri, 11 Apr 2014 06:48:18 +0000 (14:48 +0800)]
GBE: Enable CFG printer.
export OCL_OUTPUT_CFG=1
or export OCL_OUTPUT_CFG_ONLY=1
then it will output .dot file of CFG for the compiled kernels.
The CFG_ONLY means pure cfg without llvm IR.
You can use xdot to view .dot file.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Fri, 11 Apr 2014 06:48:17 +0000 (14:48 +0800)]
Runtime: increase batch size to 8K.
We met an assert on max_reloc in libdrm. So we simply work around it by
increase the batch size, then libdrm could allow more bo relocations.
This fix the the assert running ocl HaarFixture test under simd8.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Fri, 11 Apr 2014 06:48:16 +0000 (14:48 +0800)]
enable mad for mul+sub.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Wed, 9 Apr 2014 16:05:26 +0000 (00:05 +0800)]
GBE: Enable register spilling for SIMD16.
Enable register spilling for SIMD16 mode. Introduce an
new environment variable OCL_SIMD16_SPILL_THRESHOLD to
control the threshold of simd 16 register spilling. Default
value is 16, means when the spilled registers are more than
16, beignet will fallback to simd8.
Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Wed, 9 Apr 2014 10:25:22 +0000 (18:25 +0800)]
GBE: Optimize read_image performance for CL_ADDRESS_CLAMP..
The previous work around(due to hardware restriction.) is to use
CL_ADDRESS_CLAMP_TO_EDGE to implement CL_ADDRESS_CLAMP which is
not very efficient, especially for the boundary checking overhead.
The root cause is that we need to check each pixel's coordinate.
Now we change to use the LD message to implement CL_ADDRESS_CLAMP. For
integer coordinates, we don't need to do the boundary checking. And for
the float coordinates, we only need to check whether it's less than zero
which is much simpler than before.
This patch could bring about 20% to 30% performance gain for luxmark's
medium and simple scene.
v2:
simplfy the READ_IMAGE0.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Tue, 8 Apr 2014 09:58:15 +0000 (17:58 +0800)]
GBE: fixed two 'long' related bugs.
Didn't modify some hard coded number correctly in previous patch.
Now fix them. This could pass the corresponding regressions in
piglit test.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Wed, 2 Apr 2014 06:36:19 +0000 (14:36 +0800)]
GBE: fix the flag usage of those long/64 bit instruction.
Make the flag allocation be aware of the long/64bit insn
will use the flag0.1. And don't hard coded f0.1 at the gen_context
stage.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 27 Mar 2014 16:38:29 +0000 (00:38 +0800)]
GBE: Optimize the bool register allocation/processing.
Previously, we have a global flag allocation implemntation.
After some analysis, I found the global flag allocation is not
the best solution here.
As for the cross block reference of bool value, we have to
combine it with current emask. There is no obvious advantage to
allocate deadicate physical flag register for those cross block usage.
We just need to allocate physical flag within each BB. We need to handle
the following cases:
1. The bool's liveness never beyond this BB. And the bool is only used as
a dst register or a pred register. This bool value could be
allocated in physical flag only if there is enough physical flag.
We already identified those bool at the instruction select stage, and
put them in the flagBooleans set.
2. The bool is defined in another BB and used in this BB, then we need
to prepend an instruction at the position where we use it.
3. The bool is defined in this BB but is also used as some instruction's
source registers rather than the pred register. We have to keep the normal
grf (UW8/UW16) register for this bool. For some CMP instruction, we need to
append a SEL instruction convert the flag to the grf register.
4. Even for the spilling flag, if there is only one spilling flag, we will also
try to reuse the temporary flag register latter. This requires all the
instructions should got it flag at the instruction selection stage. And should
not use the flag physical number directly at the gen_context stage. Otherwise,
may break the algorithm here.
We will track all the validated bool value and to avoid any redundant
validation for the same flag. But if there is no enough physical flag,
we have to spill the previous allocated physical flag. And the spilling
policy is to spill the allocate flag which live to the last time end point.
Let's see an real example of the improvement of this patch:
I take the compiler_vect_compare as an example, before this patch, the
instructions are as below:
( 24) cmp.g.f1.1(8) null g110<8,8,1>D 0D { align1 WE_normal 1Q };
( 26) cmp.g.f1.1(8) null g111<8,8,1>D 0D { align1 WE_normal 2Q };
( 28) (+f1.1) sel(16) g109<1>UW g1.2<0,1,0>UW g1<0,1,0>UW { align1 WE_normal 1H };
( 30) cmp.g.f1.1(8) null g112<8,8,1>D 0D { align1 WE_normal 1Q };
( 32) cmp.g.f1.1(8) null g113<8,8,1>D 0D { align1 WE_normal 2Q };
( 34) (+f1.1) sel(16) g108<1>UW g1.2<0,1,0>UW g1<0,1,0>UW { align1 WE_normal 1H };
( 36) cmp.g.f1.1(8) null g114<8,8,1>D 0D { align1 WE_normal 1Q };
( 38) cmp.g.f1.1(8) null g115<8,8,1>D 0D { align1 WE_normal 2Q };
( 40) (+f1.1) sel(16) g107<1>UW g1.2<0,1,0>UW g1<0,1,0>UW { align1 WE_normal 1H };
( 42) cmp.g.f1.1(8) null g116<8,8,1>D 0D { align1 WE_normal 1Q };
( 44) cmp.g.f1.1(8) null g117<8,8,1>D 0D { align1 WE_normal 2Q };
( 46) (+f1.1) sel(16) g106<1>UW g1.2<0,1,0>UW g1<0,1,0>UW { align1 WE_normal 1H };
( 48) mov(16) g104<1>F -nanF { align1 WE_normal 1H };
( 50) cmp.ne.f1.1(16) null g109<8,8,1>UW 0x0UW { align1 WE_normal 1H switch };
( 52) (+f1.1) sel(16) g96<1>D g104<8,8,1>D 0D { align1 WE_normal 1H };
( 54) cmp.ne.f1.1(16) null g108<8,8,1>UW 0x0UW { align1 WE_normal 1H switch };
( 56) (+f1.1) sel(16) g98<1>D g104<8,8,1>D 0D { align1 WE_normal 1H };
( 58) cmp.ne.f1.1(16) null g107<8,8,1>UW 0x0UW { align1 WE_normal 1H switch };
( 60) (+f1.1) sel(16) g100<1>D g104<8,8,1>D 0D { align1 WE_normal 1H };
( 62) cmp.ne.f1.1(16) null g106<8,8,1>UW 0x0UW { align1 WE_normal 1H switch };
( 64) (+f1.1) sel(16) g102<1>D g104<8,8,1>D 0D { align1 WE_normal 1H };
( 66) add(16) g94<1>D g1.3<0,1,0>D g120<8,8,1>D { align1 WE_normal 1H };
( 68) send(16) null g94<8,8,1>UD
data (bti: 1, rgba: 0, SIMD16, legacy, Untyped Surface Write) mlen 10 rlen 0 { align1 WE_normal 1H };
( 70) mov(16) g2<1>UW 0x1UW { align1 WE_normal 1H };
( 72) endif(16) 2 null { align1 WE_normal 1H };
After this patch, it becomes:
( 24) cmp.g(8) null g110<8,8,1>D 0D { align1 WE_normal 1Q };
( 26) cmp.g(8) null g111<8,8,1>D 0D { align1 WE_normal 2Q };
( 28) cmp.g.f1.1(8) null g112<8,8,1>D 0D { align1 WE_normal 1Q };
( 30) cmp.g.f1.1(8) null g113<8,8,1>D 0D { align1 WE_normal 2Q };
( 32) cmp.g.f0.1(8) null g114<8,8,1>D 0D { align1 WE_normal 1Q };
( 34) cmp.g.f0.1(8) null g115<8,8,1>D 0D { align1 WE_normal 2Q };
( 36) (+f0.1) sel(16) g109<1>UW g1.2<0,1,0>UW g1<0,1,0>UW { align1 WE_normal 1H };
( 38) cmp.g.f1.0(8) null g116<8,8,1>D 0D { align1 WE_normal 1Q };
( 40) cmp.g.f1.0(8) null g117<8,8,1>D 0D { align1 WE_normal 2Q };
( 42) mov(16) g106<1>F -nanF { align1 WE_normal 1H };
( 44) (+f0) sel(16) g98<1>D g106<8,8,1>D 0D { align1 WE_normal 1H };
( 46) (+f1.1) sel(16) g100<1>D g106<8,8,1>D 0D { align1 WE_normal 1H };
( 48) (+f0.1) sel(16) g102<1>D g106<8,8,1>D 0D { align1 WE_normal 1H };
( 50) (+f1) sel(16) g104<1>D g106<8,8,1>D 0D { align1 WE_normal 1H };
( 52) add(16) g96<1>D g1.3<0,1,0>D g120<8,8,1>D { align1 WE_normal 1H };
( 54) send(16) null g96<8,8,1>UD
data (bti: 1, rgba: 0, SIMD16, legacy, Untyped Surface Write) mlen 10 rlen 0 { align1 WE_normal 1H };
( 56) mov(16) g2<1>UW 0x1UW { align1 WE_normal 1H };
( 58) endif(16) 2 null { align1 WE_normal 1H };
It reduces the instruction count from 25 to 18. Save about 28% instructions.
v2:
Fix some minor bugs.
Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Fri, 28 Mar 2014 06:57:51 +0000 (14:57 +0800)]
Silent some compilation warnings.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 27 Mar 2014 08:27:18 +0000 (16:27 +0800)]
GBE: avoid use a temporay register at the CMP instruction.
Use one SEL instruction, we can easily transfer a flag
to a normal bool vector register with correct mask.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 27 Mar 2014 07:54:49 +0000 (15:54 +0800)]
GBE: Add two helper scalar registers to hold 0 and all 1s.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 27 Mar 2014 05:30:00 +0000 (13:30 +0800)]
GBE: don't emit jmpi to next label.
As the following if will do the same thing, don't need to
add the jmpi instruction.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 27 Mar 2014 02:05:56 +0000 (10:05 +0800)]
GBE: one instruction is enough for SEL_CMP now.
As we have if/endif, now the SEL_CMP could write to the
dst register directly with correct emask.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 27 Mar 2014 02:00:57 +0000 (10:00 +0800)]
GBE: pass the OCL_STRICT_CONFORMANCE env to the backend.
Enable the mad pattern matching if the strict conformance
is false.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 27 Mar 2014 01:36:46 +0000 (09:36 +0800)]
GBE: Only emit long jump when jump a lot of blocks
Most of the case, we don't need to emit long jump at all.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 27 Mar 2014 06:54:15 +0000 (14:54 +0800)]
GBE: Don't need the emask/notemask/barriermask any more.
As we change to use if/endif and change the implementation of
the barrier, we don't need to maintain emask/notmask/barriermask
any more. Just remove them.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Tue, 18 Mar 2014 07:28:44 +0000 (15:28 +0800)]
GBE: Disable SPF and use JMPI + IF/ENDIF to handle each blocks.
When enable SPF (single program flow), we always need to use f0
as the predication of almost each instruction. This bring some
trouble when we want to get tow levels mask mechanism, for an
example the SEL instruction, and some BOOL operations. We
have to use more than one instructions to do that and simply
introduce 100% of overhead of those instructions.
v2:
fix the wrong assertion.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Mon, 17 Mar 2014 10:08:17 +0000 (18:08 +0800)]
GBE: Add if/endif/brc/brd instruction support.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Mon, 17 Mar 2014 10:01:03 +0000 (18:01 +0800)]
GBE: further optimize forward/backward jump.
We don't need to save the f0 at the last part of the block.
Just use it directly.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 13 Mar 2014 10:54:31 +0000 (18:54 +0800)]
GBE: use S16 vector to represent bool.
The original purpose of using flag or a S16 scalar to represent
a bool data type is to save register usage. But that bring too
much complex to handle it correctly in each possible case. And
the consequent is we have to take too much care about the bool's
handling in many places in the instruction selection stage. We
even never handle all the cases correctly. The hardest part is
that we can't just touch part of the bit in a S16 scalar register.
There is no instruction to support that. So if a bool is from
another BB, or even the bool is from the same BB but there is
a backward JMP and the bool is still a possible livein register,
thus we need to make some instructions to keep the inactive lane's
bit the original value.
I change to use a S16 vector to represent bool type, then all
the complicate cases are gone. And the only big side effect is
that the register consumption. But considering that a real
application will not have many bools active concurrently, this
may not be a big issue.
I measured the performance impact by using luxmark. And only
observed 2%-3% perfomance regression. There are some easy
performance optimization opportunity remains such as reduce
the unecessary MOVs between flag and bool within the same
block. I think this performance regression should be not a
big deal. Especially, this change will make the following if/endif
optimization a little bit easier.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 13 Mar 2014 04:33:39 +0000 (12:33 +0800)]
GBE: fix one misusage of flag in forward jump.
Forward jump instruction do not need the pred when compare
the pcip with next label. We should use the temporary flag
register.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Wed, 12 Mar 2014 09:08:15 +0000 (17:08 +0800)]
GBE: use a uniform style to calculate register size for curbe allocation.
Concentrate the register allocation to one place, and don't
use hard coded size when do curbe register allocation. All
register size allocation should use the same method.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Wed, 12 Mar 2014 08:51:57 +0000 (16:51 +0800)]
GBE: fix the wrong usage of stack pointer and stack buffer.
Stack pointer and stack buffer should be two different virtual
register. One is a vector and the other is a scalar. The reason
previous implementation could work is that it search curbe offset
and make a new stack buffer register manually which is not good.
Now fix it and remove those hacking code. We actually don't need
to use curbe offset manually after the allocation.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Wed, 12 Mar 2014 06:43:08 +0000 (14:43 +0800)]
GBE: refine the "scalar" register handling.
The scalar register's actual meaning should be uniform register.
A non-uniform register is a varying register. For further
uniform analysis and bool data optimization, this patch
make the uniform as a new register data attribute. We
can set each new created register as an uniform or varying
register.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Wed, 26 Mar 2014 10:27:40 +0000 (18:27 +0800)]
GBE: Remove BBs if it only has a label instruction.
v2:
add an extra createCFGSimplificationPass right before the createGenPass.
And don't remove BB at GEN IR layer.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Wed, 26 Mar 2014 05:45:56 +0000 (13:45 +0800)]
GBE: Add a new pass to handle barrier function's noduplicate attribute correctly.
This pass is to remove or add noduplicate function attribute for barrier functions.
Basically, we want to set NoDuplicate for those __gen_barrier_xxx functions. But if
a sub function calls those barrier functions, the sub function will not be inlined
in llvm's inlining pass. This is what we don't want. As inlining such a function in
the caller is safe, we just don't want it to duplicate the call. So Introduce this
pass to remove the NoDuplicate function attribute before the inlining pass and restore
it after.
v2:
fix the module changed check.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Yi Sun [Tue, 1 Apr 2014 08:31:26 +0000 (16:31 +0800)]
Statistics of case running
summary:
-----------------
1. Add struct RStatistics to count passed number(passCount), failed number(failCount), finished run number(finishrun).
2. Print statistics line , if the termial is too narrow, doesn't print it:
......
test_load_program_from_bin() [SUCCESS]
profiling_exec() [SUCCESS]
enqueue_copy_buf() [SUCCESS]
[run/total: 656/656] pass: 629; fail: 25; pass rate: 0.961890
3. If case crashes, count it as failed, add the function to show statistic summary.
4. When all cases finished, list a summary like follows:
summary:
----------
total: 656
run: 656
pass: 629
fail: 25
pass rate: 0.961890
5. If ./utest_run &> log, the log will be a little messy, tring the following command to analyse the log:
sed 's/\r/\n/g' log | egrep "\w*\(\)" | sed -e 's/\s//g'
After analysed:
-----------------
......
builtin_minmag_float2()[SUCCESS]
builtin_minmag_float4()[SUCCESS]
builtin_minmag_float8()[SUCCESS]
builtin_minmag_float16()[SUCCESS]
builtin_nextafter_float()[FAILED]
builtin_nextafter_float2()[FAILED]
builtin_nextafter_float4()[FAILED]
......
6. Fix one issue, print out the crashed case name.
7. Delete the debug line in utests/compiler_basic_arithmetic.cpp, which
output the kernel name.
8. Define function statistics() in struct UTest, which called by "utest_run -a/-c/-n".
We just call this function to run each case, and print the statistics line.
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Junyan He [Wed, 26 Mar 2014 10:28:02 +0000 (18:28 +0800)]
Add one tests case specific for unaligned buffer copy.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Junyan He [Wed, 26 Mar 2014 10:27:56 +0000 (18:27 +0800)]
Optimize the unaligned buffer copy logic
Because the byte aligned read and write send instruction is
very slow, we optimize to avoid the using of it.
We seperate the unaligned case into three cases,
1. The src and dst has same %4 unaligned offset.
Then we just need to handle first and last dword.
2. The src has bigger %4 unaligned offset than the dst.
We need to do some shift and montage between src[i]
and src[i+1]
3. The last case, src has smaller 4% unaligned.
Then we need to do the same for src[i-1] and src[i].
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Junyan He [Wed, 26 Mar 2014 10:27:48 +0000 (18:27 +0800)]
Add three copy cl files for Enqueue Copy usage.
Add these three cl files,
one for src and dst are not aligned but have same offset to 4.
second for src's %4 offset is bigger than the dst's
third for src's %4 offset is small than the dst's
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yongjia Zhang [Tue, 1 Apr 2014 09:16:46 +0000 (17:16 +0800)]
Add kernels performance output
if environment variable OCL_OUTPUT_KERNEL_PERF is set non-zero,
then after the executable program exits, beignet will output the
time information of each kernel executed.
v2:fixed the patch's trailing whitespace problem.
v3:if OCL_OUTPUT_KERNEL_PERF is 1, then the output will only
contains time summary, if it is 2, then the output will contain
time summary and detail. Add output 'Ave' and 'Dev', 'Ave' is
the average time per kernel per execution round, 'Dev' is the
result of 'Ave' divide a kernel's all executions' standard deviation.
Signed-off-by: Yongjia Zhang <yongjia.zhang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Mon, 24 Mar 2014 01:46:21 +0000 (09:46 +0800)]
GBE: Fix register liveness issue under simd mode.
As we run in SIMD mode with prediction mask to indicate active lanes,
If a vreg is defined in a loop, and there are som uses of the vreg out of the loop,
the define point may be run several times under *different* prediction mask.
For these kinds of vreg, we must extend the vreg liveness into the whole loop.
If we don't do this, it's liveness is killed before the def point inside loop.
If the vreg's corresponding physical reg is assigned to other vreg during the
killed period, and the instructions before kill point were re-executed with different prediction,
the inactive lanes of vreg maybe over-written. Then the out-of-loop use will got wrong data.
This patch fixes the HaarFixture case in opencv.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Mon, 17 Mar 2014 10:02:37 +0000 (18:02 +0800)]
GBE: Optimize the forward jump instruction.
As at each BB's begining, we already checked whether all channels are inactive,
we don't really need to do this duplicate checking at the end of forward jump.
This patch get about 25% performance gain for the luxmark's median scene.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Yang Rong [Mon, 24 Mar 2014 09:21:40 +0000 (17:21 +0800)]
Refine the FCMP_ORD and FCMP_UNO.
If there is a constant between src0 and src1 of FCMP_ORD/FCMP_UNO, the constant
value must be ordered, otherwise, llvm will optimize the instruction to ture/false.
So discard this constant value, only compare the other src.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yang Rong [Mon, 24 Mar 2014 08:27:31 +0000 (16:27 +0800)]
Refined the fmax and fmin builtins.
Because GEN's select instruction with cmod .l and .ge will handle NaN case, so
use the compare and select instruction in gen ir for fmax and fmin, and will be
optimized to one sel_cmp, need not check isnan.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: "Zou, Nanhai" <nanhai.zou@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Junyan He [Mon, 24 Mar 2014 07:34:43 +0000 (15:34 +0800)]
Add one test case for profiling test.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Wed, 19 Mar 2014 03:41:54 +0000 (11:41 +0800)]
GBE: make byte/short vload/vstore process one element each time.
Per OCL Spec, the computed address (p+offset*n) is 8-bit aligned for char,
and 16-bit aligned for short in vloadn & vstoren. That is we can not assume that
vload4 with char pointer is 4byte aligned. The previous implementation will make
Clang generate an load or store with alignment 4 which is in fact only alignment 1.
We need find another way to optimize the vloadn.
But before that, let's keep vloadn and vstoren work correctly.
This could fix the regression issue caused by byte/short optimization.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yang Rong [Tue, 11 Mar 2014 06:43:51 +0000 (14:43 +0800)]
Add SROA and GVN pass to default optLevel.
SROA and GVN may introduce some integer type not support by backend.
Remove this type assert in GenWrite, and found these types, set the unit to
invalid. If unit is invalid, use optLevel 0, which not include SROA and GVN, and
try again.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yi Sun [Mon, 10 Mar 2014 03:32:12 +0000 (11:32 +0800)]
utests: Refine cases for sinpi.
The general algorithm is that reducing the x to area [-0.5,0.5] then calculate results.
v2. Correct the algorithm of sinpi.
Add some input data temporarily, and we're going to design and implement a input data generator which is similar as what Conformance does.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Yi Sun [Wed, 5 Mar 2014 05:59:26 +0000 (13:59 +0800)]
Move the defination union SF to header file utest_helper.hpp
Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Chuanbo Weng [Wed, 5 Mar 2014 16:08:15 +0000 (00:08 +0800)]
Add clGetMemObjectFdIntel() api
Use this api to share buffer between OpenCL and v4l2. After import
the fd of OpenCL memory object to v4l2, v4l2 can directly read frame
into this memory object by the way of DMABUF, without memory-copy.
v2:
Check return value of cl_buffer_get_fd
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Guo Yejun [Thu, 6 Mar 2014 16:59:38 +0000 (00:59 +0800)]
merge some state buffers into one buffer
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Yang Rong [Mon, 3 Mar 2014 03:25:19 +0000 (11:25 +0800)]
Fix a convert float to long bug.
When convert some special float values, slight large than LONG_MAX, to long with sat,
will error. Simply using LONG_MAX when float value equal to LONG_MAX.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Fri, 7 Mar 2014 05:48:48 +0000 (13:48 +0800)]
GBE: Optimize byte/short load/store using untyped read/write
Scatter/gather are much worse than untyped read/write. So if we can pack
load/store of char/short to use untyped message, jut do it.
v2:
add some assert in splitReg()
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Ruiling Song [Fri, 7 Mar 2014 05:48:47 +0000 (13:48 +0800)]
GBE: Fix a potential issue if increase srcNum.
If increase MAX_SRC_NUM for ir::Instruction, unpredicted behaviour may happen.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Ruiling Song [Fri, 7 Mar 2014 05:48:46 +0000 (13:48 +0800)]
GBE: make vload3 only read 3 elements.
clang will align the vec3 load into vec4. we have to do it in frontend.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Ruiling Song [Fri, 28 Feb 2014 02:16:45 +0000 (10:16 +0800)]
GBE: Optimize scratch memory usage using register interval
As scratch memory is a limited resource in HW. And different
register have the opptunity to share same scratch memory. So
I introduce an allocator for scratch memory management.
v2:
In order to reuse the registerFilePartitioner, I rename it as
SimpleAllocator, and derive ScratchAllocator & RegisterAllocator
from it.
v3:
fix a typo, scratch size is 12KB.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Guo Yejun [Thu, 27 Feb 2014 17:58:20 +0000 (01:58 +0800)]
GBE: show correct line number in build log
Sometimes, we insert some code into the kernel,
it makes the line number reported in build log
mismatch with the line number in the kernel from
programer's view, use #line to correct it.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Guo Yejun [Wed, 26 Feb 2014 22:54:26 +0000 (06:54 +0800)]
GBE: support getelementptr with ConstantExpr operand
Add support during LLVM IR -> Gen IR period when the
first operand of getelementptr is ConstantExpr.
utest is also added.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Guo Yejun [Thu, 20 Feb 2014 21:51:33 +0000 (05:51 +0800)]
GBE: add fast path for more math functions
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Fri, 21 Feb 2014 05:09:20 +0000 (13:09 +0800)]
GBE: remove the useless get sampler info function.
We don't need to get the sampler info dynamically, so
remove the corresponding instruction.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Fri, 21 Feb 2014 04:50:55 +0000 (12:50 +0800)]
GBE: optimize read_image to avoid get sampler info dynamically.
Most of time, the user is using a const sampler value in the kernel
directly. Thus we don't need to get the sampler value through a function
call. And this way, the compiler front end could do much better optimization
than using the dynamic get sampler information. For the luxmark's
median/simple case, this patch could get about 30-45% performance gain.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Fri, 21 Feb 2014 02:40:08 +0000 (10:40 +0800)]
GBE: don't put a long live register to a selection vector.
If an element has very long interval, we don't want to put it into a
vector as it will add more pressure to the register allocation.
With this patch, it can reduce more than 20% spill registers for luxmark's
median scene benchmark(from 288 to 224).
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Wed, 19 Feb 2014 02:16:48 +0000 (10:16 +0800)]
GBE: prepare to optimize generic selection vector allocation.
Move the selection vector allocation after the register interval
calculation.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Wed, 19 Feb 2014 02:47:46 +0000 (10:47 +0800)]
GBE: fixed a potential bug in 64 bit instruction.
Current selection vector handling requires the dst/src
vector is starting at dst(0) or src(0).
v2:
fix an assertion.
v3:
fix a bug in gen_context.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Wed, 19 Feb 2014 08:36:33 +0000 (16:36 +0800)]
GBE: fix the overflow bug in register spilling.
Change to use int32 to represent the maxID.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Tue, 18 Feb 2014 10:32:33 +0000 (18:32 +0800)]
GBE: code cleanup for read_image/write_image.
Remove some useless instructions and make the read/write_image
more readable.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Tue, 18 Feb 2014 09:41:05 +0000 (17:41 +0800)]
GBE: fixed the incorrect max_dst_num and max_src_num.
Some I64 instructions are using more than 11 dst registers,
this patch change the max src number to 16. And add a assertion
to check if we run into this type of issue again.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Tue, 18 Feb 2014 09:19:41 +0000 (17:19 +0800)]
GBE: Optimize write_image instruction for simd8 mode.
On simd8 mode, we can put the u,v,w,x,r,g,b,a to
a selection vector directly and don't need to
assign those values again.
Let's see an example, the following code is generated without this
patch which is doing a simple image copy:
(26 ) (+f0) mov(8) g113<1>F g114<8,8,1>D { align1 WE_normal 1Q };
(28 ) (+f0) send(8) g108<1>UD g112<8,8,1>F
sampler (3, 0, 0, 1) mlen 2 rlen 4 { align1 WE_normal 1Q };
(30 ) mov(8) g99<1>UD 0x0UD { align1 WE_all 1Q };
(32 ) mov(1) g99.7<1>UD 0xffffUD { align1 WE_all };
(34 ) mov(8) g103<1>UD 0x0UD { align1 WE_all 1Q };
(36 ) (+f0) mov(8) g100<1>UD g117<8,8,1>UD { align1 WE_normal 1Q };
(38 ) (+f0) mov(8) g101<1>UD g114<8,8,1>UD { align1 WE_normal 1Q };
(40 ) (+f0) mov(8) g104<1>UD g108<8,8,1>UD { align1 WE_normal 1Q };
(42 ) (+f0) mov(8) g105<1>UD g109<8,8,1>UD { align1 WE_normal 1Q };
(44 ) (+f0) mov(8) g106<1>UD g110<8,8,1>UD { align1 WE_normal 1Q };
(46 ) (+f0) mov(8) g107<1>UD g111<8,8,1>UD { align1 WE_normal 1Q };
(48 ) (+f0) send(8) null g99<8,8,1>UD
renderunsupported target 5 mlen 9 rlen 0 { align1 WE_normal 1Q };
(50 ) (+f0) mov(8) g1<1>UW 0x1UW { align1 WE_normal 1Q };
L1:
(52 ) mov(8) g112<1>UD g0<8,8,1>UD { align1 WE_all 1Q };
(54 ) send(8) null g112<8,8,1>UD
thread_spawnerunsupported target 7 mlen 1 rlen 0 { align1 WE_normal 1Q EOT };
With this patch, we can optimize it as below:
(26 ) (+f0) mov(8) g106<1>F g111<8,8,1>D { align1 WE_normal 1Q };
(28 ) (+f0) send(8) g114<1>UD g105<8,8,1>F
sampler (3, 0, 0, 1) mlen 2 rlen 4 { align1 WE_normal 1Q };
(30 ) mov(8) g109<1>UD 0x0UD { align1 WE_all 1Q };
(32 ) mov(1) g109.7<1>UD 0xffffUD { align1 WE_all };
(34 ) mov(8) g113<1>UD 0x0UD { align1 WE_all 1Q };
(36 ) (+f0) send(8) null g109<8,8,1>UD
renderunsupported target 5 mlen 9 rlen 0 { align1 WE_normal 1Q };
(38 ) (+f0) mov(8) g1<1>UW 0x1UW { align1 WE_normal 1Q };
L1:
(40 ) mov(8) g112<1>UD g0<8,8,1>UD { align1 WE_all 1Q };
(42 ) send(8) null g112<8,8,1>UD
thread_spawnerunsupported target 7 mlen 1 rlen 0 { align1 WE_normal 1Q EOT };
This patch could save about 8 instructions per write_image.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Tue, 18 Feb 2014 06:40:59 +0000 (14:40 +0800)]
GBE: optimize sample instruction.
The U,V,W registers could be allocated to a selection vector directly.
Then we can save some MOV instructions for the read_image functions.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
xiuli pan [Fri, 21 Feb 2014 08:25:20 +0000 (16:25 +0800)]
Change the order of the code
Fix the 66K problem in the OpenCV testing.
The bug was casued by the incorrect order
of the code, it will result the beignet to
calculate the whole localsize of the kernel
file. Now the OpenCV test can pass.
Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>
Yang Rong [Fri, 21 Feb 2014 08:54:39 +0000 (16:54 +0800)]
Fix a long DIV/REM hang.
There is a jumpi in long DIV/REM, with predication is any16/any8. So
MUST AND the predication register with emask, otherwise may dead loop.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Lv Meng [Tue, 14 Jan 2014 03:04:57 +0000 (11:04 +0800)]
GBE: improve precision of rootn
Signed-off-by: Lv Meng <meng.lv@intel.com>
Yi Sun [Thu, 20 Feb 2014 01:32:32 +0000 (09:32 +0800)]
Remove some unreasonable input values for rootn
In manual for function pow(), there's following description:
"If x is a finite value less than 0,
and y is a finite noninteger,
a domain error occurs, and a NaN is returned."
That means we can't calculate rootn in cpu like this pow(x,1.0/y) which is mentioned in OpenCL spec.
E.g. when y=3 and x=-8, rootn should return -2. But when we calculate pow(x, 1.0/y), it will return a Nan.
I didn't find multi-root math function in glibc.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Yi Sun [Wed, 19 Feb 2014 06:12:03 +0000 (14:12 +0800)]
utests:add subnormal check by fpclassify.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: Shui yangwei <yangweix.shui@intel.com>
Yi Sun [Wed, 19 Feb 2014 06:04:52 +0000 (14:04 +0800)]
Change %.20f to %e.
This can make the error information more readable.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Guo Yejun [Mon, 17 Feb 2014 21:30:27 +0000 (05:30 +0800)]
GBE: add param to switch the behavior of math func
Add OCL_STRICT_CONFORMANCE to switch the behavior of math func,
The funcs will be high precision with perf drops if it is 1, Fast
path with good enough precision will be selected if it is 0.
This change is to add the code basis, with 'sin' and 'cos' implemented
as examples, other math functions support will be added later.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Yi Sun [Mon, 17 Feb 2014 03:32:47 +0000 (11:32 +0800)]
utests: Remove test cases for function 'tgamma' 'erf' and 'erfc'
Since OpenCL conformance doesn't cover these function at the moment,
we remove them temporarily.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Ruiling Song [Mon, 17 Feb 2014 08:54:20 +0000 (16:54 +0800)]
Improve precision of sinpi/cospi
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Boqun Feng [Mon, 17 Feb 2014 01:49:26 +0000 (09:49 +0800)]
GBE: fix terminfo library linkage
In some distros, the terminal libraries are divided into two
libraries, one is tinfo and the other is ncurses, however, for
other distros, there is only one single ncurses library with
all functions.
In order to link proper terminal library for LLVM, find_library
macro in cmake can be used. In this patch, the tinfo is prefered,
so that it wouldn't affect linkage behavior in distros with tinfo.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Boqun Feng [Sat, 15 Feb 2014 06:52:44 +0000 (14:52 +0800)]
utests: define python interpreter via cmake variable
The reason for this fix is in commit
5b64170ef5e3e78d038186fb1132b11a8fec308e.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Fri, 14 Feb 2014 08:11:36 +0000 (16:11 +0800)]
CL: make the scratch size as a device resource attribute.
Actually, the scratch size is much like the local memory size
which should be a device dependent information.
This patch is to put scratch mem size to the device attribute
structure. And when the kernel needs more than the maximum scratch
memory, we just return a out-of-resource error rather than trigger
an assertion.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Song, Ruiling <ruiling.song@intel.com>
Guo Yejun [Thu, 13 Feb 2014 03:59:48 +0000 (11:59 +0800)]
fix typo: blobTempName is assigned but not used
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>