Zhigang Gong [Wed, 19 Feb 2014 02:47:46 +0000 (10:47 +0800)]
GBE: fixed a potential bug in 64 bit instruction.
Current selection vector handling requires the dst/src
vector is starting at dst(0) or src(0).
v2:
fix an assertion.
v3:
fix a bug in gen_context.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Wed, 19 Feb 2014 08:36:33 +0000 (16:36 +0800)]
GBE: fix the overflow bug in register spilling.
Change to use int32 to represent the maxID.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Tue, 18 Feb 2014 10:32:33 +0000 (18:32 +0800)]
GBE: code cleanup for read_image/write_image.
Remove some useless instructions and make the read/write_image
more readable.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Tue, 18 Feb 2014 09:41:05 +0000 (17:41 +0800)]
GBE: fixed the incorrect max_dst_num and max_src_num.
Some I64 instructions are using more than 11 dst registers,
this patch change the max src number to 16. And add a assertion
to check if we run into this type of issue again.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Tue, 18 Feb 2014 09:19:41 +0000 (17:19 +0800)]
GBE: Optimize write_image instruction for simd8 mode.
On simd8 mode, we can put the u,v,w,x,r,g,b,a to
a selection vector directly and don't need to
assign those values again.
Let's see an example, the following code is generated without this
patch which is doing a simple image copy:
(26 ) (+f0) mov(8) g113<1>F g114<8,8,1>D { align1 WE_normal 1Q };
(28 ) (+f0) send(8) g108<1>UD g112<8,8,1>F
sampler (3, 0, 0, 1) mlen 2 rlen 4 { align1 WE_normal 1Q };
(30 ) mov(8) g99<1>UD 0x0UD { align1 WE_all 1Q };
(32 ) mov(1) g99.7<1>UD 0xffffUD { align1 WE_all };
(34 ) mov(8) g103<1>UD 0x0UD { align1 WE_all 1Q };
(36 ) (+f0) mov(8) g100<1>UD g117<8,8,1>UD { align1 WE_normal 1Q };
(38 ) (+f0) mov(8) g101<1>UD g114<8,8,1>UD { align1 WE_normal 1Q };
(40 ) (+f0) mov(8) g104<1>UD g108<8,8,1>UD { align1 WE_normal 1Q };
(42 ) (+f0) mov(8) g105<1>UD g109<8,8,1>UD { align1 WE_normal 1Q };
(44 ) (+f0) mov(8) g106<1>UD g110<8,8,1>UD { align1 WE_normal 1Q };
(46 ) (+f0) mov(8) g107<1>UD g111<8,8,1>UD { align1 WE_normal 1Q };
(48 ) (+f0) send(8) null g99<8,8,1>UD
renderunsupported target 5 mlen 9 rlen 0 { align1 WE_normal 1Q };
(50 ) (+f0) mov(8) g1<1>UW 0x1UW { align1 WE_normal 1Q };
L1:
(52 ) mov(8) g112<1>UD g0<8,8,1>UD { align1 WE_all 1Q };
(54 ) send(8) null g112<8,8,1>UD
thread_spawnerunsupported target 7 mlen 1 rlen 0 { align1 WE_normal 1Q EOT };
With this patch, we can optimize it as below:
(26 ) (+f0) mov(8) g106<1>F g111<8,8,1>D { align1 WE_normal 1Q };
(28 ) (+f0) send(8) g114<1>UD g105<8,8,1>F
sampler (3, 0, 0, 1) mlen 2 rlen 4 { align1 WE_normal 1Q };
(30 ) mov(8) g109<1>UD 0x0UD { align1 WE_all 1Q };
(32 ) mov(1) g109.7<1>UD 0xffffUD { align1 WE_all };
(34 ) mov(8) g113<1>UD 0x0UD { align1 WE_all 1Q };
(36 ) (+f0) send(8) null g109<8,8,1>UD
renderunsupported target 5 mlen 9 rlen 0 { align1 WE_normal 1Q };
(38 ) (+f0) mov(8) g1<1>UW 0x1UW { align1 WE_normal 1Q };
L1:
(40 ) mov(8) g112<1>UD g0<8,8,1>UD { align1 WE_all 1Q };
(42 ) send(8) null g112<8,8,1>UD
thread_spawnerunsupported target 7 mlen 1 rlen 0 { align1 WE_normal 1Q EOT };
This patch could save about 8 instructions per write_image.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Tue, 18 Feb 2014 06:40:59 +0000 (14:40 +0800)]
GBE: optimize sample instruction.
The U,V,W registers could be allocated to a selection vector directly.
Then we can save some MOV instructions for the read_image functions.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
xiuli pan [Fri, 21 Feb 2014 08:25:20 +0000 (16:25 +0800)]
Change the order of the code
Fix the 66K problem in the OpenCV testing.
The bug was casued by the incorrect order
of the code, it will result the beignet to
calculate the whole localsize of the kernel
file. Now the OpenCV test can pass.
Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>
Yang Rong [Fri, 21 Feb 2014 08:54:39 +0000 (16:54 +0800)]
Fix a long DIV/REM hang.
There is a jumpi in long DIV/REM, with predication is any16/any8. So
MUST AND the predication register with emask, otherwise may dead loop.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Lv Meng [Tue, 14 Jan 2014 03:04:57 +0000 (11:04 +0800)]
GBE: improve precision of rootn
Signed-off-by: Lv Meng <meng.lv@intel.com>
Yi Sun [Thu, 20 Feb 2014 01:32:32 +0000 (09:32 +0800)]
Remove some unreasonable input values for rootn
In manual for function pow(), there's following description:
"If x is a finite value less than 0,
and y is a finite noninteger,
a domain error occurs, and a NaN is returned."
That means we can't calculate rootn in cpu like this pow(x,1.0/y) which is mentioned in OpenCL spec.
E.g. when y=3 and x=-8, rootn should return -2. But when we calculate pow(x, 1.0/y), it will return a Nan.
I didn't find multi-root math function in glibc.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Yi Sun [Wed, 19 Feb 2014 06:12:03 +0000 (14:12 +0800)]
utests:add subnormal check by fpclassify.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: Shui yangwei <yangweix.shui@intel.com>
Yi Sun [Wed, 19 Feb 2014 06:04:52 +0000 (14:04 +0800)]
Change %.20f to %e.
This can make the error information more readable.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Guo Yejun [Mon, 17 Feb 2014 21:30:27 +0000 (05:30 +0800)]
GBE: add param to switch the behavior of math func
Add OCL_STRICT_CONFORMANCE to switch the behavior of math func,
The funcs will be high precision with perf drops if it is 1, Fast
path with good enough precision will be selected if it is 0.
This change is to add the code basis, with 'sin' and 'cos' implemented
as examples, other math functions support will be added later.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Yi Sun [Mon, 17 Feb 2014 03:32:47 +0000 (11:32 +0800)]
utests: Remove test cases for function 'tgamma' 'erf' and 'erfc'
Since OpenCL conformance doesn't cover these function at the moment,
we remove them temporarily.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Ruiling Song [Mon, 17 Feb 2014 08:54:20 +0000 (16:54 +0800)]
Improve precision of sinpi/cospi
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Boqun Feng [Mon, 17 Feb 2014 01:49:26 +0000 (09:49 +0800)]
GBE: fix terminfo library linkage
In some distros, the terminal libraries are divided into two
libraries, one is tinfo and the other is ncurses, however, for
other distros, there is only one single ncurses library with
all functions.
In order to link proper terminal library for LLVM, find_library
macro in cmake can be used. In this patch, the tinfo is prefered,
so that it wouldn't affect linkage behavior in distros with tinfo.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Boqun Feng [Sat, 15 Feb 2014 06:52:44 +0000 (14:52 +0800)]
utests: define python interpreter via cmake variable
The reason for this fix is in commit
5b64170ef5e3e78d038186fb1132b11a8fec308e.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Fri, 14 Feb 2014 08:11:36 +0000 (16:11 +0800)]
CL: make the scratch size as a device resource attribute.
Actually, the scratch size is much like the local memory size
which should be a device dependent information.
This patch is to put scratch mem size to the device attribute
structure. And when the kernel needs more than the maximum scratch
memory, we just return a out-of-resource error rather than trigger
an assertion.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Song, Ruiling <ruiling.song@intel.com>
Guo Yejun [Thu, 13 Feb 2014 03:59:48 +0000 (11:59 +0800)]
fix typo: blobTempName is assigned but not used
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Fri, 14 Feb 2014 07:04:26 +0000 (15:04 +0800)]
GBE: Support 64Bit register spill.
Now we support DWORD & QWORD register spill/fill.
v2:
only add poolOffset by 1 when we meet QWord register and poolOffset is 1.
v3:
allocate reserved register pool unifiedly for src and dst register.
when it spill a qword register, payload register should be retyped as dword per bottom/top logic.
put a limit on the scratch space memory size.
v4:
fix a typo.
increase the reserved register from 6 to 8 for some complex instruction.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Igor Gnatenko [Thu, 13 Feb 2014 07:16:35 +0000 (11:16 +0400)]
cmake: Fix linking with LLVM/Terminfo
DEBUG: [ 9%] Building CXX object backend/src/CMakeFiles/gbe_bin_generater.dir/gbe_bin_generater.cpp.o
DEBUG: Linking CXX executable gbe_bin_generater
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x717): undefined reference to `setupterm'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x727): undefined reference to `tigetnum'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x730): undefined reference to `set_curterm'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x738): undefined reference to `del_curterm'
Signed-off-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Mon, 10 Feb 2014 08:28:37 +0000 (16:28 +0800)]
Bump to version 0.8.0.
This version brings many improvments compare to the last released version 0.3,
so that we decide to bump the version to 0.8.0 directly. Before the 1.0.0, we
have two steps left. One is the performance optimization and the other is to
support OpenCL 1.2 by default.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Wed, 12 Feb 2014 07:20:45 +0000 (15:20 +0800)]
Docs: fix some markdown errors and add some new info.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Yang Rong [Wed, 12 Feb 2014 15:41:26 +0000 (23:41 +0800)]
Fix build errors in llvm3.5 only system.
There are some head files miss if have llvm3.5 only. If has previous llvm, even uninstall,
will still remain these head files in system, so can't trigger it.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Tue, 11 Feb 2014 09:51:50 +0000 (17:51 +0800)]
Fix the cmake problem in FindLLVM.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Zhigang Gong [Mon, 10 Feb 2014 08:28:36 +0000 (16:28 +0800)]
Update document for LLVM/Clang 3.5.
Also change the README.md to link to Beignet.mdw rather than to point to the wiki page.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Sat, 8 Feb 2014 06:12:03 +0000 (14:12 +0800)]
GBE: fixed the unsafe tmpnam_r.
Use mkstemps instead.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Sat, 8 Feb 2014 06:12:02 +0000 (14:12 +0800)]
Silent compilation warning in sampler functions.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Sat, 8 Feb 2014 03:16:43 +0000 (11:16 +0800)]
Add clang/LLVM 3.5svn support.
The clang/llvm 3.3 has some minor bugs such as the vector ++/-- which
was fixed in 3.4. But the 3.4 version introduces severer OCL bugs as
below:
http://llvm.org/bugs/show_bug.cgi?id=18119
http://llvm.org/bugs/show_bug.cgi?id=18120
It seems that the community will only fix these bugs in the ToT version
rather than the llvm 3.4 branch. I think we'd better to enable clang/llvm
3.5 in beignet. Currently, the 18120 was fixed in ToT, but 18119 still
breaks us. When 18119 get fixed, I will switch the preferred version to
3.5.
Please be noted, when you build clang/llvm 3.5, you need to enable the
cxx11 to make it compatible with beignet.
--enable-cxx11
v2:
fix the llvm3.4 issue.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Jon Nordby [Thu, 6 Feb 2014 18:50:59 +0000 (19:50 +0100)]
Make build compatible with Python 2.6
Implicit numbers for format specifiers "{}" can only be used on Py2.7+,
and Py2.6 is still in use on for instance CentOS 6.5 and similar.
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Junyan He [Sun, 26 Jan 2014 10:16:12 +0000 (18:16 +0800)]
Fix the problem by kernel file open in utest
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Tested-by: "Sun, Yi" <yi.sun@intel.com>
Zhigang Gong [Mon, 20 Jan 2014 10:44:03 +0000 (18:44 +0800)]
Update documents.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Zhigang Gong [Mon, 27 Jan 2014 01:26:21 +0000 (09:26 +0800)]
GBE: fixed the out-of-range JMPI.
For the conditional jump distance out of S15 range [-32768, 32767],
we need to use an inverted jmp followed by a add ip, ip, distance
to implement. A little hacky as we need to change the nop instruction
to add instruction manually.
There is an optimization method which we can insert a
ADD instruction on demand. But that will need some extra analysis
for all the branching instruction. And need to adjust the distance
for those branch instruction's start point and end point contains
this instruction.
After this patch, the luxrender's slg4 could render the scene "alloy"
correctly.
v2:
fix the unconditional branch too.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang, Rong R <rong.r.yang@intel.com>
Yang Rong [Sun, 26 Jan 2014 08:36:58 +0000 (16:36 +0800)]
When local_work_size is null, try to choose a local_work_size.
After fix all found fails when local_work_size is not 1, re-enalbe it to
improve performance.
V2: refine to skip some useless loop.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yang Rong [Tue, 28 Jan 2014 03:03:15 +0000 (11:03 +0800)]
Multiple register's hstride in suboffset.
When register's hstride is not 0 or 1, suboffset will get wrong element.
Also change some offsets that already multiple hstride by hard code.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Sun, 26 Jan 2014 06:07:14 +0000 (14:07 +0800)]
GBE: Implement complete register spill policy.
This patch implement a complete register spill policy.
When it needs to spill a register, we always choose the
register which is in the spill candate map and has
maximum endpoint. One tricky I used here is to merge both
the register's endpoint value and the register itself
into one single key. Then I can use one map to implement a
descending order map according to its value( the instruction
endpoint value). This patch supports to spill both vectors
or non-vectors.
And I move the scratch memory allocation from
instruction selection to register allocation. We may latter
use the internal interval information to reduce the scratch
memory comsumption.
Another big change is that I don't perform the real
spill on the fly. Instead, I move the real spill to the end of
all register allocation. Then spilling all the registers which
in the spillSet at one pass. This has the following advantage:
1. It only needs to loop over all instructions once.
2. When spilling one instruction, we know all the registers' status.
Then it's easy to know the correct scratch id for each register.
Actually, the previous implementation has a bug here.
The last part is to avoid the spill instruction restrication.
As ruiling pointed out that the spill instruction(scratch read/write)
doesn't support predication correctly for non-DW data type.
This patch avoids to spill any non-supported type register.
After this patch, both luxrender and opencv examples work fine on
my machine.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang, Rong R <rong.r.yang@intel.com>
Zhigang Gong [Fri, 24 Jan 2014 09:31:29 +0000 (17:31 +0800)]
GBE: prepare to optimize the register spilling policy.
It's better to choose the proper register to spill
rather than always spill current register. This patch
is a preparation of a better spilling policy.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang, Rong R <rong.r.yang@intel.com>
Zhigang Gong [Fri, 24 Jan 2014 04:33:10 +0000 (12:33 +0800)]
GBE: refine register allocation output.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Junyan He [Tue, 14 Jan 2014 08:43:42 +0000 (16:43 +0800)]
Add the device id for haswell GT.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Junyan He [Wed, 22 Jan 2014 06:02:30 +0000 (14:02 +0800)]
Fix the bug in removeLOADIs function.
The logic for replacing the dst of the instruction
using the src number and getSrc. Fix this problem.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Thu, 23 Jan 2014 06:25:55 +0000 (14:25 +0800)]
GBE: allow the bool registers to be expired.
After the previous's extra liveness analysis, we can allow bool
registers to be expired now.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
Zhigang Gong [Thu, 23 Jan 2014 06:15:05 +0000 (14:15 +0800)]
GBE: Implement an extra liveness analysis for the Gen backend.
Consider the following scenario, %100's normal liveness will start from Ln-1's
position. In normal analysis, the Ln-1 is not Ln's predecessor, thus the liveness
of %100 will be passed to Ln and then will not be passed to L0.
But considering we are running on a multilane with predication's vector machine.
The unconditional BR in Ln-1 may be removed and it will enter Ln with a subset of
the revert set of Ln-1's predication. For example when running Ln-1, the active lane
is 0-7, then at Ln the active lane is 8-15. Then at the end of Ln, a subset of 8-15
will jump to L0. If a register %10 is allocated the same GRF as %100, given the fact
that their normal liveness doesn't overlapped, the a subset of 8-15 lanes will be
modified. If the %10 and %100 are the same vector data type, then we are fine. But if
%100 is a float vector, and the %10 is a bool or short vector, then we hit a bug here.
L0:
...
%10 = 5
...
Ln-1:
%100 = 2
BR Ln+1
Ln:
...
BR(%xxx) L0
Ln+1:
%101 = %100 + 2;
...
The solution to fix this issue is to build an extra liveness analysis. We will start with
those BBs with backward jump. Then pass all the liveOut register as extra liveIn
of current BB and then forward this extra liveIn to all the blocks. This is very similar
to the normal liveness analysis just with reverse direction.
Thanks yang rong who found this bug.
v2:
Don't remove livein when initialize the extra livein.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
Zhigang Gong [Wed, 22 Jan 2014 02:32:08 +0000 (10:32 +0800)]
GBE: increase the disassembly output's readability.
Add label information and the instruction address
prefix. Make the address consistent with fulsim.
And also make the register allocation output a little
bit prettier.
Now the disassembly output is as below:
compiler_ceil's disassemble begin:
L0:
(0 ) mov(1) f0<1>UW 0x0UW { align1 WE_all };
....
(32 ) (+f0) mov(16) g1<1>UW 0x1UW { align1 WE_normal 1H };
L1:
(34 ) mov(16) g112<1>UD g0<8,8,1>UD { align1 WE_all 1H };
...
compiler_ceil's disassemble end.
The register allocation output is as below:
%26 g2 .8 4 B [0 -> 0 ]
%28 g2 .12 4 B [0 -> 6 ]
%29 g2 .16 4 B [0 -> 9 ]
%30 g126.0 64 B [2 -> 3 ]
%31 g124.0 64 B [3 -> 4 ]
Please be noted, the register allocation's output is not correct
when the register is a pure scalar(bool) register which allocated
at the backend instruction selection stage. To be fixed.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
Zhigang Gong [Tue, 21 Jan 2014 05:15:39 +0000 (13:15 +0800)]
GBE: fixed a bug in sample instruction.
Sample instruction only have 3 source operands now, not 4.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
Zhigang Gong [Tue, 21 Jan 2014 04:13:04 +0000 (12:13 +0800)]
GBE: fix some incorrect gen ir output messages.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Zhigang Gong [Tue, 21 Jan 2014 00:34:29 +0000 (08:34 +0800)]
GBE: don't allocate grf for those bools which map to flag.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
Zhigang Gong [Mon, 20 Jan 2014 09:14:48 +0000 (17:14 +0800)]
build: work around an old version cmake bug.
On fedora core 15 with the cmake 2.8.4, Yi experienced a build error.
It turns out that the cmake may handle the file directorys with double
slashs incorrectly when the file is on a target's dependcy list and
be a output file name of a custom command.
This small patch could work around that issue.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Sun, Yi" <yi.sun@intel.com>
Guo Yejun [Mon, 20 Jan 2014 00:38:23 +0000 (08:38 +0800)]
GBE: use native exp instruction when enough precision
for the input data with enough precision, use the native exp instruction,
otherwise, use the software path to emulate the exp function.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Junyan He [Mon, 20 Jan 2014 03:28:43 +0000 (11:28 +0800)]
Fix the bug of multi deleting of load instruction in lowering
When the load instruction has multi-value destinations, the load
instruction in buildConstantPush function will be replaced many
times and which can cause the potential problems.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yongjia Zhang [Fri, 17 Jan 2014 08:20:02 +0000 (16:20 +0800)]
Add utest compiler_private_data_overflow
utests: compiler_private_data_overflow is aimed to hit a larger than
1KB stack. It will fail with the old beignet which allocate 1KB stack
size no matter the actual usage of stack in the kernel.
Signed-off-by: Yongjia Zhang<zhang_yong_jia@126.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yang Rong [Fri, 17 Jan 2014 08:22:56 +0000 (16:22 +0800)]
Add some native functions vector proto.
Native functions just define as normal function before, so don't need
vector proto. Now only native_exp2 and native_sqrt define as exp2 and sqrt,
so enable others'.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yi Sun [Thu, 9 Jan 2014 07:56:04 +0000 (15:56 +0800)]
Remove builtin function fma from utest_math_gen.py.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Tue, 14 Jan 2014 03:10:00 +0000 (11:10 +0800)]
utests: Put all the generated kernel files to .gitignore at runtime.
As there are so many generated kernel files, it's annoying when I use
git status to check the modified files and new added files. This patch
to put all of them to the gitignore file which could make things easier.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Fri, 17 Jan 2014 05:05:20 +0000 (13:05 +0800)]
GBE: fixed the hacky code of 3D image read/write.
The previous implementation use a magic virtual register(0) to
indiate this is a 2D read/write. This is too hacky and may hide
bugs in the future. Now fix it without create any dumy virtual
register.
Also clean up some useless enums.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Fri, 17 Jan 2014 04:26:47 +0000 (12:26 +0800)]
GBE: fix the hack code of sampler offset handling.
Previous implementation use a virtual register to pass the offset
to the back end side which is too hacky, now fix it.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Fri, 17 Jan 2014 02:42:25 +0000 (10:42 +0800)]
GBE: fixed the stack allocation.
Yongjia wrote a case hit the previous 1KB limitation. I took a look at
the stack pointer related code then I found the implementation is not
comply with the OCL spec.
According to OpenCL spec, section 6.9:
d. Variable length arrays and structures with flexible (or unsized) arrays are not supported.
Thus all the local variable size should be constant, and we can
manipulate the stack pointer easier , no need to do the alignment
calculating at runtime, and could get the eaxct stack size then
allocate stack size on demand. I still put a limitation there which
is 64KB.
v2:
don't add the step if the step is zero.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Thu, 16 Jan 2014 03:56:15 +0000 (11:56 +0800)]
GBE: move the image info register allocation to GEN IR stage.
If we allocate image infor register at code generation stage,
we miss the liveness calculation. Thus there is a potential risk
that some image information register's livenss data is incorrect and
may cause very subtle bug. Now fix it.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Thu, 16 Jan 2014 02:16:36 +0000 (10:16 +0800)]
GBE: move the image allocation to the GEN IR stage.
Image register should be translate to a const at the GEN IR
stage to avoid the register allocator to allocate unnecessary
register for the image id.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Wed, 15 Jan 2014 11:50:55 +0000 (19:50 +0800)]
GBE/Sampler: Simplfy the sampler handling.
Mov the sampler allocation to the Gen stage. Then we don't need to
maintain a fake key register which may also confusing the latter
register allocation phase.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Wed, 15 Jan 2014 07:26:07 +0000 (15:26 +0800)]
GBE: fixed a register liveness bug for getsamplerinfo instrution.
The previous implementation insert the ocl::samplerinfo to the
instruction after the liveness calculation stage, so the liveness
information is not correct for that register and may cause some
test cases fails. Now fix it.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Igor Gnatenko [Mon, 13 Jan 2014 21:31:39 +0000 (01:31 +0400)]
typo: bsically to basically
Signed-off-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Igor Gnatenko [Thu, 16 Jan 2014 07:19:53 +0000 (11:19 +0400)]
cmake: use libdir macros
Don't hardcode ${prefix}/lib. More better give choice to maintainer where install libs.
We will use ${LIB_INSTALL_DIR}, which by default will point to
${CMAKE_INSTALL_PREFIX}/lib. But maintainer will can redefine it with
-DLIB_INSTALL_DIR=/usr/lib64 or the same.
Let's use libdir macroses.
Signed-off-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yang Rong [Wed, 15 Jan 2014 08:31:06 +0000 (16:31 +0800)]
Change compiler_function_argument3 to cover llvm.memcpy.
We found clang wound emit llvm.memcpy when assign a stuct to another,
if sizeof(struct) > 64. Add a assignment to produce llvm.memcpy.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yang Rong [Thu, 16 Jan 2014 07:38:30 +0000 (15:38 +0800)]
Add llvm instrinsic function llvm.memset and llvm.memcpy support.
SPIR 1.2 require llvm.memcpy support. And llvm will emit llvm.memset sometimes.
So adding a pass to lower these two intrinsic function, and then inline them.
In intrinsic lowering pass, find all llvm.memset and llvm.memcpy and then replace
them with a function call __gen_memset_x and __gen_memcpy_xx, x and xx is for address space.
Because this pass is after clang, but after clang, the unused function seems be stripped, so
implement the __gen_memset_x and __gen_memcpy_xx functions in pre compiled module, then link
them.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yang Rong [Wed, 15 Jan 2014 08:31:04 +0000 (16:31 +0800)]
Use OCL_USE_PCH to control the using pch or not.
Junyan has added the environment variable OCL_USE_PCH, but not using it.
Enable it.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Lv Meng [Mon, 13 Jan 2014 05:50:25 +0000 (13:50 +0800)]
GBE: improve precision of remquo
Signed-off-by: Lv Meng <meng.lv@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Lv Meng [Mon, 13 Jan 2014 01:17:35 +0000 (09:17 +0800)]
GBE: improve precision of hypot
Signed-off-by: Lv Meng <meng.lv@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Lv Meng [Mon, 13 Jan 2014 00:54:02 +0000 (08:54 +0800)]
GBE: improve precision of exp10
Signed-off-by: Lv Meng <meng.lv@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Fri, 10 Jan 2014 05:39:43 +0000 (13:39 +0800)]
GBE: Improve precision of cbrt
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Fri, 10 Jan 2014 05:39:42 +0000 (13:39 +0800)]
GBE: Improve precision of atan2
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Fri, 10 Jan 2014 05:39:41 +0000 (13:39 +0800)]
GBE: Improve atan precision
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Fri, 10 Jan 2014 05:39:40 +0000 (13:39 +0800)]
GBE: improve precision of tan
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Ruiling Song [Fri, 10 Jan 2014 05:39:39 +0000 (13:39 +0800)]
GBE: Improve precision of sin/cos/sincos
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Junyan He [Wed, 15 Jan 2014 07:34:12 +0000 (15:34 +0800)]
Add -cl-fast-relaxed-math into incompatible opts and fix the PreprocessorOptions bug
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Zhigang Gong [Thu, 9 Jan 2014 09:36:37 +0000 (17:36 +0800)]
Refine the method to find pch and pcm files.
When compile user kernels, we need to find the precompiled header
file and the precompiled module file. The previous implementation
will find the build directory then find the system directory.
This is not elegant when it is packaged to a distro. It doesn't
need to search the build directory. So I change the default search
path to the system directory only. And for the deveoper, I change
the build script to set a proper environment variable and make the
gbe bin generator and the utest could find the local pch files and
pcm files firstly.
The only change is now, after the build process. Before the user
run the utests, it need to set up the environment firstly. Just
invoke
. utest/setenv.sh.
Then everything should be the same as previous. This setenv.sh also
set the OCL_KERNEL_PATH, so you don't need to set it manually now.
This patch also update the document.
v2:
add the missing setenv.sh.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Thu, 9 Jan 2014 06:20:29 +0000 (14:20 +0800)]
GBE: enable relocatable pch files.
As by default, when include a pch file, clang need to make sure
the original header file is untouched. This is impossible when
we want to distribute a pch file to a new system. We need to
use the relocatable pch feature provided by clang here.
We now create two pch files. One is relocatable pch file which
is used to install to the system directory. The other is a local
pch file which is used during the build time. We need both pch
files because at the build time, we don't have an ocl_stdlib.h
in the system directory. The local pch file is used for the beignet's
build and the utest only. All the other applications will use
the installed pch/pcm files.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Wed, 8 Jan 2014 10:57:31 +0000 (18:57 +0800)]
CL: prepare to support ICD if the system has ocl-icd..
v2:
Only install the intel-beignet.icd if the system has ocl-icd
support.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Tested-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Wed, 8 Jan 2014 11:10:53 +0000 (19:10 +0800)]
CL: back port ICD support to 1.1 branch.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Song, Ruiling" <ruiling.song@intel.com>
Zhigang Gong [Fri, 10 Jan 2014 09:49:12 +0000 (17:49 +0800)]
GBE: fixed a long related bug.
We need to consider the situation that the 64 bit virtual register
is crossing two GRFs.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Tue, 14 Jan 2014 01:33:00 +0000 (09:33 +0800)]
Revert faulty pushed patchset
This reverts:
Revert "GBE: fixed a long related bug."
Revert "Refine the method to find pch and pcm files."
Revert "GBE: enable relocatable pch files."
Revert "CL: prepare to support ICD if the system has ocl-icd.."
Revert "CL: back port ICD support to 1.1 branch."
The above patches are merged by accident without review comments and
are broken. Now revert them.
Zhigang Gong [Fri, 10 Jan 2014 09:49:12 +0000 (17:49 +0800)]
GBE: fixed a long related bug.
We need to consider the situation that the 64 bit virtual register
is crossing two GRFs.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Zhigang Gong [Thu, 9 Jan 2014 09:36:37 +0000 (17:36 +0800)]
Refine the method to find pch and pcm files.
When compile user kernels, we need to find the precompiled header
file and the precompiled module file. The previous implementation
will find the build directory then find the system directory.
This is not elegant when it is packaged to a distro. It doesn't
need to search the build directory. So I change the default search
path to the system directory only. And for the deveoper, I change
the build script to set a proper environment variable and make the
gbe bin generator and the utest could find the local pch files and
pcm files firstly.
The only change is now, after the build process. Before the user
run the utests, it need to set up the environment firstly. Just
invoke
. utest/setenv.sh.
Then everything should be the same as previous. This setenv.sh also
set the OCL_KERNEL_PATH, so you don't need to set it manually now.
This patch also update the document.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Zhigang Gong [Thu, 9 Jan 2014 06:20:29 +0000 (14:20 +0800)]
GBE: enable relocatable pch files.
As by default, when include a pch file, clang need to make sure
the original header file is untouched. This is impossible when
we want to distribute a pch file to a new system. We need to
use the relocatable pch feature provided by clang here.
We now create two pch files. One is relocatable pch file which
is used to install to the system directory. The other is a local
pch file which is used during the build time. We need both pch
files because at the build time, we don't have an ocl_stdlib.h
in the system directory. The local pch file is used for the beignet's
build and the utest only. All the other applications will use
the installed pch/pcm files.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Zhigang Gong [Wed, 8 Jan 2014 10:57:31 +0000 (18:57 +0800)]
CL: prepare to support ICD if the system has ocl-icd..
v2:
Only install the intel-beignet.icd if the system has ocl-icd
support.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Zhigang Gong [Wed, 8 Jan 2014 11:10:53 +0000 (19:10 +0800)]
CL: back port ICD support to 1.1 branch.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Ruiling Song [Wed, 8 Jan 2014 06:58:07 +0000 (14:58 +0800)]
GBE: Remove some noduplicate to let inline works
llvm Inliner seems won't inline a function if it contains noduplicate function calls.
So, we just keep the noduplicate for barrier itself. then barrier() could still be inlined.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yang Rong [Tue, 7 Jan 2014 03:30:54 +0000 (11:30 +0800)]
Move the memory allocate size check to the callee.
Because image's alignment, the alloc size may exceed the CL_DEVICE_MAX_MEM_ALLOC_SIZE if the
image's size is calculate from it. So move the size check from cl_mem_allocate to the callee, and
slightly enlarge the limit size when check in allocate image.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Simon Richter [Mon, 2 Dec 2013 13:27:46 +0000 (14:27 +0100)]
Start looking for LLVM from version 3.3 then higher version.
When different LLVM versions are installed, look for 3.5, 3.4 and 3.3 in
order, then try the system default.
As configuring for 3.1 and 3.2 gives an error now, drop these versions from
the search.
v2:
change to use llvm 3.3 as the preferred version.
update the document accordingly.
Signed-off-by: Simon Richter <Simon.Richter@hogyros.de>
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yi Sun [Thu, 2 Jan 2014 06:17:03 +0000 (14:17 +0800)]
utests/CMakeList.txt: Remove kernel files which generated by utest_generator.py.
v1. Remove all files which generated automatically.
v2. Refine the depends of generated test cases.
v3. Fix bug that error occurs while building project outside of source folder.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Junyan He [Mon, 6 Jan 2014 09:06:59 +0000 (17:06 +0800)]
Fix the multi-thread crash problem of batch buffer release.
The case causes like this:
our thread hold the ref of the batch buffer, but have called
cl_driver_delete to delete the bufmgr. So when we release
the buffer object next time, the bufmgr's function pointer
is invalid and cause the crash.
We now release the batch buffer before every time call the
cl_set_thread_batch_buf.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Tested-by: "Yang, Rong R" <rong.r.yang@intel.com>
Yi Sun [Mon, 6 Jan 2014 08:51:52 +0000 (16:51 +0800)]
Refine calculation for ULP.
Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Tue, 7 Jan 2014 04:14:54 +0000 (12:14 +0800)]
GBE: handle the first index of GEP correctly.
The first index of GEP instruction is to step over the pointer[0]
to the index. We just need to calculate the *pointer's size, and
step over *pointer's size * Index to reach the position of the
data strucutre. Then we start to iterate the composite data type.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Tue, 7 Jan 2014 02:37:55 +0000 (10:37 +0800)]
GBE: Fix a bug at constant GEP processing.
We need to initialize the offset to zero for each new operand.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Mon, 6 Jan 2014 08:37:36 +0000 (16:37 +0800)]
GBE: clang's FE doesn't support static, we just ignore it.
Although opencl spec does support static global variable or
non-kernel function, clang doesn't support them currently.
We simply ignore it currently.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Fri, 3 Jan 2014 09:15:58 +0000 (17:15 +0800)]
GBE: optimize JMP instruction.
If the pred register is not in the liveIn set, it means this register
is defined in this block. Then we don't need to validate it.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Fri, 3 Jan 2014 09:03:09 +0000 (17:03 +0800)]
GBE: optimize the CMP instruction.
If the dst bool value is not in the liveIn set, then we don't need
to care about those inactive lanes as they don't hold any active data.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Fri, 3 Jan 2014 04:54:15 +0000 (12:54 +0800)]
GBE: validate active bool value in the branching instruction.
As one bool value may be used in multiple basic blocks, we have to
validate its value to and it with current flag register.
This patch is not fully optimized. As we can avoid the validation,
if we know this bool value is already validated in the same basic
block. I will write another patch to do this optimization.
After this patch, the Opencv's all filter/blur and filter/filter2D
passed.
v2:
The compare instruction should not touch the bool value's
inactive lanes. The previous implementation clear those
channels to zero by default.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Zhigang Gong [Mon, 30 Dec 2013 10:26:42 +0000 (18:26 +0800)]
GBE: use soft mask to handle the barrier call.
As the GPU is running under predication control, the following IR
may lead one single barrier be called twice at runtime.
A:
barrier()
instructions after barrier()
B:
...
BR(cond) A
C:
...
BR A
When it runs to B's BR instruction, and if any of the condition bits is
true, it will jump to block A to execute the barrier. Then latter, if
any of the condition bits is false, it will continue to execute the
block C's code and at the end of the C block, it jump to A to execute
the barrier again.
If on the other thread, all the condition bits are true, then it triggers
a hang.
And even if all the threads run the same count of barrier, it may cause
incorrect result, as it executes the instructions after barrier() in block
A before all the work items hit the barrier point.
The solution to fix this issue is to use a soft mask register. The register
is shared by all barrier call. We initialize it to !emask at the beginning
of the program.
barrierMask = !emask.
Then when it runs into the barrier call, we set current predication bits
to the mask register, and check whether all the lanes are set. If any of
the lanes is disabled, we simply jump to next basic block. Then latter
when it runs into barrier again, we can set more bits/lanes to 1, and
check it again, if all the bits are 1, then we set the preciation flag 0,0
to all 1 and execute the barrier call and after the wait, we reinitialize
the barrierMask to !emask, and run all the other instructions after the
barrier() in block A with all lanes enabled.
After this patch, we can fix the hang issue when testing the opencv's
transpose test cases.
v2:
1. If there are still some lanes not reach the barrier, we need to set all
the finished lanes' block ip to FFFF, and we also need to clear all the
flag0 to zero. Thus we can avoid to execute those instructions after the
barrier too early.
2. fix some typos.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
Yang Rong [Thu, 26 Dec 2013 01:55:54 +0000 (09:55 +0800)]
Move the llvm optimize pass from clang to backend.
Call llvm opt pass in llvmToGen. Remove SROA pass and call GVN pass with NoLoads is true to avoid
large integer. Also handle the opt level in function llvmToGen, 0 equal to clang -O1, and 1 equal to
clang -O2.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Yang Rong [Tue, 31 Dec 2013 07:20:52 +0000 (15:20 +0800)]
Fix utest compiler_function_argument3 error after move -O2 to backend.
After move optimize from clang to backend, some pass is removed, and some pass using diff parameters,
will trigger the bug in build pushmap, cause compiler_function_argument3 fail.
There maybe one loadImm/add instruction used by different loads, in set seq. So should not add to pushmap
if the same argID/offset already added, also can't delete loadImm/add instruction again if have been deleted.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>