contrib/beignet.git
10 years agoGBE: Add a new pass to handle barrier function's noduplicate attribute correctly.
Zhigang Gong [Wed, 26 Mar 2014 05:45:56 +0000 (13:45 +0800)]
GBE: Add a new pass to handle barrier function's noduplicate attribute correctly.

This pass is to remove or add noduplicate function attribute for barrier functions.
Basically, we want to set NoDuplicate for those __gen_barrier_xxx functions. But if
a sub function calls those barrier functions, the sub function will not be inlined
in llvm's inlining pass. This is what we don't want. As inlining such a function in
the caller is safe, we just don't want it to duplicate the call. So Introduce this
pass to remove the NoDuplicate function attribute before the inlining pass and restore
it after.

v2:
fix the module changed check.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoStatistics of case running
Yi Sun [Tue, 1 Apr 2014 08:31:26 +0000 (16:31 +0800)]
Statistics of case running

summary:
-----------------
1. Add struct RStatistics to count passed number(passCount), failed number(failCount), finished run number(finishrun).

2. Print statistics line , if the termial is too narrow, doesn't print it:
  ......
  test_load_program_from_bin()    [SUCCESS]
  profiling_exec()    [SUCCESS]
  enqueue_copy_buf()    [SUCCESS]
   [run/total: 656/656]      pass: 629; fail: 25; pass rate: 0.961890

3. If case crashes, count it as failed, add the function to show statistic summary.

4. When all cases finished, list a summary like follows:
summary:
----------
  total: 656
  run: 656
  pass: 629
  fail: 25
  pass rate: 0.961890

5. If ./utest_run &> log, the log will be a little messy, tring the following command to analyse the log:

  sed 's/\r/\n/g' log | egrep "\w*\(\)" | sed -e 's/\s//g'

  After analysed:
  -----------------
......
builtin_minmag_float2()[SUCCESS]
builtin_minmag_float4()[SUCCESS]
builtin_minmag_float8()[SUCCESS]
builtin_minmag_float16()[SUCCESS]
builtin_nextafter_float()[FAILED]
builtin_nextafter_float2()[FAILED]
builtin_nextafter_float4()[FAILED]
......

6. Fix one issue, print out the crashed case name.

7. Delete the debug line in utests/compiler_basic_arithmetic.cpp, which
   output the kernel name.

8. Define function statistics() in struct UTest, which called by "utest_run -a/-c/-n".
   We just call this function to run each case, and print the statistics line.

Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd one tests case specific for unaligned buffer copy.
Junyan He [Wed, 26 Mar 2014 10:28:02 +0000 (18:28 +0800)]
Add one tests case specific for unaligned buffer copy.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoOptimize the unaligned buffer copy logic
Junyan He [Wed, 26 Mar 2014 10:27:56 +0000 (18:27 +0800)]
Optimize the unaligned buffer copy logic

Because the byte aligned read and write send instruction is
very slow, we optimize to avoid the using of it.
We seperate the unaligned case into three cases,
   1. The src and dst has same %4 unaligned offset.
      Then we just need to handle first and last dword.
   2. The src has bigger %4 unaligned offset than the dst.
      We need to do some shift and montage between src[i]
      and src[i+1]
   3. The last case, src has smaller 4% unaligned.
      Then we need to do the same for src[i-1] and src[i].

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd three copy cl files for Enqueue Copy usage.
Junyan He [Wed, 26 Mar 2014 10:27:48 +0000 (18:27 +0800)]
Add three copy cl files for Enqueue Copy usage.

Add these three cl files,
one for src and dst are not aligned but have same offset to 4.
second for src's %4 offset is bigger than the dst's
third for src's %4 offset is small than the dst's

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd kernels performance output
Yongjia Zhang [Tue, 1 Apr 2014 09:16:46 +0000 (17:16 +0800)]
Add kernels performance output

if environment variable OCL_OUTPUT_KERNEL_PERF is set non-zero,
then after the executable program exits, beignet will output the
time information of each kernel executed.

v2:fixed the patch's trailing whitespace problem.

v3:if OCL_OUTPUT_KERNEL_PERF is 1, then the output will only
contains time summary, if it is 2, then the output will contain
time summary and detail. Add output 'Ave' and 'Dev', 'Ave' is
the average time per kernel per execution round, 'Dev' is the
result of 'Ave' divide a kernel's all executions' standard deviation.

Signed-off-by: Yongjia Zhang <yongjia.zhang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Fix register liveness issue under simd mode.
Ruiling Song [Mon, 24 Mar 2014 01:46:21 +0000 (09:46 +0800)]
GBE: Fix register liveness issue under simd mode.

As we run in SIMD mode with prediction mask to indicate active lanes,
If a vreg is defined in a loop, and there are som uses of the vreg out of the loop,
the define point may be run several times under *different* prediction mask.
For these kinds of vreg, we must extend the vreg liveness into the whole loop.
If we don't do this, it's liveness is killed before the def point inside loop.
If the vreg's corresponding physical reg is assigned to other vreg during the
killed period, and the instructions before kill point were re-executed with different prediction,
the inactive lanes of vreg maybe over-written. Then the out-of-loop use will got wrong data.

This patch fixes the HaarFixture case in opencv.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Optimize the forward jump instruction.
Zhigang Gong [Mon, 17 Mar 2014 10:02:37 +0000 (18:02 +0800)]
GBE: Optimize the forward jump instruction.

As at each BB's begining, we already checked whether all channels are inactive,
we don't really need to do this duplicate checking at the end of forward jump.

This patch get about 25% performance gain for the luxmark's median scene.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoRefine the FCMP_ORD and FCMP_UNO.
Yang Rong [Mon, 24 Mar 2014 09:21:40 +0000 (17:21 +0800)]
Refine the FCMP_ORD and FCMP_UNO.

If there is a constant between src0 and src1 of FCMP_ORD/FCMP_UNO, the constant
value must be ordered, otherwise, llvm will optimize the instruction to ture/false.
So discard this constant value, only compare the other src.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoRefined the fmax and fmin builtins.
Yang Rong [Mon, 24 Mar 2014 08:27:31 +0000 (16:27 +0800)]
Refined the fmax and fmin builtins.

Because GEN's select instruction with cmod .l and .ge will handle NaN case, so
use the compare and select instruction in gen ir for fmax and fmin, and will be
optimized to one sel_cmp, need not check isnan.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: "Zou, Nanhai" <nanhai.zou@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd one test case for profiling test.
Junyan He [Mon, 24 Mar 2014 07:34:43 +0000 (15:34 +0800)]
Add one test case for profiling test.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: make byte/short vload/vstore process one element each time.
Ruiling Song [Wed, 19 Mar 2014 03:41:54 +0000 (11:41 +0800)]
GBE: make byte/short vload/vstore process one element each time.

Per OCL Spec, the computed address (p+offset*n) is 8-bit aligned for char,
and 16-bit aligned for short in vloadn & vstoren. That is we can not assume that
vload4 with char pointer is 4byte aligned. The previous implementation will make
Clang generate an load or store with alignment 4 which is in fact only alignment 1.

We need find another way to optimize the vloadn.
But before that, let's keep vloadn and vstoren work correctly.
This could fix the regression issue caused by byte/short optimization.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd SROA and GVN pass to default optLevel.
Yang Rong [Tue, 11 Mar 2014 06:43:51 +0000 (14:43 +0800)]
Add SROA and GVN pass to default optLevel.

SROA and GVN may introduce some integer type not support by backend.
Remove this type assert in GenWrite, and found these types, set the unit to
invalid. If unit is invalid, use optLevel 0, which not include SROA and GVN, and
try again.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoutests: Refine cases for sinpi.
Yi Sun [Mon, 10 Mar 2014 03:32:12 +0000 (11:32 +0800)]
utests: Refine cases for sinpi.

The general algorithm is that reducing the x to area [-0.5,0.5] then calculate results.

v2. Correct the algorithm of sinpi.
    Add some input data temporarily, and we're going to design and implement a input data generator which is similar as what Conformance does.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoMove the defination union SF to header file utest_helper.hpp
Yi Sun [Wed, 5 Mar 2014 05:59:26 +0000 (13:59 +0800)]
Move the defination union SF to header file utest_helper.hpp

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoAdd clGetMemObjectFdIntel() api
Chuanbo Weng [Wed, 5 Mar 2014 16:08:15 +0000 (00:08 +0800)]
Add clGetMemObjectFdIntel() api

Use this api to share buffer between OpenCL and v4l2. After import
the fd of OpenCL memory object to v4l2, v4l2 can directly read frame
into this memory object by the way of DMABUF, without memory-copy.

v2:
Check return value of cl_buffer_get_fd

Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agomerge some state buffers into one buffer
Guo Yejun [Thu, 6 Mar 2014 16:59:38 +0000 (00:59 +0800)]
merge some state buffers into one buffer

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoFix a convert float to long bug.
Yang Rong [Mon, 3 Mar 2014 03:25:19 +0000 (11:25 +0800)]
Fix a convert float to long bug.

When convert some special float values, slight large than LONG_MAX, to long with sat,
will error. Simply using LONG_MAX when float value equal to LONG_MAX.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Optimize byte/short load/store using untyped read/write
Ruiling Song [Fri, 7 Mar 2014 05:48:48 +0000 (13:48 +0800)]
GBE: Optimize byte/short load/store using untyped read/write

Scatter/gather are much worse than untyped read/write. So if we can pack
load/store of char/short to use untyped message, jut do it.

v2:
add some assert in splitReg()

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: Fix a potential issue if increase srcNum.
Ruiling Song [Fri, 7 Mar 2014 05:48:47 +0000 (13:48 +0800)]
GBE: Fix a potential issue if increase srcNum.

If increase MAX_SRC_NUM for ir::Instruction, unpredicted behaviour may happen.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: make vload3 only read 3 elements.
Ruiling Song [Fri, 7 Mar 2014 05:48:46 +0000 (13:48 +0800)]
GBE: make vload3 only read 3 elements.

clang will align the vec3 load into vec4. we have to do it in frontend.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: Optimize scratch memory usage using register interval
Ruiling Song [Fri, 28 Feb 2014 02:16:45 +0000 (10:16 +0800)]
GBE: Optimize scratch memory usage using register interval

As scratch memory is a limited resource in HW. And different
register have the opptunity to share same scratch memory. So
I introduce an allocator for scratch memory management.

v2:
In order to reuse the registerFilePartitioner, I rename it as
SimpleAllocator, and derive ScratchAllocator & RegisterAllocator
from it.

v3:
fix a typo, scratch size is 12KB.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: show correct line number in build log
Guo Yejun [Thu, 27 Feb 2014 17:58:20 +0000 (01:58 +0800)]
GBE: show correct line number in build log

Sometimes, we insert some code into the kernel,
it makes the line number reported in build log
mismatch with the line number in the kernel from
programer's view, use #line to correct it.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: support getelementptr with ConstantExpr operand
Guo Yejun [Wed, 26 Feb 2014 22:54:26 +0000 (06:54 +0800)]
GBE: support getelementptr with ConstantExpr operand

Add support during LLVM IR -> Gen IR period when the
first operand of getelementptr is ConstantExpr.

utest is also added.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: add fast path for more math functions
Guo Yejun [Thu, 20 Feb 2014 21:51:33 +0000 (05:51 +0800)]
GBE: add fast path for more math functions

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: remove the useless get sampler info function.
Zhigang Gong [Fri, 21 Feb 2014 05:09:20 +0000 (13:09 +0800)]
GBE: remove the useless get sampler info function.

We don't need to get the sampler info dynamically, so
remove the corresponding instruction.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: optimize read_image to avoid get sampler info dynamically.
Zhigang Gong [Fri, 21 Feb 2014 04:50:55 +0000 (12:50 +0800)]
GBE: optimize read_image to avoid get sampler info dynamically.

Most of time, the user is using a const sampler value in the kernel
directly. Thus we don't need to get the sampler value through a function
call. And this way, the compiler front end could do much better optimization
than using the dynamic get sampler information. For the luxmark's
median/simple case, this patch could get about 30-45% performance gain.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: don't put a long live register to a selection vector.
Zhigang Gong [Fri, 21 Feb 2014 02:40:08 +0000 (10:40 +0800)]
GBE: don't put a long live register to a selection vector.

If an element has very long interval, we don't want to put it into a
vector as it will add more pressure to the register allocation.

With this patch, it can reduce more than 20% spill registers for luxmark's
median scene benchmark(from 288 to 224).

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: prepare to optimize generic selection vector allocation.
Zhigang Gong [Wed, 19 Feb 2014 02:16:48 +0000 (10:16 +0800)]
GBE: prepare to optimize generic selection vector allocation.

Move the selection vector allocation after the register interval
calculation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: fixed a potential bug in 64 bit instruction.
Zhigang Gong [Wed, 19 Feb 2014 02:47:46 +0000 (10:47 +0800)]
GBE: fixed a potential bug in 64 bit instruction.

Current selection vector handling requires the dst/src
vector is starting at dst(0) or src(0).

v2:
fix an assertion.
v3:
fix a bug in gen_context.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fix the overflow bug in register spilling.
Zhigang Gong [Wed, 19 Feb 2014 08:36:33 +0000 (16:36 +0800)]
GBE: fix the overflow bug in register spilling.

Change to use int32 to represent the maxID.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: code cleanup for read_image/write_image.
Zhigang Gong [Tue, 18 Feb 2014 10:32:33 +0000 (18:32 +0800)]
GBE: code cleanup for read_image/write_image.

Remove some useless instructions and make the read/write_image
more readable.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fixed the incorrect max_dst_num and max_src_num.
Zhigang Gong [Tue, 18 Feb 2014 09:41:05 +0000 (17:41 +0800)]
GBE: fixed the incorrect max_dst_num and max_src_num.

Some I64 instructions are using more than 11 dst registers,
this patch change the max src number to 16. And add a assertion
to check if we run into this type of issue again.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: Optimize write_image instruction for simd8 mode.
Zhigang Gong [Tue, 18 Feb 2014 09:19:41 +0000 (17:19 +0800)]
GBE: Optimize write_image instruction for simd8 mode.

On simd8 mode, we can put the u,v,w,x,r,g,b,a to
a selection vector directly and don't need to
assign those values again.

Let's see an example, the following code is generated without this
patch which is doing a simple image copy:

    (26      )  (+f0) mov(8)    g113<1>F        g114<8,8,1>D                    { align1 WE_normal 1Q };
    (28      )  (+f0) send(8)   g108<1>UD       g112<8,8,1>F
                sampler (3, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
    (30      )  mov(8)          g99<1>UD        0x0UD                           { align1 WE_all 1Q };
    (32      )  mov(1)          g99.7<1>UD      0xffffUD                        { align1 WE_all };
    (34      )  mov(8)          g103<1>UD       0x0UD                           { align1 WE_all 1Q };
    (36      )  (+f0) mov(8)    g100<1>UD       g117<8,8,1>UD                   { align1 WE_normal 1Q };
    (38      )  (+f0) mov(8)    g101<1>UD       g114<8,8,1>UD                   { align1 WE_normal 1Q };
    (40      )  (+f0) mov(8)    g104<1>UD       g108<8,8,1>UD                   { align1 WE_normal 1Q };
    (42      )  (+f0) mov(8)    g105<1>UD       g109<8,8,1>UD                   { align1 WE_normal 1Q };
    (44      )  (+f0) mov(8)    g106<1>UD       g110<8,8,1>UD                   { align1 WE_normal 1Q };
    (46      )  (+f0) mov(8)    g107<1>UD       g111<8,8,1>UD                   { align1 WE_normal 1Q };
    (48      )  (+f0) send(8)   null            g99<8,8,1>UD
                renderunsupported target 5 mlen 9 rlen 0        { align1 WE_normal 1Q };
    (50      )  (+f0) mov(8)    g1<1>UW         0x1UW                           { align1 WE_normal 1Q };
  L1:
    (52      )  mov(8)          g112<1>UD       g0<8,8,1>UD                     { align1 WE_all 1Q };
    (54      )  send(8)         null            g112<8,8,1>UD
                thread_spawnerunsupported target 7 mlen 1 rlen 0 { align1 WE_normal 1Q EOT };

With this patch, we can optimize it as below:

    (26      )  (+f0) mov(8)    g106<1>F        g111<8,8,1>D                    { align1 WE_normal 1Q };
    (28      )  (+f0) send(8)   g114<1>UD       g105<8,8,1>F
                sampler (3, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
    (30      )  mov(8)          g109<1>UD       0x0UD                           { align1 WE_all 1Q };
    (32      )  mov(1)          g109.7<1>UD     0xffffUD                        { align1 WE_all };
    (34      )  mov(8)          g113<1>UD       0x0UD                           { align1 WE_all 1Q };
    (36      )  (+f0) send(8)   null            g109<8,8,1>UD
                renderunsupported target 5 mlen 9 rlen 0        { align1 WE_normal 1Q };
    (38      )  (+f0) mov(8)    g1<1>UW         0x1UW                           { align1 WE_normal 1Q };
  L1:
    (40      )  mov(8)          g112<1>UD       g0<8,8,1>UD                     { align1 WE_all 1Q };
    (42      )  send(8)         null            g112<8,8,1>UD
                thread_spawnerunsupported target 7 mlen 1 rlen 0 { align1 WE_normal 1Q EOT };

This patch could save about 8 instructions per write_image.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: optimize sample instruction.
Zhigang Gong [Tue, 18 Feb 2014 06:40:59 +0000 (14:40 +0800)]
GBE: optimize sample instruction.

The U,V,W registers could be allocated to a selection vector directly.
Then we can save some MOV instructions for the read_image functions.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoChange the order of the code
xiuli pan [Fri, 21 Feb 2014 08:25:20 +0000 (16:25 +0800)]
Change the order of the code

Fix the 66K problem in the OpenCV testing.
The bug was casued by the incorrect order
of the code, it will result the beignet to
calculate the whole localsize of the kernel
file. Now the OpenCV test can pass.

Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoFix a long DIV/REM hang.
Yang Rong [Fri, 21 Feb 2014 08:54:39 +0000 (16:54 +0800)]
Fix a long DIV/REM hang.

There is a jumpi in long DIV/REM, with predication is any16/any8. So
MUST AND the predication register with emask, otherwise may dead loop.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: improve precision of rootn
Lv Meng [Tue, 14 Jan 2014 03:04:57 +0000 (11:04 +0800)]
GBE: improve precision of rootn

Signed-off-by: Lv Meng <meng.lv@intel.com>
10 years agoRemove some unreasonable input values for rootn
Yi Sun [Thu, 20 Feb 2014 01:32:32 +0000 (09:32 +0800)]
Remove some unreasonable input values for rootn

In manual for function pow(), there's following description:
"If x is a finite value less than 0,
and y is a finite noninteger,
a domain error occurs, and a NaN is returned."
That means we can't calculate rootn in cpu like this pow(x,1.0/y) which is mentioned in OpenCL spec.
E.g. when y=3 and x=-8, rootn should return -2. But when we calculate pow(x, 1.0/y), it will return a Nan.
I didn't find multi-root math function in glibc.

Signed-off-by: Yi Sun <yi.sun@intel.com>
10 years agoutests:add subnormal check by fpclassify.
Yi Sun [Wed, 19 Feb 2014 06:12:03 +0000 (14:12 +0800)]
utests:add subnormal check by fpclassify.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: Shui yangwei <yangweix.shui@intel.com>
10 years agoChange %.20f to %e.
Yi Sun [Wed, 19 Feb 2014 06:04:52 +0000 (14:04 +0800)]
Change %.20f to %e.

This can make the error information more readable.

Signed-off-by: Yi Sun <yi.sun@intel.com>
10 years agoGBE: add param to switch the behavior of math func
Guo Yejun [Mon, 17 Feb 2014 21:30:27 +0000 (05:30 +0800)]
GBE: add param to switch the behavior of math func

Add OCL_STRICT_CONFORMANCE to switch the behavior of math func,
The funcs will be high precision with perf drops if it is 1, Fast
path with good enough precision will be selected if it is 0.

This change is to add the code basis, with 'sin' and 'cos' implemented
as examples, other math functions support will be added later.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
10 years agoutests: Remove test cases for function 'tgamma' 'erf' and 'erfc'
Yi Sun [Mon, 17 Feb 2014 03:32:47 +0000 (11:32 +0800)]
utests: Remove test cases for function 'tgamma' 'erf' and 'erfc'

Since OpenCL conformance doesn't cover these function at the moment,
we remove them temporarily.

Signed-off-by: Yi Sun <yi.sun@intel.com>
10 years agoImprove precision of sinpi/cospi
Ruiling Song [Mon, 17 Feb 2014 08:54:20 +0000 (16:54 +0800)]
Improve precision of sinpi/cospi

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: fix terminfo library linkage
Boqun Feng [Mon, 17 Feb 2014 01:49:26 +0000 (09:49 +0800)]
GBE: fix terminfo library linkage

In some distros, the terminal libraries are divided into two
libraries, one is tinfo and the other is ncurses, however, for
other distros, there is only one single ncurses library with
all functions.
In order to link proper terminal library for LLVM, find_library
macro in cmake can be used. In this patch, the tinfo is prefered,
so that it wouldn't affect linkage behavior in distros with tinfo.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
10 years agoutests: define python interpreter via cmake variable
Boqun Feng [Sat, 15 Feb 2014 06:52:44 +0000 (14:52 +0800)]
utests: define python interpreter via cmake variable

The reason for this fix is in commit
5b64170ef5e3e78d038186fb1132b11a8fec308e.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoCL: make the scratch size as a device resource attribute.
Zhigang Gong [Fri, 14 Feb 2014 08:11:36 +0000 (16:11 +0800)]
CL: make the scratch size as a device resource attribute.

Actually, the scratch size is much like the local memory size
which should be a device dependent information.

This patch is to put scratch mem size to the device attribute
structure. And when the kernel needs more than the maximum scratch
memory, we just return a out-of-resource error rather than trigger
an assertion.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Song, Ruiling <ruiling.song@intel.com>
10 years agofix typo: blobTempName is assigned but not used
Guo Yejun [Thu, 13 Feb 2014 03:59:48 +0000 (11:59 +0800)]
fix typo: blobTempName is assigned but not used

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Support 64Bit register spill.
Ruiling Song [Fri, 14 Feb 2014 07:04:26 +0000 (15:04 +0800)]
GBE: Support 64Bit register spill.

Now we support DWORD & QWORD register spill/fill.

v2:
  only add poolOffset by 1 when we meet QWord register and poolOffset is 1.

v3:
  allocate reserved register pool unifiedly for src and dst register.
  when it spill a qword register, payload register should be retyped as dword per bottom/top logic.
  put a limit on the scratch space memory size.

v4:
  fix a typo.
  increase the reserved register from 6 to 8 for some complex instruction.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agocmake: Fix linking with LLVM/Terminfo
Igor Gnatenko [Thu, 13 Feb 2014 07:16:35 +0000 (11:16 +0400)]
cmake: Fix linking with LLVM/Terminfo

DEBUG: [  9%] Building CXX object backend/src/CMakeFiles/gbe_bin_generater.dir/gbe_bin_generater.cpp.o
DEBUG: Linking CXX executable gbe_bin_generater
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x717): undefined reference to `setupterm'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x727): undefined reference to `tigetnum'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x730): undefined reference to `set_curterm'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x738): undefined reference to `del_curterm'

Signed-off-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoBump to version 0.8.0.
Zhigang Gong [Mon, 10 Feb 2014 08:28:37 +0000 (16:28 +0800)]
Bump to version 0.8.0.

This version brings many improvments compare to the last released version 0.3,
so that we decide to bump the version to 0.8.0 directly. Before the 1.0.0, we
have two steps left. One is the performance optimization and the other is to
support OpenCL 1.2 by default.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoDocs: fix some markdown errors and add some new info.
Zhigang Gong [Wed, 12 Feb 2014 07:20:45 +0000 (15:20 +0800)]
Docs: fix some markdown errors and add some new info.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoFix build errors in llvm3.5 only system.
Yang Rong [Wed, 12 Feb 2014 15:41:26 +0000 (23:41 +0800)]
Fix build errors in llvm3.5 only system.

There are some head files miss if have llvm3.5 only. If has previous llvm, even uninstall,
will still remain these head files in system, so can't trigger it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoFix the cmake problem in FindLLVM.
Zhigang Gong [Tue, 11 Feb 2014 09:51:50 +0000 (17:51 +0800)]
Fix the cmake problem in FindLLVM.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoUpdate document for LLVM/Clang 3.5.
Zhigang Gong [Mon, 10 Feb 2014 08:28:36 +0000 (16:28 +0800)]
Update document for LLVM/Clang 3.5.

Also change the README.md to link to Beignet.mdw rather than to point to the wiki page.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fixed the unsafe tmpnam_r.
Zhigang Gong [Sat, 8 Feb 2014 06:12:03 +0000 (14:12 +0800)]
GBE: fixed the unsafe tmpnam_r.

Use mkstemps instead.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoSilent compilation warning in sampler functions.
Zhigang Gong [Sat, 8 Feb 2014 06:12:02 +0000 (14:12 +0800)]
Silent compilation warning in sampler functions.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoAdd clang/LLVM 3.5svn support.
Zhigang Gong [Sat, 8 Feb 2014 03:16:43 +0000 (11:16 +0800)]
Add clang/LLVM 3.5svn support.

The clang/llvm 3.3 has some minor bugs such as the vector ++/-- which
was fixed in 3.4. But the 3.4 version introduces severer OCL bugs as
below:
http://llvm.org/bugs/show_bug.cgi?id=18119
http://llvm.org/bugs/show_bug.cgi?id=18120

It seems that the community will only fix these bugs in the ToT version
rather than the llvm 3.4 branch. I think we'd better to enable clang/llvm
3.5 in beignet. Currently, the 18120 was fixed in ToT, but 18119 still
breaks us. When 18119 get fixed, I will switch the preferred version to
3.5.

Please be noted, when you build clang/llvm 3.5, you need to enable the
cxx11 to make it compatible with beignet.

--enable-cxx11

v2:
fix the llvm3.4 issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoMake build compatible with Python 2.6
Jon Nordby [Thu, 6 Feb 2014 18:50:59 +0000 (19:50 +0100)]
Make build compatible with Python 2.6

Implicit numbers for format specifiers "{}" can only be used on Py2.7+,
and Py2.6 is still in use on for instance CentOS 6.5 and similar.

Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoFix the problem by kernel file open in utest
Junyan He [Sun, 26 Jan 2014 10:16:12 +0000 (18:16 +0800)]
Fix the problem by kernel file open in utest

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Tested-by: "Sun, Yi" <yi.sun@intel.com>
10 years agoUpdate documents.
Zhigang Gong [Mon, 20 Jan 2014 10:44:03 +0000 (18:44 +0800)]
Update documents.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoGBE: fixed the out-of-range JMPI.
Zhigang Gong [Mon, 27 Jan 2014 01:26:21 +0000 (09:26 +0800)]
GBE: fixed the out-of-range JMPI.

For the conditional jump distance out of S15 range [-32768, 32767],
we need to use an inverted jmp followed by a add ip, ip, distance
to implement. A little hacky as we need to change the nop instruction
to add instruction manually.

There is an optimization method which we can insert a
ADD instruction on demand. But that will need some extra analysis
for all the branching instruction. And need to adjust the distance
for those branch instruction's start point and end point contains
this instruction.

After this patch, the luxrender's slg4 could render the scene "alloy"
correctly.

v2:
fix the unconditional branch too.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang, Rong R <rong.r.yang@intel.com>
10 years agoWhen local_work_size is null, try to choose a local_work_size.
Yang Rong [Sun, 26 Jan 2014 08:36:58 +0000 (16:36 +0800)]
When local_work_size is null, try to choose a local_work_size.

After fix all found fails when local_work_size is not 1, re-enalbe it to
improve performance.

V2: refine to skip some useless loop.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoMultiple register's hstride in suboffset.
Yang Rong [Tue, 28 Jan 2014 03:03:15 +0000 (11:03 +0800)]
Multiple register's hstride in suboffset.

When register's hstride is not 0 or 1, suboffset will get wrong element.
Also change some offsets that already multiple hstride by hard code.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Implement complete register spill policy.
Zhigang Gong [Sun, 26 Jan 2014 06:07:14 +0000 (14:07 +0800)]
GBE: Implement complete register spill policy.

This patch implement a complete register spill policy.

When it needs to spill a register, we always choose the
register which is in the spill candate map and has
maximum endpoint. One tricky I used here is to merge both
the register's endpoint value and the register itself
into one single key. Then I can use one map to implement a
descending order map according to its value( the instruction
endpoint value). This patch supports to spill both vectors
or non-vectors.

And I move the scratch memory allocation from
instruction selection to register allocation. We may latter
use the internal interval information to reduce the scratch
memory comsumption.

Another big change is that I don't perform the real
spill on the fly. Instead, I move the real spill to the end of
all register allocation. Then spilling all the registers which
in the spillSet at one pass. This has the following advantage:
1. It only needs to loop over all instructions once.
2. When spilling one instruction, we know all the registers' status.
   Then it's easy to know the correct scratch id for each register.
   Actually, the previous implementation has a bug here.

The last part is to avoid the spill instruction restrication.
As ruiling pointed out that the spill instruction(scratch read/write)
doesn't support predication correctly for non-DW data type.

This patch avoids to spill any non-supported type register.

After this patch, both luxrender and opencv examples work fine on
my machine.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang, Rong R <rong.r.yang@intel.com>
10 years agoGBE: prepare to optimize the register spilling policy.
Zhigang Gong [Fri, 24 Jan 2014 09:31:29 +0000 (17:31 +0800)]
GBE: prepare to optimize the register spilling policy.

It's better to choose the proper register to spill
rather than always spill current register. This patch
is a preparation of a better spilling policy.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang, Rong R <rong.r.yang@intel.com>
10 years agoGBE: refine register allocation output.
Zhigang Gong [Fri, 24 Jan 2014 04:33:10 +0000 (12:33 +0800)]
GBE: refine register allocation output.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoAdd the device id for haswell GT.
Junyan He [Tue, 14 Jan 2014 08:43:42 +0000 (16:43 +0800)]
Add the device id for haswell GT.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoFix the bug in removeLOADIs function.
Junyan He [Wed, 22 Jan 2014 06:02:30 +0000 (14:02 +0800)]
Fix the bug in removeLOADIs function.

The logic for replacing the dst of the instruction
using the src number and getSrc. Fix this problem.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: allow the bool registers to be expired.
Zhigang Gong [Thu, 23 Jan 2014 06:25:55 +0000 (14:25 +0800)]
GBE: allow the bool registers to be expired.

After the previous's extra liveness analysis, we can allow bool
registers to be expired now.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agoGBE: Implement an extra liveness analysis for the Gen backend.
Zhigang Gong [Thu, 23 Jan 2014 06:15:05 +0000 (14:15 +0800)]
GBE: Implement an extra liveness analysis for the Gen backend.

  Consider the following scenario, %100's normal liveness will start from Ln-1's
  position. In normal analysis, the Ln-1 is not Ln's predecessor, thus the liveness
  of %100 will be passed to Ln and then will not be passed to L0.

  But considering we are running on a multilane with predication's vector machine.
  The unconditional BR in Ln-1 may be removed and it will enter Ln with a subset of
  the revert set of Ln-1's predication. For example when running Ln-1, the active lane
  is 0-7, then at Ln the active lane is 8-15. Then at the end of Ln, a subset of 8-15
  will jump to L0. If a register %10 is allocated the same GRF as %100, given the fact
  that their normal liveness doesn't overlapped, the a subset of 8-15 lanes will be
  modified. If the %10 and %100 are the same vector data type, then we are fine. But if
  %100 is a float vector, and the %10 is a bool or short vector, then we hit a bug here.

L0:
  ...
  %10 = 5
  ...
Ln-1:
  %100 = 2
  BR Ln+1

Ln:
  ...
  BR(%xxx) L0

Ln+1:
  %101 = %100 + 2;
  ...

  The solution to fix this issue is to build an extra liveness analysis. We will start with
  those BBs with backward jump. Then pass all the liveOut register as extra liveIn
  of current BB and then forward this extra liveIn to all the blocks. This is very similar
  to the normal liveness analysis just with reverse direction.

  Thanks yang rong who found this bug.

v2:
  Don't remove livein when initialize the extra livein.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agoGBE: increase the disassembly output's readability.
Zhigang Gong [Wed, 22 Jan 2014 02:32:08 +0000 (10:32 +0800)]
GBE: increase the disassembly output's readability.

Add label information and the instruction address
prefix. Make the address consistent with fulsim.
And also make the register allocation output a little
bit prettier.

Now the disassembly output is as below:
compiler_ceil's disassemble begin:
  L0:
    (0       )  mov(1)          f0<1>UW         0x0UW                           { align1 WE_all };
    ....
    (32      )  (+f0) mov(16)   g1<1>UW         0x1UW                           { align1 WE_normal 1H };
  L1:
    (34      )  mov(16)         g112<1>UD       g0<8,8,1>UD                     { align1 WE_all 1H };
    ...
compiler_ceil's disassemble end.

The register allocation output is as below:
%26      g2  .8   4  B  [0        -> 0       ]
%28      g2  .12  4  B  [0        -> 6       ]
%29      g2  .16  4  B  [0        -> 9       ]
%30      g126.0   64 B  [2        -> 3       ]
%31      g124.0   64 B  [3        -> 4       ]

Please be noted, the register allocation's output is not correct
when the register is a pure scalar(bool) register which allocated
at the backend instruction selection stage. To be fixed.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agoGBE: fixed a bug in sample instruction.
Zhigang Gong [Tue, 21 Jan 2014 05:15:39 +0000 (13:15 +0800)]
GBE: fixed a bug in sample instruction.

Sample instruction only have 3 source operands now, not 4.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agoGBE: fix some incorrect gen ir output messages.
Zhigang Gong [Tue, 21 Jan 2014 04:13:04 +0000 (12:13 +0800)]
GBE: fix some incorrect gen ir output messages.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoGBE: don't allocate grf for those bools which map to flag.
Zhigang Gong [Tue, 21 Jan 2014 00:34:29 +0000 (08:34 +0800)]
GBE: don't allocate grf for those bools which map to flag.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agobuild: work around an old version cmake bug.
Zhigang Gong [Mon, 20 Jan 2014 09:14:48 +0000 (17:14 +0800)]
build: work around an old version cmake bug.

On fedora core 15 with the cmake 2.8.4, Yi experienced a build error.
It turns out that the cmake may handle the file directorys with double
slashs incorrectly when the file is on a target's dependcy list and
be a output file name of a custom command.

This small patch could work around that issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Sun, Yi" <yi.sun@intel.com>
10 years agoGBE: use native exp instruction when enough precision
Guo Yejun [Mon, 20 Jan 2014 00:38:23 +0000 (08:38 +0800)]
GBE: use native exp instruction when enough precision

for the input data with enough precision, use the native exp instruction,
otherwise, use the software path to emulate the exp function.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoFix the bug of multi deleting of load instruction in lowering
Junyan He [Mon, 20 Jan 2014 03:28:43 +0000 (11:28 +0800)]
Fix the bug of multi deleting of load instruction in lowering

When the load instruction has multi-value destinations, the load
instruction in buildConstantPush function will be replaced many
times and which can cause the potential problems.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd utest compiler_private_data_overflow
Yongjia Zhang [Fri, 17 Jan 2014 08:20:02 +0000 (16:20 +0800)]
Add utest compiler_private_data_overflow

utests: compiler_private_data_overflow is aimed to hit a larger than
1KB stack. It will fail with the old beignet which allocate 1KB stack
size no matter the actual usage of stack in the kernel.

Signed-off-by: Yongjia Zhang<zhang_yong_jia@126.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd some native functions vector proto.
Yang Rong [Fri, 17 Jan 2014 08:22:56 +0000 (16:22 +0800)]
Add some native functions vector proto.

Native functions just define as normal function before, so don't need
vector proto. Now only native_exp2 and native_sqrt define as exp2 and sqrt,
so enable others'.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoRemove builtin function fma from utest_math_gen.py.
Yi Sun [Thu, 9 Jan 2014 07:56:04 +0000 (15:56 +0800)]
Remove builtin function fma from utest_math_gen.py.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoutests: Put all the generated kernel files to .gitignore at runtime.
Zhigang Gong [Tue, 14 Jan 2014 03:10:00 +0000 (11:10 +0800)]
utests: Put all the generated kernel files to .gitignore at runtime.

As there are so many generated kernel files, it's annoying when I use
git status to check the modified files and new added files. This patch
to put all of them to the gitignore file which could make things easier.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fixed the hacky code of 3D image read/write.
Zhigang Gong [Fri, 17 Jan 2014 05:05:20 +0000 (13:05 +0800)]
GBE: fixed the hacky code of 3D image read/write.

The previous implementation use a magic virtual register(0) to
indiate this is a 2D read/write. This is too hacky and may hide
bugs in the future. Now fix it without create any dumy virtual
register.

Also clean up some useless enums.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fix the hack code of sampler offset handling.
Zhigang Gong [Fri, 17 Jan 2014 04:26:47 +0000 (12:26 +0800)]
GBE: fix the hack code of sampler offset handling.

Previous implementation use a virtual register to pass the offset
to the back end side which is too hacky, now fix it.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fixed the stack allocation.
Zhigang Gong [Fri, 17 Jan 2014 02:42:25 +0000 (10:42 +0800)]
GBE: fixed the stack allocation.

Yongjia wrote a case hit the previous 1KB limitation. I took a look at
the stack pointer related code then I found the implementation is not
comply with the OCL spec.

According to OpenCL spec, section 6.9:

d. Variable length arrays and structures with flexible (or unsized) arrays are not supported.

Thus all the local variable size should be constant, and we can
manipulate the stack pointer easier , no need to do the alignment
calculating at runtime, and could get the eaxct stack size then
allocate stack size on demand. I still put a limitation there which
is 64KB.

v2:
don't add the step if the step is zero.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: move the image info register allocation to GEN IR stage.
Zhigang Gong [Thu, 16 Jan 2014 03:56:15 +0000 (11:56 +0800)]
GBE: move the image info register allocation to GEN IR stage.

If we allocate image infor register at code generation stage,
we miss the liveness calculation. Thus there is a potential risk
that some image information register's livenss data is incorrect and
may cause very subtle bug. Now fix it.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: move the image allocation to the GEN IR stage.
Zhigang Gong [Thu, 16 Jan 2014 02:16:36 +0000 (10:16 +0800)]
GBE: move the image allocation to the GEN IR stage.

Image register should be translate to a const at the GEN IR
stage to avoid the register allocator to allocate unnecessary
register for the image id.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE/Sampler: Simplfy the sampler handling.
Zhigang Gong [Wed, 15 Jan 2014 11:50:55 +0000 (19:50 +0800)]
GBE/Sampler: Simplfy the sampler handling.

Mov the sampler allocation to the Gen stage. Then we don't need to
maintain a fake key register which may also confusing the latter
register allocation phase.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fixed a register liveness bug for getsamplerinfo instrution.
Zhigang Gong [Wed, 15 Jan 2014 07:26:07 +0000 (15:26 +0800)]
GBE: fixed a register liveness bug for getsamplerinfo instrution.

The previous implementation insert the ocl::samplerinfo to the
instruction after the liveness calculation stage, so the liveness
information is not correct for that register and may cause some
test cases fails. Now fix it.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agotypo: bsically to basically
Igor Gnatenko [Mon, 13 Jan 2014 21:31:39 +0000 (01:31 +0400)]
typo: bsically to basically

Signed-off-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agocmake: use libdir macros
Igor Gnatenko [Thu, 16 Jan 2014 07:19:53 +0000 (11:19 +0400)]
cmake: use libdir macros

Don't hardcode ${prefix}/lib. More better give choice to maintainer where install libs.
We will use ${LIB_INSTALL_DIR}, which by default will point to
${CMAKE_INSTALL_PREFIX}/lib. But maintainer will can redefine it with
-DLIB_INSTALL_DIR=/usr/lib64 or the same.
Let's use libdir macroses.

Signed-off-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoChange compiler_function_argument3 to cover llvm.memcpy.
Yang Rong [Wed, 15 Jan 2014 08:31:06 +0000 (16:31 +0800)]
Change compiler_function_argument3 to cover llvm.memcpy.

We found clang wound emit llvm.memcpy when assign a stuct to another,
if sizeof(struct) > 64. Add a assignment to produce llvm.memcpy.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd llvm instrinsic function llvm.memset and llvm.memcpy support.
Yang Rong [Thu, 16 Jan 2014 07:38:30 +0000 (15:38 +0800)]
Add llvm instrinsic function llvm.memset and llvm.memcpy support.

SPIR 1.2 require llvm.memcpy support. And llvm will emit llvm.memset sometimes.
So adding a pass to lower these two intrinsic function, and then inline them.

In intrinsic lowering pass, find all llvm.memset and llvm.memcpy and then replace
them with a function call __gen_memset_x and __gen_memcpy_xx, x and xx is for address space.

Because this pass is after clang, but after clang, the unused function seems be stripped, so
implement the __gen_memset_x and __gen_memcpy_xx functions in pre compiled module, then link
them.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoUse OCL_USE_PCH to control the using pch or not.
Yang Rong [Wed, 15 Jan 2014 08:31:04 +0000 (16:31 +0800)]
Use OCL_USE_PCH to control the using pch or not.

Junyan has added the environment variable OCL_USE_PCH, but not using it.
Enable it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: improve precision of remquo
Lv Meng [Mon, 13 Jan 2014 05:50:25 +0000 (13:50 +0800)]
GBE: improve precision of remquo

Signed-off-by: Lv Meng <meng.lv@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: improve precision of hypot
Lv Meng [Mon, 13 Jan 2014 01:17:35 +0000 (09:17 +0800)]
GBE: improve precision of hypot

Signed-off-by: Lv Meng <meng.lv@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: improve precision of exp10
Lv Meng [Mon, 13 Jan 2014 00:54:02 +0000 (08:54 +0800)]
GBE: improve precision of exp10

Signed-off-by: Lv Meng <meng.lv@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Improve precision of cbrt
Ruiling Song [Fri, 10 Jan 2014 05:39:43 +0000 (13:39 +0800)]
GBE: Improve precision of cbrt

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Improve precision of atan2
Ruiling Song [Fri, 10 Jan 2014 05:39:42 +0000 (13:39 +0800)]
GBE: Improve precision of atan2

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Improve atan precision
Ruiling Song [Fri, 10 Jan 2014 05:39:41 +0000 (13:39 +0800)]
GBE: Improve atan precision

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>