review.tizen.org Git - contrib/beignet.git/log

GBE: Add a new pass to handle barrier function's noduplicate attribute correctly.

This pass is to remove or add noduplicate function attribute for barrier functions.
Basically, we want to set NoDuplicate for those __gen_barrier_xxx functions. But if
a sub function calls those barrier functions, the sub function will not be inlined
in llvm's inlining pass. This is what we don't want. As inlining such a function in
the caller is safe, we just don't want it to duplicate the call. So Introduce this
pass to remove the NoDuplicate function attribute before the inlining pass and restore
it after.

v2:
fix the module changed check.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

Statistics of case running

summary:
-----------------
1. Add struct RStatistics to count passed number(passCount), failed number(failCount), finished run number(finishrun).

2. Print statistics line , if the termial is too narrow, doesn't print it:
  ......
  test_load_program_from_bin()    [SUCCESS]
  profiling_exec()    [SUCCESS]
  enqueue_copy_buf()    [SUCCESS]
   [run/total: 656/656]      pass: 629; fail: 25; pass rate: 0.961890

3. If case crashes, count it as failed, add the function to show statistic summary.

4. When all cases finished, list a summary like follows:
summary:
----------
  total: 656
  run: 656
  pass: 629
  fail: 25
  pass rate: 0.961890

5. If ./utest_run &> log, the log will be a little messy, tring the following command to analyse the log:

  sed 's/\r/\n/g' log | egrep "\w*" | sed -e 's/\s//g'

  After analysed:
  -----------------
......
builtin_minmag_float2()[SUCCESS]
builtin_minmag_float4()[SUCCESS]
builtin_minmag_float8()[SUCCESS]
builtin_minmag_float16()[SUCCESS]
builtin_nextafter_float()[FAILED]
builtin_nextafter_float2()[FAILED]
builtin_nextafter_float4()[FAILED]
......

6. Fix one issue, print out the crashed case name.

7. Delete the debug line in utests/compiler_basic_arithmetic.cpp, which
   output the kernel name.

8. Define function statistics() in struct UTest, which called by "utest_run -a/-c/-n".
   We just call this function to run each case, and print the statistics line.

Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add one tests case specific for unaligned buffer copy.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Optimize the unaligned buffer copy logic

Because the byte aligned read and write send instruction is
very slow, we optimize to avoid the using of it.
We seperate the unaligned case into three cases,
   1. The src and dst has same %4 unaligned offset.
      Then we just need to handle first and last dword.
   2. The src has bigger %4 unaligned offset than the dst.
      We need to do some shift and montage between src[i]
      and src[i+1]
   3. The last case, src has smaller 4% unaligned.
      Then we need to do the same for src[i-1] and src[i].

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add three copy cl files for Enqueue Copy usage.

Add these three cl files,
one for src and dst are not aligned but have same offset to 4.
second for src's %4 offset is bigger than the dst's
third for src's %4 offset is small than the dst's

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add kernels performance output

if environment variable OCL_OUTPUT_KERNEL_PERF is set non-zero,
then after the executable program exits, beignet will output the
time information of each kernel executed.

v2:fixed the patch's trailing whitespace problem.

v3:if OCL_OUTPUT_KERNEL_PERF is 1, then the output will only
contains time summary, if it is 2, then the output will contain
time summary and detail. Add output 'Ave' and 'Dev', 'Ave' is
the average time per kernel per execution round, 'Dev' is the
result of 'Ave' divide a kernel's all executions' standard deviation.

Signed-off-by: Yongjia Zhang <yongjia.zhang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Fix register liveness issue under simd mode.

As we run in SIMD mode with prediction mask to indicate active lanes,
If a vreg is defined in a loop, and there are som uses of the vreg out of the loop,
the define point may be run several times under *different* prediction mask.
For these kinds of vreg, we must extend the vreg liveness into the whole loop.
If we don't do this, it's liveness is killed before the def point inside loop.
If the vreg's corresponding physical reg is assigned to other vreg during the
killed period, and the instructions before kill point were re-executed with different prediction,
the inactive lanes of vreg maybe over-written. Then the out-of-loop use will got wrong data.

This patch fixes the HaarFixture case in opencv.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Optimize the forward jump instruction.

As at each BB's begining, we already checked whether all channels are inactive,
we don't really need to do this duplicate checking at the end of forward jump.

This patch get about 25% performance gain for the luxmark's median scene.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Refine the FCMP_ORD and FCMP_UNO.

If there is a constant between src0 and src1 of FCMP_ORD/FCMP_UNO, the constant
value must be ordered, otherwise, llvm will optimize the instruction to ture/false.
So discard this constant value, only compare the other src.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Refined the fmax and fmin builtins.

Because GEN's select instruction with cmod .l and .ge will handle NaN case, so
use the compare and select instruction in gen ir for fmax and fmin, and will be
optimized to one sel_cmp, need not check isnan.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: "Zou, Nanhai" <nanhai.zou@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add one test case for profiling test.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: make byte/short vload/vstore process one element each time.

Per OCL Spec, the computed address (p+offset*n) is 8-bit aligned for char,
and 16-bit aligned for short in vloadn & vstoren. That is we can not assume that
vload4 with char pointer is 4byte aligned. The previous implementation will make
Clang generate an load or store with alignment 4 which is in fact only alignment 1.

We need find another way to optimize the vloadn.
But before that, let's keep vloadn and vstoren work correctly.
This could fix the regression issue caused by byte/short optimization.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add SROA and GVN pass to default optLevel.

SROA and GVN may introduce some integer type not support by backend.
Remove this type assert in GenWrite, and found these types, set the unit to
invalid. If unit is invalid, use optLevel 0, which not include SROA and GVN, and
try again.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

utests: Refine cases for sinpi.

The general algorithm is that reducing the x to area [-0.5,0.5] then calculate results.

v2. Correct the algorithm of sinpi.
Add some input data temporarily, and we're going to design and implement a input data generator which is similar as what Conformance does.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

Move the defination union SF to header file utest_helper.hpp

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

Add clGetMemObjectFdIntel() api

Use this api to share buffer between OpenCL and v4l2. After import
the fd of OpenCL memory object to v4l2, v4l2 can directly read frame
into this memory object by the way of DMABUF, without memory-copy.

v2:
Check return value of cl_buffer_get_fd

Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

merge some state buffers into one buffer

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Fix a convert float to long bug.

When convert some special float values, slight large than LONG_MAX, to long with sat,
will error. Simply using LONG_MAX when float value equal to LONG_MAX.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Optimize byte/short load/store using untyped read/write

Scatter/gather are much worse than untyped read/write. So if we can pack
load/store of char/short to use untyped message, jut do it.

v2:
add some assert in splitReg()

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: Fix a potential issue if increase srcNum.

If increase MAX_SRC_NUM for ir::Instruction, unpredicted behaviour may happen.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: make vload3 only read 3 elements.

clang will align the vec3 load into vec4. we have to do it in frontend.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: Optimize scratch memory usage using register interval

As scratch memory is a limited resource in HW. And different
register have the opptunity to share same scratch memory. So
I introduce an allocator for scratch memory management.

v2:
In order to reuse the registerFilePartitioner, I rename it as
SimpleAllocator, and derive ScratchAllocator & RegisterAllocator
from it.

v3:
fix a typo, scratch size is 12KB.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: show correct line number in build log

Sometimes, we insert some code into the kernel,
it makes the line number reported in build log
mismatch with the line number in the kernel from
programer's view, use #line to correct it.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: support getelementptr with ConstantExpr operand

Add support during LLVM IR -> Gen IR period when the
first operand of getelementptr is ConstantExpr.

utest is also added.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: add fast path for more math functions

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: remove the useless get sampler info function.

We don't need to get the sampler info dynamically, so
remove the corresponding instruction.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: optimize read_image to avoid get sampler info dynamically.

Most of time, the user is using a const sampler value in the kernel
directly. Thus we don't need to get the sampler value through a function
call. And this way, the compiler front end could do much better optimization
than using the dynamic get sampler information. For the luxmark's
median/simple case, this patch could get about 30-45% performance gain.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: don't put a long live register to a selection vector.

If an element has very long interval, we don't want to put it into a
vector as it will add more pressure to the register allocation.

With this patch, it can reduce more than 20% spill registers for luxmark's
median scene benchmark(from 288 to 224).

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: prepare to optimize generic selection vector allocation.

Move the selection vector allocation after the register interval
calculation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fixed a potential bug in 64 bit instruction.

Current selection vector handling requires the dst/src
vector is starting at dst(0) or src(0).

v2:
fix an assertion.
v3:
fix a bug in gen_context.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fix the overflow bug in register spilling.

Change to use int32 to represent the maxID.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: code cleanup for read_image/write_image.

Remove some useless instructions and make the read/write_image
more readable.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fixed the incorrect max_dst_num and max_src_num.

Some I64 instructions are using more than 11 dst registers,
this patch change the max src number to 16. And add a assertion
to check if we run into this type of issue again.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: Optimize write_image instruction for simd8 mode.

On simd8 mode, we can put the u,v,w,x,r,g,b,a to
a selection vector directly and don't need to
assign those values again.

Let's see an example, the following code is generated without this
patch which is doing a simple image copy:

    (26      )  (+f0) mov(8)    g113<1>F        g114<8,8,1>D                    { align1 WE_normal 1Q };
    (28      )  (+f0) send(8)   g108<1>UD       g112<8,8,1>F
                sampler (3, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
    (30      )  mov(8)          g99<1>UD        0x0UD                           { align1 WE_all 1Q };
    (32      )  mov(1)          g99.7<1>UD      0xffffUD                        { align1 WE_all };
    (34      )  mov(8)          g103<1>UD       0x0UD                           { align1 WE_all 1Q };
    (36      )  (+f0) mov(8)    g100<1>UD       g117<8,8,1>UD                   { align1 WE_normal 1Q };
    (38      )  (+f0) mov(8)    g101<1>UD       g114<8,8,1>UD                   { align1 WE_normal 1Q };
    (40      )  (+f0) mov(8)    g104<1>UD       g108<8,8,1>UD                   { align1 WE_normal 1Q };
    (42      )  (+f0) mov(8)    g105<1>UD       g109<8,8,1>UD                   { align1 WE_normal 1Q };
    (44      )  (+f0) mov(8)    g106<1>UD       g110<8,8,1>UD                   { align1 WE_normal 1Q };
    (46      )  (+f0) mov(8)    g107<1>UD       g111<8,8,1>UD                   { align1 WE_normal 1Q };
    (48      )  (+f0) send(8)   null            g99<8,8,1>UD
                renderunsupported target 5 mlen 9 rlen 0        { align1 WE_normal 1Q };
    (50      )  (+f0) mov(8)    g1<1>UW         0x1UW                           { align1 WE_normal 1Q };
  L1:
    (52      )  mov(8)          g112<1>UD       g0<8,8,1>UD                     { align1 WE_all 1Q };
    (54      )  send(8)         null            g112<8,8,1>UD
                thread_spawnerunsupported target 7 mlen 1 rlen 0 { align1 WE_normal 1Q EOT };

With this patch, we can optimize it as below:

    (26      )  (+f0) mov(8)    g106<1>F        g111<8,8,1>D                    { align1 WE_normal 1Q };
    (28      )  (+f0) send(8)   g114<1>UD       g105<8,8,1>F
                sampler (3, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
    (30      )  mov(8)          g109<1>UD       0x0UD                           { align1 WE_all 1Q };
    (32      )  mov(1)          g109.7<1>UD     0xffffUD                        { align1 WE_all };
    (34      )  mov(8)          g113<1>UD       0x0UD                           { align1 WE_all 1Q };
    (36      )  (+f0) send(8)   null            g109<8,8,1>UD
                renderunsupported target 5 mlen 9 rlen 0        { align1 WE_normal 1Q };
    (38      )  (+f0) mov(8)    g1<1>UW         0x1UW                           { align1 WE_normal 1Q };
  L1:
    (40      )  mov(8)          g112<1>UD       g0<8,8,1>UD                     { align1 WE_all 1Q };
    (42      )  send(8)         null            g112<8,8,1>UD
                thread_spawnerunsupported target 7 mlen 1 rlen 0 { align1 WE_normal 1Q EOT };

This patch could save about 8 instructions per write_image.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: optimize sample instruction.

The U,V,W registers could be allocated to a selection vector directly.
Then we can save some MOV instructions for the read_image functions.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Change the order of the code

Fix the 66K problem in the OpenCV testing.
The bug was casued by the incorrect order
of the code, it will result the beignet to
calculate the whole localsize of the kernel
file. Now the OpenCV test can pass.

Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>

Fix a long DIV/REM hang.

There is a jumpi in long DIV/REM, with predication is any16/any8. So
MUST AND the predication register with emask, otherwise may dead loop.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: improve precision of rootn

Signed-off-by: Lv Meng <meng.lv@intel.com>

Remove some unreasonable input values for rootn

In manual for function pow(), there's following description:
"If x is a finite value less than 0,
and y is a finite noninteger,
a domain error occurs, and a NaN is returned."
That means we can't calculate rootn in cpu like this pow(x,1.0/y) which is mentioned in OpenCL spec.
E.g. when y=3 and x=-8, rootn should return -2. But when we calculate pow(x, 1.0/y), it will return a Nan.
I didn't find multi-root math function in glibc.

Signed-off-by: Yi Sun <yi.sun@intel.com>

utests:add subnormal check by fpclassify.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: Shui yangwei <yangweix.shui@intel.com>

Change %.20f to %e.

This can make the error information more readable.

Signed-off-by: Yi Sun <yi.sun@intel.com>

GBE: add param to switch the behavior of math func

Add OCL_STRICT_CONFORMANCE to switch the behavior of math func,
The funcs will be high precision with perf drops if it is 1, Fast
path with good enough precision will be selected if it is 0.

This change is to add the code basis, with 'sin' and 'cos' implemented
as examples, other math functions support will be added later.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>

utests: Remove test cases for function 'tgamma' 'erf' and 'erfc'

Since OpenCL conformance doesn't cover these function at the moment,
we remove them temporarily.

Signed-off-by: Yi Sun <yi.sun@intel.com>

Improve precision of sinpi/cospi

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: fix terminfo library linkage

In some distros, the terminal libraries are divided into two
libraries, one is tinfo and the other is ncurses, however, for
other distros, there is only one single ncurses library with
all functions.
In order to link proper terminal library for LLVM, find_library
macro in cmake can be used. In this patch, the tinfo is prefered,
so that it wouldn't affect linkage behavior in distros with tinfo.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>

utests: define python interpreter via cmake variable

The reason for this fix is in commit
5b64170ef5e3e78d038186fb1132b11a8fec308e.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

CL: make the scratch size as a device resource attribute.

Actually, the scratch size is much like the local memory size
which should be a device dependent information.

This patch is to put scratch mem size to the device attribute
structure. And when the kernel needs more than the maximum scratch
memory, we just return a out-of-resource error rather than trigger
an assertion.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Song, Ruiling <ruiling.song@intel.com>

fix typo: blobTempName is assigned but not used

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Support 64Bit register spill.

Now we support DWORD & QWORD register spill/fill.

v2:
  only add poolOffset by 1 when we meet QWord register and poolOffset is 1.

v3:
  allocate reserved register pool unifiedly for src and dst register.
  when it spill a qword register, payload register should be retyped as dword per bottom/top logic.
  put a limit on the scratch space memory size.

v4:
  fix a typo.
  increase the reserved register from 6 to 8 for some complex instruction.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

cmake: Fix linking with LLVM/Terminfo

DEBUG: [ 9%] Building CXX object backend/src/CMakeFiles/gbe_bin_generater.dir/gbe_bin_generater.cpp.o
DEBUG: Linking CXX executable gbe_bin_generater
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x717): undefined reference to `setupterm'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x727): undefined reference to `tigetnum'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x730): undefined reference to `set_curterm'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x738): undefined reference to `del_curterm'

Signed-off-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Bump to version 0.8.0.

This version brings many improvments compare to the last released version 0.3,
so that we decide to bump the version to 0.8.0 directly. Before the 1.0.0, we
have two steps left. One is the performance optimization and the other is to
support OpenCL 1.2 by default.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Docs: fix some markdown errors and add some new info.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Fix build errors in llvm3.5 only system.

There are some head files miss if have llvm3.5 only. If has previous llvm, even uninstall,
will still remain these head files in system, so can't trigger it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix the cmake problem in FindLLVM.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Update document for LLVM/Clang 3.5.

Also change the README.md to link to Beignet.mdw rather than to point to the wiki page.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fixed the unsafe tmpnam_r.

Use mkstemps instead.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Silent compilation warning in sampler functions.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Add clang/LLVM 3.5svn support.

The clang/llvm 3.3 has some minor bugs such as the vector ++/-- which
was fixed in 3.4. But the 3.4 version introduces severer OCL bugs as
below:
http://llvm.org/bugs/show_bug.cgi?id=18119
http://llvm.org/bugs/show_bug.cgi?id=18120

It seems that the community will only fix these bugs in the ToT version
rather than the llvm 3.4 branch. I think we'd better to enable clang/llvm
3.5 in beignet. Currently, the 18120 was fixed in ToT, but 18119 still
breaks us. When 18119 get fixed, I will switch the preferred version to
3.5.

Please be noted, when you build clang/llvm 3.5, you need to enable the
cxx11 to make it compatible with beignet.

--enable-cxx11

v2:
fix the llvm3.4 issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Make build compatible with Python 2.6

Implicit numbers for format specifiers "{}" can only be used on Py2.7+,
and Py2.6 is still in use on for instance CentOS 6.5 and similar.

Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix the problem by kernel file open in utest

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Tested-by: "Sun, Yi" <yi.sun@intel.com>

Update documents.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

GBE: fixed the out-of-range JMPI.

For the conditional jump distance out of S15 range [-32768, 32767],
we need to use an inverted jmp followed by a add ip, ip, distance
to implement. A little hacky as we need to change the nop instruction
to add instruction manually.

There is an optimization method which we can insert a
ADD instruction on demand. But that will need some extra analysis
for all the branching instruction. And need to adjust the distance
for those branch instruction's start point and end point contains
this instruction.

After this patch, the luxrender's slg4 could render the scene "alloy"
correctly.

v2:
fix the unconditional branch too.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang, Rong R <rong.r.yang@intel.com>

When local_work_size is null, try to choose a local_work_size.

After fix all found fails when local_work_size is not 1, re-enalbe it to
improve performance.

V2: refine to skip some useless loop.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Multiple register's hstride in suboffset.

When register's hstride is not 0 or 1, suboffset will get wrong element.
Also change some offsets that already multiple hstride by hard code.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Implement complete register spill policy.

This patch implement a complete register spill policy.

When it needs to spill a register, we always choose the
register which is in the spill candate map and has
maximum endpoint. One tricky I used here is to merge both
the register's endpoint value and the register itself
into one single key. Then I can use one map to implement a
descending order map according to its value( the instruction
endpoint value). This patch supports to spill both vectors
or non-vectors.

And I move the scratch memory allocation from
instruction selection to register allocation. We may latter
use the internal interval information to reduce the scratch
memory comsumption.

Another big change is that I don't perform the real
spill on the fly. Instead, I move the real spill to the end of
all register allocation. Then spilling all the registers which
in the spillSet at one pass. This has the following advantage:
1. It only needs to loop over all instructions once.
2. When spilling one instruction, we know all the registers' status.
Then it's easy to know the correct scratch id for each register.
Actually, the previous implementation has a bug here.

The last part is to avoid the spill instruction restrication.
As ruiling pointed out that the spill instruction(scratch read/write)
doesn't support predication correctly for non-DW data type.

This patch avoids to spill any non-supported type register.

After this patch, both luxrender and opencv examples work fine on
my machine.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang, Rong R <rong.r.yang@intel.com>

GBE: prepare to optimize the register spilling policy.

It's better to choose the proper register to spill
rather than always spill current register. This patch
is a preparation of a better spilling policy.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang, Rong R <rong.r.yang@intel.com>

GBE: refine register allocation output.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Add the device id for haswell GT.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix the bug in removeLOADIs function.

The logic for replacing the dst of the instruction
using the src number and getSrc. Fix this problem.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: allow the bool registers to be expired.

After the previous's extra liveness analysis, we can allow bool
registers to be expired now.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

GBE: Implement an extra liveness analysis for the Gen backend.

  Consider the following scenario, %100's normal liveness will start from Ln-1's
  position. In normal analysis, the Ln-1 is not Ln's predecessor, thus the liveness
  of %100 will be passed to Ln and then will not be passed to L0.

  But considering we are running on a multilane with predication's vector machine.
  The unconditional BR in Ln-1 may be removed and it will enter Ln with a subset of
  the revert set of Ln-1's predication. For example when running Ln-1, the active lane
  is 0-7, then at Ln the active lane is 8-15. Then at the end of Ln, a subset of 8-15
  will jump to L0. If a register %10 is allocated the same GRF as %100, given the fact
  that their normal liveness doesn't overlapped, the a subset of 8-15 lanes will be
  modified. If the %10 and %100 are the same vector data type, then we are fine. But if
  %100 is a float vector, and the %10 is a bool or short vector, then we hit a bug here.

L0:
  ...
  %10 = 5
  ...
Ln-1:
  %100 = 2
  BR Ln+1

Ln:
  ...
  BR(%xxx) L0

Ln+1:
  %101 = %100 + 2;
  ...

  The solution to fix this issue is to build an extra liveness analysis. We will start with
  those BBs with backward jump. Then pass all the liveOut register as extra liveIn
  of current BB and then forward this extra liveIn to all the blocks. This is very similar
  to the normal liveness analysis just with reverse direction.

  Thanks yang rong who found this bug.

v2:
  Don't remove livein when initialize the extra livein.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

GBE: increase the disassembly output's readability.

Add label information and the instruction address
prefix. Make the address consistent with fulsim.
And also make the register allocation output a little
bit prettier.

Now the disassembly output is as below:
compiler_ceil's disassemble begin:
  L0:
    (0       )  mov(1)          f0<1>UW         0x0UW                           { align1 WE_all };
    ....
    (32      )  (+f0) mov(16)   g1<1>UW         0x1UW                           { align1 WE_normal 1H };
  L1:
    (34      )  mov(16)         g112<1>UD       g0<8,8,1>UD                     { align1 WE_all 1H };
    ...
compiler_ceil's disassemble end.

The register allocation output is as below:
%26      g2  .8   4  B  [0        -> 0       ]
%28      g2  .12  4  B  [0        -> 6       ]
%29      g2  .16  4  B  [0        -> 9       ]
%30      g126.0   64 B  [2        -> 3       ]
%31      g124.0   64 B  [3        -> 4       ]

Please be noted, the register allocation's output is not correct
when the register is a pure scalar(bool) register which allocated
at the backend instruction selection stage. To be fixed.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

GBE: fixed a bug in sample instruction.

Sample instruction only have 3 source operands now, not 4.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

GBE: fix some incorrect gen ir output messages.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

GBE: don't allocate grf for those bools which map to flag.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

build: work around an old version cmake bug.

On fedora core 15 with the cmake 2.8.4, Yi experienced a build error.
It turns out that the cmake may handle the file directorys with double
slashs incorrectly when the file is on a target's dependcy list and
be a output file name of a custom command.

This small patch could work around that issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Sun, Yi" <yi.sun@intel.com>

GBE: use native exp instruction when enough precision

for the input data with enough precision, use the native exp instruction,
otherwise, use the software path to emulate the exp function.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

Fix the bug of multi deleting of load instruction in lowering

When the load instruction has multi-value destinations, the load
instruction in buildConstantPush function will be replaced many
times and which can cause the potential problems.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add utest compiler_private_data_overflow

utests: compiler_private_data_overflow is aimed to hit a larger than
1KB stack. It will fail with the old beignet which allocate 1KB stack
size no matter the actual usage of stack in the kernel.

Signed-off-by: Yongjia Zhang<zhang_yong_jia@126.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add some native functions vector proto.

Native functions just define as normal function before, so don't need
vector proto. Now only native_exp2 and native_sqrt define as exp2 and sqrt,
so enable others'.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Remove builtin function fma from utest_math_gen.py.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

utests: Put all the generated kernel files to .gitignore at runtime.

As there are so many generated kernel files, it's annoying when I use
git status to check the modified files and new added files. This patch
to put all of them to the gitignore file which could make things easier.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fixed the hacky code of 3D image read/write.

The previous implementation use a magic virtual register(0) to
indiate this is a 2D read/write. This is too hacky and may hide
bugs in the future. Now fix it without create any dumy virtual
register.

Also clean up some useless enums.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fix the hack code of sampler offset handling.

Previous implementation use a virtual register to pass the offset
to the back end side which is too hacky, now fix it.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fixed the stack allocation.

Yongjia wrote a case hit the previous 1KB limitation. I took a look at
the stack pointer related code then I found the implementation is not
comply with the OCL spec.

According to OpenCL spec, section 6.9:

d. Variable length arrays and structures with flexible (or unsized) arrays are not supported.

Thus all the local variable size should be constant, and we can
manipulate the stack pointer easier , no need to do the alignment
calculating at runtime, and could get the eaxct stack size then
allocate stack size on demand. I still put a limitation there which
is 64KB.

v2:
don't add the step if the step is zero.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: move the image info register allocation to GEN IR stage.

If we allocate image infor register at code generation stage,
we miss the liveness calculation. Thus there is a potential risk
that some image information register's livenss data is incorrect and
may cause very subtle bug. Now fix it.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: move the image allocation to the GEN IR stage.

Image register should be translate to a const at the GEN IR
stage to avoid the register allocator to allocate unnecessary
register for the image id.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE/Sampler: Simplfy the sampler handling.

Mov the sampler allocation to the Gen stage. Then we don't need to
maintain a fake key register which may also confusing the latter
register allocation phase.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fixed a register liveness bug for getsamplerinfo instrution.

The previous implementation insert the ocl::samplerinfo to the
instruction after the liveness calculation stage, so the liveness
information is not correct for that register and may cause some
test cases fails. Now fix it.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

typo: bsically to basically

Signed-off-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

cmake: use libdir macros

Don't hardcode ${prefix}/lib. More better give choice to maintainer where install libs.
We will use ${LIB_INSTALL_DIR}, which by default will point to
${CMAKE_INSTALL_PREFIX}/lib. But maintainer will can redefine it with
-DLIB_INSTALL_DIR=/usr/lib64 or the same.
Let's use libdir macroses.

Signed-off-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Change compiler_function_argument3 to cover llvm.memcpy.

We found clang wound emit llvm.memcpy when assign a stuct to another,
if sizeof(struct) > 64. Add a assignment to produce llvm.memcpy.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add llvm instrinsic function llvm.memset and llvm.memcpy support.

SPIR 1.2 require llvm.memcpy support. And llvm will emit llvm.memset sometimes.
So adding a pass to lower these two intrinsic function, and then inline them.

In intrinsic lowering pass, find all llvm.memset and llvm.memcpy and then replace
them with a function call __gen_memset_x and __gen_memcpy_xx, x and xx is for address space.

Because this pass is after clang, but after clang, the unused function seems be stripped, so
implement the __gen_memset_x and __gen_memcpy_xx functions in pre compiled module, then link
them.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Use OCL_USE_PCH to control the using pch or not.

Junyan has added the environment variable OCL_USE_PCH, but not using it.
Enable it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: improve precision of remquo

Signed-off-by: Lv Meng <meng.lv@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: improve precision of hypot

Signed-off-by: Lv Meng <meng.lv@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: improve precision of exp10

Signed-off-by: Lv Meng <meng.lv@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Improve precision of cbrt

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Improve precision of atan2

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Improve atan precision

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>