contrib/beignet.git
10 years agoGBE: fixed a regression at "Long" div/rem.
Zhigang Gong [Sun, 4 May 2014 00:59:41 +0000 (08:59 +0800)]
GBE: fixed a regression at "Long" div/rem.

If the GEN_PREDICATE_ALIGN1_ANY8H/ANY16H or ALL8H/ALL16H
are used, we must make sure those inactive lanes are initialized
correctly. For "ANY" condition, all the inactive lanes need to
be clear to zero. For "ALL" condition, all the inactive lanes
need to be set to 1s. Otherwise, it may cause infinite loop.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoInit Benchmark suite
Yi Sun [Mon, 28 Apr 2014 05:31:05 +0000 (13:31 +0800)]
Init Benchmark suite

The first benchmark case is name enqueue_copy_buf.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: reserve flag0.0 for large basic block.
Zhigang Gong [Fri, 25 Apr 2014 13:52:34 +0000 (21:52 +0800)]
GBE: reserve flag0.0 for large basic block.

As in large basic block, there are more than one IF instruction which
need to use the flag0.0. We have to reserve flag 0.0 to those IF
instructions.

Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fix the large if/endif block issue.
Zhigang Gong [Fri, 25 Apr 2014 07:38:22 +0000 (15:38 +0800)]
GBE: fix the large if/endif block issue.

Some test cases have some very large block which contains
more than 32768/2 instructions which could fit into one
if/endif block.

This patch introduce a ifendif fix switch at the GenContext.
Once we encounter one of such error, we set the switch on
and then recompile the kernel. When the switch is on, we will
insert extra endif/if pair to the block to split one if/endif
block to multiple ones to fix the large if/endif issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fix the hard coded endif offset calculation.
Zhigang Gong [Fri, 25 Apr 2014 04:36:59 +0000 (12:36 +0800)]
GBE: fix the hard coded endif offset calculation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: Avoid unecessary dag/liveness computing at backend.
Zhigang Gong [Thu, 24 Apr 2014 07:24:07 +0000 (15:24 +0800)]
GBE: Avoid unecessary dag/liveness computing at backend.

We don't need to compute dag/liveness at the backend when
we switch to a new code gen strategy.
For the unit test case, this patch could save 15% of the
overall execution time. For the luxmark with STRICT conformance
mode, it saves about 40% of the build time.

v3: fix some minor bugs.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fixed a potential scalarize bug.
Zhigang Gong [Thu, 24 Apr 2014 10:18:40 +0000 (18:18 +0800)]
GBE: fixed a potential scalarize bug.

We need to append extract instruction when do a bitcast to
a vector. Otherwise, we may trigger an assert as the extract
instruction uses a undefined vector.

After this patch, it becomes safe to do many rounds of scalarize
pass.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoadd support for cross compiler
Guo Yejun [Wed, 23 Apr 2014 18:18:00 +0000 (02:18 +0800)]
add support for cross compiler

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: refine the gen program strategy.
Zhigang Gong [Thu, 24 Apr 2014 02:09:13 +0000 (10:09 +0800)]
GBE: refine the gen program strategy.

The limitRegisterPressure only affects the MAD pattern matching
which could not bring noticeable difference here. I change it to always
be false. And add the reserved registers for spill to the strategy
structure. Thus we can try to build a program as the following
strategy:

1. SIMD16 without spilling
2. SIMD16 with 10 spilling registers and with a default spilling threshold
   value 16. When need to spill more than 16 registers, we fall back to next
   method.
3. SIMD8 without spilling
4. SIMD8 with 8 spilling registers.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: fixed the undefined phi value's liveness analysis.
Zhigang Gong [Thu, 10 Apr 2014 06:33:48 +0000 (14:33 +0800)]
GBE: fixed the undefined phi value's liveness analysis.

If a phi component is undef from one of the predecessors,
we should not pass it as the predecessor's liveout registers.
Otherwise, that phi register's liveness may be extent to
the basic block zero which is not good.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoGBE: Try expire some register before register allocation
Ruiling Song [Wed, 23 Apr 2014 06:31:29 +0000 (14:31 +0800)]
GBE: Try expire some register before register allocation

1. This would free unused register asap, so it becomes easy to allocate
   contiguous registers.

2. We previously met many hidden register liveness issue. Let's try
   to reuse the expired register early. Then I think wrong liveness may
   easy to find.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
10 years agoGBE: Optimize byte gather read using untyped read.
Ruiling Song [Wed, 23 Apr 2014 02:56:50 +0000 (10:56 +0800)]
GBE: Optimize byte gather read using untyped read.

Untyped read seems better than byte gather read.
Some performance test in opencv got doubled after the patch.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
10 years agoadd test for __gen_ocl_simd_any and __gen_ocl_simd_all
Guo Yejun [Fri, 18 Apr 2014 05:42:29 +0000 (13:42 +0800)]
add test for __gen_ocl_simd_any and __gen_ocl_simd_all

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agosupport __gen_ocl_simd_any and __gen_ocl_simd_all
Guo Yejun [Fri, 18 Apr 2014 05:42:16 +0000 (13:42 +0800)]
support __gen_ocl_simd_any and __gen_ocl_simd_all

short __gen_ocl_simd_any(short x):
if x in any of the active threads in the same SIMD is not zero,
the return value for all these threads is not zero, otherwise, zero returned.

short __gen_ocl_simd_all(short x):
only if x in all of the active threads in the same SIMD is not zero,
the return value for all these threads is not zero, otherwise, zero returned.

for example:
to check if a special value exists in a global buffer, use one SIMD
to do the searching parallelly, the whole SIMD can stop the task
once the value is found. The key kernel code looks like:

for(; ; ) {
  ...
  if (__gen_ocl_simd_any(...))
    break;   //the whole SIMD stop the searching
}

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoDelete the printing of dynamic statistics line.
Sun, Yi [Tue, 8 Apr 2014 02:53:40 +0000 (10:53 +0800)]
Delete the printing of dynamic statistics line.

summary:
---------------------
  1. Delete the printing of dynamic statistics line.
  2. Add function to catch signals(like CTRL+C,core dumped ...),
     if caught, reminder user the signal name.
     core dumped example:
...
displacement_map_element()    [SUCCESS]
compiler_clod()    Interrupt signal (SIGSEGV) received.
summary:
----------
  total: 657
  run: 297
  pass: 271
  fail: 26
  pass rate: 0.960426

Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: Yangwei Shui <yangweix.shui@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Implement instruction compact.
Ruiling Song [Tue, 15 Apr 2014 08:53:17 +0000 (16:53 +0800)]
GBE: Implement instruction compact.

A native GEN ASM would takes 2*64bit, but GEN also support compact instruction
which only takes 64bit. To make code easily understood, GenInstruction now only
stands for 64bit memory, and use GenNativeInstruction & GenCompactInstruction
to represent normal(native) and compact instruction.

After this change, it is not easily to map SelectionInstruction distance to ASM distance.
As the instructions in the distance maybe compacted. To not introduce too much
complexity, JMP, IF, ENDIF, NOP will NEVER be compacted.

Some experiment in luxMark shows it could reduce about 20% instruction memory.
But it is sad that no performance improvement observed.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: fix a Q64 spilling bug in non-simd8 mode.
Zhigang Gong [Thu, 17 Apr 2014 09:41:58 +0000 (17:41 +0800)]
GBE: fix a Q64 spilling bug in non-simd8 mode.

For simd16 mode, the payload need to have 2 GRFs not the hard coded 1 GRF.
This patch fixes the corresponding regression on piglit.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: work around baytrail-t hang issue.
Zhigang Gong [Thu, 17 Apr 2014 06:59:00 +0000 (14:59 +0800)]
GBE: work around baytrail-t hang issue.

There is an unkown issue with baytrail-t platform. It will hang at
utest's compiler_global_constant case. After some investigation,
it turns out to be related to the DWORD GATHER READ send message
on the constand cache data port. I change to use data cache data
port could work around that hang issue.

Now we only fail one more case on baytrail-t compare to the IVB
desktop platform which is the:

profiling_exec()    [FAILED]
   Error: Too large time from submit to start

That may be caused by kernel related issue. And that bug will not
cause serious issue for normal kernel. So after this patch, the
baytrail-t platform should be in a pretty good shape with beignet.

Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>
10 years agoGBE/Runtime: pass the device id to the compiler backend.
Zhigang Gong [Thu, 17 Apr 2014 06:56:08 +0000 (14:56 +0800)]
GBE/Runtime: pass the device id to the compiler backend.

For some reason, we need to know current target device id
at the code generation stage. This patch introduces such
a mechanism. This is the preparation for the baytrail werid
hang issue fixing.

Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>
10 years agoRuntime: increase the build log buffer size to 1000.
Zhigang Gong [Thu, 17 Apr 2014 05:11:50 +0000 (13:11 +0800)]
Runtime: increase the build log buffer size to 1000.

200 is too small sometimes.

Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>
10 years agoRuntime: Add support for Bay Trail-T device.
Chuanbo Weng [Thu, 10 Apr 2014 08:17:53 +0000 (16:17 +0800)]
Runtime: Add support for Bay Trail-T device.

According to the baytrial-t spec, baytrail-t has 4 EUs and each
EU has 8 threads. So the compute unit is 32 and the maximum
work group size is 32 * 8 which is 256.

Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoMark SandyBridge as unsupported
Jesper Pedersen [Sun, 13 Apr 2014 13:58:12 +0000 (09:58 -0400)]
Mark SandyBridge as unsupported

Signed-off-by: Jesper Pedersen <jesper.pedersen@comcast.net>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoUse pkg-config to check modules
Zhenyu Wang [Thu, 10 Apr 2014 10:09:44 +0000 (18:09 +0800)]
Use pkg-config to check modules

Instead of use pre-defined path for dependent modules, e.g libdrm,
libdrm_intel, etc. Use pkg-config helper for cmake instead. This makes
it easy to work with developer own built version of those dependences.

Also remove libGL dependence for 'gbe_bin_generator' which is not required.
libutest.so still requires libGL now but might be fixed by checking real
GL dependence.

v2: Fix build with mesa source (92e6260) and link required EGL lib with utests too.

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by:Zhigang Gong <zhigang.gong@linux.intel.com>

10 years agoGBE: Enable CFG printer.
Ruiling Song [Fri, 11 Apr 2014 06:48:18 +0000 (14:48 +0800)]
GBE: Enable CFG printer.

export OCL_OUTPUT_CFG=1
or export OCL_OUTPUT_CFG_ONLY=1
then it will output .dot file of CFG for the compiled kernels.

The CFG_ONLY means pure cfg without llvm IR.
You can use xdot to view .dot file.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoRuntime: increase batch size to 8K.
Ruiling Song [Fri, 11 Apr 2014 06:48:17 +0000 (14:48 +0800)]
Runtime: increase batch size to 8K.

We met an assert on max_reloc in libdrm. So we simply work around it by
increase the batch size, then libdrm could allow more bo relocations.
This fix the the assert running ocl HaarFixture test under simd8.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoenable mad for mul+sub.
Ruiling Song [Fri, 11 Apr 2014 06:48:16 +0000 (14:48 +0800)]
enable mad for mul+sub.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Enable register spilling for SIMD16.
Zhigang Gong [Wed, 9 Apr 2014 16:05:26 +0000 (00:05 +0800)]
GBE: Enable register spilling for SIMD16.

Enable register spilling for SIMD16 mode. Introduce an
new environment variable OCL_SIMD16_SPILL_THRESHOLD to
control the threshold of simd 16 register spilling. Default
value is 16, means when the spilled registers are more than
16, beignet will fallback to simd8.

Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Optimize read_image performance for CL_ADDRESS_CLAMP..
Zhigang Gong [Wed, 9 Apr 2014 10:25:22 +0000 (18:25 +0800)]
GBE: Optimize read_image performance for CL_ADDRESS_CLAMP..

The previous work around(due to hardware restriction.) is to use
CL_ADDRESS_CLAMP_TO_EDGE to implement CL_ADDRESS_CLAMP which is
not very efficient, especially for the boundary checking overhead.
The root cause is that we need to check each pixel's coordinate.

Now we change to use the LD message to implement CL_ADDRESS_CLAMP. For
integer coordinates, we don't need to do the boundary checking. And for
the float coordinates, we only need to check whether it's less than zero
which is much simpler than before.

This patch could bring about 20% to 30% performance gain for luxmark's
medium and simple scene.

v2:
simplfy the READ_IMAGE0.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fixed two 'long' related bugs.
Zhigang Gong [Tue, 8 Apr 2014 09:58:15 +0000 (17:58 +0800)]
GBE: fixed two 'long' related bugs.

Didn't modify some hard coded number correctly in previous patch.
Now fix them. This could pass the corresponding regressions in
piglit test.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fix the flag usage of those long/64 bit instruction.
Zhigang Gong [Wed, 2 Apr 2014 06:36:19 +0000 (14:36 +0800)]
GBE: fix the flag usage of those long/64 bit instruction.

Make the flag allocation be aware of the long/64bit insn
will use the flag0.1. And don't hard coded f0.1 at the gen_context
stage.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Optimize the bool register allocation/processing.
Zhigang Gong [Thu, 27 Mar 2014 16:38:29 +0000 (00:38 +0800)]
GBE: Optimize the bool register allocation/processing.

Previously, we have a global flag allocation implemntation.
After some analysis, I found the global flag allocation is not
the best solution here.
As for the cross block reference of bool value, we have to
combine it with current emask. There is no obvious advantage to
allocate deadicate physical flag register for those cross block usage.
We just need to allocate physical flag within each BB. We need to handle
the following cases:

1. The bool's liveness never beyond this BB. And the bool is only used as
   a dst register or a pred register. This bool value could be
   allocated in physical flag only if there is enough physical flag.
   We already identified those bool at the instruction select stage, and
   put them in the flagBooleans set.
2. The bool is defined in another BB and used in this BB, then we need
   to prepend an instruction at the position where we use it.
3. The bool is defined in this BB but is also used as some instruction's
   source registers rather than the pred register. We have to keep the normal
   grf (UW8/UW16) register for this bool. For some CMP instruction, we need to
   append a SEL instruction convert the flag to the grf register.
4. Even for the spilling flag, if there is only one spilling flag, we will also
   try to reuse the temporary flag register latter. This requires all the
   instructions should got it flag at the instruction selection stage. And should
   not use the flag physical number directly at the gen_context stage. Otherwise,
   may break the algorithm here.
We will track all the validated bool value and to avoid any redundant
validation for the same flag. But if there is no enough physical flag,
we have to spill the previous allocated physical flag. And the spilling
policy is to spill the allocate flag which live to the last time end point.

Let's see an real example of the improvement of this patch:
I take the compiler_vect_compare as an example, before this patch, the
instructions are as below:
    (      24)  cmp.g.f1.1(8)   null            g110<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      26)  cmp.g.f1.1(8)   null            g111<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      28)  (+f1.1) sel(16) g109<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      30)  cmp.g.f1.1(8)   null            g112<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      32)  cmp.g.f1.1(8)   null            g113<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      34)  (+f1.1) sel(16) g108<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      36)  cmp.g.f1.1(8)   null            g114<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      38)  cmp.g.f1.1(8)   null            g115<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      40)  (+f1.1) sel(16) g107<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      42)  cmp.g.f1.1(8)   null            g116<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      44)  cmp.g.f1.1(8)   null            g117<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      46)  (+f1.1) sel(16) g106<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      48)  mov(16)         g104<1>F        -nanF                           { align1 WE_normal 1H };
    (      50)  cmp.ne.f1.1(16) null            g109<8,8,1>UW   0x0UW           { align1 WE_normal 1H switch };
    (      52)  (+f1.1) sel(16) g96<1>D         g104<8,8,1>D    0D              { align1 WE_normal 1H };
    (      54)  cmp.ne.f1.1(16) null            g108<8,8,1>UW   0x0UW           { align1 WE_normal 1H switch };
    (      56)  (+f1.1) sel(16) g98<1>D         g104<8,8,1>D    0D              { align1 WE_normal 1H };
    (      58)  cmp.ne.f1.1(16) null            g107<8,8,1>UW   0x0UW           { align1 WE_normal 1H switch };
    (      60)  (+f1.1) sel(16) g100<1>D        g104<8,8,1>D    0D              { align1 WE_normal 1H };
    (      62)  cmp.ne.f1.1(16) null            g106<8,8,1>UW   0x0UW           { align1 WE_normal 1H switch };
    (      64)  (+f1.1) sel(16) g102<1>D        g104<8,8,1>D    0D              { align1 WE_normal 1H };
    (      66)  add(16)         g94<1>D         g1.3<0,1,0>D    g120<8,8,1>D    { align1 WE_normal 1H };
    (      68)  send(16)        null            g94<8,8,1>UD
                data (bti: 1, rgba: 0, SIMD16, legacy, Untyped Surface Write) mlen 10 rlen 0 { align1 WE_normal 1H };
    (      70)  mov(16)         g2<1>UW         0x1UW                           { align1 WE_normal 1H };
    (      72)  endif(16) 2                     null                            { align1 WE_normal 1H };

After this patch, it becomes:

    (      24)  cmp.g(8)        null            g110<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      26)  cmp.g(8)        null            g111<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      28)  cmp.g.f1.1(8)   null            g112<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      30)  cmp.g.f1.1(8)   null            g113<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      32)  cmp.g.f0.1(8)   null            g114<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      34)  cmp.g.f0.1(8)   null            g115<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      36)  (+f0.1) sel(16) g109<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      38)  cmp.g.f1.0(8)   null            g116<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      40)  cmp.g.f1.0(8)   null            g117<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      42)  mov(16)         g106<1>F        -nanF                           { align1 WE_normal 1H };
    (      44)  (+f0) sel(16)   g98<1>D         g106<8,8,1>D    0D              { align1 WE_normal 1H };
    (      46)  (+f1.1) sel(16) g100<1>D        g106<8,8,1>D    0D              { align1 WE_normal 1H };
    (      48)  (+f0.1) sel(16) g102<1>D        g106<8,8,1>D    0D              { align1 WE_normal 1H };
    (      50)  (+f1) sel(16)   g104<1>D        g106<8,8,1>D    0D              { align1 WE_normal 1H };
    (      52)  add(16)         g96<1>D         g1.3<0,1,0>D    g120<8,8,1>D    { align1 WE_normal 1H };
    (      54)  send(16)        null            g96<8,8,1>UD
                data (bti: 1, rgba: 0, SIMD16, legacy, Untyped Surface Write) mlen 10 rlen 0 { align1 WE_normal 1H };
    (      56)  mov(16)         g2<1>UW         0x1UW                           { align1 WE_normal 1H };
    (      58)  endif(16) 2                     null                            { align1 WE_normal 1H };

It reduces the instruction count from 25 to 18. Save about 28% instructions.

v2:
Fix some minor bugs.

Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoSilent some compilation warnings.
Zhigang Gong [Fri, 28 Mar 2014 06:57:51 +0000 (14:57 +0800)]
Silent some compilation warnings.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: avoid use a temporay register at the CMP instruction.
Zhigang Gong [Thu, 27 Mar 2014 08:27:18 +0000 (16:27 +0800)]
GBE: avoid use a temporay register at the CMP instruction.

Use one SEL instruction, we can easily transfer a flag
to a normal bool vector register with correct mask.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Add two helper scalar registers to hold 0 and all 1s.
Zhigang Gong [Thu, 27 Mar 2014 07:54:49 +0000 (15:54 +0800)]
GBE: Add two helper scalar registers to hold 0 and all 1s.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: don't emit jmpi to next label.
Zhigang Gong [Thu, 27 Mar 2014 05:30:00 +0000 (13:30 +0800)]
GBE: don't emit jmpi to next label.

As the following if will do the same thing, don't need to
add the jmpi instruction.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: one instruction is enough for SEL_CMP now.
Zhigang Gong [Thu, 27 Mar 2014 02:05:56 +0000 (10:05 +0800)]
GBE: one instruction is enough for SEL_CMP now.

As we have if/endif, now the SEL_CMP could write to the
dst register directly with correct emask.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: pass the OCL_STRICT_CONFORMANCE env to the backend.
Zhigang Gong [Thu, 27 Mar 2014 02:00:57 +0000 (10:00 +0800)]
GBE: pass the OCL_STRICT_CONFORMANCE env to the backend.

Enable the mad pattern matching if the strict conformance
is false.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Only emit long jump when jump a lot of blocks
Zhigang Gong [Thu, 27 Mar 2014 01:36:46 +0000 (09:36 +0800)]
GBE: Only emit long jump when jump a lot of blocks

Most of the case, we don't need to emit long jump at all.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Don't need the emask/notemask/barriermask any more.
Zhigang Gong [Thu, 27 Mar 2014 06:54:15 +0000 (14:54 +0800)]
GBE: Don't need the emask/notemask/barriermask any more.

As we change to use if/endif and change the implementation of
the barrier, we don't need to maintain emask/notmask/barriermask
any more. Just remove them.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Disable SPF and use JMPI + IF/ENDIF to handle each blocks.
Zhigang Gong [Tue, 18 Mar 2014 07:28:44 +0000 (15:28 +0800)]
GBE: Disable SPF and use JMPI + IF/ENDIF to handle each blocks.

When enable SPF (single program flow), we always need to use f0
as the predication of almost each instruction. This bring some
trouble when we want to get tow levels mask mechanism, for an
example the SEL instruction, and some BOOL operations. We
have to use more than one instructions to do that and simply
introduce 100% of overhead of those instructions.

v2:
fix the wrong assertion.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Add if/endif/brc/brd instruction support.
Zhigang Gong [Mon, 17 Mar 2014 10:08:17 +0000 (18:08 +0800)]
GBE: Add if/endif/brc/brd instruction support.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: further optimize forward/backward jump.
Zhigang Gong [Mon, 17 Mar 2014 10:01:03 +0000 (18:01 +0800)]
GBE: further optimize forward/backward jump.

We don't need to save the f0 at the last part of the block.
Just use it directly.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: use S16 vector to represent bool.
Zhigang Gong [Thu, 13 Mar 2014 10:54:31 +0000 (18:54 +0800)]
GBE: use S16 vector to represent bool.

The original purpose of using flag or a S16 scalar to represent
a bool data type is to save register usage. But that bring too
much complex to handle it correctly in each possible case. And
the consequent is we have to take too much care about the bool's
handling in many places in the instruction selection stage. We
even never handle all the cases correctly. The hardest part is
that we can't just touch part of the bit in a S16 scalar register.
There is no instruction to support that. So if a bool is from
another BB, or even the bool is from the same BB but there is
a backward JMP and the bool is still a possible livein register,
thus we need to make some instructions to keep the inactive lane's
bit the original value.

I change to use a S16 vector to represent bool type, then all
the complicate cases are gone. And the only big side effect is
that the register consumption. But considering that a real
application will not have many bools active concurrently, this
may not be a big issue.

I measured the performance impact by using luxmark. And only
observed 2%-3% perfomance regression. There are some easy
performance optimization opportunity remains such as reduce
the unecessary MOVs between flag and bool within the same
block. I think this performance regression should be not a
big deal. Especially, this change will make the following if/endif
optimization a little bit easier.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: fix one misusage of flag in forward jump.
Zhigang Gong [Thu, 13 Mar 2014 04:33:39 +0000 (12:33 +0800)]
GBE: fix one misusage of flag in forward jump.

Forward jump instruction do not need the pred when compare
the pcip with next label. We should use the temporary flag
register.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: use a uniform style to calculate register size for curbe allocation.
Zhigang Gong [Wed, 12 Mar 2014 09:08:15 +0000 (17:08 +0800)]
GBE: use a uniform style to calculate register size for curbe allocation.

Concentrate the register allocation to one place, and don't
use hard coded size when do curbe register allocation. All
register size allocation should use the same method.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: fix the wrong usage of stack pointer and stack buffer.
Zhigang Gong [Wed, 12 Mar 2014 08:51:57 +0000 (16:51 +0800)]
GBE: fix the wrong usage of stack pointer and stack buffer.

Stack pointer and stack buffer should be two different virtual
register. One is a vector and the other is a scalar. The reason
previous implementation could work is that it search curbe offset
and make a new stack buffer register manually which is not good.
Now fix it and remove those hacking code. We actually don't need
to use curbe offset manually after the allocation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: refine the "scalar" register handling.
Zhigang Gong [Wed, 12 Mar 2014 06:43:08 +0000 (14:43 +0800)]
GBE: refine the "scalar" register handling.

The scalar register's actual meaning should be uniform register.
A non-uniform register is a varying register. For further
uniform analysis and bool data optimization, this patch
make the uniform as a new register data attribute. We
can set each new created register as an uniform or varying
register.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Remove BBs if it only has a label instruction.
Zhigang Gong [Wed, 26 Mar 2014 10:27:40 +0000 (18:27 +0800)]
GBE: Remove BBs if it only has a label instruction.

v2:
add an extra createCFGSimplificationPass right before the createGenPass.
And don't remove BB at GEN IR layer.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Add a new pass to handle barrier function's noduplicate attribute correctly.
Zhigang Gong [Wed, 26 Mar 2014 05:45:56 +0000 (13:45 +0800)]
GBE: Add a new pass to handle barrier function's noduplicate attribute correctly.

This pass is to remove or add noduplicate function attribute for barrier functions.
Basically, we want to set NoDuplicate for those __gen_barrier_xxx functions. But if
a sub function calls those barrier functions, the sub function will not be inlined
in llvm's inlining pass. This is what we don't want. As inlining such a function in
the caller is safe, we just don't want it to duplicate the call. So Introduce this
pass to remove the NoDuplicate function attribute before the inlining pass and restore
it after.

v2:
fix the module changed check.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoStatistics of case running
Yi Sun [Tue, 1 Apr 2014 08:31:26 +0000 (16:31 +0800)]
Statistics of case running

summary:
-----------------
1. Add struct RStatistics to count passed number(passCount), failed number(failCount), finished run number(finishrun).

2. Print statistics line , if the termial is too narrow, doesn't print it:
  ......
  test_load_program_from_bin()    [SUCCESS]
  profiling_exec()    [SUCCESS]
  enqueue_copy_buf()    [SUCCESS]
   [run/total: 656/656]      pass: 629; fail: 25; pass rate: 0.961890

3. If case crashes, count it as failed, add the function to show statistic summary.

4. When all cases finished, list a summary like follows:
summary:
----------
  total: 656
  run: 656
  pass: 629
  fail: 25
  pass rate: 0.961890

5. If ./utest_run &> log, the log will be a little messy, tring the following command to analyse the log:

  sed 's/\r/\n/g' log | egrep "\w*\(\)" | sed -e 's/\s//g'

  After analysed:
  -----------------
......
builtin_minmag_float2()[SUCCESS]
builtin_minmag_float4()[SUCCESS]
builtin_minmag_float8()[SUCCESS]
builtin_minmag_float16()[SUCCESS]
builtin_nextafter_float()[FAILED]
builtin_nextafter_float2()[FAILED]
builtin_nextafter_float4()[FAILED]
......

6. Fix one issue, print out the crashed case name.

7. Delete the debug line in utests/compiler_basic_arithmetic.cpp, which
   output the kernel name.

8. Define function statistics() in struct UTest, which called by "utest_run -a/-c/-n".
   We just call this function to run each case, and print the statistics line.

Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd one tests case specific for unaligned buffer copy.
Junyan He [Wed, 26 Mar 2014 10:28:02 +0000 (18:28 +0800)]
Add one tests case specific for unaligned buffer copy.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoOptimize the unaligned buffer copy logic
Junyan He [Wed, 26 Mar 2014 10:27:56 +0000 (18:27 +0800)]
Optimize the unaligned buffer copy logic

Because the byte aligned read and write send instruction is
very slow, we optimize to avoid the using of it.
We seperate the unaligned case into three cases,
   1. The src and dst has same %4 unaligned offset.
      Then we just need to handle first and last dword.
   2. The src has bigger %4 unaligned offset than the dst.
      We need to do some shift and montage between src[i]
      and src[i+1]
   3. The last case, src has smaller 4% unaligned.
      Then we need to do the same for src[i-1] and src[i].

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd three copy cl files for Enqueue Copy usage.
Junyan He [Wed, 26 Mar 2014 10:27:48 +0000 (18:27 +0800)]
Add three copy cl files for Enqueue Copy usage.

Add these three cl files,
one for src and dst are not aligned but have same offset to 4.
second for src's %4 offset is bigger than the dst's
third for src's %4 offset is small than the dst's

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd kernels performance output
Yongjia Zhang [Tue, 1 Apr 2014 09:16:46 +0000 (17:16 +0800)]
Add kernels performance output

if environment variable OCL_OUTPUT_KERNEL_PERF is set non-zero,
then after the executable program exits, beignet will output the
time information of each kernel executed.

v2:fixed the patch's trailing whitespace problem.

v3:if OCL_OUTPUT_KERNEL_PERF is 1, then the output will only
contains time summary, if it is 2, then the output will contain
time summary and detail. Add output 'Ave' and 'Dev', 'Ave' is
the average time per kernel per execution round, 'Dev' is the
result of 'Ave' divide a kernel's all executions' standard deviation.

Signed-off-by: Yongjia Zhang <yongjia.zhang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Fix register liveness issue under simd mode.
Ruiling Song [Mon, 24 Mar 2014 01:46:21 +0000 (09:46 +0800)]
GBE: Fix register liveness issue under simd mode.

As we run in SIMD mode with prediction mask to indicate active lanes,
If a vreg is defined in a loop, and there are som uses of the vreg out of the loop,
the define point may be run several times under *different* prediction mask.
For these kinds of vreg, we must extend the vreg liveness into the whole loop.
If we don't do this, it's liveness is killed before the def point inside loop.
If the vreg's corresponding physical reg is assigned to other vreg during the
killed period, and the instructions before kill point were re-executed with different prediction,
the inactive lanes of vreg maybe over-written. Then the out-of-loop use will got wrong data.

This patch fixes the HaarFixture case in opencv.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Optimize the forward jump instruction.
Zhigang Gong [Mon, 17 Mar 2014 10:02:37 +0000 (18:02 +0800)]
GBE: Optimize the forward jump instruction.

As at each BB's begining, we already checked whether all channels are inactive,
we don't really need to do this duplicate checking at the end of forward jump.

This patch get about 25% performance gain for the luxmark's median scene.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoRefine the FCMP_ORD and FCMP_UNO.
Yang Rong [Mon, 24 Mar 2014 09:21:40 +0000 (17:21 +0800)]
Refine the FCMP_ORD and FCMP_UNO.

If there is a constant between src0 and src1 of FCMP_ORD/FCMP_UNO, the constant
value must be ordered, otherwise, llvm will optimize the instruction to ture/false.
So discard this constant value, only compare the other src.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoRefined the fmax and fmin builtins.
Yang Rong [Mon, 24 Mar 2014 08:27:31 +0000 (16:27 +0800)]
Refined the fmax and fmin builtins.

Because GEN's select instruction with cmod .l and .ge will handle NaN case, so
use the compare and select instruction in gen ir for fmax and fmin, and will be
optimized to one sel_cmp, need not check isnan.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: "Zou, Nanhai" <nanhai.zou@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd one test case for profiling test.
Junyan He [Mon, 24 Mar 2014 07:34:43 +0000 (15:34 +0800)]
Add one test case for profiling test.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: make byte/short vload/vstore process one element each time.
Ruiling Song [Wed, 19 Mar 2014 03:41:54 +0000 (11:41 +0800)]
GBE: make byte/short vload/vstore process one element each time.

Per OCL Spec, the computed address (p+offset*n) is 8-bit aligned for char,
and 16-bit aligned for short in vloadn & vstoren. That is we can not assume that
vload4 with char pointer is 4byte aligned. The previous implementation will make
Clang generate an load or store with alignment 4 which is in fact only alignment 1.

We need find another way to optimize the vloadn.
But before that, let's keep vloadn and vstoren work correctly.
This could fix the regression issue caused by byte/short optimization.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd SROA and GVN pass to default optLevel.
Yang Rong [Tue, 11 Mar 2014 06:43:51 +0000 (14:43 +0800)]
Add SROA and GVN pass to default optLevel.

SROA and GVN may introduce some integer type not support by backend.
Remove this type assert in GenWrite, and found these types, set the unit to
invalid. If unit is invalid, use optLevel 0, which not include SROA and GVN, and
try again.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoutests: Refine cases for sinpi.
Yi Sun [Mon, 10 Mar 2014 03:32:12 +0000 (11:32 +0800)]
utests: Refine cases for sinpi.

The general algorithm is that reducing the x to area [-0.5,0.5] then calculate results.

v2. Correct the algorithm of sinpi.
    Add some input data temporarily, and we're going to design and implement a input data generator which is similar as what Conformance does.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoMove the defination union SF to header file utest_helper.hpp
Yi Sun [Wed, 5 Mar 2014 05:59:26 +0000 (13:59 +0800)]
Move the defination union SF to header file utest_helper.hpp

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoAdd clGetMemObjectFdIntel() api
Chuanbo Weng [Wed, 5 Mar 2014 16:08:15 +0000 (00:08 +0800)]
Add clGetMemObjectFdIntel() api

Use this api to share buffer between OpenCL and v4l2. After import
the fd of OpenCL memory object to v4l2, v4l2 can directly read frame
into this memory object by the way of DMABUF, without memory-copy.

v2:
Check return value of cl_buffer_get_fd

Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agomerge some state buffers into one buffer
Guo Yejun [Thu, 6 Mar 2014 16:59:38 +0000 (00:59 +0800)]
merge some state buffers into one buffer

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoFix a convert float to long bug.
Yang Rong [Mon, 3 Mar 2014 03:25:19 +0000 (11:25 +0800)]
Fix a convert float to long bug.

When convert some special float values, slight large than LONG_MAX, to long with sat,
will error. Simply using LONG_MAX when float value equal to LONG_MAX.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Optimize byte/short load/store using untyped read/write
Ruiling Song [Fri, 7 Mar 2014 05:48:48 +0000 (13:48 +0800)]
GBE: Optimize byte/short load/store using untyped read/write

Scatter/gather are much worse than untyped read/write. So if we can pack
load/store of char/short to use untyped message, jut do it.

v2:
add some assert in splitReg()

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: Fix a potential issue if increase srcNum.
Ruiling Song [Fri, 7 Mar 2014 05:48:47 +0000 (13:48 +0800)]
GBE: Fix a potential issue if increase srcNum.

If increase MAX_SRC_NUM for ir::Instruction, unpredicted behaviour may happen.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: make vload3 only read 3 elements.
Ruiling Song [Fri, 7 Mar 2014 05:48:46 +0000 (13:48 +0800)]
GBE: make vload3 only read 3 elements.

clang will align the vec3 load into vec4. we have to do it in frontend.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: Optimize scratch memory usage using register interval
Ruiling Song [Fri, 28 Feb 2014 02:16:45 +0000 (10:16 +0800)]
GBE: Optimize scratch memory usage using register interval

As scratch memory is a limited resource in HW. And different
register have the opptunity to share same scratch memory. So
I introduce an allocator for scratch memory management.

v2:
In order to reuse the registerFilePartitioner, I rename it as
SimpleAllocator, and derive ScratchAllocator & RegisterAllocator
from it.

v3:
fix a typo, scratch size is 12KB.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: show correct line number in build log
Guo Yejun [Thu, 27 Feb 2014 17:58:20 +0000 (01:58 +0800)]
GBE: show correct line number in build log

Sometimes, we insert some code into the kernel,
it makes the line number reported in build log
mismatch with the line number in the kernel from
programer's view, use #line to correct it.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: support getelementptr with ConstantExpr operand
Guo Yejun [Wed, 26 Feb 2014 22:54:26 +0000 (06:54 +0800)]
GBE: support getelementptr with ConstantExpr operand

Add support during LLVM IR -> Gen IR period when the
first operand of getelementptr is ConstantExpr.

utest is also added.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: add fast path for more math functions
Guo Yejun [Thu, 20 Feb 2014 21:51:33 +0000 (05:51 +0800)]
GBE: add fast path for more math functions

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: remove the useless get sampler info function.
Zhigang Gong [Fri, 21 Feb 2014 05:09:20 +0000 (13:09 +0800)]
GBE: remove the useless get sampler info function.

We don't need to get the sampler info dynamically, so
remove the corresponding instruction.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: optimize read_image to avoid get sampler info dynamically.
Zhigang Gong [Fri, 21 Feb 2014 04:50:55 +0000 (12:50 +0800)]
GBE: optimize read_image to avoid get sampler info dynamically.

Most of time, the user is using a const sampler value in the kernel
directly. Thus we don't need to get the sampler value through a function
call. And this way, the compiler front end could do much better optimization
than using the dynamic get sampler information. For the luxmark's
median/simple case, this patch could get about 30-45% performance gain.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: don't put a long live register to a selection vector.
Zhigang Gong [Fri, 21 Feb 2014 02:40:08 +0000 (10:40 +0800)]
GBE: don't put a long live register to a selection vector.

If an element has very long interval, we don't want to put it into a
vector as it will add more pressure to the register allocation.

With this patch, it can reduce more than 20% spill registers for luxmark's
median scene benchmark(from 288 to 224).

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: prepare to optimize generic selection vector allocation.
Zhigang Gong [Wed, 19 Feb 2014 02:16:48 +0000 (10:16 +0800)]
GBE: prepare to optimize generic selection vector allocation.

Move the selection vector allocation after the register interval
calculation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: fixed a potential bug in 64 bit instruction.
Zhigang Gong [Wed, 19 Feb 2014 02:47:46 +0000 (10:47 +0800)]
GBE: fixed a potential bug in 64 bit instruction.

Current selection vector handling requires the dst/src
vector is starting at dst(0) or src(0).

v2:
fix an assertion.
v3:
fix a bug in gen_context.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fix the overflow bug in register spilling.
Zhigang Gong [Wed, 19 Feb 2014 08:36:33 +0000 (16:36 +0800)]
GBE: fix the overflow bug in register spilling.

Change to use int32 to represent the maxID.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: code cleanup for read_image/write_image.
Zhigang Gong [Tue, 18 Feb 2014 10:32:33 +0000 (18:32 +0800)]
GBE: code cleanup for read_image/write_image.

Remove some useless instructions and make the read/write_image
more readable.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fixed the incorrect max_dst_num and max_src_num.
Zhigang Gong [Tue, 18 Feb 2014 09:41:05 +0000 (17:41 +0800)]
GBE: fixed the incorrect max_dst_num and max_src_num.

Some I64 instructions are using more than 11 dst registers,
this patch change the max src number to 16. And add a assertion
to check if we run into this type of issue again.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: Optimize write_image instruction for simd8 mode.
Zhigang Gong [Tue, 18 Feb 2014 09:19:41 +0000 (17:19 +0800)]
GBE: Optimize write_image instruction for simd8 mode.

On simd8 mode, we can put the u,v,w,x,r,g,b,a to
a selection vector directly and don't need to
assign those values again.

Let's see an example, the following code is generated without this
patch which is doing a simple image copy:

    (26      )  (+f0) mov(8)    g113<1>F        g114<8,8,1>D                    { align1 WE_normal 1Q };
    (28      )  (+f0) send(8)   g108<1>UD       g112<8,8,1>F
                sampler (3, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
    (30      )  mov(8)          g99<1>UD        0x0UD                           { align1 WE_all 1Q };
    (32      )  mov(1)          g99.7<1>UD      0xffffUD                        { align1 WE_all };
    (34      )  mov(8)          g103<1>UD       0x0UD                           { align1 WE_all 1Q };
    (36      )  (+f0) mov(8)    g100<1>UD       g117<8,8,1>UD                   { align1 WE_normal 1Q };
    (38      )  (+f0) mov(8)    g101<1>UD       g114<8,8,1>UD                   { align1 WE_normal 1Q };
    (40      )  (+f0) mov(8)    g104<1>UD       g108<8,8,1>UD                   { align1 WE_normal 1Q };
    (42      )  (+f0) mov(8)    g105<1>UD       g109<8,8,1>UD                   { align1 WE_normal 1Q };
    (44      )  (+f0) mov(8)    g106<1>UD       g110<8,8,1>UD                   { align1 WE_normal 1Q };
    (46      )  (+f0) mov(8)    g107<1>UD       g111<8,8,1>UD                   { align1 WE_normal 1Q };
    (48      )  (+f0) send(8)   null            g99<8,8,1>UD
                renderunsupported target 5 mlen 9 rlen 0        { align1 WE_normal 1Q };
    (50      )  (+f0) mov(8)    g1<1>UW         0x1UW                           { align1 WE_normal 1Q };
  L1:
    (52      )  mov(8)          g112<1>UD       g0<8,8,1>UD                     { align1 WE_all 1Q };
    (54      )  send(8)         null            g112<8,8,1>UD
                thread_spawnerunsupported target 7 mlen 1 rlen 0 { align1 WE_normal 1Q EOT };

With this patch, we can optimize it as below:

    (26      )  (+f0) mov(8)    g106<1>F        g111<8,8,1>D                    { align1 WE_normal 1Q };
    (28      )  (+f0) send(8)   g114<1>UD       g105<8,8,1>F
                sampler (3, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
    (30      )  mov(8)          g109<1>UD       0x0UD                           { align1 WE_all 1Q };
    (32      )  mov(1)          g109.7<1>UD     0xffffUD                        { align1 WE_all };
    (34      )  mov(8)          g113<1>UD       0x0UD                           { align1 WE_all 1Q };
    (36      )  (+f0) send(8)   null            g109<8,8,1>UD
                renderunsupported target 5 mlen 9 rlen 0        { align1 WE_normal 1Q };
    (38      )  (+f0) mov(8)    g1<1>UW         0x1UW                           { align1 WE_normal 1Q };
  L1:
    (40      )  mov(8)          g112<1>UD       g0<8,8,1>UD                     { align1 WE_all 1Q };
    (42      )  send(8)         null            g112<8,8,1>UD
                thread_spawnerunsupported target 7 mlen 1 rlen 0 { align1 WE_normal 1Q EOT };

This patch could save about 8 instructions per write_image.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: optimize sample instruction.
Zhigang Gong [Tue, 18 Feb 2014 06:40:59 +0000 (14:40 +0800)]
GBE: optimize sample instruction.

The U,V,W registers could be allocated to a selection vector directly.
Then we can save some MOV instructions for the read_image functions.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoChange the order of the code
xiuli pan [Fri, 21 Feb 2014 08:25:20 +0000 (16:25 +0800)]
Change the order of the code

Fix the 66K problem in the OpenCV testing.
The bug was casued by the incorrect order
of the code, it will result the beignet to
calculate the whole localsize of the kernel
file. Now the OpenCV test can pass.

Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoFix a long DIV/REM hang.
Yang Rong [Fri, 21 Feb 2014 08:54:39 +0000 (16:54 +0800)]
Fix a long DIV/REM hang.

There is a jumpi in long DIV/REM, with predication is any16/any8. So
MUST AND the predication register with emask, otherwise may dead loop.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: improve precision of rootn
Lv Meng [Tue, 14 Jan 2014 03:04:57 +0000 (11:04 +0800)]
GBE: improve precision of rootn

Signed-off-by: Lv Meng <meng.lv@intel.com>
10 years agoRemove some unreasonable input values for rootn
Yi Sun [Thu, 20 Feb 2014 01:32:32 +0000 (09:32 +0800)]
Remove some unreasonable input values for rootn

In manual for function pow(), there's following description:
"If x is a finite value less than 0,
and y is a finite noninteger,
a domain error occurs, and a NaN is returned."
That means we can't calculate rootn in cpu like this pow(x,1.0/y) which is mentioned in OpenCL spec.
E.g. when y=3 and x=-8, rootn should return -2. But when we calculate pow(x, 1.0/y), it will return a Nan.
I didn't find multi-root math function in glibc.

Signed-off-by: Yi Sun <yi.sun@intel.com>
10 years agoutests:add subnormal check by fpclassify.
Yi Sun [Wed, 19 Feb 2014 06:12:03 +0000 (14:12 +0800)]
utests:add subnormal check by fpclassify.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: Shui yangwei <yangweix.shui@intel.com>
10 years agoChange %.20f to %e.
Yi Sun [Wed, 19 Feb 2014 06:04:52 +0000 (14:04 +0800)]
Change %.20f to %e.

This can make the error information more readable.

Signed-off-by: Yi Sun <yi.sun@intel.com>
10 years agoGBE: add param to switch the behavior of math func
Guo Yejun [Mon, 17 Feb 2014 21:30:27 +0000 (05:30 +0800)]
GBE: add param to switch the behavior of math func

Add OCL_STRICT_CONFORMANCE to switch the behavior of math func,
The funcs will be high precision with perf drops if it is 1, Fast
path with good enough precision will be selected if it is 0.

This change is to add the code basis, with 'sin' and 'cos' implemented
as examples, other math functions support will be added later.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
10 years agoutests: Remove test cases for function 'tgamma' 'erf' and 'erfc'
Yi Sun [Mon, 17 Feb 2014 03:32:47 +0000 (11:32 +0800)]
utests: Remove test cases for function 'tgamma' 'erf' and 'erfc'

Since OpenCL conformance doesn't cover these function at the moment,
we remove them temporarily.

Signed-off-by: Yi Sun <yi.sun@intel.com>
10 years agoImprove precision of sinpi/cospi
Ruiling Song [Mon, 17 Feb 2014 08:54:20 +0000 (16:54 +0800)]
Improve precision of sinpi/cospi

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: fix terminfo library linkage
Boqun Feng [Mon, 17 Feb 2014 01:49:26 +0000 (09:49 +0800)]
GBE: fix terminfo library linkage

In some distros, the terminal libraries are divided into two
libraries, one is tinfo and the other is ncurses, however, for
other distros, there is only one single ncurses library with
all functions.
In order to link proper terminal library for LLVM, find_library
macro in cmake can be used. In this patch, the tinfo is prefered,
so that it wouldn't affect linkage behavior in distros with tinfo.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
10 years agoutests: define python interpreter via cmake variable
Boqun Feng [Sat, 15 Feb 2014 06:52:44 +0000 (14:52 +0800)]
utests: define python interpreter via cmake variable

The reason for this fix is in commit
5b64170ef5e3e78d038186fb1132b11a8fec308e.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoCL: make the scratch size as a device resource attribute.
Zhigang Gong [Fri, 14 Feb 2014 08:11:36 +0000 (16:11 +0800)]
CL: make the scratch size as a device resource attribute.

Actually, the scratch size is much like the local memory size
which should be a device dependent information.

This patch is to put scratch mem size to the device attribute
structure. And when the kernel needs more than the maximum scratch
memory, we just return a out-of-resource error rather than trigger
an assertion.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Song, Ruiling <ruiling.song@intel.com>
10 years agofix typo: blobTempName is assigned but not used
Guo Yejun [Thu, 13 Feb 2014 03:59:48 +0000 (11:59 +0800)]
fix typo: blobTempName is assigned but not used

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Support 64Bit register spill.
Ruiling Song [Fri, 14 Feb 2014 07:04:26 +0000 (15:04 +0800)]
GBE: Support 64Bit register spill.

Now we support DWORD & QWORD register spill/fill.

v2:
  only add poolOffset by 1 when we meet QWord register and poolOffset is 1.

v3:
  allocate reserved register pool unifiedly for src and dst register.
  when it spill a qword register, payload register should be retyped as dword per bottom/top logic.
  put a limit on the scratch space memory size.

v4:
  fix a typo.
  increase the reserved register from 6 to 8 for some complex instruction.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agocmake: Fix linking with LLVM/Terminfo
Igor Gnatenko [Thu, 13 Feb 2014 07:16:35 +0000 (11:16 +0400)]
cmake: Fix linking with LLVM/Terminfo

DEBUG: [  9%] Building CXX object backend/src/CMakeFiles/gbe_bin_generater.dir/gbe_bin_generater.cpp.o
DEBUG: Linking CXX executable gbe_bin_generater
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x717): undefined reference to `setupterm'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x727): undefined reference to `tigetnum'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x730): undefined reference to `set_curterm'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x738): undefined reference to `del_curterm'

Signed-off-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoBump to version 0.8.0.
Zhigang Gong [Mon, 10 Feb 2014 08:28:37 +0000 (16:28 +0800)]
Bump to version 0.8.0.

This version brings many improvments compare to the last released version 0.3,
so that we decide to bump the version to 0.8.0 directly. Before the 1.0.0, we
have two steps left. One is the performance optimization and the other is to
support OpenCL 1.2 by default.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoDocs: fix some markdown errors and add some new info.
Zhigang Gong [Wed, 12 Feb 2014 07:20:45 +0000 (15:20 +0800)]
Docs: fix some markdown errors and add some new info.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>