review.tizen.org Git - contrib/beignet.git/log

enable mad for mul+sub.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Enable register spilling for SIMD16.

Enable register spilling for SIMD16 mode. Introduce an
new environment variable OCL_SIMD16_SPILL_THRESHOLD to
control the threshold of simd 16 register spilling. Default
value is 16, means when the spilled registers are more than
16, beignet will fallback to simd8.

Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: Optimize read_image performance for CL_ADDRESS_CLAMP..

The previous work around(due to hardware restriction.) is to use
CL_ADDRESS_CLAMP_TO_EDGE to implement CL_ADDRESS_CLAMP which is
not very efficient, especially for the boundary checking overhead.
The root cause is that we need to check each pixel's coordinate.

Now we change to use the LD message to implement CL_ADDRESS_CLAMP. For
integer coordinates, we don't need to do the boundary checking. And for
the float coordinates, we only need to check whether it's less than zero
which is much simpler than before.

This patch could bring about 20% to 30% performance gain for luxmark's
medium and simple scene.

v2:
simplfy the READ_IMAGE0.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fixed two 'long' related bugs.

Didn't modify some hard coded number correctly in previous patch.
Now fix them. This could pass the corresponding regressions in
piglit test.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fix the flag usage of those long/64 bit instruction.

Make the flag allocation be aware of the long/64bit insn
will use the flag0.1. And don't hard coded f0.1 at the gen_context
stage.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: Optimize the bool register allocation/processing.

Previously, we have a global flag allocation implemntation.
After some analysis, I found the global flag allocation is not
the best solution here.
As for the cross block reference of bool value, we have to
combine it with current emask. There is no obvious advantage to
allocate deadicate physical flag register for those cross block usage.
We just need to allocate physical flag within each BB. We need to handle
the following cases:

1. The bool's liveness never beyond this BB. And the bool is only used as
   a dst register or a pred register. This bool value could be
   allocated in physical flag only if there is enough physical flag.
   We already identified those bool at the instruction select stage, and
   put them in the flagBooleans set.
2. The bool is defined in another BB and used in this BB, then we need
   to prepend an instruction at the position where we use it.
3. The bool is defined in this BB but is also used as some instruction's
   source registers rather than the pred register. We have to keep the normal
   grf (UW8/UW16) register for this bool. For some CMP instruction, we need to
   append a SEL instruction convert the flag to the grf register.
4. Even for the spilling flag, if there is only one spilling flag, we will also
   try to reuse the temporary flag register latter. This requires all the
   instructions should got it flag at the instruction selection stage. And should
   not use the flag physical number directly at the gen_context stage. Otherwise,
   may break the algorithm here.
We will track all the validated bool value and to avoid any redundant
validation for the same flag. But if there is no enough physical flag,
we have to spill the previous allocated physical flag. And the spilling
policy is to spill the allocate flag which live to the last time end point.

Let's see an real example of the improvement of this patch:
I take the compiler_vect_compare as an example, before this patch, the
instructions are as below:
    (      24)  cmp.g.f1.1(8)   null            g110<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      26)  cmp.g.f1.1(8)   null            g111<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      28)  (+f1.1) sel(16) g109<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      30)  cmp.g.f1.1(8)   null            g112<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      32)  cmp.g.f1.1(8)   null            g113<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      34)  (+f1.1) sel(16) g108<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      36)  cmp.g.f1.1(8)   null            g114<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      38)  cmp.g.f1.1(8)   null            g115<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      40)  (+f1.1) sel(16) g107<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      42)  cmp.g.f1.1(8)   null            g116<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      44)  cmp.g.f1.1(8)   null            g117<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      46)  (+f1.1) sel(16) g106<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      48)  mov(16)         g104<1>F        -nanF                           { align1 WE_normal 1H };
    (      50)  cmp.ne.f1.1(16) null            g109<8,8,1>UW   0x0UW           { align1 WE_normal 1H switch };
    (      52)  (+f1.1) sel(16) g96<1>D         g104<8,8,1>D    0D              { align1 WE_normal 1H };
    (      54)  cmp.ne.f1.1(16) null            g108<8,8,1>UW   0x0UW           { align1 WE_normal 1H switch };
    (      56)  (+f1.1) sel(16) g98<1>D         g104<8,8,1>D    0D              { align1 WE_normal 1H };
    (      58)  cmp.ne.f1.1(16) null            g107<8,8,1>UW   0x0UW           { align1 WE_normal 1H switch };
    (      60)  (+f1.1) sel(16) g100<1>D        g104<8,8,1>D    0D              { align1 WE_normal 1H };
    (      62)  cmp.ne.f1.1(16) null            g106<8,8,1>UW   0x0UW           { align1 WE_normal 1H switch };
    (      64)  (+f1.1) sel(16) g102<1>D        g104<8,8,1>D    0D              { align1 WE_normal 1H };
    (      66)  add(16)         g94<1>D         g1.3<0,1,0>D    g120<8,8,1>D    { align1 WE_normal 1H };
    (      68)  send(16)        null            g94<8,8,1>UD
                data (bti: 1, rgba: 0, SIMD16, legacy, Untyped Surface Write) mlen 10 rlen 0 { align1 WE_normal 1H };
    (      70)  mov(16)         g2<1>UW         0x1UW                           { align1 WE_normal 1H };
    (      72)  endif(16) 2                     null                            { align1 WE_normal 1H };

After this patch, it becomes:

    (      24)  cmp.g(8)        null            g110<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      26)  cmp.g(8)        null            g111<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      28)  cmp.g.f1.1(8)   null            g112<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      30)  cmp.g.f1.1(8)   null            g113<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      32)  cmp.g.f0.1(8)   null            g114<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      34)  cmp.g.f0.1(8)   null            g115<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      36)  (+f0.1) sel(16) g109<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      38)  cmp.g.f1.0(8)   null            g116<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      40)  cmp.g.f1.0(8)   null            g117<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      42)  mov(16)         g106<1>F        -nanF                           { align1 WE_normal 1H };
    (      44)  (+f0) sel(16)   g98<1>D         g106<8,8,1>D    0D              { align1 WE_normal 1H };
    (      46)  (+f1.1) sel(16) g100<1>D        g106<8,8,1>D    0D              { align1 WE_normal 1H };
    (      48)  (+f0.1) sel(16) g102<1>D        g106<8,8,1>D    0D              { align1 WE_normal 1H };
    (      50)  (+f1) sel(16)   g104<1>D        g106<8,8,1>D    0D              { align1 WE_normal 1H };
    (      52)  add(16)         g96<1>D         g1.3<0,1,0>D    g120<8,8,1>D    { align1 WE_normal 1H };
    (      54)  send(16)        null            g96<8,8,1>UD
                data (bti: 1, rgba: 0, SIMD16, legacy, Untyped Surface Write) mlen 10 rlen 0 { align1 WE_normal 1H };
    (      56)  mov(16)         g2<1>UW         0x1UW                           { align1 WE_normal 1H };
    (      58)  endif(16) 2                     null                            { align1 WE_normal 1H };

It reduces the instruction count from 25 to 18. Save about 28% instructions.

v2:
Fix some minor bugs.

Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

Silent some compilation warnings.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: avoid use a temporay register at the CMP instruction.

Use one SEL instruction, we can easily transfer a flag
to a normal bool vector register with correct mask.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: Add two helper scalar registers to hold 0 and all 1s.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: don't emit jmpi to next label.

As the following if will do the same thing, don't need to
add the jmpi instruction.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: one instruction is enough for SEL_CMP now.

As we have if/endif, now the SEL_CMP could write to the
dst register directly with correct emask.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: pass the OCL_STRICT_CONFORMANCE env to the backend.

Enable the mad pattern matching if the strict conformance
is false.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: Only emit long jump when jump a lot of blocks

Most of the case, we don't need to emit long jump at all.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: Don't need the emask/notemask/barriermask any more.

As we change to use if/endif and change the implementation of
the barrier, we don't need to maintain emask/notmask/barriermask
any more. Just remove them.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: Disable SPF and use JMPI + IF/ENDIF to handle each blocks.

When enable SPF (single program flow), we always need to use f0
as the predication of almost each instruction. This bring some
trouble when we want to get tow levels mask mechanism, for an
example the SEL instruction, and some BOOL operations. We
have to use more than one instructions to do that and simply
introduce 100% of overhead of those instructions.

v2:
fix the wrong assertion.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: Add if/endif/brc/brd instruction support.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: further optimize forward/backward jump.

We don't need to save the f0 at the last part of the block.
Just use it directly.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: use S16 vector to represent bool.

The original purpose of using flag or a S16 scalar to represent
a bool data type is to save register usage. But that bring too
much complex to handle it correctly in each possible case. And
the consequent is we have to take too much care about the bool's
handling in many places in the instruction selection stage. We
even never handle all the cases correctly. The hardest part is
that we can't just touch part of the bit in a S16 scalar register.
There is no instruction to support that. So if a bool is from
another BB, or even the bool is from the same BB but there is
a backward JMP and the bool is still a possible livein register,
thus we need to make some instructions to keep the inactive lane's
bit the original value.

I change to use a S16 vector to represent bool type, then all
the complicate cases are gone. And the only big side effect is
that the register consumption. But considering that a real
application will not have many bools active concurrently, this
may not be a big issue.

I measured the performance impact by using luxmark. And only
observed 2%-3% perfomance regression. There are some easy
performance optimization opportunity remains such as reduce
the unecessary MOVs between flag and bool within the same
block. I think this performance regression should be not a
big deal. Especially, this change will make the following if/endif
optimization a little bit easier.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fix one misusage of flag in forward jump.

Forward jump instruction do not need the pred when compare
the pcip with next label. We should use the temporary flag
register.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: use a uniform style to calculate register size for curbe allocation.

Concentrate the register allocation to one place, and don't
use hard coded size when do curbe register allocation. All
register size allocation should use the same method.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fix the wrong usage of stack pointer and stack buffer.

Stack pointer and stack buffer should be two different virtual
register. One is a vector and the other is a scalar. The reason
previous implementation could work is that it search curbe offset
and make a new stack buffer register manually which is not good.
Now fix it and remove those hacking code. We actually don't need
to use curbe offset manually after the allocation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: refine the "scalar" register handling.

The scalar register's actual meaning should be uniform register.
A non-uniform register is a varying register. For further
uniform analysis and bool data optimization, this patch
make the uniform as a new register data attribute. We
can set each new created register as an uniform or varying
register.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: Remove BBs if it only has a label instruction.

v2:
add an extra createCFGSimplificationPass right before the createGenPass.
And don't remove BB at GEN IR layer.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: Add a new pass to handle barrier function's noduplicate attribute correctly.

This pass is to remove or add noduplicate function attribute for barrier functions.
Basically, we want to set NoDuplicate for those __gen_barrier_xxx functions. But if
a sub function calls those barrier functions, the sub function will not be inlined
in llvm's inlining pass. This is what we don't want. As inlining such a function in
the caller is safe, we just don't want it to duplicate the call. So Introduce this
pass to remove the NoDuplicate function attribute before the inlining pass and restore
it after.

v2:
fix the module changed check.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

Statistics of case running

summary:
-----------------
1. Add struct RStatistics to count passed number(passCount), failed number(failCount), finished run number(finishrun).

2. Print statistics line , if the termial is too narrow, doesn't print it:
  ......
  test_load_program_from_bin()    [SUCCESS]
  profiling_exec()    [SUCCESS]
  enqueue_copy_buf()    [SUCCESS]
   [run/total: 656/656]      pass: 629; fail: 25; pass rate: 0.961890

3. If case crashes, count it as failed, add the function to show statistic summary.

4. When all cases finished, list a summary like follows:
summary:
----------
  total: 656
  run: 656
  pass: 629
  fail: 25
  pass rate: 0.961890

5. If ./utest_run &> log, the log will be a little messy, tring the following command to analyse the log:

  sed 's/\r/\n/g' log | egrep "\w*\(\)" | sed -e 's/\s//g'

  After analysed:
  -----------------
......
builtin_minmag_float2()[SUCCESS]
builtin_minmag_float4()[SUCCESS]
builtin_minmag_float8()[SUCCESS]
builtin_minmag_float16()[SUCCESS]
builtin_nextafter_float()[FAILED]
builtin_nextafter_float2()[FAILED]
builtin_nextafter_float4()[FAILED]
......

6. Fix one issue, print out the crashed case name.

7. Delete the debug line in utests/compiler_basic_arithmetic.cpp, which
   output the kernel name.

8. Define function statistics() in struct UTest, which called by "utest_run -a/-c/-n".
   We just call this function to run each case, and print the statistics line.

Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add one tests case specific for unaligned buffer copy.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Optimize the unaligned buffer copy logic

Because the byte aligned read and write send instruction is
very slow, we optimize to avoid the using of it.
We seperate the unaligned case into three cases,
   1. The src and dst has same %4 unaligned offset.
      Then we just need to handle first and last dword.
   2. The src has bigger %4 unaligned offset than the dst.
      We need to do some shift and montage between src[i]
      and src[i+1]
   3. The last case, src has smaller 4% unaligned.
      Then we need to do the same for src[i-1] and src[i].

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add three copy cl files for Enqueue Copy usage.

Add these three cl files,
one for src and dst are not aligned but have same offset to 4.
second for src's %4 offset is bigger than the dst's
third for src's %4 offset is small than the dst's

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add kernels performance output

if environment variable OCL_OUTPUT_KERNEL_PERF is set non-zero,
then after the executable program exits, beignet will output the
time information of each kernel executed.

v2:fixed the patch's trailing whitespace problem.

v3:if OCL_OUTPUT_KERNEL_PERF is 1, then the output will only
contains time summary, if it is 2, then the output will contain
time summary and detail. Add output 'Ave' and 'Dev', 'Ave' is
the average time per kernel per execution round, 'Dev' is the
result of 'Ave' divide a kernel's all executions' standard deviation.

Signed-off-by: Yongjia Zhang <yongjia.zhang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Fix register liveness issue under simd mode.

As we run in SIMD mode with prediction mask to indicate active lanes,
If a vreg is defined in a loop, and there are som uses of the vreg out of the loop,
the define point may be run several times under *different* prediction mask.
For these kinds of vreg, we must extend the vreg liveness into the whole loop.
If we don't do this, it's liveness is killed before the def point inside loop.
If the vreg's corresponding physical reg is assigned to other vreg during the
killed period, and the instructions before kill point were re-executed with different prediction,
the inactive lanes of vreg maybe over-written. Then the out-of-loop use will got wrong data.

This patch fixes the HaarFixture case in opencv.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Optimize the forward jump instruction.

As at each BB's begining, we already checked whether all channels are inactive,
we don't really need to do this duplicate checking at the end of forward jump.

This patch get about 25% performance gain for the luxmark's median scene.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Refine the FCMP_ORD and FCMP_UNO.

If there is a constant between src0 and src1 of FCMP_ORD/FCMP_UNO, the constant
value must be ordered, otherwise, llvm will optimize the instruction to ture/false.
So discard this constant value, only compare the other src.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Refined the fmax and fmin builtins.

Because GEN's select instruction with cmod .l and .ge will handle NaN case, so
use the compare and select instruction in gen ir for fmax and fmin, and will be
optimized to one sel_cmp, need not check isnan.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: "Zou, Nanhai" <nanhai.zou@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add one test case for profiling test.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: make byte/short vload/vstore process one element each time.

Per OCL Spec, the computed address (p+offset*n) is 8-bit aligned for char,
and 16-bit aligned for short in vloadn & vstoren. That is we can not assume that
vload4 with char pointer is 4byte aligned. The previous implementation will make
Clang generate an load or store with alignment 4 which is in fact only alignment 1.

We need find another way to optimize the vloadn.
But before that, let's keep vloadn and vstoren work correctly.
This could fix the regression issue caused by byte/short optimization.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add SROA and GVN pass to default optLevel.

SROA and GVN may introduce some integer type not support by backend.
Remove this type assert in GenWrite, and found these types, set the unit to
invalid. If unit is invalid, use optLevel 0, which not include SROA and GVN, and
try again.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

utests: Refine cases for sinpi.

The general algorithm is that reducing the x to area [-0.5,0.5] then calculate results.

v2. Correct the algorithm of sinpi.
Add some input data temporarily, and we're going to design and implement a input data generator which is similar as what Conformance does.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

Move the defination union SF to header file utest_helper.hpp

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

Add clGetMemObjectFdIntel() api

Use this api to share buffer between OpenCL and v4l2. After import
the fd of OpenCL memory object to v4l2, v4l2 can directly read frame
into this memory object by the way of DMABUF, without memory-copy.

v2:
Check return value of cl_buffer_get_fd

Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

merge some state buffers into one buffer

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Fix a convert float to long bug.

When convert some special float values, slight large than LONG_MAX, to long with sat,
will error. Simply using LONG_MAX when float value equal to LONG_MAX.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Optimize byte/short load/store using untyped read/write

Scatter/gather are much worse than untyped read/write. So if we can pack
load/store of char/short to use untyped message, jut do it.

v2:
add some assert in splitReg()

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: Fix a potential issue if increase srcNum.

If increase MAX_SRC_NUM for ir::Instruction, unpredicted behaviour may happen.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: make vload3 only read 3 elements.

clang will align the vec3 load into vec4. we have to do it in frontend.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: Optimize scratch memory usage using register interval

As scratch memory is a limited resource in HW. And different
register have the opptunity to share same scratch memory. So
I introduce an allocator for scratch memory management.

v2:
In order to reuse the registerFilePartitioner, I rename it as
SimpleAllocator, and derive ScratchAllocator & RegisterAllocator
from it.

v3:
fix a typo, scratch size is 12KB.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: show correct line number in build log

Sometimes, we insert some code into the kernel,
it makes the line number reported in build log
mismatch with the line number in the kernel from
programer's view, use #line to correct it.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: support getelementptr with ConstantExpr operand

Add support during LLVM IR -> Gen IR period when the
first operand of getelementptr is ConstantExpr.

utest is also added.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: add fast path for more math functions

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: remove the useless get sampler info function.

We don't need to get the sampler info dynamically, so
remove the corresponding instruction.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: optimize read_image to avoid get sampler info dynamically.

Most of time, the user is using a const sampler value in the kernel
directly. Thus we don't need to get the sampler value through a function
call. And this way, the compiler front end could do much better optimization
than using the dynamic get sampler information. For the luxmark's
median/simple case, this patch could get about 30-45% performance gain.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: don't put a long live register to a selection vector.

If an element has very long interval, we don't want to put it into a
vector as it will add more pressure to the register allocation.

With this patch, it can reduce more than 20% spill registers for luxmark's
median scene benchmark(from 288 to 224).

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: prepare to optimize generic selection vector allocation.

Move the selection vector allocation after the register interval
calculation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fixed a potential bug in 64 bit instruction.

Current selection vector handling requires the dst/src
vector is starting at dst(0) or src(0).

v2:
fix an assertion.
v3:
fix a bug in gen_context.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fix the overflow bug in register spilling.

Change to use int32 to represent the maxID.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: code cleanup for read_image/write_image.

Remove some useless instructions and make the read/write_image
more readable.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fixed the incorrect max_dst_num and max_src_num.

Some I64 instructions are using more than 11 dst registers,
this patch change the max src number to 16. And add a assertion
to check if we run into this type of issue again.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: Optimize write_image instruction for simd8 mode.

On simd8 mode, we can put the u,v,w,x,r,g,b,a to
a selection vector directly and don't need to
assign those values again.

Let's see an example, the following code is generated without this
patch which is doing a simple image copy:

    (26      )  (+f0) mov(8)    g113<1>F        g114<8,8,1>D                    { align1 WE_normal 1Q };
    (28      )  (+f0) send(8)   g108<1>UD       g112<8,8,1>F
                sampler (3, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
    (30      )  mov(8)          g99<1>UD        0x0UD                           { align1 WE_all 1Q };
    (32      )  mov(1)          g99.7<1>UD      0xffffUD                        { align1 WE_all };
    (34      )  mov(8)          g103<1>UD       0x0UD                           { align1 WE_all 1Q };
    (36      )  (+f0) mov(8)    g100<1>UD       g117<8,8,1>UD                   { align1 WE_normal 1Q };
    (38      )  (+f0) mov(8)    g101<1>UD       g114<8,8,1>UD                   { align1 WE_normal 1Q };
    (40      )  (+f0) mov(8)    g104<1>UD       g108<8,8,1>UD                   { align1 WE_normal 1Q };
    (42      )  (+f0) mov(8)    g105<1>UD       g109<8,8,1>UD                   { align1 WE_normal 1Q };
    (44      )  (+f0) mov(8)    g106<1>UD       g110<8,8,1>UD                   { align1 WE_normal 1Q };
    (46      )  (+f0) mov(8)    g107<1>UD       g111<8,8,1>UD                   { align1 WE_normal 1Q };
    (48      )  (+f0) send(8)   null            g99<8,8,1>UD
                renderunsupported target 5 mlen 9 rlen 0        { align1 WE_normal 1Q };
    (50      )  (+f0) mov(8)    g1<1>UW         0x1UW                           { align1 WE_normal 1Q };
  L1:
    (52      )  mov(8)          g112<1>UD       g0<8,8,1>UD                     { align1 WE_all 1Q };
    (54      )  send(8)         null            g112<8,8,1>UD
                thread_spawnerunsupported target 7 mlen 1 rlen 0 { align1 WE_normal 1Q EOT };

With this patch, we can optimize it as below:

    (26      )  (+f0) mov(8)    g106<1>F        g111<8,8,1>D                    { align1 WE_normal 1Q };
    (28      )  (+f0) send(8)   g114<1>UD       g105<8,8,1>F
                sampler (3, 0, 0, 1) mlen 2 rlen 4              { align1 WE_normal 1Q };
    (30      )  mov(8)          g109<1>UD       0x0UD                           { align1 WE_all 1Q };
    (32      )  mov(1)          g109.7<1>UD     0xffffUD                        { align1 WE_all };
    (34      )  mov(8)          g113<1>UD       0x0UD                           { align1 WE_all 1Q };
    (36      )  (+f0) send(8)   null            g109<8,8,1>UD
                renderunsupported target 5 mlen 9 rlen 0        { align1 WE_normal 1Q };
    (38      )  (+f0) mov(8)    g1<1>UW         0x1UW                           { align1 WE_normal 1Q };
  L1:
    (40      )  mov(8)          g112<1>UD       g0<8,8,1>UD                     { align1 WE_all 1Q };
    (42      )  send(8)         null            g112<8,8,1>UD
                thread_spawnerunsupported target 7 mlen 1 rlen 0 { align1 WE_normal 1Q EOT };

This patch could save about 8 instructions per write_image.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: optimize sample instruction.

The U,V,W registers could be allocated to a selection vector directly.
Then we can save some MOV instructions for the read_image functions.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Change the order of the code

Fix the 66K problem in the OpenCV testing.
The bug was casued by the incorrect order
of the code, it will result the beignet to
calculate the whole localsize of the kernel
file. Now the OpenCV test can pass.

Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>

Fix a long DIV/REM hang.

There is a jumpi in long DIV/REM, with predication is any16/any8. So
MUST AND the predication register with emask, otherwise may dead loop.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: improve precision of rootn

Signed-off-by: Lv Meng <meng.lv@intel.com>

Remove some unreasonable input values for rootn

In manual for function pow(), there's following description:
"If x is a finite value less than 0,
and y is a finite noninteger,
a domain error occurs, and a NaN is returned."
That means we can't calculate rootn in cpu like this pow(x,1.0/y) which is mentioned in OpenCL spec.
E.g. when y=3 and x=-8, rootn should return -2. But when we calculate pow(x, 1.0/y), it will return a Nan.
I didn't find multi-root math function in glibc.

Signed-off-by: Yi Sun <yi.sun@intel.com>

utests:add subnormal check by fpclassify.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: Shui yangwei <yangweix.shui@intel.com>

Change %.20f to %e.

This can make the error information more readable.

Signed-off-by: Yi Sun <yi.sun@intel.com>

GBE: add param to switch the behavior of math func

Add OCL_STRICT_CONFORMANCE to switch the behavior of math func,
The funcs will be high precision with perf drops if it is 1, Fast
path with good enough precision will be selected if it is 0.

This change is to add the code basis, with 'sin' and 'cos' implemented
as examples, other math functions support will be added later.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>

utests: Remove test cases for function 'tgamma' 'erf' and 'erfc'

Since OpenCL conformance doesn't cover these function at the moment,
we remove them temporarily.

Signed-off-by: Yi Sun <yi.sun@intel.com>

Improve precision of sinpi/cospi

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: fix terminfo library linkage

In some distros, the terminal libraries are divided into two
libraries, one is tinfo and the other is ncurses, however, for
other distros, there is only one single ncurses library with
all functions.
In order to link proper terminal library for LLVM, find_library
macro in cmake can be used. In this patch, the tinfo is prefered,
so that it wouldn't affect linkage behavior in distros with tinfo.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>

utests: define python interpreter via cmake variable

The reason for this fix is in commit
5b64170ef5e3e78d038186fb1132b11a8fec308e.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

CL: make the scratch size as a device resource attribute.

Actually, the scratch size is much like the local memory size
which should be a device dependent information.

This patch is to put scratch mem size to the device attribute
structure. And when the kernel needs more than the maximum scratch
memory, we just return a out-of-resource error rather than trigger
an assertion.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Song, Ruiling <ruiling.song@intel.com>

fix typo: blobTempName is assigned but not used

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Support 64Bit register spill.

Now we support DWORD & QWORD register spill/fill.

v2:
  only add poolOffset by 1 when we meet QWord register and poolOffset is 1.

v3:
  allocate reserved register pool unifiedly for src and dst register.
  when it spill a qword register, payload register should be retyped as dword per bottom/top logic.
  put a limit on the scratch space memory size.

v4:
  fix a typo.
  increase the reserved register from 6 to 8 for some complex instruction.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

cmake: Fix linking with LLVM/Terminfo

DEBUG: [ 9%] Building CXX object backend/src/CMakeFiles/gbe_bin_generater.dir/gbe_bin_generater.cpp.o
DEBUG: Linking CXX executable gbe_bin_generater
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x717): undefined reference to `setupterm'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x727): undefined reference to `tigetnum'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x730): undefined reference to `set_curterm'
DEBUG: /usr/lib64/llvm/libLLVMSupport.a(Process.o): In function `llvm::sys::Process::FileDescriptorHasColors(int)':
DEBUG: (.text+0x738): undefined reference to `del_curterm'

Signed-off-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Bump to version 0.8.0.

This version brings many improvments compare to the last released version 0.3,
so that we decide to bump the version to 0.8.0 directly. Before the 1.0.0, we
have two steps left. One is the performance optimization and the other is to
support OpenCL 1.2 by default.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Docs: fix some markdown errors and add some new info.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Fix build errors in llvm3.5 only system.

There are some head files miss if have llvm3.5 only. If has previous llvm, even uninstall,
will still remain these head files in system, so can't trigger it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix the cmake problem in FindLLVM.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Update document for LLVM/Clang 3.5.

Also change the README.md to link to Beignet.mdw rather than to point to the wiki page.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fixed the unsafe tmpnam_r.

Use mkstemps instead.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Silent compilation warning in sampler functions.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Add clang/LLVM 3.5svn support.

The clang/llvm 3.3 has some minor bugs such as the vector ++/-- which
was fixed in 3.4. But the 3.4 version introduces severer OCL bugs as
below:
http://llvm.org/bugs/show_bug.cgi?id=18119
http://llvm.org/bugs/show_bug.cgi?id=18120

It seems that the community will only fix these bugs in the ToT version
rather than the llvm 3.4 branch. I think we'd better to enable clang/llvm
3.5 in beignet. Currently, the 18120 was fixed in ToT, but 18119 still
breaks us. When 18119 get fixed, I will switch the preferred version to
3.5.

Please be noted, when you build clang/llvm 3.5, you need to enable the
cxx11 to make it compatible with beignet.

--enable-cxx11

v2:
fix the llvm3.4 issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Make build compatible with Python 2.6

Implicit numbers for format specifiers "{}" can only be used on Py2.7+,
and Py2.6 is still in use on for instance CentOS 6.5 and similar.

Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix the problem by kernel file open in utest

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Tested-by: "Sun, Yi" <yi.sun@intel.com>

Update documents.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

GBE: fixed the out-of-range JMPI.

For the conditional jump distance out of S15 range [-32768, 32767],
we need to use an inverted jmp followed by a add ip, ip, distance
to implement. A little hacky as we need to change the nop instruction
to add instruction manually.

There is an optimization method which we can insert a
ADD instruction on demand. But that will need some extra analysis
for all the branching instruction. And need to adjust the distance
for those branch instruction's start point and end point contains
this instruction.

After this patch, the luxrender's slg4 could render the scene "alloy"
correctly.

v2:
fix the unconditional branch too.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang, Rong R <rong.r.yang@intel.com>

When local_work_size is null, try to choose a local_work_size.

After fix all found fails when local_work_size is not 1, re-enalbe it to
improve performance.

V2: refine to skip some useless loop.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Multiple register's hstride in suboffset.

When register's hstride is not 0 or 1, suboffset will get wrong element.
Also change some offsets that already multiple hstride by hard code.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Implement complete register spill policy.

This patch implement a complete register spill policy.

When it needs to spill a register, we always choose the
register which is in the spill candate map and has
maximum endpoint. One tricky I used here is to merge both
the register's endpoint value and the register itself
into one single key. Then I can use one map to implement a
descending order map according to its value( the instruction
endpoint value). This patch supports to spill both vectors
or non-vectors.

And I move the scratch memory allocation from
instruction selection to register allocation. We may latter
use the internal interval information to reduce the scratch
memory comsumption.

Another big change is that I don't perform the real
spill on the fly. Instead, I move the real spill to the end of
all register allocation. Then spilling all the registers which
in the spillSet at one pass. This has the following advantage:
1. It only needs to loop over all instructions once.
2. When spilling one instruction, we know all the registers' status.
Then it's easy to know the correct scratch id for each register.
Actually, the previous implementation has a bug here.

The last part is to avoid the spill instruction restrication.
As ruiling pointed out that the spill instruction(scratch read/write)
doesn't support predication correctly for non-DW data type.

This patch avoids to spill any non-supported type register.

After this patch, both luxrender and opencv examples work fine on
my machine.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang, Rong R <rong.r.yang@intel.com>

GBE: prepare to optimize the register spilling policy.

It's better to choose the proper register to spill
rather than always spill current register. This patch
is a preparation of a better spilling policy.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang, Rong R <rong.r.yang@intel.com>

GBE: refine register allocation output.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Add the device id for haswell GT.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix the bug in removeLOADIs function.

The logic for replacing the dst of the instruction
using the src number and getSrc. Fix this problem.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: allow the bool registers to be expired.

After the previous's extra liveness analysis, we can allow bool
registers to be expired now.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

GBE: Implement an extra liveness analysis for the Gen backend.

  Consider the following scenario, %100's normal liveness will start from Ln-1's
  position. In normal analysis, the Ln-1 is not Ln's predecessor, thus the liveness
  of %100 will be passed to Ln and then will not be passed to L0.

  But considering we are running on a multilane with predication's vector machine.
  The unconditional BR in Ln-1 may be removed and it will enter Ln with a subset of
  the revert set of Ln-1's predication. For example when running Ln-1, the active lane
  is 0-7, then at Ln the active lane is 8-15. Then at the end of Ln, a subset of 8-15
  will jump to L0. If a register %10 is allocated the same GRF as %100, given the fact
  that their normal liveness doesn't overlapped, the a subset of 8-15 lanes will be
  modified. If the %10 and %100 are the same vector data type, then we are fine. But if
  %100 is a float vector, and the %10 is a bool or short vector, then we hit a bug here.

L0:
  ...
  %10 = 5
  ...
Ln-1:
  %100 = 2
  BR Ln+1

Ln:
  ...
  BR(%xxx) L0

Ln+1:
  %101 = %100 + 2;
  ...

  The solution to fix this issue is to build an extra liveness analysis. We will start with
  those BBs with backward jump. Then pass all the liveOut register as extra liveIn
  of current BB and then forward this extra liveIn to all the blocks. This is very similar
  to the normal liveness analysis just with reverse direction.

  Thanks yang rong who found this bug.

v2:
  Don't remove livein when initialize the extra livein.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

GBE: increase the disassembly output's readability.

Add label information and the instruction address
prefix. Make the address consistent with fulsim.
And also make the register allocation output a little
bit prettier.

Now the disassembly output is as below:
compiler_ceil's disassemble begin:
  L0:
    (0       )  mov(1)          f0<1>UW         0x0UW                           { align1 WE_all };
    ....
    (32      )  (+f0) mov(16)   g1<1>UW         0x1UW                           { align1 WE_normal 1H };
  L1:
    (34      )  mov(16)         g112<1>UD       g0<8,8,1>UD                     { align1 WE_all 1H };
    ...
compiler_ceil's disassemble end.

The register allocation output is as below:
%26      g2  .8   4  B  [0        -> 0       ]
%28      g2  .12  4  B  [0        -> 6       ]
%29      g2  .16  4  B  [0        -> 9       ]
%30      g126.0   64 B  [2        -> 3       ]
%31      g124.0   64 B  [3        -> 4       ]

Please be noted, the register allocation's output is not correct
when the register is a pure scalar(bool) register which allocated
at the backend instruction selection stage. To be fixed.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

GBE: fixed a bug in sample instruction.

Sample instruction only have 3 source operands now, not 4.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

GBE: fix some incorrect gen ir output messages.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

GBE: don't allocate grf for those bools which map to flag.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

build: work around an old version cmake bug.

On fedora core 15 with the cmake 2.8.4, Yi experienced a build error.
It turns out that the cmake may handle the file directorys with double
slashs incorrectly when the file is on a target's dependcy list and
be a output file name of a custom command.

This small patch could work around that issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Sun, Yi" <yi.sun@intel.com>

GBE: use native exp instruction when enough precision

for the input data with enough precision, use the native exp instruction,
otherwise, use the software path to emulate the exp function.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>