review.tizen.org Git - contrib/beignet.git/log

GBE: fix disassembly bug.

Those math attributes are not valid on Gen7/7.5.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

utests: replace the nodistriutable picture.

According to https://bugs.debian.org/758442, we should not use
Len(n)a standard test image in our package. I just select a picture
took by myself. Thanks Rebecca for pointing this out.

v2:
forgot to add sample.bmp.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

utest: change the box_blur_image to be identical to box_blur.

Change box_blur_image to read integer type surface thus
it could be totally identical to the box_blur thus they
can share the same reference image.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

add utest function bswap.

this llvm instrincs bswap function is generated by calling
__builtin_bswap.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

enable llvm intrinsic call bswap function.

this intrinsic call is implemented at the GEN IR level currently,
should be optimazed later.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Fix live range for temporary register in replaceReg

previously it is simply assigned as [insnID, insnID], But it is used in 2 instruction:
[1] MOV tmp, replacedReg
[2] send null, addr, tmp,...

As minID maxID is equal, it will be treated as temporary register during spill,
and no scratch memory allocated. But scratch register is allocated per instruction,
if tmp in [1] is assigned g100 from spill register pool, instruction [2] would have no idea of that,
it will assign another register like g102 from spill register pool. As no scratch memory allocated,
we cannnot do any spill/unspill operation between them.

To fix this issue, I change the liveness range according to isSrc, so spill/unspill
could work as normally.

This patch fix the cos16 test case failure in piglit under strict conformance.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: workaround register allocation fail caused by custom loop unroll.

As this issue only occurs under strict math, we disable custom
loop unroll if strict math is enabled.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add utest for llvm intrinsic call usub_with_overflow funtion.

as llvm couldn't recognize the pattern of usub overflow, this usub
with is generated by calling the intrinsic function __builtin_usub_overflow;
also this type of uadd intrinsic funtion couldn't support short/byte type
overflow, we choose another way for the uadd kernel to generate short/byte
overflow.
will send patch to llvm later to fix the 2 issues.

v2: split the patch.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

enable llvm intrinsic call usub_with_overflow funtion.

implement usub_with_overflow without any optimization.

v2:
fix the wrong implementation:
change ctx.LT(dst0Type, overflow, dst0, src1)
to ctx.GT(unsignedType, overflow, dst0, src0)

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

use env to set environment variables for GBE_BIN_GENERATER

cmake interprets OCL_PCM_PATH=... as a command and will enclose it in
quotes in case it contains characters requiring protection, e.g. ~
a quoted "FOO=bar" is interpreted by /bin/sh as a command (that does not
exist), not a variable setting for a following command

use env to set the variables unambiguously

Signed-off-by: Andreas Beckmann <anbe@debian.org>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

fix some typos

Signed-off-by: Andreas Beckmann <anbe@debian.org>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Remove intel_gpgpu_check_binded_buf_address()

On recent kernel with full PPGTT support, we can possibly bind buffer
offset with 0, but intel_gpgpu_check_binded_buf_address() always thinks
it's invalid, which is not true. So simply remove the check.

v2: Add comment on why this was added and leads to failure. Use FIXME tag
on issue that needs to be fixed otherwise.

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

IVB/HSW/BYT: Revert the Dynamic state Base Addr and relative buffers address setting.

I have changed the setting of the curbe/sampler state/idrt buffer address, whith set
the Dynamic State Base Addr in intel_gpgpu_set_base_address and then set the relative
offset of these address when set these buffer address.
But it may cause some some image tests random fail in IVB/HSW/BYT.
Revert to the previous manner to set these address which always set absolute address.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

BDW: Fix bwd 32*32 scalar multiplication bug.

When scalar multiplication, must disable predicate and don't need specail handle.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Tested-by: Zhu, BingbingX <bingbingx.zhu@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix a size assert when setup bti.

Global constant buffer size is not align to 4 byte, will cause assert in BDW when set bti.
Per spec, the low two bits of surface state's width must be 11 if SURFACE_BUFFER's format is RAW.
Align the global constant buffer size to 4.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Refine the intel gpgpu delete.

The intel gpgpu struct is destroyed when a new gpgpu struct needed. But in that time,
the command batch relative with the destroyed gpgpu may have not finish, and the resource
in gpgpu still used by gpgpu, can't be destroyed.
So, when delete a gpgpu, check the batch status, if have not complete, insert to list in intel driver,
and delete all finished gpgpu in that list.

V2: V1 assume all the gpgpu list would complete in order, but it is not true. The insert order in is the
gpgpu delete order, not the flush command batch order. So visit and delete all complete gpgpu in the list.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix the compare_image_2d_and_1d_array test case bug

The test case use OCL_MAP_BUFFER_GTT to map the image
buffers and then do the result comparison, which may
cause problems.
On IVB and HSW, the slice pitch is equal but on BDW,
because we change the slice pitch of image array, it
cause this bug.
Modify it by using the standard clEnqueueReadImage API.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add utest compiler_overflow for llvm intrinsic function.

this case only runs for uadd_with_over_flow function so far.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add llvm Intrinsic call support.

so far this patch only support uadd_with_overflow.
sadd_with_overflow, usub_with_overflow, smul_with_overflow, umul_with_overflow, bswap are not supported yet.
this funtion should be implemented by carrier flag later.

v2: update comments.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: fix a wrong type of cl_device_info.

Per OpenCL spec 1.2:
CL_DEVICE_IMAGE_MAX_BUFFER_SIZE should be size_t type rather
than cl_ulong.

This bug will cause problems on i386 platform.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>

GBE: set default address space to -1 to avoid incorrect unroll hint.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: Meng Mengmeng <mengmeng.meng@intel.com>

Fix AUX buffer for page alignment

Apply ALIGN() for aux buffer size from beginning has no effect.
Move to the end of all state offsets set for page alignment.

v2: Update comments

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: "Guo, Yejun" <yejun.guo@intel.com>

Use pread/pwrite for buffer enqueue read/write

Instead of mmap, use pread/write interface for bo read/write with
optimized operations.

Result on one mem bandwidth benchmark for buffer enqueue read/write on HSW.

**** Host to device copy - workgroup_size=512

                     old code with mmap       new code with pread/pwrite

WG=512  SZ=  1 KiB     2325 MB/s               3336 MB/s
WG=512  SZ=  2 KiB     4479 MB/s        6270 MB/s
WG=512  SZ=  4 KiB     8215 MB/s       11808 MB/s
WG=512  SZ=  8 KiB    14271 MB/s       19493 MB/s
WG=512  SZ= 16 KiB    16456 MB/s       20079 MB/s
WG=512  SZ= 32 KiB    21136 MB/s       22334 MB/s
WG=512  SZ= 64 KiB    24785 MB/s       24792 MB/s
WG=512  SZ=128 KiB    24590 MB/s       24908 MB/s
WG=512  SZ=256 KiB    17928 MB/s       21435 MB/s
WG=512  SZ=512 KiB    18346 MB/s       20583 MB/s
WG=512  SZ=  1 MiB    18558 MB/s       20808 MB/s
WG=512  SZ=  2 MiB    18582 MB/s       20939 MB/s
WG=512  SZ=  4 MiB    15382 MB/s       18230 MB/s
WG=512  SZ=  8 MiB     7737 MB/s       11558 MB/s
WG=512  SZ= 16 MiB     7073 MB/s        8962 MB/s
WG=512  SZ= 32 MiB     6984 MB/s        8302 MB/s
WG=512  SZ= 64 MiB     6938 MB/s        8308 MB/s
WG=512  SZ=128 MiB     6950 MB/s        8271 MB/s
WG=512  SZ=256 MiB     6941 MB/s        8034 MB/s

**** Device to host copy - workgroup_size=512

                     old code with mmap       new code with pread/pwrite

WG=512  SZ=  1 KiB     2165 MB/s               2596 MB/s
WG=512  SZ=  2 KiB     4453 MB/s        6154 MB/s
WG=512  SZ=  4 KiB     8211 MB/s       11198 MB/s
WG=512  SZ=  8 KiB    14001 MB/s       11319 MB/s
WG=512  SZ= 16 KiB    16218 MB/s       14394 MB/s
WG=512  SZ= 32 KiB    21310 MB/s       22438 MB/s
WG=512  SZ= 64 KiB    24738 MB/s       25237 MB/s
WG=512  SZ=128 KiB    22784 MB/s       25113 MB/s
WG=512  SZ=256 KiB    18018 MB/s       22265 MB/s
WG=512  SZ=512 KiB    18429 MB/s       21687 MB/s
WG=512  SZ=  1 MiB    18654 MB/s       21856 MB/s
WG=512  SZ=  2 MiB    18655 MB/s       21998 MB/s
WG=512  SZ=  4 MiB    15098 MB/s       17563 MB/s
WG=512  SZ=  8 MiB     7737 MB/s       11170 MB/s
WG=512  SZ= 16 MiB     6989 MB/s        8812 MB/s
WG=512  SZ= 32 MiB     6891 MB/s        7597 MB/s
WG=512  SZ= 64 MiB     6868 MB/s        7798 MB/s
WG=512  SZ=128 MiB     6838 MB/s        7630 MB/s
WG=512  SZ=256 MiB     6842 MB/s        7466 MB/s

The benchmark can be got from http://cgit.freedesktop.org/~zhen/MPBenchmarks/.

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Clear batch buffer pointer after unmap

Fix libdrm warning about unmap the unmapped buffer, as when
batch terminate we will try to do unmap again, but batch has
already been unmapped.

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Make use of write enable flag for mem bo map

Use drm/intel optimization for mem bo mapping in case of read or write.
So we could be possibly waiting less.

This also adds 'map_flags' check in clEnqueueMapBuffer/clEnqueueMapImage
for actual read or write mapping.

But currently leave clMapBufferIntel untouched which might break ABI/API.

v2: Fix write_map flag in clEnqueueMapBuffer/clEnqueueMapImage.

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: "Guo, Yejun" <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: handle dead loop BBs in liveness analysis.

Considering the following CFG, our previous liveness analysis will only
back traverse from the exit point BB, thus the BB 6 and 7 will not be handled.

    4---
    |   |
    |   |
    5 --|---> 10 ---> ret
    |   |
    |   |
    6<--
    |
    |<--
    7   |
    |   |
     ---

Although the CFG looks not a normal application as once it goes to
block 6, it will enter a dead loop and will never return, we still need
to compile it successfully.

This patch is to fix the bug at:
https://bugs.freedesktop.org/show_bug.cgi?id=85362

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

create GIT_SHA1 without any dependency

Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add the disasm support for Gen8

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fix regression caused by simple block optimization.

Almost all 64bit related instructions and unaligned load
instruction should be complex instruction. We need to exclude
them from simple block.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Xionghu Luo <xionghu.luo@intel.com>

add beignet GIT_HAL1 if there is .git directory

Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: add basic PHINode support in legalize pass.

Lack of the incomplete PHINode support currently.
After this patch, we could fix all the crash cases in piglit.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>

GBE: increase maximum src/dst operands to 32.

As we may bitcast a <16 * i64> to/from <32 * i32> due to
the legalize pass, we have to increase the maximum operands
number to 32 and fix some assertions accordingly.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>

GBE: add Selection instruction handler at legalize pass.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>

Add the test case for image 2d array fill

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add the test case for image 1d array fill

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix the bug of 1D array slice pitch

For BDW, the vertical align is 4 at least.
This cause the slice pitch twice as big as
the Gen7 for 1D buffer array.
Because the buffer tiling alignment may change
for different GENs, we move it from run time to
intel driver.

V2:
  Fix all the bugs about 1d and 2d image array.
  And delete the tile align size which is useless.
  Also integrate two image array test cases into
  this patch set.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

use global flag 0.0 to control unstructured simple block.

filter the simple block out and replace the if/endif with global flag
to control.

v2: fix the luxmark sala performance degression due to extern flag in a
BRA instruction.

v3: fix compiler_switch regression, LOAD/STORE instruction could
call replaceSrc/replaceDst to generate 2 extra MOV instruction; exclude
the scalar instructions since they don't have prediction.

this patch is somewhat dangerous to change the instruction structure of block,
will add sanity check after emitInstructionStream to assert if illegally
modified.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>

GBE: disable custom loop unroll for LLVM 3.3/3.4.

To fix a build error with LLVM 3.3/3.4.

v2:
should include llvm-config before check llvm version.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: Add a customized loop unrolling handling mechanism.

By default, the unrolling threshold is relatively small.
Thus some relative large loops which access private array
will not be unrolled, thus those private array can't
be scalarized latter. And the private array is allocated
in stack which is extreme slow for Gen backend currently.

To increase the unrolling threshold for all loops is not
a good idea, as most of the loops don't need to do unrolling
for this purpose and a large unrolling threshold will cause
a big code size and unecessary big register pressure which
may lead to register spilling.

So this patch introduce a trade-off pass to identify those
loops which still have private load/store in the outer most
of the loop. Then add a metadata to it to indicate aggresive
unrolling on those loops. Then do another round loop unrolling.

This patch with the previous small patch, can bring significant
performance improvement for some cases. I just tested with some
opencv test cases, and observed it can bring 2x to 10x improvement.

v2:
refine the parent loop unroll analysis method.

v3:
disable this pass for LLVM 3.3/3.4.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fix a legalize pass bug when bitcast wide integer to incompaitble vector.

Our wide integer legalize pass will assume the source type of a wide
integer must be the same as the final use type. But this is not always
true. The following case is a real example:

      %conv.i.i.14 = sext i8 %usrc0.i.sroa.0.14.extract.trunc to i32
      %call.i.i2.14 = tail call i32 @__gen_ocl_abs(i32 %conv.i.i.14) #5
      %conv1.i.i.14.mask = and i32 %call.i.i2.14, 255
      %uret.i.sroa.0.14.insert.ext = zext i32 %conv1.i.i.14.mask to i128
      %uret.i.sroa.0.14.insert.shift = shl nuw nsw i128 %uret.i.sroa.0.14.insert.ext, 112
      ......
      %uret.i.sroa.0.15.insert.mask = or i128 %uret.i.sroa.0.14.insert.mask.masked, %uret.i.sroa.0.14.insert.shift
      %uret.i.sroa.0.15.insert.insert = or i128 %uret.i.sroa.0.15.insert.mask, %uret.i.sroa.0.15.insert.shift
      %2 = bitcast i128 %uret.i.sroa.0.15.insert.insert to <16 x i8>

The wide integer i128 %uret.i.sroa.0.16.insert.insert is from an
i32 integer %conv1.i.i.14.mask. But the use of it is i8 vector
which breaks our assumption.

According to ruiling's good suggestion, we always bitcast the wide integer to
a compatible vector, take the above example:
  %3 = bitcast i128 %uret.i.sroa.0.15.insert.insert to <4 x i32>
Then insert a bit cast instruction to convert it to the original destination
  %2 = bitcast <4 x i32> %3 to <16 x i8>

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Refine the the error handling in function cl_command_queue_ND_range_gen7.

Return error code when work group error, and remove printf and exit(-1) when internal error..

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix a HSW regression.

slmoffset use as DWORD register, but declare WORD, fix it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: do intrinsics lowering pass earlier.

After the intrinsics lowering pass, it may introduce some new loops
which could be optimized nicely with the previous loop unrolling and
SROA PASS which especially may elminiate private data access.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Fix a upsample regression.

In GenEncoder, unpack is not good, so move the upsample int/short from GenEncoder to instruction selection.
Still handle upsample long in GenContext.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

BDW: Also need set Shader Channel Select for constant buffer in BDW.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

BDW: Fix load/store half error.

BDW support HF data type, so use mov directly to convert between Half Float/Float.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

BDW: Change BDW's max work group size to 512.

Opencv only query and use device max work group size, when SLM/Barrier enable, BDW
can't fill 1024 work group in one subslice, even in SIMD16. Change device's max work
group size temp.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix a HSW constant buffer regression.

HSW's constant buffer read use ld message, also need set Shader Channel Select in surface state.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>

GBE: fix a bug in legalize pass.

The Shr/Shl llvm instructions require to have identical type for
the two operands, so we can't set the shift count to i32 always.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: Zhu, BingbingX <bingbingx.zhu@intel.com>

Fit the printf bug in loop

The static analyse for printf can not totally work
when the printf inst is within the loop and the loop
can not be unrolled. This causes the printf just to
print one info for a loop and to lose all the others.
We now increment the exec number every time the printf
inst is triggered. The number is stored for output all
the message later.
The problem is that we can not caculate the exact loops
number for each printf inst. The wrong loop number will
cause the data overwritten. We now assume all the printf
inst are in loop and store the data like this:
| PRINTF1_DATA PRINTF2_DATA ... | PRINTF1_DATA PRINTF2_DATA ... | ...
| DATA_LOOP_ONE | DATA_LOOP_TWO | ...
Although this may cause some space wasted.

Another problem is that we need to decide the size of printf buffer
because the loop upbound can not be caculated. We just set
it yo 1M for small info slot request and 4M for big one.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix HSW thread_n <= 64 assert.

In function cl_get_kernel_max_wg_sz, hsw's thread count may large than 64,
add a max limit.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add utest popcount for all types.

v2: add all types to test.
v3: fix signed type count bits error.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add opencl-1.2 builtin function popcount.

the popcount function returns the number of non-zero bits in input.
use GEN instruction cbit(Count Bits Set) to implement it.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Let the failure of intel_drm lib's check as a FATAL_ERROR

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix memcpy and memset bug.

In ocl_memcpy.ll and ocl_memset.ll, index+4 should be less than size when use int in
memcpy and memset, and need consider alignment.

V3: For performance, provide two versions of memcpy and memset, decide call which one when lowering intrinsic.
V4: add these new functions in the bitcode link filter list.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Add the libdrm version check.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Refine the shared function ID define.

The old sfid define is obsolete and confusing.
Correct the names and delete the invalid ones.
The math shared function is no longer included in sfid,
and some modification for disasm to print the math
related infomation correctly.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix a HSW regression.

HSW's JMPI instruction has 32bits JIP, can't merge JIP and UIP as other jump instruction.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Re-apply "improve the build performance of vector type built-in function."

This reverts commit c65c0087166a2194ece457d8739d06e86a857dbe.

As we have handled wide integers, we can enable it now.
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: add legalize pass to handle wide integers

This legalize pass will break wider integers like i128/i256/... into shorter ones.
The problem is how to choose the shorter type? From my observation,
wide integer type always comes from shorter ones through 'zext' on small type
or 'bitcast' on vectors, so we simply choose the type where it comes from.
Then we can split wide integer operations into operations on shorter interger.

v2:
  add an assert on the wide integer bit-width, should be power of 2.
  use rpo_iterator to make sure traverse Value def before its use.

v3:
  drop all references before erase processed instruction.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Fix a bug when setting flag register

we should use simd1, instead of simd8/simd16.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix a segment fault.

llvm::CallInst::CallInst::getCalledFunction may return NULL, can't call getIntrinsicID
directly.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

BDW: Add gen8 into intel_driver_init

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

Modify the bind sampler logic for gen8

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

Add sampler state and tile define for gen8.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

BDW: Correct stack setting in BDW.

Remove special fftid handle for HSW in Gen8Context, and change stack buffer address
to QWORD, for curbe. Because it only waste 4 bytes register in other platform, change
to QWORD for all platform.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Need not restore SLM setting in BDW.

Restore SLM setting may cause some test random fail, remove it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Fix a scaler int 32*32 bug.

Seems BDW's scalar mul need QWROD dst, otherwise will touch the dst's follow register

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Correct BDW device name.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Forgot to set UIP of else in BDW.

GEN_OPCODE_ELSE also need set UIP, add it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Correct scratch buffer of BDW.

BDW's scratch buffer change to power 2 alignment from 1024.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Add device's sub slice field, for cl_get_kernel_max_wg_sz.

When SLM enable, get kernal max workgroup size should return the a sub slice's max thread * simdwidth.
So need the sub slice information.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Add BDW Device id to gen binary generater and binary serialize in backend.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: BDW don't need add slm offset, remove it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Refine BDW's int 32*32 multiply.

BDW support int32 * int32 directly. So add a flag to selection for it.
BDW use int32*int16 when use acc. Because int32*int16 also work in IVB,
change to int32*int16 when use acc.
Need refine int32*int32 to long later.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Fix unsample bug.

When set the hstride to 2, also need set vstride to 16.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: enable SLM in BDW.

BDW's SLM control register change to L3CNTLREG, offset is 0x7034.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Fix Pointer argument curbe alloce size.

Because kernel will write 64bits address when reloc, so when reloc argument
in the curbe bo, the pointer need 8 byte curbe.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: add some BDW function.

Add intel_gpgpu_load_vfe_state_gen8, intel_gpgpu_walker_gen8, intel_gpgpu_build_idrt_gen8.
Reloc Dynamic State Base Address in gen7's intel_gpgpu_set_base_address, to unify intel_gpgpu_load_curbe_buffer
and intel_gpgpu_load_idrt.
Now can pass part of utest builtin_global_id.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Refine intel_gpgpu_setup_bti and add intel_gpgpu_set_base_address for BDW.

Because the sizeof struct surface state change in BDW, remove gen6_surface_state, and
use gen_surface_state as the unoin of gen7_surface_state and gen8_surface_state.
Use gen_surface_state in surface_heap_t.
Reloc the Dynamic State Base and Instruction Address in intel_gpgpu_set_base_address_gen8.
BDW use 48 bits GPU address, so when reloc address, remember that kernel will reloc 64 bits in
command batch, so make sure there are 64 bits address, the high 64bits follow by low 32bits in command batch.

v2:
remove binary .swp file.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Pass Jip and Uip when patchJMPI.

Do not like GEN7, BDW's Jip is in bits4 and Uip is in bits3, so should set Jip
and Uip independently.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Add function intel_gpgpu_bind_buf for gen8.

Must call cl_bind_buf instead of intel_gpgpu_bind_buf directly in intel_gpgpu.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Correct surface base address set in setup bti.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Add function intel_gpgpu_setup_bti for gen8.

Also set the correct surface cache control.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: refine the gen8_surface_state_t.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

BDW: Add gen8 surface state struct.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

BDW: Add class Gen8Context.

Now Gen8Context is almost same as Gen75Context, but still derive Gen8Context from GenContext for clearly.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Add Gen8Encoder and Gen7Encoder.

Class Gen8Encoder and Gen7Encoder derive from GenEncoder, and Gen75Encoder derive from Gen7Encode.
GenNativeInstruction is handled in class GenEncoder, Gen7NativeInstruction is handled in class
Gen7Encoder and Gen75Encoder, and Gen8NativeInstruction is handled in classe Gen8Encoder.
Disable Gen8's instruction compact temporary, should add compact and disassemble later.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Add BDW instruction define.

Seperate GEN7 instruction and GEN8 instrunction. GenNativeInstruction will become a union of
Gen7NativeInstruction and Gen8NativeInstruction.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

BDW: Add BDW pci ids and BDW device struct.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

Avoid use GenNativeInstruction directly out of GenEncode and gen_insn_compact.

Use the void* instead of when do instruction compact/decompact.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

GBE: structurized loop exit need an extra branching instruction when do reordering.

When we want to reorder the BBs and move the unstructured BB out-of the
structured block, we need to add a BRA to the block. If the exit of the
structured block is a loop, we need to append a unconditional BRA right
after the predicated BRA. Otherwise, we may lost the correct successor
if an unstructured BB is moved next to this BB.

After this patch, with loop optimization enabled, there is no regression
on both utests and piglit. But there are still a few regressions in opencv
test suite:
[----------] Global test environment tear-down
[==========] 8 tests from 2 test cases ran. (40041 ms total)
[  PASSED  ] 2 tests.
[  FAILED  ] 6 tests, listed below:
[  FAILED  ] OCL_Photo/FastNlMeansDenoising.Mat/2, where GetParam() = (Channels(2), false)
[  FAILED  ] OCL_Photo/FastNlMeansDenoising.Mat/3, where GetParam() = (Channels(2), true)
[  FAILED  ] OCL_Photo/FastNlMeansDenoisingColored.Mat/0, where GetParam() = (Channels(3), false)
[  FAILED  ] OCL_Photo/FastNlMeansDenoisingColored.Mat/1, where GetParam() = (Channels(3), true)
[  FAILED  ] OCL_Photo/FastNlMeansDenoisingColored.Mat/2, where GetParam() = (Channels(4), false)
[  FAILED  ] OCL_Photo/FastNlMeansDenoisingColored.Mat/3, where GetParam() = (Channels(4), true)

So let's keep this optimizaion disabled. Will enable it when I fixed all
the known issues.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Luo <xionghu.luo@intel.com>

GBE: fix a loop header file including bug.

function.hpp doesn't need to include the structural_analysis.hpp.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Luo <xionghu.luo@intel.com>

Use instruction WHILE to manipulate structure.

1. WHILE instruction should be non-schedulable.
2. if this WHILE instruction jumps to an ELSE instruction, the distance
need add 2.

v2:
We also need to take care of HSW for while instruction.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

add handleSelfLoopNode to insert while instruction on Gen IR level.

v2:
disable loop optimization by default due to still buggy.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Add Gen IR WHILE.

Add Gen IR WHILE to mark the strucutred region.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE/libocl: Add __gen_ocl_get_timestamp() to get timestamp.

Gen provide tm0 register for intra-kernel profiling.
Here we provide an API __gen_ocl_get_timestamp() to return
the timestamp in TM.

The return type is defined as:
struct time_stamp {
ulong tick;
uint event;
};

'tick' is a 64bit time tick. 'event' stores a value which means
whether a tmEvent has occured (non-zero) or not (0). tmEvent includes
time-impacting event such as context switch or frequency change
since last time tm0 was read.

I add a sample in the kernels/compiler_time_stamp.cl. Hope it
would help you understand how to use it.

V2:
Introduce ir::ARFRegister to avoid directly use of nr/subnr in Gen IR.
Rename __gen_ocl_extract_reg to __gen_ocl_region.
Rename beignet_get_time_stamp to __gen_ocl_get_timestamp.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE/libocl: fix build dependency issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>

Add long support for printf

V2:
Replace all the long and ulong to int64_t

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Output linkModules's error message.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>