review.tizen.org Git - contrib/beignet.git/log

Added requires on ocl-icd

updated version to 1.0

add smack manifest

initial packaging

Bump to 1.0.0.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

GBE: work around error reporting for unresolved symbols

Beignet currently doesn't have a good error reporting mechanism
for internal passes. Currently, almost all error will cause an
user unfriendly assert. This is ok for most of the cases, but
it is really not good for unresolve symbols error, as it is not
a real compiler internal bug and we should not assert. We should
report it to help user to identify the error and fix that in the
cl kernel.

This patch is just a work around. We will implement a better
error handling in the future to consolidate all this type of
error into the normal error log buffer.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Fix the bug of multi-thread crash

The cl_thread has a potential problem.
If the threads are created and destroyed very fast,
while the queue remain avaible, the resource of
destroyed thread will not be free correctly and will
be wrongly reused by later created thread.

V2:
   Use a easy way to handle this case. We do not clear
   the resource and just keep it. The later thread will
   not wrongly reuse it. The thread number will not be
   very huge, so it is reasonable to clear all the
   resource when the command queue is destroyed.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

runtime: disable userptr due to random fail.

We get a random fail when enable userptr.
Disable it before we root cause and fix it.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Guo, Yejun" <yejun.guo@intel.com>

runtime: fix bug in cl_enqueue_read_buffer.

If the buffer is a userptr buffer, we should copy it directly.
Otherwise, it fails in libdrm. As drm_intel_gem_bo_subdata() refuses
to read a userptr buffer object.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Guo, Yejun" <yejun.guo@intel.com>

runtime: refine version handling.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Update documents.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

runtime: fix one bug in BDW image.

As we still have the image 1d array workaround, we need to
fix it for BDW as well.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

update some documents.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

GBE: fix one double related bugs for post register scheduling.

We need to set the temporary register to U64 type, otherwise
latter post register scheduling will do some bad things.

Although we don't support double currently, this bug still could
be triggerred easily if you use printf("%f", foo).

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

Revert "BDW: Change the default tiling mode to TILING_Y on BDW."

This reverts commit f2c57a46de4f51fa5d4c8e02cc751fce7ff417c8.

GBE: fix relocatable issue for pch file.

When we use pch file, we need to provide the orignal header files.
Otherwise, clang may complain errors. Clang provide a compile option
--relocatable-pch to enable pch/header files relocation. We have to
use this option when we want to install the header files/pch file
into system diretory.

v2:
exchange beignet.pch and beignet.local.pch.

v3:
fix comment.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>

License: adjust all license version to LGPL v2.1+.

To make the license statement consistent to each other, adjust
all license versions to v2.1+. Thus beignet should have a pure
LGPL v2.1+ license.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

utests: remove all shader toy test cases.

As we can't find the original license of these test cases, we
have to remove them from beignet's unit test cases.
Reported by "Rebecca N. Palmer" <rebecca_palmer@zoho.com>.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

docs: update mixed_buffer_pointer document.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Revert "fix issue to create cl image from libva with non-zero offset"

We found this patch cause some serious regressions. Considering it is not
part of the OCL standard API, we choose to revert it for 1.0 release.

This reverts commit b6660fa343e4e80231123695834cc24e3fc5487b.

Revert "add test for clCreateImageFromLibvaIntel"

This reverts commit 9e236b18542f2564e399bf13d4d1fbcc48a5ec9f.

use posix_memalign instead of aligned_alloc to be more compatible

At some systems, function aligned_alloc is not supported.
From Linux Programmer's Manual:
The function aligned_alloc() was added to glibc in version 2.16.
The function posix_memalign() is available since glibc 2.1.91.

V2: add check for return value of posix_memalign
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

BDW: Change the default tiling mode to TILING_Y on BDW.

TILING_Y's performance is better than TILING_X'S on BDW, but almost same
on IVB/HSW. Using the TILING_Y as default tiling mode temporary, still need
to find out the root cause why different behavior between BDW and IVB/HSW.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add test for clCreateImageFromLibvaIntel

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>

fix issue to create cl image from libva with non-zero offset

Beignet accepts buffer object name to share data between libva,
it is supposed to support to create cl image from the bo name
with a non-zero offset, but it does not work at some platforms.

The driver calls intel_bo_gem_create_from_name to retrieve the
dri_bo, and the offset of dri_bo is changed by the non-zero offset.
At some platforms, the change of the offset has side effect when
the kernel is executed again and so intel_bo_gem_create_from_name
is called for the second time.

So, do not change the offset of dri_bo, but maintain the non-zero
offset in cl_image, and use the non-zero offset until we fill the
surface state.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>

fix a bug in clCompileProgram().

passing a binary program to clCompileProgram() should return
CL_INVALID_OPERATION.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

fix piglit clCreateProgramWithBinary fail.

the program should be deserialized and loaded when created from a
EXECUTABLE binary.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Do topological sorting of basicblocks.

Toplogical sorting have two big advantages:
1. Sorted basicblocks will reduce unneccesary register pressure.
2. Sorted basicblocks will make liveness analysis easier.

This patch fix opencv failures:
./opencv_test_video --gtest_filter=OCL_OCL_Video/Mog2_Update.Accuracy/1
./opencv_test_imgproc --gtest_filter=OCL_ImageProc/Filter2D.Mat/482

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Fix a bitcast from float vector to wide interger issue in legalize pass.

When bitcast from <4 x float> to i128, we should not use extractelement directly.
Instead, we cast <4 x float> to <4 x i32>, then use extractelement to get
individual element.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add test for cl buffer created with CL_MEM_USE_HOST_PTR

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Zhenyu Wang <zhenyuw@linux.intel.com>

enable CL_DEVICE_HOST_UNIFIED_MEMORY when userptr is supported

userptr is firstly checked at compile time with libdrm version, but
it does not ensure the system has such capability (for exmaple, with
old linux kernel), so also take a check at run time for the device info.

V2: add runtime check to see if userptr is really supported
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Zhenyu Wang <zhenyuw@linux.intel.com>

support CL_MEM_USE_HOST_PTR with userptr for cl buffer

userptr is used to wrap a memory pointer (page aligned) supplied
by user space into a buffer object accessed by GPU, and so no extra
copy is needed. It is supported starting from linux kernel 3.16
and libdrm 2.4.58.

This patch is originally finished by Zhenyu Wang <zhenyuw@linux.intel.com>,
I did a little change and some code clean.

No regression issue found on IVB+Ubuntu14.10 with libdrm upgraded with tests:
beignet/utests, piglit, OpenCV/test&perf, conformance/basic&mem_host_flags&buffers

V2: add page align limit for data size, add comments for kernel without MMU_NOTIFIER
V3: add runtime check with host_unified_memory, return CL_MEM_OBJECT_ALLOCATION_FAILURE if failed
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Zhenyu Wang <zhenyuw@linux.intel.com>

BDW: Set the URB/REST size to 384K/384K when SLM disable.

If application switch between SLM enable and disable, will cause random fail.
The fail occure only when URB/REST partition changed when enable and disable SLM.
Set the same REST size when disable SLM to workaround.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Tested-by: Meng Mengmeng <mengmeng.meng@intel.com>

utests: add a test to trigger cl_float3 bug in clSetKernelArg.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Fix kernel argument size for vector3

per OpenCL Spec 1.2, see 6.1.5 Alignment of Types.
For 3-component vector data types, the size of the data type
is 4 * sizeof(component).

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: remove useless debug info.

This debug information is not useful now. Remove them to
avoid confusing.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

utests: fix bugs in builtin_tgamma().

This patch is based on Rebecca's patch at:
https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=5;filename=Fix-pow-erf-tgamma.patch;att=3;bug=768090.

And fixed another bug which we should not use an absolute error checking.
We should use ULP and considering the strict conformance or non strict
conformance state.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Rebecca Palmer <rebecca_palmer@zoho.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

utests: fix bugs in builtin_pow().

This patch is based on Rebecca's patch at:
https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=5;filename=Fix-pow-erf-tgamma.patch;att=3;bug=768090.

And fixed another bug which we should not use an absolute error checking.
We should use ULP and considering the strict conformance or non strict
conformance state.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Rebecca Palmer <rebecca_palmer@zoho.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fix bug in tgamma().

tgamma is actually lgamma, a related but very different function.

This patch is from:
https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=5;filename=Fix-pow-erf-tgamma.patch;att=3;bug=768090

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Rebecca Palmer <rebecca_palmer@zoho.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fix bug in erf()/erfc().

erf/erfc diverge (instead of converging to 1 or 0) for arguments above
about 2.

This patch is from:
https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=5;filename=Fix-pow-erf-tgamma.patch;att=3;bug=768090

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Rebecca Palmer <rebecca_palmer@zoho.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fix bug in pow()/pown().

pow/pown ignore the sign of their first argument (e.g. pow(-2,3) gives
8 instead of -8)

This patch is from:
https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=5;filename=Fix-pow-erf-tgamma.patch;att=3;bug=768090

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Rebecca Palmer <rebecca_palmer@zoho.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: Support more instructions for constant expression handling.

Add support for the following OPs:
FCmp/ICmp/FPToSI/FPToUI/SIToFP/UIToFP.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fix a bool handling bug when SEL on a uniform bool variable.

If a SEL uses a bool variable which is a uniform bool, even
we can get a dag node within the same BB, we still need to
set the externFlag bit. The reason is that we don't know how
to generate a scalar physical flag.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

utest: add new test that trigger an assignment operation bug in if.

This test case shows that assignment operation in if block seems
does not affect lvalue.

Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add building dependency GIT_SHA1

Signed-off-by: Meng Mengmeng <mengmeng.meng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

fix bswap kernel function type issue.

use MACRO to define the corresponding function.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: fix disassembly bug.

Those math attributes are not valid on Gen7/7.5.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

utests: replace the nodistriutable picture.

According to https://bugs.debian.org/758442, we should not use
Len(n)a standard test image in our package. I just select a picture
took by myself. Thanks Rebecca for pointing this out.

v2:
forgot to add sample.bmp.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

utest: change the box_blur_image to be identical to box_blur.

Change box_blur_image to read integer type surface thus
it could be totally identical to the box_blur thus they
can share the same reference image.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

add utest function bswap.

this llvm instrincs bswap function is generated by calling
__builtin_bswap.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

enable llvm intrinsic call bswap function.

this intrinsic call is implemented at the GEN IR level currently,
should be optimazed later.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Fix live range for temporary register in replaceReg

previously it is simply assigned as [insnID, insnID], But it is used in 2 instruction:
[1] MOV tmp, replacedReg
[2] send null, addr, tmp,...

As minID maxID is equal, it will be treated as temporary register during spill,
and no scratch memory allocated. But scratch register is allocated per instruction,
if tmp in [1] is assigned g100 from spill register pool, instruction [2] would have no idea of that,
it will assign another register like g102 from spill register pool. As no scratch memory allocated,
we cannnot do any spill/unspill operation between them.

To fix this issue, I change the liveness range according to isSrc, so spill/unspill
could work as normally.

This patch fix the cos16 test case failure in piglit under strict conformance.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: workaround register allocation fail caused by custom loop unroll.

As this issue only occurs under strict math, we disable custom
loop unroll if strict math is enabled.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add utest for llvm intrinsic call usub_with_overflow funtion.

as llvm couldn't recognize the pattern of usub overflow, this usub
with is generated by calling the intrinsic function __builtin_usub_overflow;
also this type of uadd intrinsic funtion couldn't support short/byte type
overflow, we choose another way for the uadd kernel to generate short/byte
overflow.
will send patch to llvm later to fix the 2 issues.

v2: split the patch.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

enable llvm intrinsic call usub_with_overflow funtion.

implement usub_with_overflow without any optimization.

v2:
fix the wrong implementation:
change ctx.LT(dst0Type, overflow, dst0, src1)
to ctx.GT(unsignedType, overflow, dst0, src0)

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

use env to set environment variables for GBE_BIN_GENERATER

cmake interprets OCL_PCM_PATH=... as a command and will enclose it in
quotes in case it contains characters requiring protection, e.g. ~
a quoted "FOO=bar" is interpreted by /bin/sh as a command (that does not
exist), not a variable setting for a following command

use env to set the variables unambiguously

Signed-off-by: Andreas Beckmann <anbe@debian.org>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

fix some typos

Signed-off-by: Andreas Beckmann <anbe@debian.org>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Remove intel_gpgpu_check_binded_buf_address()

On recent kernel with full PPGTT support, we can possibly bind buffer
offset with 0, but intel_gpgpu_check_binded_buf_address() always thinks
it's invalid, which is not true. So simply remove the check.

v2: Add comment on why this was added and leads to failure. Use FIXME tag
on issue that needs to be fixed otherwise.

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

IVB/HSW/BYT: Revert the Dynamic state Base Addr and relative buffers address setting.

I have changed the setting of the curbe/sampler state/idrt buffer address, whith set
the Dynamic State Base Addr in intel_gpgpu_set_base_address and then set the relative
offset of these address when set these buffer address.
But it may cause some some image tests random fail in IVB/HSW/BYT.
Revert to the previous manner to set these address which always set absolute address.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

BDW: Fix bwd 32*32 scalar multiplication bug.

When scalar multiplication, must disable predicate and don't need specail handle.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Tested-by: Zhu, BingbingX <bingbingx.zhu@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix a size assert when setup bti.

Global constant buffer size is not align to 4 byte, will cause assert in BDW when set bti.
Per spec, the low two bits of surface state's width must be 11 if SURFACE_BUFFER's format is RAW.
Align the global constant buffer size to 4.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Refine the intel gpgpu delete.

The intel gpgpu struct is destroyed when a new gpgpu struct needed. But in that time,
the command batch relative with the destroyed gpgpu may have not finish, and the resource
in gpgpu still used by gpgpu, can't be destroyed.
So, when delete a gpgpu, check the batch status, if have not complete, insert to list in intel driver,
and delete all finished gpgpu in that list.

V2: V1 assume all the gpgpu list would complete in order, but it is not true. The insert order in is the
gpgpu delete order, not the flush command batch order. So visit and delete all complete gpgpu in the list.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix the compare_image_2d_and_1d_array test case bug

The test case use OCL_MAP_BUFFER_GTT to map the image
buffers and then do the result comparison, which may
cause problems.
On IVB and HSW, the slice pitch is equal but on BDW,
because we change the slice pitch of image array, it
cause this bug.
Modify it by using the standard clEnqueueReadImage API.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add utest compiler_overflow for llvm intrinsic function.

this case only runs for uadd_with_over_flow function so far.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add llvm Intrinsic call support.

so far this patch only support uadd_with_overflow.
sadd_with_overflow, usub_with_overflow, smul_with_overflow, umul_with_overflow, bswap are not supported yet.
this funtion should be implemented by carrier flag later.

v2: update comments.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: fix a wrong type of cl_device_info.

Per OpenCL spec 1.2:
CL_DEVICE_IMAGE_MAX_BUFFER_SIZE should be size_t type rather
than cl_ulong.

This bug will cause problems on i386 platform.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>

GBE: set default address space to -1 to avoid incorrect unroll hint.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: Meng Mengmeng <mengmeng.meng@intel.com>

Fix AUX buffer for page alignment

Apply ALIGN() for aux buffer size from beginning has no effect.
Move to the end of all state offsets set for page alignment.

v2: Update comments

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: "Guo, Yejun" <yejun.guo@intel.com>

Use pread/pwrite for buffer enqueue read/write

Instead of mmap, use pread/write interface for bo read/write with
optimized operations.

Result on one mem bandwidth benchmark for buffer enqueue read/write on HSW.

**** Host to device copy - workgroup_size=512

                     old code with mmap       new code with pread/pwrite

WG=512  SZ=  1 KiB     2325 MB/s               3336 MB/s
WG=512  SZ=  2 KiB     4479 MB/s        6270 MB/s
WG=512  SZ=  4 KiB     8215 MB/s       11808 MB/s
WG=512  SZ=  8 KiB    14271 MB/s       19493 MB/s
WG=512  SZ= 16 KiB    16456 MB/s       20079 MB/s
WG=512  SZ= 32 KiB    21136 MB/s       22334 MB/s
WG=512  SZ= 64 KiB    24785 MB/s       24792 MB/s
WG=512  SZ=128 KiB    24590 MB/s       24908 MB/s
WG=512  SZ=256 KiB    17928 MB/s       21435 MB/s
WG=512  SZ=512 KiB    18346 MB/s       20583 MB/s
WG=512  SZ=  1 MiB    18558 MB/s       20808 MB/s
WG=512  SZ=  2 MiB    18582 MB/s       20939 MB/s
WG=512  SZ=  4 MiB    15382 MB/s       18230 MB/s
WG=512  SZ=  8 MiB     7737 MB/s       11558 MB/s
WG=512  SZ= 16 MiB     7073 MB/s        8962 MB/s
WG=512  SZ= 32 MiB     6984 MB/s        8302 MB/s
WG=512  SZ= 64 MiB     6938 MB/s        8308 MB/s
WG=512  SZ=128 MiB     6950 MB/s        8271 MB/s
WG=512  SZ=256 MiB     6941 MB/s        8034 MB/s

**** Device to host copy - workgroup_size=512

                     old code with mmap       new code with pread/pwrite

WG=512  SZ=  1 KiB     2165 MB/s               2596 MB/s
WG=512  SZ=  2 KiB     4453 MB/s        6154 MB/s
WG=512  SZ=  4 KiB     8211 MB/s       11198 MB/s
WG=512  SZ=  8 KiB    14001 MB/s       11319 MB/s
WG=512  SZ= 16 KiB    16218 MB/s       14394 MB/s
WG=512  SZ= 32 KiB    21310 MB/s       22438 MB/s
WG=512  SZ= 64 KiB    24738 MB/s       25237 MB/s
WG=512  SZ=128 KiB    22784 MB/s       25113 MB/s
WG=512  SZ=256 KiB    18018 MB/s       22265 MB/s
WG=512  SZ=512 KiB    18429 MB/s       21687 MB/s
WG=512  SZ=  1 MiB    18654 MB/s       21856 MB/s
WG=512  SZ=  2 MiB    18655 MB/s       21998 MB/s
WG=512  SZ=  4 MiB    15098 MB/s       17563 MB/s
WG=512  SZ=  8 MiB     7737 MB/s       11170 MB/s
WG=512  SZ= 16 MiB     6989 MB/s        8812 MB/s
WG=512  SZ= 32 MiB     6891 MB/s        7597 MB/s
WG=512  SZ= 64 MiB     6868 MB/s        7798 MB/s
WG=512  SZ=128 MiB     6838 MB/s        7630 MB/s
WG=512  SZ=256 MiB     6842 MB/s        7466 MB/s

The benchmark can be got from http://cgit.freedesktop.org/~zhen/MPBenchmarks/.

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Clear batch buffer pointer after unmap

Fix libdrm warning about unmap the unmapped buffer, as when
batch terminate we will try to do unmap again, but batch has
already been unmapped.

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Make use of write enable flag for mem bo map

Use drm/intel optimization for mem bo mapping in case of read or write.
So we could be possibly waiting less.

This also adds 'map_flags' check in clEnqueueMapBuffer/clEnqueueMapImage
for actual read or write mapping.

But currently leave clMapBufferIntel untouched which might break ABI/API.

v2: Fix write_map flag in clEnqueueMapBuffer/clEnqueueMapImage.

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: "Guo, Yejun" <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: handle dead loop BBs in liveness analysis.

Considering the following CFG, our previous liveness analysis will only
back traverse from the exit point BB, thus the BB 6 and 7 will not be handled.

    4---
    |   |
    |   |
    5 --|---> 10 ---> ret
    |   |
    |   |
    6<--
    |
    |<--
    7   |
    |   |
     ---

Although the CFG looks not a normal application as once it goes to
block 6, it will enter a dead loop and will never return, we still need
to compile it successfully.

This patch is to fix the bug at:
https://bugs.freedesktop.org/show_bug.cgi?id=85362

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

create GIT_SHA1 without any dependency

Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add the disasm support for Gen8

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fix regression caused by simple block optimization.

Almost all 64bit related instructions and unaligned load
instruction should be complex instruction. We need to exclude
them from simple block.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Xionghu Luo <xionghu.luo@intel.com>

add beignet GIT_HAL1 if there is .git directory

Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: add basic PHINode support in legalize pass.

Lack of the incomplete PHINode support currently.
After this patch, we could fix all the crash cases in piglit.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>

GBE: increase maximum src/dst operands to 32.

As we may bitcast a <16 * i64> to/from <32 * i32> due to
the legalize pass, we have to increase the maximum operands
number to 32 and fix some assertions accordingly.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>

GBE: add Selection instruction handler at legalize pass.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>

Add the test case for image 2d array fill

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Add the test case for image 1d array fill

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix the bug of 1D array slice pitch

For BDW, the vertical align is 4 at least.
This cause the slice pitch twice as big as
the Gen7 for 1D buffer array.
Because the buffer tiling alignment may change
for different GENs, we move it from run time to
intel driver.

V2:
  Fix all the bugs about 1d and 2d image array.
  And delete the tile align size which is useless.
  Also integrate two image array test cases into
  this patch set.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

use global flag 0.0 to control unstructured simple block.

filter the simple block out and replace the if/endif with global flag
to control.

v2: fix the luxmark sala performance degression due to extern flag in a
BRA instruction.

v3: fix compiler_switch regression, LOAD/STORE instruction could
call replaceSrc/replaceDst to generate 2 extra MOV instruction; exclude
the scalar instructions since they don't have prediction.

this patch is somewhat dangerous to change the instruction structure of block,
will add sanity check after emitInstructionStream to assert if illegally
modified.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>

GBE: disable custom loop unroll for LLVM 3.3/3.4.

To fix a build error with LLVM 3.3/3.4.

v2:
should include llvm-config before check llvm version.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: Add a customized loop unrolling handling mechanism.

By default, the unrolling threshold is relatively small.
Thus some relative large loops which access private array
will not be unrolled, thus those private array can't
be scalarized latter. And the private array is allocated
in stack which is extreme slow for Gen backend currently.

To increase the unrolling threshold for all loops is not
a good idea, as most of the loops don't need to do unrolling
for this purpose and a large unrolling threshold will cause
a big code size and unecessary big register pressure which
may lead to register spilling.

So this patch introduce a trade-off pass to identify those
loops which still have private load/store in the outer most
of the loop. Then add a metadata to it to indicate aggresive
unrolling on those loops. Then do another round loop unrolling.

This patch with the previous small patch, can bring significant
performance improvement for some cases. I just tested with some
opencv test cases, and observed it can bring 2x to 10x improvement.

v2:
refine the parent loop unroll analysis method.

v3:
disable this pass for LLVM 3.3/3.4.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fix a legalize pass bug when bitcast wide integer to incompaitble vector.

Our wide integer legalize pass will assume the source type of a wide
integer must be the same as the final use type. But this is not always
true. The following case is a real example:

      %conv.i.i.14 = sext i8 %usrc0.i.sroa.0.14.extract.trunc to i32
      %call.i.i2.14 = tail call i32 @__gen_ocl_abs(i32 %conv.i.i.14) #5
      %conv1.i.i.14.mask = and i32 %call.i.i2.14, 255
      %uret.i.sroa.0.14.insert.ext = zext i32 %conv1.i.i.14.mask to i128
      %uret.i.sroa.0.14.insert.shift = shl nuw nsw i128 %uret.i.sroa.0.14.insert.ext, 112
      ......
      %uret.i.sroa.0.15.insert.mask = or i128 %uret.i.sroa.0.14.insert.mask.masked, %uret.i.sroa.0.14.insert.shift
      %uret.i.sroa.0.15.insert.insert = or i128 %uret.i.sroa.0.15.insert.mask, %uret.i.sroa.0.15.insert.shift
      %2 = bitcast i128 %uret.i.sroa.0.15.insert.insert to <16 x i8>

The wide integer i128 %uret.i.sroa.0.16.insert.insert is from an
i32 integer %conv1.i.i.14.mask. But the use of it is i8 vector
which breaks our assumption.

According to ruiling's good suggestion, we always bitcast the wide integer to
a compatible vector, take the above example:
  %3 = bitcast i128 %uret.i.sroa.0.15.insert.insert to <4 x i32>
Then insert a bit cast instruction to convert it to the original destination
  %2 = bitcast <4 x i32> %3 to <16 x i8>

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Refine the the error handling in function cl_command_queue_ND_range_gen7.

Return error code when work group error, and remove printf and exit(-1) when internal error..

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix a HSW regression.

slmoffset use as DWORD register, but declare WORD, fix it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: do intrinsics lowering pass earlier.

After the intrinsics lowering pass, it may introduce some new loops
which could be optimized nicely with the previous loop unrolling and
SROA PASS which especially may elminiate private data access.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Fix a upsample regression.

In GenEncoder, unpack is not good, so move the upsample int/short from GenEncoder to instruction selection.
Still handle upsample long in GenContext.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

BDW: Also need set Shader Channel Select for constant buffer in BDW.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

BDW: Fix load/store half error.

BDW support HF data type, so use mov directly to convert between Half Float/Float.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

BDW: Change BDW's max work group size to 512.

Opencv only query and use device max work group size, when SLM/Barrier enable, BDW
can't fill 1024 work group in one subslice, even in SIMD16. Change device's max work
group size temp.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix a HSW constant buffer regression.

HSW's constant buffer read use ld message, also need set Shader Channel Select in surface state.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>

GBE: fix a bug in legalize pass.

The Shr/Shl llvm instructions require to have identical type for
the two operands, so we can't set the shift count to i32 always.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: Zhu, BingbingX <bingbingx.zhu@intel.com>

Fit the printf bug in loop

The static analyse for printf can not totally work
when the printf inst is within the loop and the loop
can not be unrolled. This causes the printf just to
print one info for a loop and to lose all the others.
We now increment the exec number every time the printf
inst is triggered. The number is stored for output all
the message later.
The problem is that we can not caculate the exact loops
number for each printf inst. The wrong loop number will
cause the data overwritten. We now assume all the printf
inst are in loop and store the data like this:
| PRINTF1_DATA PRINTF2_DATA ... | PRINTF1_DATA PRINTF2_DATA ... | ...
| DATA_LOOP_ONE | DATA_LOOP_TWO | ...
Although this may cause some space wasted.

Another problem is that we need to decide the size of printf buffer
because the loop upbound can not be caculated. We just set
it yo 1M for small info slot request and 4M for big one.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix HSW thread_n <= 64 assert.

In function cl_get_kernel_max_wg_sz, hsw's thread count may large than 64,
add a max limit.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add utest popcount for all types.

v2: add all types to test.
v3: fix signed type count bits error.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

add opencl-1.2 builtin function popcount.

the popcount function returns the number of non-zero bits in input.
use GEN instruction cbit(Count Bits Set) to implement it.

Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Let the failure of intel_drm lib's check as a FATAL_ERROR

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>