contrib/beignet.git
10 years agoAdd test cases for 1d image fill and copy
Junyan He [Fri, 13 Jun 2014 07:07:44 +0000 (15:07 +0800)]
Add test cases for 1d image fill and copy

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd the support for 1D image in backend
Junyan He [Fri, 13 Jun 2014 07:07:31 +0000 (15:07 +0800)]
Add the support for 1D image in backend

1. Delete the is3D member in instruction class. Because we need more
than 1 bit to represent 1D 2D and 3D. We now add an invalid register
in ir profile, and comparing the coords to it to judge the dimension.
2. Rename all the xxx_image to xxx_image2D to make its meaning clear.
3. Update the according Sampler and Typed_Write instruction in selection
and Gen IR generation.

v2:
fix the use of InvalidRegister. Use ir::ocl::invalid only.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoAdd checks for clCreateImage and add 1d image creating logic
Junyan He [Fri, 13 Jun 2014 07:07:10 +0000 (15:07 +0800)]
Add checks for clCreateImage and add 1d image creating logic

Add more check for Image creating according to the spec.
Update the according image utest cases to pass it.
The 1d image creating is also be added.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoadd[opencl-1.2] test case for API clCreateProgramWithBuiltInKernels.
Luo [Fri, 13 Jun 2014 00:58:17 +0000 (08:58 +0800)]
add[opencl-1.2] test case for API clCreateProgramWithBuiltInKernels.

Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoadd [opencl-1.2] API clCreateProgramWithBuiltInKernels.
Luo [Fri, 13 Jun 2014 00:58:16 +0000 (08:58 +0800)]
add [opencl-1.2] API clCreateProgramWithBuiltInKernels.

This API creates a built-in program object for a context, and loads the
built-in kernels into this program object.

v2:
fix the image base index handling issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoadd [opencl 1.2] API clEnqueueMarkerWithWaitList.
Luo [Fri, 13 Jun 2014 00:58:15 +0000 (08:58 +0800)]
add [opencl 1.2] API clEnqueueMarkerWithWaitList.

Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoAdd the test case for clEnqueueFillBuffer
Junyan He [Fri, 13 Jun 2014 05:30:49 +0000 (13:30 +0800)]
Add the test case for clEnqueueFillBuffer

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoImplement the clEnqueueFillBuffer API.
Junyan He [Fri, 13 Jun 2014 05:30:42 +0000 (13:30 +0800)]
Implement the clEnqueueFillBuffer API.

We use the floatn's assigment to do the copy.
128 pattern size is according to double16, and because
the double problem on our platform, we use to float16
to handle this.
unaligned cases is not optimized now, just use the char
assigment.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd the kernels used by clEnqueueBufferFill API
Junyan He [Fri, 13 Jun 2014 05:30:30 +0000 (13:30 +0800)]
Add the kernels used by clEnqueueBufferFill API

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: switch to ocl-1.2 header files.
Zhigang Gong [Thu, 12 Jun 2014 06:31:00 +0000 (14:31 +0800)]
GBE: switch to ocl-1.2 header files.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agorelax the build dependency on Gen GPU
Guo Yejun [Mon, 26 May 2014 22:13:12 +0000 (06:13 +0800)]
relax the build dependency on Gen GPU

currently, the Gen GPU pciid of the underlying system is queried
and then passed to gbe_bin_generater as the target option.

This does not work when building the driver on another system with
non-intel GPUs, this patch relaxes the dependency by exporting the
pciid setting at CMake level, therefore, the pciid could be given
as a CMake option besides the current real time query method.

this patch also remove the redundancy code in utest/CMake by setting
PARENT_SCOPE in src/CMake.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoFix the same kernel name issue of OCL_OUTPUT_KERNEL_PERF
Yongjia Zhang [Mon, 23 Jun 2014 15:09:33 +0000 (23:09 +0800)]
Fix the same kernel name issue of OCL_OUTPUT_KERNEL_PERF

Now it treats kernels with same kernel name and different build
options separately. When OCL_OUTPUT_KERNEL_PERF==1, it outputs the
time summary as before, but if OCL_OUTPUT_KERNEL_PERF==2, it will
output the time details including the kernel build options and
kernels with same kernel name but different build options will
output separately.

v2: use strncmp and strncpy instead of strcmp and strcpy.

Signed-off-by: Yongjia Zhang <yongjia.zhang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoutest: reduce group size to fit into baytrail platform.
Zhigang Gong [Thu, 12 Jun 2014 08:45:19 +0000 (16:45 +0800)]
utest: reduce group size to fit into baytrail platform.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoHSW: Remove the jmpi distance limit of HSW.
Yang Rong [Thu, 12 Jun 2014 15:22:15 +0000 (23:22 +0800)]
HSW: Remove the jmpi distance limit of HSW.

Because the HSW's jmpi distance's unit is byte, the distance in JMPI instruction should
be S31, so remove S16 restriction.
It can fix luxmark fail when OCL_STRICT_CONFORMANCE=1.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Tested-by: Li, Peng <peng.li@intel.com>
10 years agoGBE: fix some bugs in 64bit bitcast.
Ruiling Song [Thu, 12 Jun 2014 07:11:52 +0000 (15:11 +0800)]
GBE: fix some bugs in 64bit bitcast.

1. set correct vstride when do int64 bitcast.
2. the condition to offset to next half should be (i%multiple) >= multiple/2.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoHSW: Fix potential issue of GT3 when calc stack address.
Yang Rong [Thu, 12 Jun 2014 11:42:12 +0000 (19:42 +0800)]
HSW: Fix potential issue of GT3 when calc stack address.

GT3 have 4 half slice, so should shift left 2 bits, and also should enlarge the stack buffer size,
otherwize, if thread generate is non-balance, may out of bound.
Per bspec, scratch size need set 2X of desired.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoHandle the difference timestamp count, got from drm_intel_reg_read.
Yang Rong [Thu, 12 Jun 2014 11:04:27 +0000 (19:04 +0800)]
Handle the difference timestamp count, got from drm_intel_reg_read.

In HSW and IVB, if x86_64 system, the low 32bits of timestamp count are stored in the high 32 bits of result which
got from drm_intel_reg_read, and 32-35 bits are lost; but in i386 system, the timestamp count match bspec.
It seems the kernel readq bug. So shift 32 bit in x86_64, and only remain 32 bits data in i386.

V2: In baytrail, don't have these issue, but need clear 32-35 bits.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoremove RTLD_DEEPBIND to avoid stdc++ issues
Guo Yejun [Wed, 11 Jun 2014 18:38:22 +0000 (02:38 +0800)]
remove RTLD_DEEPBIND to avoid stdc++ issues

there are weired issues about stdc++ when dlopen .so file with flag
RTLD_DEEPBIND, remove the flag by renaming the function pointer names.
The new names in runtime begin with interp_*, meaning that they finally
go into libgbeinterp.so to interpret the meta data of binary kernel.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>
10 years agofix utest simd_any for simd width 8 and 16
Guo Yejun [Tue, 10 Jun 2014 21:27:26 +0000 (05:27 +0800)]
fix utest simd_any for simd width 8 and 16

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: ignoring some debug related intrinsics.
Zhigang Gong [Fri, 6 Jun 2014 07:34:18 +0000 (15:34 +0800)]
GBE: ignoring some debug related intrinsics.

We don't need to assert the kernel if we met some
debug related intrinsics. Just ignore them.

This patch could make beignet works well with Debug
mode clBLAS.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: output compact flag when output asm.
Ruiling Song [Wed, 11 Jun 2014 03:14:52 +0000 (11:14 +0800)]
GBE: output compact flag when output asm.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agofix issue when create cl image from libva with offset
Guo Yejun [Mon, 9 Jun 2014 00:39:33 +0000 (08:39 +0800)]
fix issue when create cl image from libva with offset

to share data between libva and ocl (at drm level), it is acceptable
to create cl image from libva with offset (to drm object). Correct
the bo offset whose value will finally go to ss1.base_addr.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd the utest case for printf
Junyan He [Tue, 10 Jun 2014 04:53:22 +0000 (12:53 +0800)]
Add the utest case for printf

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd the printf logic into the run time.
Junyan He [Tue, 10 Jun 2014 04:53:12 +0000 (12:53 +0800)]
Add the printf logic into the run time.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd the printfSet into the kernel Class and add misc helper functions
Junyan He [Tue, 10 Jun 2014 04:53:04 +0000 (12:53 +0800)]
Add the printfSet into the kernel Class and add misc helper functions

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd the PrintfParser llvm parser into the llvm backend.
Junyan He [Tue, 10 Jun 2014 04:52:54 +0000 (12:52 +0800)]
Add the PrintfParser llvm parser into the llvm backend.

The PrintfParser will work before the llvm gen backend.
It will filter out all the printf function call. When
the printf call found, we will analyse the print format
and % place holder here. Replace the print call with
STORE or CONV+STORE instruction if needed.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd the PrintfSet class into the ir
Junyan He [Tue, 10 Jun 2014 04:52:45 +0000 (12:52 +0800)]
Add the PrintfSet class into the ir

The PrintfSet will be used to collect all the infomation in
the kernel. After the kernel executed, it will be used
to generate the according printf output.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd two special register for printf output buffer usage
Junyan He [Tue, 10 Jun 2014 04:52:37 +0000 (12:52 +0800)]
Add two special register for printf output buffer usage

printfiptr for printf index buffer pointer in curbe
and printfbptr for printf output buffer pointer in curbe.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: support SLM bool load and store.
Zhigang Gong [Tue, 10 Jun 2014 02:45:56 +0000 (10:45 +0800)]
GBE: support SLM bool load and store.

The OCL spec does allow the use of a i1/BOOL SLM
variable, so we have to support the load and store of
it. To make things simple, I choose to use S16 to represent
i1 value.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
10 years agoGBE: increase batch size to relax the max reloc restriction.
Zhigang Gong [Mon, 9 Jun 2014 10:39:00 +0000 (18:39 +0800)]
GBE: increase batch size to relax the max reloc restriction.

The drm will restrict the max reloc to (batch size)/8.
Current batch buffer size is 8K, then the max reloc is 1024.
As the max workgroup size is 1024, if it uses simd16 channel
then the thread_n will be 1024/16 = 64. And if it need to bind
32 buffers, then the reloc count will be 64*32 which is 2048
and exceed current limitation. Let's increase the batch size to
16K to relax this restrication to 2048 relocs.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoremove the code of saving the llvm bitcode to file, replace it with llvm::Module
Luo [Fri, 6 Jun 2014 06:17:31 +0000 (14:17 +0800)]
remove the code of saving the llvm bitcode to file, replace it with llvm::Module

Save the global LLVMContext and module pointer to GenProgram, delete the
module pointer in the destructor.

Signed-off-by: Luo <xionghu.luo@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoHandle server IVB GT2.
Abrahm Scully [Fri, 6 Jun 2014 18:16:25 +0000 (14:16 -0400)]
Handle server IVB GT2.

Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Junyan He <junyan.he@aim.com>
10 years agoGBE: Fix an assert on bitcast long to char8
Ruiling Song [Mon, 9 Jun 2014 08:14:29 +0000 (16:14 +0800)]
GBE: Fix an assert on bitcast long to char8

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd some lost pci id into GetGenID.sh
Junyan He [Mon, 9 Jun 2014 08:31:03 +0000 (16:31 +0800)]
Add some lost pci id into GetGenID.sh

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoHSW: Restore L3 control register to disable SLM mode.
Yang Rong [Mon, 9 Jun 2014 15:29:50 +0000 (23:29 +0800)]
HSW: Restore L3 control register to disable SLM mode.

It seems L3 control register is per context in IVB, but not per context
in HSW, so need restore the L3 control register, otherwise, it may cause screen flick.
this patch may hurt some performance.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoHSW: enable the surface's cache in HSW.
Yang Rong [Mon, 9 Jun 2014 15:29:49 +0000 (23:29 +0800)]
HSW: enable the surface's cache in HSW.

HSW's surface cache control is changed, correct it. Also correct scratch size calculate.
And disable exec flag for slm. When kernel parse cmd finish, need remove it totally

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoHSW: Set correct max work group size for GT2 and GT3.
Yang Rong [Mon, 9 Jun 2014 15:29:48 +0000 (23:29 +0800)]
HSW: Set correct max work group size for GT2 and GT3.

v2: Return an error when can't fit work group to a single half slice.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoHSW: add data port 1 support in disassemble.
Yang Rong [Mon, 9 Jun 2014 15:29:47 +0000 (23:29 +0800)]
HSW: add data port 1 support in disassemble.

HSW add new data port, add support in diassemble.

V2: seperate HSW and IVB's send msg function table, so need pass deviceID to gen_disasm.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: fix one illegal instruction.
Zhigang Gong [Fri, 6 Jun 2014 10:05:09 +0000 (18:05 +0800)]
GBE: fix one illegal instruction.

When the destination is a scalar and the execution width
is 1, we should use scalar vec rather.

This patch fix the following illegal instruction:
  (38      )  mov(1)          g124.3<1>:F     acc0<8,8,1>:F
to the correct one:
  (38      )  mov(1)          g124.3<1>:F     acc0<0,1,0>:F

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: Fix a jump issue in int64 to float conversion
Ruiling Song [Fri, 6 Jun 2014 06:57:18 +0000 (14:57 +0800)]
GBE: Fix a jump issue in int64 to float conversion

The the inactive lanes should use 32, so later jump could jump
as desired.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: fix a bug in int64 to float conversion.
Zhigang Gong [Fri, 6 Jun 2014 06:26:48 +0000 (14:26 +0800)]
GBE: fix a bug in int64 to float conversion.

When copy those pure 32bit int to float destination, we
should enable the mask. Otherwise, we may destroy the
value in inactive lanes.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agoGBE: fix a typo in utests.
Zhigang Gong [Mon, 9 Jun 2014 07:04:46 +0000 (15:04 +0800)]
GBE: fix a typo in utests.

sub_bufffer_check ==> sub_buffer_check.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoutests: add a double precision check test case.
Zhigang Gong [Fri, 6 Jun 2014 02:37:19 +0000 (10:37 +0800)]
utests: add a double precision check test case.

v2:
fix some bugs in test case.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agoGBE: Add support double to float conversion.
Zhigang Gong [Fri, 30 May 2014 10:19:04 +0000 (18:19 +0800)]
GBE: Add support double to float conversion.

Previous double to float conversion will go to the
int64 to float code path incorrectly. And don't really
have double to float conversion support at gen_encoder.
This patch fix the above issues.

v2:
fix some bug on HSW platform.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agoGBE: optimize a special case of convert INT64 to float.
Zhigang Gong [Thu, 5 Jun 2014 08:16:10 +0000 (16:16 +0800)]
GBE: optimize a special case of convert INT64 to float.

We found the following instruction sequence is common
in luxmark:
CVT.int64.uin32 %75 %74
LOADI.int64 %537 16777215
AND.int64 %76 %75 %537
CVT.float.uin64 %77 %76

Actually, the immediate value is a pure 32 bit value,
and the %74 is also a uint32 bit value. The AND instruction
will not touch the high 32 bit as well. So we can simply optimize
the above instruction series to the follow:
AND.uint32 %tmp %74 16777215
MOV.float  %77 %tmp

This way, it will finally save about 55 instructions for each
of the above case. This patch could bring about 8% performance
gain with sala scene in luxmark.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoadd DRM_LIBDIR path into link directory list
Li Peng [Wed, 4 Jun 2014 06:21:44 +0000 (14:21 +0800)]
add DRM_LIBDIR path into link directory list

Then beignet can link to user preferred drm library rather than default

Signed-off-by: Li Peng <peng.li@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoHSW: Fix a compact assert.
Yang Rong [Thu, 29 May 2014 16:37:30 +0000 (00:37 +0800)]
HSW: Fix a compact assert.

Also use const static int instead of const int to avoid build error
in some gcc.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Optmize phi elimination
Ruiling Song [Tue, 3 Jun 2014 05:53:15 +0000 (13:53 +0800)]
GBE: Optmize phi elimination

During phi elimination, we simply insert 3 MOVs for one phi instruction
to avoid lost copy issue. But in fact, only two of them are needed for
most of time. This patch tries to see whether the move from phiCopy
to phi can be avoided.

The patch basically checks whether the phiCopy and phi have live range
interference. If no, then they can be coalesced, thus one instruction
can be optimized.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoRevert "GBE: No need to compute liveout again in value.cpp."
Ruiling Song [Tue, 3 Jun 2014 05:53:14 +0000 (13:53 +0800)]
Revert "GBE: No need to compute liveout again in value.cpp."

We need to transfer ValueDef from predecessors to their successors.
Consider a register defined in BB0, and used in BB3. we need to
iterate over liveout to pass the def in BB0 to BB3, so the use
in BB3 could get that correct def. Otherwise, the UD/DU graph is incomplete.

This reverts commit 89b490b5a17cfda2d9816dc1c246ce5bbff12648.
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agorefine code for the usage of set_image_base_index
Guo Yejun [Mon, 2 Jun 2014 18:13:54 +0000 (02:13 +0800)]
refine code for the usage of set_image_base_index

In libgbe.so and libgbeinterp.so, the same function pointer name
gbe_set_image_base_index is used for a unified source code.

In libcl.so, function pointer names begin with compiler_* point to
the functions from libgbe.so, function pointer names begin with
gbe_* point to the functions from libgbeinterp.so.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Fix bitcast between long and other type.
Ruiling Song [Fri, 30 May 2014 08:22:30 +0000 (16:22 +0800)]
GBE: Fix bitcast between long and other type.

As we store long low/high 32bits separately, when we do bitcast
like int64 --> int16, the horizontal stride of the int64's low/high
half should be set as 2 instead of 4.

This fix an regression of opencv test:
Imgproc/Threshold.Mat/40, where GetParam() = (16SC1, 0, 0, false)

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoMake utest pass rate reach 100%.
Yi Sun [Fri, 30 May 2014 03:22:33 +0000 (11:22 +0800)]
Make utest pass rate reach 100%.

1. Add more input values
2. remove case pow(0,0)
3. remove negtive values test in powr && pown

Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: Yangweix Shui <yangweix.shui@intel.com>
10 years agoRefine some test for math function
Yi Sun [Fri, 30 May 2014 03:22:32 +0000 (11:22 +0800)]
Refine some test for math function

1. nextafter: we originally use nextafter as cpu execution result, It's return value is double, so changed it to nextafterf.
2. sinpi: add judgement to reduce input data limitation from [-2pi,2pi] to [-pi,pi]
3. cospi: define cospi function.
4. tanpi: define tanpi function by using sinpi/cospi.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: YangweiX Shui <yangweix.shui@intel.com>
10 years agoRefine the cl thread implement for queue.
Junyan He [Fri, 30 May 2014 06:28:30 +0000 (14:28 +0800)]
Refine the cl thread implement for queue.

Because the cl_command_queue can be used in several threads simultaneously but
without add ref to it, we now handle it like this:
Keep one threads_slot_array, every time the thread get gpgpu or batch buffer, if it
does not have a slot, assign it.
The resources are keeped in queue private, and resize it if needed.
When the thread exit, the slot will be set invalid.
When queue released, all the resources will be released. If user still enqueue, flush
or finish the queue after it has been released, the behavior is undefined.
TODO: Need to shrink the slot map.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoFix timestamp on HASWELL
Li Peng [Mon, 26 May 2014 11:25:59 +0000 (19:25 +0800)]
Fix timestamp on HASWELL

The GPU timestamp should be lower 36 bit on HASWELL

Signed-off-by: Li Peng <peng.li@intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>
10 years agoextract libgbeinterp.so from runtime (libcl.so)
Guo Yejun [Thu, 29 May 2014 22:29:09 +0000 (06:29 +0800)]
extract libgbeinterp.so from runtime (libcl.so)

currently, there are same symbol names in libinterp.a (inside
libcl.so) and libgbe.so (compiler), and so have to dlopen libgbe.so
with RTLD_DEEPBIND, this flag makes std::cerr inside libgbe crash.

extract the interp part from libcl.so as libgbeinterp.so, therefore,
first dlopen libgbe.so without RTLD_DEEPBIND, then dlopen libgbeinterp.so
with RTLD_DEEPBIND, to fix the std:cerr crash issue.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: fix one illegal instruction when replace a uniform dst.
Zhigang Gong [Thu, 29 May 2014 04:27:02 +0000 (12:27 +0800)]
GBE: fix one illegal instruction when replace a uniform dst.

When the dst is a uniform value, we replace it with a vector value, then
copy the vector value back may generate an illegal instruction as below
at address 18:

    (14      )  mov(16)         g124<1>:F       g127.7<0,1,0>:F                 { align1 WE_all 1H };
    (16      )  send(16)        g122<1>:UW      g124<8,8,1>:UD
                data (bti: 1, rgba: 14, SIMD16, legacy, Untyped Surface Read) mlen 2 rlen 2 { align1 WE_all 1H };
    (18      )  mov(1)          g127.6<1>:F     g122<8,8,1>:F                   { align1 WE_all };

This patch could fix this issue and generate correct instruction as below:

    (      14)  mov(16)         g124<1>:UD      g127.7<0,1,0>:UD                { align1 WE_all 1H };
    (      16)  send(16)        g122<1>:UW      g124<8,8,1>:UD
                data (bti: 1, rgba: 14, SIMD16, legacy, Untyped Surface Read) mlen 2 rlen 2 { align1 WE_all 1H };
    (      18)  mov(1)          g127.6<1>:UD    g122<0,1,0>:UD                  { align1 WE_all };

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoutests: disable double test case.
Ruiling Song [Thu, 29 May 2014 02:29:35 +0000 (10:29 +0800)]
utests: disable double test case.

As we could not provide full support of double now,
and my patch to refine long support breaks double load/store.
So, we disable all double test cases.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: Pass correct register type when replaceReg
Ruiling Song [Thu, 29 May 2014 02:29:34 +0000 (10:29 +0800)]
GBE: Pass correct register type when replaceReg

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: Change 64bit integer storage in register
Ruiling Song [Thu, 29 May 2014 02:29:33 +0000 (10:29 +0800)]
GBE: Change 64bit integer storage in register

Previously, we store low/high half of 64bit together, which need several
32bit instructions to do one 64bit instruction. Now we simply change its
storage in register, low 32bit of all lanes are stored together, and then the
high 32bit of all lanes. This will make long support cleaner and less
32bit instructions needed.

v2:
fix a typo when getRegAtrrib().
Refine SelectionVector alignment.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: optimize scalar data type conversion.
Zhigang Gong [Thu, 29 May 2014 01:26:39 +0000 (09:26 +0800)]
GBE: optimize scalar data type conversion.

If the dst is scalar, the register region restrication is relaxed.
we can save one instruction as below:

    (12      )  mov.sat(1)      g127.24<4>:B    g1.3<0,1,0>:D              { align1 WE_all };
    (14      )  mov(1)          g127.28<1>:B    g127.24<0,1,4>:D           { align1 WE_all };

Optimized to:

    (12      )  mov.sat(1)      g128.28<4>:B    g1.3<0,1,0>:D              { align1 WE_all };

No need to create a temporary register g127.24.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: fix uniform/scalar related bugs.
Zhigang Gong [Thu, 29 May 2014 01:26:24 +0000 (09:26 +0800)]
GBE: fix uniform/scalar related bugs.

One major fix is that even a register is a scalar, when
we move a scalar Dword to a scalar Byte, we have to set
the hstride to 4, otherwise, it breaks the following
register restication:
  B. When the Execution Data Type is wider than the destination data type,
     the destination must be aligned as required by the wider execution data
     type and specify a HorzStride equal to the ratio in sizes of the two data
     types. For example, a mov with a D source and B destination must use a
     4-byte aligned destination and a Dst.HorzStride of 4.

The following instruction may doesn't take effect.
mov.sat(1)  g127.4<1>:B  g126<0,1,0>:D
We have to change it to
mov.sat(1)  g127.4<4>:B  g126<0,1,0>:D

v2: keep the instruction selection stage unchanged, we fix this restircation
    in setDst only.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: fix a regression for piglit test.
Zhigang Gong [Wed, 28 May 2014 09:02:16 +0000 (17:02 +0800)]
GBE: fix a regression for piglit test.

Access this->store[insnID+2] is not always safe, as it may
not exist.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoReturn CL_IMAGE_FORMAT_NOT_SUPPORTED if image_format is not supported.
Yang Rong [Wed, 28 May 2014 09:02:26 +0000 (17:02 +0800)]
Return CL_IMAGE_FORMAT_NOT_SUPPORTED if image_format is not supported.

And move the function cl_image_byte_per_pixel call before cl_image_get_supported_fmt
to return correct error code when format invalid.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoSilence compilation warnings when release build.
Yang Rong [Wed, 28 May 2014 07:37:49 +0000 (15:37 +0800)]
Silence compilation warnings when release build.

Also silence warnings in 32bit system.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoSilence some compilation warnings.
Zhigang Gong [Wed, 28 May 2014 02:29:22 +0000 (10:29 +0800)]
Silence some compilation warnings.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agoGBE: Consolidate all read/write instruction's bti handling.
Zhigang Gong [Wed, 28 May 2014 02:27:36 +0000 (10:27 +0800)]
GBE: Consolidate all read/write instruction's bti handling.

The previous bti handling for each read/write instruction is
slightly different from each other. There are two major bugs,
the OP_ATOMIC store the bti in different position, so the
post scheduling for ATOMIC instruction is buggy.
The second bug is the DWORD_GATHER instruction is not in
the isRead list. That may cause potential bug.

This patch fixes both of them.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agoseparate runtime(libcl.so) and compiler(libgbe.so)
Guo Yejun [Mon, 26 May 2014 23:10:04 +0000 (07:10 +0800)]
separate runtime(libcl.so) and compiler(libgbe.so)

On embedded/handheld devices, storage and memory are scarce, it is
necessary to provide only the OpenCL runtime library with small size,
and only the executable binary kernel will be supported on such device.

At the beginning of process (before function main), OpenCL runtime
(libcl.so) will try to load the compiler (libgbe.so), the system's
behavior is the same as before if successfully loaded, otherwise,
the runtime assumes no OpenCL compiler in the system, and the device
info will be changed as CL_DEVICE_COMPILER_AVAILABLE=false and
CL_DEVICE_PROFILE="EMBEDDED_PROFILE", the clBuildProgram returns
CL_COMPILER_NOT_AVAILABLE if the program is created with
clCreateProgramWithSource, following the OpenCL spec.

To simulate the case without OpenCL compiler, just delete the file
libgbe.so, or export OCL_NON_COMPILER=1.

Some explanation of the binary kernel interpreter (libinterp.a):

libinterp.a is used to interpret the binary kernel inside runtime,
and the runtime library libcl.so is built against libinterp.a.

Since the code to interpret binary kernel is tightly integrated inside
the compiler, to avoid code duplicate, a new file gbe_bin_interpreter.cpp
is created to include some other .cpp files; to make libinterp.a small
(the purpose to make libcl.so small), the macro GBE_COMPILER_AVAILABLE
is used to make only the needed code active when build for libinterp.a.

V2: code base is changed to call function gbe_set_image_base_index in
gbe_bin_generater, while this function is modified in this patch as
gbe_set_image_base_index_compiler, fix it accordingly.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: fix baytrail L3 cache configuration.
Zhigang Gong [Tue, 27 May 2014 08:58:28 +0000 (16:58 +0800)]
GBE: fix baytrail L3 cache configuration.

Reduce URB from 128KB to 64KB causes rendering artifact in X window.
I have to change it to 96KB URB and also change the RO and DC to 16KB
to satisfy the total 192KB L3 size limitation.

With this fix, the artifact is gone and utests has no new failures.

Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Guo Yejun <yejun.guo@intel.com>
10 years agoGBE: Make compatible with old gcc version.
Ruiling Song [Mon, 26 May 2014 02:07:15 +0000 (10:07 +0800)]
GBE: Make compatible with old gcc version.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoRefine pci id detecting.
Junyan He [Mon, 26 May 2014 15:57:32 +0000 (23:57 +0800)]
Refine pci id detecting.

Some platforms do not have key word "Gen" or "Graphic" when run the
lspci command. So we failed to get the pci id in such cases.
We now just use the 8086 key word and get the sub pci id, and compare
it to all the gen known gen pci ids. This can be safe in all platforms.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: fix post scheduling related bug for spill/unspill.
Zhigang Gong [Wed, 21 May 2014 08:27:55 +0000 (16:27 +0800)]
GBE: fix post scheduling related bug for spill/unspill.

spill/unspill instruction touch some registers directly which
are not in dst/src. This breaks the post scheduling. Simply
work around it by add all the reserved registers to the dst
array.

The scratch memory is not correctly indexed and the barrier is
not handled properly.

After this patch, the post scheduling will be enabled by default.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: optimize post reg allocation instruction scheduling.
Zhigang Gong [Tue, 20 May 2014 10:38:05 +0000 (18:38 +0800)]
GBE: optimize post reg allocation instruction scheduling.

To make the post scheduling working better, I relax the frequency of
the calling of expireGRF when doing register allocation. Thus we can
reduce the physical register conflict and doing the post scheduling.

Another optimization is to insert a pre retire for the instruction
to release those WRITE_AFTER_READ dependency. Write after read will
not bring any hazard, so we can release those register as soon as
the instruction scheduled.

The pre register allocation scheduling is quite different than post
schedlulig, for now, just disable it.

The whole patch could get about 10% perfromance gain with luxmark.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fix one post register allocation instruction scheduling bug.
Zhigang Gong [Mon, 19 May 2014 09:51:14 +0000 (17:51 +0800)]
GBE: fix one post register allocation instruction scheduling bug.

The instuction has modFlag 1 indicating it will modify the flag.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: disable mad for some cases.
Zhigang Gong [Thu, 15 May 2014 05:35:00 +0000 (13:35 +0800)]
GBE: disable mad for some cases.

One case is one operand is an imm value. Then it turns to
save one instruction but add an extra LOADI instruction.
We don't need to bother to use mad in this case.
And considering when we optimize the simd16 under simd8
mode, we

The other case is under simd16 mode. As mad is a 3-src instruction,
which only support simd8, it will convert one mad(16) instruction to
two mad(8) instructions. Then we don't need to use mad.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fix a uniform analysis bug.
Zhigang Gong [Thu, 22 May 2014 16:19:25 +0000 (00:19 +0800)]
GBE: fix a uniform analysis bug.

If a value is defined in a loop and is used out-of the
loop. That value could not be a uniform(scalar) value.
The reason is that value may be assigned different
scalar value on different lanes when it reenters with
different lanes actived.
Thanks for yang rong reporting this bug.

Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: don't allocate/modify flag if it is not used in current BB.
Zhigang Gong [Tue, 13 May 2014 09:51:26 +0000 (17:51 +0800)]
GBE: don't allocate/modify flag if it is not used in current BB.

If a flag is not used in current BB, we don't need to
set the modFlag bit on that instruction. Thus the register
allocation stage will not allocate a flag register for it.

No performance impact, as the previous implementation will
expire that flag register immediately.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: optimize IMM handling for SEL/SEL_CMP/CMP.
Zhigang Gong [Wed, 14 May 2014 06:58:32 +0000 (14:58 +0800)]
GBE: optimize IMM handling for SEL/SEL_CMP/CMP.

Actually, all of the above 3 instructions could avoid
one LOADI instruction by switching operands position.

This patch impemented this optimization. And consolidate
all the same type of optimization into one place.

No obvious performance impact on luxmark.

v2:
fix some wrong indent.
v3:
fix the OP_ORD issue. OP_ORD use both src0/src1 as both src0/src1
so can't use this IMM optimization.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: optimize SUB dst, imm, src1 instruction.
Zhigang Gong [Wed, 14 May 2014 03:21:13 +0000 (11:21 +0800)]
GBE: optimize SUB dst, imm, src1 instruction.

We could easily convert it to SUB dst, -src1, -imm.
Thus we can avoid one LOADI instruction eventually.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: optimize CMP instruction encoding.
Zhigang Gong [Fri, 16 May 2014 11:06:08 +0000 (19:06 +0800)]
GBE: optimize CMP instruction encoding.

This patch fixes the following two things.
1. Use a temporary register as dst register for the CMP
instruction in the middle of a block.
2. fix the switch flag for the CMP instruction at the begining
of each block. As the compact instruction handling will handle
the cmp instruction directly, and will ignore the switch
flag which is incorrect.

This patch could get about 2-3% performance gain for luxmark.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: refine disassembly code to show null register's type.
Zhigang Gong [Fri, 16 May 2014 07:57:24 +0000 (15:57 +0800)]
GBE: refine disassembly code to show null register's type.

We should show null register's type in the assembly output, as
if a null reigster is using a wrong type, such as the following
instruction:

cmp.le(8)      null:UW         g2<8,8,1>:F    0.1F

It is a fatal error from the hardware point of view. We should
output that information.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agogbe_bin_generater: fix two bugs.
Zhigang Gong [Fri, 23 May 2014 10:21:04 +0000 (18:21 +0800)]
gbe_bin_generater: fix two bugs.

The pci id detecting method is broken on some system.
And the gen pci id parsing in gbe_bin_generater is incorrect when
the pci id has a-f hex digit.

v2:
Add VGA to filter out some nonVGA devices.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agocorrect L3 cache settings for baytrail
Guo Yejun [Thu, 22 May 2014 17:24:20 +0000 (01:24 +0800)]
correct L3 cache settings for baytrail

baytrail and ivb have different register bits layout for L3 cache,
so, add a special path for baytrail.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-bu: "Song, Ruiling" <ruiling.song@intel.com>

10 years agomove enqueue_copy_image kernels outside of runtime code.
Luo [Mon, 12 May 2014 04:56:26 +0000 (12:56 +0800)]
move enqueue_copy_image kernels outside of runtime code.

seperate the kernel code from host code to make it clean; build the
kernels offline by gbe_bin_generator to improve the performance.

v2:
fix the image base issue with the standalone compiler.

Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agofix event related bugs.
Luo [Mon, 12 May 2014 04:56:25 +0000 (12:56 +0800)]
fix event related bugs.

1. remove repeated user events in list.
2. missed braces in loops.
3. fix barrier event reference not incresed.

Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: optimize builtin atan2.
Ruiling Song [Mon, 19 May 2014 08:43:03 +0000 (16:43 +0800)]
GBE: optimize builtin atan2.

clang will generate extra stores for the implementation.
So, put the data in __constant address space.
This will improve opencv test PhaseFixture_Phase by 3x.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoFix the bug of forgetting release sampler in utest.
Junyan He [Fri, 16 May 2014 07:13:34 +0000 (15:13 +0800)]
Fix the bug of forgetting release sampler in utest.

utest helper will not help us to free the sampler resource
as buffer and kernel. So we need to release it by ourself.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: fix unpacked_uw/ub on uniform registers.
Zhigang Gong [Wed, 14 May 2014 02:47:50 +0000 (10:47 +0800)]
GBE: fix unpacked_uw/ub on uniform registers.

unpacked_uw/ub macros hard coded the register's width to 8
which is bad for uniform registers. This patch fix that issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoAdd the pci id support for gbe_generate
Junyan He [Tue, 20 May 2014 07:07:29 +0000 (15:07 +0800)]
Add the pci id support for gbe_generate

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoFix map gtt fail when memory object size is too large.
Yang Rong [Tue, 20 May 2014 02:46:19 +0000 (10:46 +0800)]
Fix map gtt fail when memory object size is too large.

After max allocate size is changed to 256M, the large memory object would map gtt
fail  in some system. So when image size is large then 128M, disable tiling, and
used normal map. But in function clEnqueueMapBuffer/Image, may still fail because
unsync map.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoHSW: Corret the scratch buffer size calc and set the correct index in vfe state.
Yang Rong [Mon, 19 May 2014 05:52:25 +0000 (13:52 +0800)]
HSW: Corret the scratch buffer size calc and set the correct index in vfe state.

HSW's scratch buffer alignment and the index set in vfe state are different with IVB.
And when calc per thread's stack offset, will used R0.0's FFTID to, the define of
FFTID also changed in HSW.
With this patch, all utest pass.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoHSW: Fix the atomic msg type typo.
Yang Rong [Mon, 19 May 2014 05:52:24 +0000 (13:52 +0800)]
HSW: Fix the atomic msg type typo.

The atomic msg type should be GEN75_P1_UNTYPED_ATOMIC_OP. Correct it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoCorrect the double bug in HSW.
Yang Rong [Mon, 19 May 2014 05:52:23 +0000 (13:52 +0800)]
Correct the double bug in HSW.

Should set the nomask in mov_df_imm and need handle exec_width=4 case in setHeader.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoHSW: Use the drm flag I915_EXEC_ENABLE_SLM to set L3 control config.
Yang Rong [Mon, 19 May 2014 05:52:22 +0000 (13:52 +0800)]
HSW: Use the drm flag I915_EXEC_ENABLE_SLM to set L3 control config.

Because LRI commands will be converted to NOOP, add the I915_EXEC_ENABLE_SLM
flag to the drm kernal driver, to enable SLM in the L3. Set the flag when
application use slm. Still keep the L3 config in the batch buffer for fulsim.
Also create and use the openCL own context when exec, to avoid affect the other context.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoHSW: Workaround the slm address issue.
Yang Rong [Mon, 19 May 2014 05:52:21 +0000 (13:52 +0800)]
HSW: Workaround the slm address issue.

Each work group has it's own slm offset, and when dispatch threads,
TSG will handle it automatic in IVB. But it will fail in HSW.
After check, all work group's slm offset are 0, even the slm index is
correct in R0.0. So calc the slm offset for slm index, and add it
to the slm address.
TODO: need to find the root casue.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoEnable pipe control.
Yang Rong [Mon, 19 May 2014 05:52:20 +0000 (13:52 +0800)]
Enable pipe control.

The previour pipe control don't work, because it don't advance the batch buffer.
So the value set in function intel_gpgpu_pipe_control will be flushed later. Fix it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoFix a crash when clSetKernelArg of parameter point to NULL value.
Yang Rong [Mon, 19 May 2014 05:52:19 +0000 (13:52 +0800)]
Fix a crash when clSetKernelArg of parameter point to NULL value.

Per OCL spec, if the arg_value of clSetKernelArg is a memory object, it can be
NULL or point to NULL. Driver only handle NULL case, will crash if point to NULL.
Correct it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoHSW: align buffer's size to DWORD.
Yang Rong [Mon, 19 May 2014 05:52:18 +0000 (13:52 +0800)]
HSW: align buffer's size to DWORD.

HSW: Byte scattered Read/Write require that the buffer size must be a multiple of 4 bytes.
     So simply alignment all buffer size to 4. Pass utest compiler_function_constant0.

Because it is very light work around, align it without not check device.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoModify the GenContext and GenEncoder's destructor to virtual
Junyan He [Thu, 15 May 2014 09:38:53 +0000 (17:38 +0800)]
Modify the GenContext and GenEncoder's destructor to virtual

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoRuntime: Fix a bug in L3 configuration.
Ruiling Song [Fri, 16 May 2014 03:26:30 +0000 (11:26 +0800)]
Runtime: Fix a bug in L3 configuration.

We forgot to set L3SQCREG1 register.
And also add a more suitable configuration.
This patch improves Luxmark score above 50%.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>