contrib/beignet.git
10 years agoGBE: fix a uniform analysis bug.
Zhigang Gong [Thu, 22 May 2014 16:19:25 +0000 (00:19 +0800)]
GBE: fix a uniform analysis bug.

If a value is defined in a loop and is used out-of the
loop. That value could not be a uniform(scalar) value.
The reason is that value may be assigned different
scalar value on different lanes when it reenters with
different lanes actived.
Thanks for yang rong reporting this bug.

Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: don't allocate/modify flag if it is not used in current BB.
Zhigang Gong [Tue, 13 May 2014 09:51:26 +0000 (17:51 +0800)]
GBE: don't allocate/modify flag if it is not used in current BB.

If a flag is not used in current BB, we don't need to
set the modFlag bit on that instruction. Thus the register
allocation stage will not allocate a flag register for it.

No performance impact, as the previous implementation will
expire that flag register immediately.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: optimize IMM handling for SEL/SEL_CMP/CMP.
Zhigang Gong [Wed, 14 May 2014 06:58:32 +0000 (14:58 +0800)]
GBE: optimize IMM handling for SEL/SEL_CMP/CMP.

Actually, all of the above 3 instructions could avoid
one LOADI instruction by switching operands position.

This patch impemented this optimization. And consolidate
all the same type of optimization into one place.

No obvious performance impact on luxmark.

v2:
fix some wrong indent.
v3:
fix the OP_ORD issue. OP_ORD use both src0/src1 as both src0/src1
so can't use this IMM optimization.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: optimize SUB dst, imm, src1 instruction.
Zhigang Gong [Wed, 14 May 2014 03:21:13 +0000 (11:21 +0800)]
GBE: optimize SUB dst, imm, src1 instruction.

We could easily convert it to SUB dst, -src1, -imm.
Thus we can avoid one LOADI instruction eventually.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: optimize CMP instruction encoding.
Zhigang Gong [Fri, 16 May 2014 11:06:08 +0000 (19:06 +0800)]
GBE: optimize CMP instruction encoding.

This patch fixes the following two things.
1. Use a temporary register as dst register for the CMP
instruction in the middle of a block.
2. fix the switch flag for the CMP instruction at the begining
of each block. As the compact instruction handling will handle
the cmp instruction directly, and will ignore the switch
flag which is incorrect.

This patch could get about 2-3% performance gain for luxmark.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: refine disassembly code to show null register's type.
Zhigang Gong [Fri, 16 May 2014 07:57:24 +0000 (15:57 +0800)]
GBE: refine disassembly code to show null register's type.

We should show null register's type in the assembly output, as
if a null reigster is using a wrong type, such as the following
instruction:

cmp.le(8)      null:UW         g2<8,8,1>:F    0.1F

It is a fatal error from the hardware point of view. We should
output that information.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agogbe_bin_generater: fix two bugs.
Zhigang Gong [Fri, 23 May 2014 10:21:04 +0000 (18:21 +0800)]
gbe_bin_generater: fix two bugs.

The pci id detecting method is broken on some system.
And the gen pci id parsing in gbe_bin_generater is incorrect when
the pci id has a-f hex digit.

v2:
Add VGA to filter out some nonVGA devices.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agocorrect L3 cache settings for baytrail
Guo Yejun [Thu, 22 May 2014 17:24:20 +0000 (01:24 +0800)]
correct L3 cache settings for baytrail

baytrail and ivb have different register bits layout for L3 cache,
so, add a special path for baytrail.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-bu: "Song, Ruiling" <ruiling.song@intel.com>

10 years agomove enqueue_copy_image kernels outside of runtime code.
Luo [Mon, 12 May 2014 04:56:26 +0000 (12:56 +0800)]
move enqueue_copy_image kernels outside of runtime code.

seperate the kernel code from host code to make it clean; build the
kernels offline by gbe_bin_generator to improve the performance.

v2:
fix the image base issue with the standalone compiler.

Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agofix event related bugs.
Luo [Mon, 12 May 2014 04:56:25 +0000 (12:56 +0800)]
fix event related bugs.

1. remove repeated user events in list.
2. missed braces in loops.
3. fix barrier event reference not incresed.

Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: optimize builtin atan2.
Ruiling Song [Mon, 19 May 2014 08:43:03 +0000 (16:43 +0800)]
GBE: optimize builtin atan2.

clang will generate extra stores for the implementation.
So, put the data in __constant address space.
This will improve opencv test PhaseFixture_Phase by 3x.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoFix the bug of forgetting release sampler in utest.
Junyan He [Fri, 16 May 2014 07:13:34 +0000 (15:13 +0800)]
Fix the bug of forgetting release sampler in utest.

utest helper will not help us to free the sampler resource
as buffer and kernel. So we need to release it by ourself.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: fix unpacked_uw/ub on uniform registers.
Zhigang Gong [Wed, 14 May 2014 02:47:50 +0000 (10:47 +0800)]
GBE: fix unpacked_uw/ub on uniform registers.

unpacked_uw/ub macros hard coded the register's width to 8
which is bad for uniform registers. This patch fix that issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoAdd the pci id support for gbe_generate
Junyan He [Tue, 20 May 2014 07:07:29 +0000 (15:07 +0800)]
Add the pci id support for gbe_generate

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoFix map gtt fail when memory object size is too large.
Yang Rong [Tue, 20 May 2014 02:46:19 +0000 (10:46 +0800)]
Fix map gtt fail when memory object size is too large.

After max allocate size is changed to 256M, the large memory object would map gtt
fail  in some system. So when image size is large then 128M, disable tiling, and
used normal map. But in function clEnqueueMapBuffer/Image, may still fail because
unsync map.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoHSW: Corret the scratch buffer size calc and set the correct index in vfe state.
Yang Rong [Mon, 19 May 2014 05:52:25 +0000 (13:52 +0800)]
HSW: Corret the scratch buffer size calc and set the correct index in vfe state.

HSW's scratch buffer alignment and the index set in vfe state are different with IVB.
And when calc per thread's stack offset, will used R0.0's FFTID to, the define of
FFTID also changed in HSW.
With this patch, all utest pass.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoHSW: Fix the atomic msg type typo.
Yang Rong [Mon, 19 May 2014 05:52:24 +0000 (13:52 +0800)]
HSW: Fix the atomic msg type typo.

The atomic msg type should be GEN75_P1_UNTYPED_ATOMIC_OP. Correct it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoCorrect the double bug in HSW.
Yang Rong [Mon, 19 May 2014 05:52:23 +0000 (13:52 +0800)]
Correct the double bug in HSW.

Should set the nomask in mov_df_imm and need handle exec_width=4 case in setHeader.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoHSW: Use the drm flag I915_EXEC_ENABLE_SLM to set L3 control config.
Yang Rong [Mon, 19 May 2014 05:52:22 +0000 (13:52 +0800)]
HSW: Use the drm flag I915_EXEC_ENABLE_SLM to set L3 control config.

Because LRI commands will be converted to NOOP, add the I915_EXEC_ENABLE_SLM
flag to the drm kernal driver, to enable SLM in the L3. Set the flag when
application use slm. Still keep the L3 config in the batch buffer for fulsim.
Also create and use the openCL own context when exec, to avoid affect the other context.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoHSW: Workaround the slm address issue.
Yang Rong [Mon, 19 May 2014 05:52:21 +0000 (13:52 +0800)]
HSW: Workaround the slm address issue.

Each work group has it's own slm offset, and when dispatch threads,
TSG will handle it automatic in IVB. But it will fail in HSW.
After check, all work group's slm offset are 0, even the slm index is
correct in R0.0. So calc the slm offset for slm index, and add it
to the slm address.
TODO: need to find the root casue.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoEnable pipe control.
Yang Rong [Mon, 19 May 2014 05:52:20 +0000 (13:52 +0800)]
Enable pipe control.

The previour pipe control don't work, because it don't advance the batch buffer.
So the value set in function intel_gpgpu_pipe_control will be flushed later. Fix it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoFix a crash when clSetKernelArg of parameter point to NULL value.
Yang Rong [Mon, 19 May 2014 05:52:19 +0000 (13:52 +0800)]
Fix a crash when clSetKernelArg of parameter point to NULL value.

Per OCL spec, if the arg_value of clSetKernelArg is a memory object, it can be
NULL or point to NULL. Driver only handle NULL case, will crash if point to NULL.
Correct it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoHSW: align buffer's size to DWORD.
Yang Rong [Mon, 19 May 2014 05:52:18 +0000 (13:52 +0800)]
HSW: align buffer's size to DWORD.

HSW: Byte scattered Read/Write require that the buffer size must be a multiple of 4 bytes.
     So simply alignment all buffer size to 4. Pass utest compiler_function_constant0.

Because it is very light work around, align it without not check device.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
10 years agoModify the GenContext and GenEncoder's destructor to virtual
Junyan He [Thu, 15 May 2014 09:38:53 +0000 (17:38 +0800)]
Modify the GenContext and GenEncoder's destructor to virtual

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoRuntime: Fix a bug in L3 configuration.
Ruiling Song [Fri, 16 May 2014 03:26:30 +0000 (11:26 +0800)]
Runtime: Fix a bug in L3 configuration.

We forgot to set L3SQCREG1 register.
And also add a more suitable configuration.
This patch improves Luxmark score above 50%.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: fix one regression caused by uniform analysis.
Zhigang Gong [Tue, 13 May 2014 10:29:18 +0000 (18:29 +0800)]
GBE: fix one regression caused by uniform analysis.

Some instructions handle simd1 incorrectly. Disable them
currently.

v2:
add addsat into the unsupported list.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: fix the legacy use of isScalarOrBool.
Zhigang Gong [Mon, 12 May 2014 02:27:28 +0000 (10:27 +0800)]
GBE: fix the legacy use of isScalarOrBool.

isScalarOrBool is a legacy function which was used when the bool
is treated as a scalar register by default. Now, we are using
normal vector word register to represent bool, we no need to
keep this macro. And repace all of the uses to isScalarReg.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
10 years agoGBE: enable uniform analysis for bool data type.
Zhigang Gong [Wed, 7 May 2014 05:12:36 +0000 (13:12 +0800)]
GBE: enable uniform analysis for bool data type.

v2:
refine the flag allocation implementation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
10 years agoGBE: enable uniform for load instruction.
Zhigang Gong [Wed, 7 May 2014 07:46:59 +0000 (15:46 +0800)]
GBE: enable uniform for load instruction.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
10 years agoGBE: implement uniform analysis.
Zhigang Gong [Tue, 6 May 2014 10:31:13 +0000 (18:31 +0800)]
GBE: implement uniform analysis.

We have many uniform (scalar) input values which include
the kernel input argument and some special registers.

And all those variables derived by all uniform values are
also uniform values. This patch analysis this type of register
at liveness analysis stage, and change uniform register's
type to scalar type. Then latter, these registers need
less register space.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
10 years agoGBE: change scalar byte size to 2 from 1.
Zhigang Gong [Wed, 7 May 2014 01:39:50 +0000 (09:39 +0800)]
GBE: change scalar byte size to 2 from 1.

Due to the exec size is always larger or equal to 2,
we need to change the scalar byte size to 2 rather than
1. Otherwise, it may generate the following illegal instruction:

(17      )  mov(1)          g127.31<1>UB    0x2UW                           { align1 WE_all };

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
10 years agoGBE: No need to compute liveout again in value.cpp.
Zhigang Gong [Tue, 29 Apr 2014 03:26:14 +0000 (11:26 +0800)]
GBE: No need to compute liveout again in value.cpp.

We already did a complete liveness analysis at the liveness.cpp.
Don't need to do that again. Save about 10% of the compile time.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
10 years agoFix double bugs for hsw
Junyan He [Wed, 7 May 2014 10:03:18 +0000 (18:03 +0800)]
Fix double bugs for hsw

In bspec, IVB should use SIMD8 for double ops, but HSW should use SIMD4.
TODO: The long ops maybe also need change.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
10 years agoMake the surface typed write work for HSW
Junyan He [Wed, 7 May 2014 10:03:10 +0000 (18:03 +0800)]
Make the surface typed write work for HSW

1.Modify the typed write for state write using GEN_SFID_DATAPORT_DATA_CACHE.
2.Add the channel select for surface state setting.
3.Correct the send message for setting slot in send description.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agocorrect jump distance of hsw's jmpi.
Junyan He [Wed, 7 May 2014 10:03:04 +0000 (18:03 +0800)]
correct jump distance of hsw's jmpi.

Gen5+ bspec: the jump distance is in number of eight-byte units.
Gen7.5+: the offset is in unit of 8bits for JMPI, 64bits for other flow control instructions.
So need multiple all jump distance with 8 in jmpi.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>
10 years agoUsing a correct DATAPORT and SFID for some send message of haswell.
Junyan He [Wed, 7 May 2014 10:02:56 +0000 (18:02 +0800)]
Using a correct DATAPORT and SFID for some send message of haswell.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
10 years agoAdd Gen75Context and Gen75Encoder class for hsw
Junyan He [Wed, 7 May 2014 10:02:50 +0000 (18:02 +0800)]
Add Gen75Context and Gen75Encoder class for hsw

We will create the Gen75Context and Gen75Encoder
dynamically based on the vendor ID, which is same
with the PCI ID.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agoUpdate the device info description for HSW
Junyan He [Fri, 9 May 2014 07:38:57 +0000 (15:38 +0800)]
Update the device info description for HSW

Split the cl_device_id description for HSW into
GT1, GT2 and GT3, with different parameters.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
10 years agoRuntime: change default tiling mode to TILE_X from TILE_Y.
Zhigang Gong [Fri, 9 May 2014 05:12:28 +0000 (13:12 +0800)]
Runtime: change default tiling mode to TILE_X from TILE_Y.

Nanhai found that tiling mode does matter the performance much
for some cases. So make the tiling mode configurable at runtime
and make the default tiling mode as TILE_X which is much
better than the original TILE_Y for many cases.

At runtime, it is easy to change the default tiling mode as below
export OCL_TILING=0 # enable NO_TILE
export OCL_TILING=1 # enable TILE_X
export OCL_TILING=2 # enable TILE_Y

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoGBE: Merge successive load/store together for better performance.
Ruiling Song [Thu, 8 May 2014 02:18:22 +0000 (10:18 +0800)]
GBE: Merge successive load/store together for better performance.

Gen support at most 4 DWORD read/write in one single instruction.
So we merge successive read/write for less instruction and better performance.
This improves about 10% for LuxMark medium scene.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Refine logic of finding where the local variable is defined.
Ruiling Song [Fri, 9 May 2014 04:54:25 +0000 (12:54 +0800)]
GBE: Refine logic of finding where the local variable is defined.

Traverse all uses of the local variable as there maybe some dead use.
Most time, the function will return fast, as the use tree is not deep.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agodo not serialize zero image/sampler info into binary
Guo Yejun [Tue, 6 May 2014 19:34:37 +0000 (03:34 +0800)]
do not serialize zero image/sampler info into binary

if there is no image/sampler used in kernel source, it is not
necessary to serialize the zero image/sampler info into kernel binary.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoGBE: fix one potential bug in UnsignedI64ToFloat.
Zhigang Gong [Tue, 6 May 2014 03:20:41 +0000 (11:20 +0800)]
GBE: fix one potential bug in UnsignedI64ToFloat.

Set exp to a proper value to make sure all the inactive lanes
flag bits are 1s which satisfy the requirement of the following
ALL16/ALL8H condition check.

v2:
enable the first JMPI's optimization rather the second,
as it has higher probability.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: Fix one build error of friend declaration for a class.
Chuanbo Weng [Tue, 6 May 2014 10:48:26 +0000 (18:48 +0800)]
GBE: Fix one build error of friend declaration for a class.

If the g++ is older than 4.7.0, the class-key of the
elaborated-type-specifier is required in a friend declaration
for a class. So modify the code to make it compitible with old
g++ version.

Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: remove some useless code.
Zhigang Gong [Sun, 4 May 2014 01:16:05 +0000 (09:16 +0800)]
GBE: remove some useless code.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: increase the global memory size to 1GB.
Zhigang Gong [Sun, 4 May 2014 01:14:08 +0000 (09:14 +0800)]
GBE: increase the global memory size to 1GB.

Also increase the global memory to 1GB.

v2: change the max memory size to 256MB

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: fixed a regression at "Long" div/rem.
Zhigang Gong [Sun, 4 May 2014 00:59:41 +0000 (08:59 +0800)]
GBE: fixed a regression at "Long" div/rem.

If the GEN_PREDICATE_ALIGN1_ANY8H/ANY16H or ALL8H/ALL16H
are used, we must make sure those inactive lanes are initialized
correctly. For "ANY" condition, all the inactive lanes need to
be clear to zero. For "ALL" condition, all the inactive lanes
need to be set to 1s. Otherwise, it may cause infinite loop.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoInit Benchmark suite
Yi Sun [Mon, 28 Apr 2014 05:31:05 +0000 (13:31 +0800)]
Init Benchmark suite

The first benchmark case is name enqueue_copy_buf.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: reserve flag0.0 for large basic block.
Zhigang Gong [Fri, 25 Apr 2014 13:52:34 +0000 (21:52 +0800)]
GBE: reserve flag0.0 for large basic block.

As in large basic block, there are more than one IF instruction which
need to use the flag0.0. We have to reserve flag 0.0 to those IF
instructions.

Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fix the large if/endif block issue.
Zhigang Gong [Fri, 25 Apr 2014 07:38:22 +0000 (15:38 +0800)]
GBE: fix the large if/endif block issue.

Some test cases have some very large block which contains
more than 32768/2 instructions which could fit into one
if/endif block.

This patch introduce a ifendif fix switch at the GenContext.
Once we encounter one of such error, we set the switch on
and then recompile the kernel. When the switch is on, we will
insert extra endif/if pair to the block to split one if/endif
block to multiple ones to fix the large if/endif issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fix the hard coded endif offset calculation.
Zhigang Gong [Fri, 25 Apr 2014 04:36:59 +0000 (12:36 +0800)]
GBE: fix the hard coded endif offset calculation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: Avoid unecessary dag/liveness computing at backend.
Zhigang Gong [Thu, 24 Apr 2014 07:24:07 +0000 (15:24 +0800)]
GBE: Avoid unecessary dag/liveness computing at backend.

We don't need to compute dag/liveness at the backend when
we switch to a new code gen strategy.
For the unit test case, this patch could save 15% of the
overall execution time. For the luxmark with STRICT conformance
mode, it saves about 40% of the build time.

v3: fix some minor bugs.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fixed a potential scalarize bug.
Zhigang Gong [Thu, 24 Apr 2014 10:18:40 +0000 (18:18 +0800)]
GBE: fixed a potential scalarize bug.

We need to append extract instruction when do a bitcast to
a vector. Otherwise, we may trigger an assert as the extract
instruction uses a undefined vector.

After this patch, it becomes safe to do many rounds of scalarize
pass.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoadd support for cross compiler
Guo Yejun [Wed, 23 Apr 2014 18:18:00 +0000 (02:18 +0800)]
add support for cross compiler

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: refine the gen program strategy.
Zhigang Gong [Thu, 24 Apr 2014 02:09:13 +0000 (10:09 +0800)]
GBE: refine the gen program strategy.

The limitRegisterPressure only affects the MAD pattern matching
which could not bring noticeable difference here. I change it to always
be false. And add the reserved registers for spill to the strategy
structure. Thus we can try to build a program as the following
strategy:

1. SIMD16 without spilling
2. SIMD16 with 10 spilling registers and with a default spilling threshold
   value 16. When need to spill more than 16 registers, we fall back to next
   method.
3. SIMD8 without spilling
4. SIMD8 with 8 spilling registers.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: fixed the undefined phi value's liveness analysis.
Zhigang Gong [Thu, 10 Apr 2014 06:33:48 +0000 (14:33 +0800)]
GBE: fixed the undefined phi value's liveness analysis.

If a phi component is undef from one of the predecessors,
we should not pass it as the predecessor's liveout registers.
Otherwise, that phi register's liveness may be extent to
the basic block zero which is not good.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
10 years agoGBE: Try expire some register before register allocation
Ruiling Song [Wed, 23 Apr 2014 06:31:29 +0000 (14:31 +0800)]
GBE: Try expire some register before register allocation

1. This would free unused register asap, so it becomes easy to allocate
   contiguous registers.

2. We previously met many hidden register liveness issue. Let's try
   to reuse the expired register early. Then I think wrong liveness may
   easy to find.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
10 years agoGBE: Optimize byte gather read using untyped read.
Ruiling Song [Wed, 23 Apr 2014 02:56:50 +0000 (10:56 +0800)]
GBE: Optimize byte gather read using untyped read.

Untyped read seems better than byte gather read.
Some performance test in opencv got doubled after the patch.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
10 years agoadd test for __gen_ocl_simd_any and __gen_ocl_simd_all
Guo Yejun [Fri, 18 Apr 2014 05:42:29 +0000 (13:42 +0800)]
add test for __gen_ocl_simd_any and __gen_ocl_simd_all

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agosupport __gen_ocl_simd_any and __gen_ocl_simd_all
Guo Yejun [Fri, 18 Apr 2014 05:42:16 +0000 (13:42 +0800)]
support __gen_ocl_simd_any and __gen_ocl_simd_all

short __gen_ocl_simd_any(short x):
if x in any of the active threads in the same SIMD is not zero,
the return value for all these threads is not zero, otherwise, zero returned.

short __gen_ocl_simd_all(short x):
only if x in all of the active threads in the same SIMD is not zero,
the return value for all these threads is not zero, otherwise, zero returned.

for example:
to check if a special value exists in a global buffer, use one SIMD
to do the searching parallelly, the whole SIMD can stop the task
once the value is found. The key kernel code looks like:

for(; ; ) {
  ...
  if (__gen_ocl_simd_any(...))
    break;   //the whole SIMD stop the searching
}

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoDelete the printing of dynamic statistics line.
Sun, Yi [Tue, 8 Apr 2014 02:53:40 +0000 (10:53 +0800)]
Delete the printing of dynamic statistics line.

summary:
---------------------
  1. Delete the printing of dynamic statistics line.
  2. Add function to catch signals(like CTRL+C,core dumped ...),
     if caught, reminder user the signal name.
     core dumped example:
...
displacement_map_element()    [SUCCESS]
compiler_clod()    Interrupt signal (SIGSEGV) received.
summary:
----------
  total: 657
  run: 297
  pass: 271
  fail: 26
  pass rate: 0.960426

Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: Yangwei Shui <yangweix.shui@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Implement instruction compact.
Ruiling Song [Tue, 15 Apr 2014 08:53:17 +0000 (16:53 +0800)]
GBE: Implement instruction compact.

A native GEN ASM would takes 2*64bit, but GEN also support compact instruction
which only takes 64bit. To make code easily understood, GenInstruction now only
stands for 64bit memory, and use GenNativeInstruction & GenCompactInstruction
to represent normal(native) and compact instruction.

After this change, it is not easily to map SelectionInstruction distance to ASM distance.
As the instructions in the distance maybe compacted. To not introduce too much
complexity, JMP, IF, ENDIF, NOP will NEVER be compacted.

Some experiment in luxMark shows it could reduce about 20% instruction memory.
But it is sad that no performance improvement observed.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: fix a Q64 spilling bug in non-simd8 mode.
Zhigang Gong [Thu, 17 Apr 2014 09:41:58 +0000 (17:41 +0800)]
GBE: fix a Q64 spilling bug in non-simd8 mode.

For simd16 mode, the payload need to have 2 GRFs not the hard coded 1 GRF.
This patch fixes the corresponding regression on piglit.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: work around baytrail-t hang issue.
Zhigang Gong [Thu, 17 Apr 2014 06:59:00 +0000 (14:59 +0800)]
GBE: work around baytrail-t hang issue.

There is an unkown issue with baytrail-t platform. It will hang at
utest's compiler_global_constant case. After some investigation,
it turns out to be related to the DWORD GATHER READ send message
on the constand cache data port. I change to use data cache data
port could work around that hang issue.

Now we only fail one more case on baytrail-t compare to the IVB
desktop platform which is the:

profiling_exec()    [FAILED]
   Error: Too large time from submit to start

That may be caused by kernel related issue. And that bug will not
cause serious issue for normal kernel. So after this patch, the
baytrail-t platform should be in a pretty good shape with beignet.

Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>
10 years agoGBE/Runtime: pass the device id to the compiler backend.
Zhigang Gong [Thu, 17 Apr 2014 06:56:08 +0000 (14:56 +0800)]
GBE/Runtime: pass the device id to the compiler backend.

For some reason, we need to know current target device id
at the code generation stage. This patch introduces such
a mechanism. This is the preparation for the baytrail werid
hang issue fixing.

Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>
10 years agoRuntime: increase the build log buffer size to 1000.
Zhigang Gong [Thu, 17 Apr 2014 05:11:50 +0000 (13:11 +0800)]
Runtime: increase the build log buffer size to 1000.

200 is too small sometimes.

Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>
10 years agoRuntime: Add support for Bay Trail-T device.
Chuanbo Weng [Thu, 10 Apr 2014 08:17:53 +0000 (16:17 +0800)]
Runtime: Add support for Bay Trail-T device.

According to the baytrial-t spec, baytrail-t has 4 EUs and each
EU has 8 threads. So the compute unit is 32 and the maximum
work group size is 32 * 8 which is 256.

Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoMark SandyBridge as unsupported
Jesper Pedersen [Sun, 13 Apr 2014 13:58:12 +0000 (09:58 -0400)]
Mark SandyBridge as unsupported

Signed-off-by: Jesper Pedersen <jesper.pedersen@comcast.net>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoUse pkg-config to check modules
Zhenyu Wang [Thu, 10 Apr 2014 10:09:44 +0000 (18:09 +0800)]
Use pkg-config to check modules

Instead of use pre-defined path for dependent modules, e.g libdrm,
libdrm_intel, etc. Use pkg-config helper for cmake instead. This makes
it easy to work with developer own built version of those dependences.

Also remove libGL dependence for 'gbe_bin_generator' which is not required.
libutest.so still requires libGL now but might be fixed by checking real
GL dependence.

v2: Fix build with mesa source (92e6260) and link required EGL lib with utests too.

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by:Zhigang Gong <zhigang.gong@linux.intel.com>

10 years agoGBE: Enable CFG printer.
Ruiling Song [Fri, 11 Apr 2014 06:48:18 +0000 (14:48 +0800)]
GBE: Enable CFG printer.

export OCL_OUTPUT_CFG=1
or export OCL_OUTPUT_CFG_ONLY=1
then it will output .dot file of CFG for the compiled kernels.

The CFG_ONLY means pure cfg without llvm IR.
You can use xdot to view .dot file.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoRuntime: increase batch size to 8K.
Ruiling Song [Fri, 11 Apr 2014 06:48:17 +0000 (14:48 +0800)]
Runtime: increase batch size to 8K.

We met an assert on max_reloc in libdrm. So we simply work around it by
increase the batch size, then libdrm could allow more bo relocations.
This fix the the assert running ocl HaarFixture test under simd8.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoenable mad for mul+sub.
Ruiling Song [Fri, 11 Apr 2014 06:48:16 +0000 (14:48 +0800)]
enable mad for mul+sub.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoGBE: Enable register spilling for SIMD16.
Zhigang Gong [Wed, 9 Apr 2014 16:05:26 +0000 (00:05 +0800)]
GBE: Enable register spilling for SIMD16.

Enable register spilling for SIMD16 mode. Introduce an
new environment variable OCL_SIMD16_SPILL_THRESHOLD to
control the threshold of simd 16 register spilling. Default
value is 16, means when the spilled registers are more than
16, beignet will fallback to simd8.

Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Optimize read_image performance for CL_ADDRESS_CLAMP..
Zhigang Gong [Wed, 9 Apr 2014 10:25:22 +0000 (18:25 +0800)]
GBE: Optimize read_image performance for CL_ADDRESS_CLAMP..

The previous work around(due to hardware restriction.) is to use
CL_ADDRESS_CLAMP_TO_EDGE to implement CL_ADDRESS_CLAMP which is
not very efficient, especially for the boundary checking overhead.
The root cause is that we need to check each pixel's coordinate.

Now we change to use the LD message to implement CL_ADDRESS_CLAMP. For
integer coordinates, we don't need to do the boundary checking. And for
the float coordinates, we only need to check whether it's less than zero
which is much simpler than before.

This patch could bring about 20% to 30% performance gain for luxmark's
medium and simple scene.

v2:
simplfy the READ_IMAGE0.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fixed two 'long' related bugs.
Zhigang Gong [Tue, 8 Apr 2014 09:58:15 +0000 (17:58 +0800)]
GBE: fixed two 'long' related bugs.

Didn't modify some hard coded number correctly in previous patch.
Now fix them. This could pass the corresponding regressions in
piglit test.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
10 years agoGBE: fix the flag usage of those long/64 bit instruction.
Zhigang Gong [Wed, 2 Apr 2014 06:36:19 +0000 (14:36 +0800)]
GBE: fix the flag usage of those long/64 bit instruction.

Make the flag allocation be aware of the long/64bit insn
will use the flag0.1. And don't hard coded f0.1 at the gen_context
stage.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Optimize the bool register allocation/processing.
Zhigang Gong [Thu, 27 Mar 2014 16:38:29 +0000 (00:38 +0800)]
GBE: Optimize the bool register allocation/processing.

Previously, we have a global flag allocation implemntation.
After some analysis, I found the global flag allocation is not
the best solution here.
As for the cross block reference of bool value, we have to
combine it with current emask. There is no obvious advantage to
allocate deadicate physical flag register for those cross block usage.
We just need to allocate physical flag within each BB. We need to handle
the following cases:

1. The bool's liveness never beyond this BB. And the bool is only used as
   a dst register or a pred register. This bool value could be
   allocated in physical flag only if there is enough physical flag.
   We already identified those bool at the instruction select stage, and
   put them in the flagBooleans set.
2. The bool is defined in another BB and used in this BB, then we need
   to prepend an instruction at the position where we use it.
3. The bool is defined in this BB but is also used as some instruction's
   source registers rather than the pred register. We have to keep the normal
   grf (UW8/UW16) register for this bool. For some CMP instruction, we need to
   append a SEL instruction convert the flag to the grf register.
4. Even for the spilling flag, if there is only one spilling flag, we will also
   try to reuse the temporary flag register latter. This requires all the
   instructions should got it flag at the instruction selection stage. And should
   not use the flag physical number directly at the gen_context stage. Otherwise,
   may break the algorithm here.
We will track all the validated bool value and to avoid any redundant
validation for the same flag. But if there is no enough physical flag,
we have to spill the previous allocated physical flag. And the spilling
policy is to spill the allocate flag which live to the last time end point.

Let's see an real example of the improvement of this patch:
I take the compiler_vect_compare as an example, before this patch, the
instructions are as below:
    (      24)  cmp.g.f1.1(8)   null            g110<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      26)  cmp.g.f1.1(8)   null            g111<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      28)  (+f1.1) sel(16) g109<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      30)  cmp.g.f1.1(8)   null            g112<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      32)  cmp.g.f1.1(8)   null            g113<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      34)  (+f1.1) sel(16) g108<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      36)  cmp.g.f1.1(8)   null            g114<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      38)  cmp.g.f1.1(8)   null            g115<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      40)  (+f1.1) sel(16) g107<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      42)  cmp.g.f1.1(8)   null            g116<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      44)  cmp.g.f1.1(8)   null            g117<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      46)  (+f1.1) sel(16) g106<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      48)  mov(16)         g104<1>F        -nanF                           { align1 WE_normal 1H };
    (      50)  cmp.ne.f1.1(16) null            g109<8,8,1>UW   0x0UW           { align1 WE_normal 1H switch };
    (      52)  (+f1.1) sel(16) g96<1>D         g104<8,8,1>D    0D              { align1 WE_normal 1H };
    (      54)  cmp.ne.f1.1(16) null            g108<8,8,1>UW   0x0UW           { align1 WE_normal 1H switch };
    (      56)  (+f1.1) sel(16) g98<1>D         g104<8,8,1>D    0D              { align1 WE_normal 1H };
    (      58)  cmp.ne.f1.1(16) null            g107<8,8,1>UW   0x0UW           { align1 WE_normal 1H switch };
    (      60)  (+f1.1) sel(16) g100<1>D        g104<8,8,1>D    0D              { align1 WE_normal 1H };
    (      62)  cmp.ne.f1.1(16) null            g106<8,8,1>UW   0x0UW           { align1 WE_normal 1H switch };
    (      64)  (+f1.1) sel(16) g102<1>D        g104<8,8,1>D    0D              { align1 WE_normal 1H };
    (      66)  add(16)         g94<1>D         g1.3<0,1,0>D    g120<8,8,1>D    { align1 WE_normal 1H };
    (      68)  send(16)        null            g94<8,8,1>UD
                data (bti: 1, rgba: 0, SIMD16, legacy, Untyped Surface Write) mlen 10 rlen 0 { align1 WE_normal 1H };
    (      70)  mov(16)         g2<1>UW         0x1UW                           { align1 WE_normal 1H };
    (      72)  endif(16) 2                     null                            { align1 WE_normal 1H };

After this patch, it becomes:

    (      24)  cmp.g(8)        null            g110<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      26)  cmp.g(8)        null            g111<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      28)  cmp.g.f1.1(8)   null            g112<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      30)  cmp.g.f1.1(8)   null            g113<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      32)  cmp.g.f0.1(8)   null            g114<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      34)  cmp.g.f0.1(8)   null            g115<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      36)  (+f0.1) sel(16) g109<1>UW       g1.2<0,1,0>UW   g1<0,1,0>UW     { align1 WE_normal 1H };
    (      38)  cmp.g.f1.0(8)   null            g116<8,8,1>D    0D              { align1 WE_normal 1Q };
    (      40)  cmp.g.f1.0(8)   null            g117<8,8,1>D    0D              { align1 WE_normal 2Q };
    (      42)  mov(16)         g106<1>F        -nanF                           { align1 WE_normal 1H };
    (      44)  (+f0) sel(16)   g98<1>D         g106<8,8,1>D    0D              { align1 WE_normal 1H };
    (      46)  (+f1.1) sel(16) g100<1>D        g106<8,8,1>D    0D              { align1 WE_normal 1H };
    (      48)  (+f0.1) sel(16) g102<1>D        g106<8,8,1>D    0D              { align1 WE_normal 1H };
    (      50)  (+f1) sel(16)   g104<1>D        g106<8,8,1>D    0D              { align1 WE_normal 1H };
    (      52)  add(16)         g96<1>D         g1.3<0,1,0>D    g120<8,8,1>D    { align1 WE_normal 1H };
    (      54)  send(16)        null            g96<8,8,1>UD
                data (bti: 1, rgba: 0, SIMD16, legacy, Untyped Surface Write) mlen 10 rlen 0 { align1 WE_normal 1H };
    (      56)  mov(16)         g2<1>UW         0x1UW                           { align1 WE_normal 1H };
    (      58)  endif(16) 2                     null                            { align1 WE_normal 1H };

It reduces the instruction count from 25 to 18. Save about 28% instructions.

v2:
Fix some minor bugs.

Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoSilent some compilation warnings.
Zhigang Gong [Fri, 28 Mar 2014 06:57:51 +0000 (14:57 +0800)]
Silent some compilation warnings.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: avoid use a temporay register at the CMP instruction.
Zhigang Gong [Thu, 27 Mar 2014 08:27:18 +0000 (16:27 +0800)]
GBE: avoid use a temporay register at the CMP instruction.

Use one SEL instruction, we can easily transfer a flag
to a normal bool vector register with correct mask.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Add two helper scalar registers to hold 0 and all 1s.
Zhigang Gong [Thu, 27 Mar 2014 07:54:49 +0000 (15:54 +0800)]
GBE: Add two helper scalar registers to hold 0 and all 1s.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: don't emit jmpi to next label.
Zhigang Gong [Thu, 27 Mar 2014 05:30:00 +0000 (13:30 +0800)]
GBE: don't emit jmpi to next label.

As the following if will do the same thing, don't need to
add the jmpi instruction.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: one instruction is enough for SEL_CMP now.
Zhigang Gong [Thu, 27 Mar 2014 02:05:56 +0000 (10:05 +0800)]
GBE: one instruction is enough for SEL_CMP now.

As we have if/endif, now the SEL_CMP could write to the
dst register directly with correct emask.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: pass the OCL_STRICT_CONFORMANCE env to the backend.
Zhigang Gong [Thu, 27 Mar 2014 02:00:57 +0000 (10:00 +0800)]
GBE: pass the OCL_STRICT_CONFORMANCE env to the backend.

Enable the mad pattern matching if the strict conformance
is false.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Only emit long jump when jump a lot of blocks
Zhigang Gong [Thu, 27 Mar 2014 01:36:46 +0000 (09:36 +0800)]
GBE: Only emit long jump when jump a lot of blocks

Most of the case, we don't need to emit long jump at all.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Don't need the emask/notemask/barriermask any more.
Zhigang Gong [Thu, 27 Mar 2014 06:54:15 +0000 (14:54 +0800)]
GBE: Don't need the emask/notemask/barriermask any more.

As we change to use if/endif and change the implementation of
the barrier, we don't need to maintain emask/notmask/barriermask
any more. Just remove them.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Disable SPF and use JMPI + IF/ENDIF to handle each blocks.
Zhigang Gong [Tue, 18 Mar 2014 07:28:44 +0000 (15:28 +0800)]
GBE: Disable SPF and use JMPI + IF/ENDIF to handle each blocks.

When enable SPF (single program flow), we always need to use f0
as the predication of almost each instruction. This bring some
trouble when we want to get tow levels mask mechanism, for an
example the SEL instruction, and some BOOL operations. We
have to use more than one instructions to do that and simply
introduce 100% of overhead of those instructions.

v2:
fix the wrong assertion.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Add if/endif/brc/brd instruction support.
Zhigang Gong [Mon, 17 Mar 2014 10:08:17 +0000 (18:08 +0800)]
GBE: Add if/endif/brc/brd instruction support.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: further optimize forward/backward jump.
Zhigang Gong [Mon, 17 Mar 2014 10:01:03 +0000 (18:01 +0800)]
GBE: further optimize forward/backward jump.

We don't need to save the f0 at the last part of the block.
Just use it directly.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: use S16 vector to represent bool.
Zhigang Gong [Thu, 13 Mar 2014 10:54:31 +0000 (18:54 +0800)]
GBE: use S16 vector to represent bool.

The original purpose of using flag or a S16 scalar to represent
a bool data type is to save register usage. But that bring too
much complex to handle it correctly in each possible case. And
the consequent is we have to take too much care about the bool's
handling in many places in the instruction selection stage. We
even never handle all the cases correctly. The hardest part is
that we can't just touch part of the bit in a S16 scalar register.
There is no instruction to support that. So if a bool is from
another BB, or even the bool is from the same BB but there is
a backward JMP and the bool is still a possible livein register,
thus we need to make some instructions to keep the inactive lane's
bit the original value.

I change to use a S16 vector to represent bool type, then all
the complicate cases are gone. And the only big side effect is
that the register consumption. But considering that a real
application will not have many bools active concurrently, this
may not be a big issue.

I measured the performance impact by using luxmark. And only
observed 2%-3% perfomance regression. There are some easy
performance optimization opportunity remains such as reduce
the unecessary MOVs between flag and bool within the same
block. I think this performance regression should be not a
big deal. Especially, this change will make the following if/endif
optimization a little bit easier.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: fix one misusage of flag in forward jump.
Zhigang Gong [Thu, 13 Mar 2014 04:33:39 +0000 (12:33 +0800)]
GBE: fix one misusage of flag in forward jump.

Forward jump instruction do not need the pred when compare
the pcip with next label. We should use the temporary flag
register.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: use a uniform style to calculate register size for curbe allocation.
Zhigang Gong [Wed, 12 Mar 2014 09:08:15 +0000 (17:08 +0800)]
GBE: use a uniform style to calculate register size for curbe allocation.

Concentrate the register allocation to one place, and don't
use hard coded size when do curbe register allocation. All
register size allocation should use the same method.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: fix the wrong usage of stack pointer and stack buffer.
Zhigang Gong [Wed, 12 Mar 2014 08:51:57 +0000 (16:51 +0800)]
GBE: fix the wrong usage of stack pointer and stack buffer.

Stack pointer and stack buffer should be two different virtual
register. One is a vector and the other is a scalar. The reason
previous implementation could work is that it search curbe offset
and make a new stack buffer register manually which is not good.
Now fix it and remove those hacking code. We actually don't need
to use curbe offset manually after the allocation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: refine the "scalar" register handling.
Zhigang Gong [Wed, 12 Mar 2014 06:43:08 +0000 (14:43 +0800)]
GBE: refine the "scalar" register handling.

The scalar register's actual meaning should be uniform register.
A non-uniform register is a varying register. For further
uniform analysis and bool data optimization, this patch
make the uniform as a new register data attribute. We
can set each new created register as an uniform or varying
register.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Remove BBs if it only has a label instruction.
Zhigang Gong [Wed, 26 Mar 2014 10:27:40 +0000 (18:27 +0800)]
GBE: Remove BBs if it only has a label instruction.

v2:
add an extra createCFGSimplificationPass right before the createGenPass.
And don't remove BB at GEN IR layer.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoGBE: Add a new pass to handle barrier function's noduplicate attribute correctly.
Zhigang Gong [Wed, 26 Mar 2014 05:45:56 +0000 (13:45 +0800)]
GBE: Add a new pass to handle barrier function's noduplicate attribute correctly.

This pass is to remove or add noduplicate function attribute for barrier functions.
Basically, we want to set NoDuplicate for those __gen_barrier_xxx functions. But if
a sub function calls those barrier functions, the sub function will not be inlined
in llvm's inlining pass. This is what we don't want. As inlining such a function in
the caller is safe, we just don't want it to duplicate the call. So Introduce this
pass to remove the NoDuplicate function attribute before the inlining pass and restore
it after.

v2:
fix the module changed check.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
10 years agoStatistics of case running
Yi Sun [Tue, 1 Apr 2014 08:31:26 +0000 (16:31 +0800)]
Statistics of case running

summary:
-----------------
1. Add struct RStatistics to count passed number(passCount), failed number(failCount), finished run number(finishrun).

2. Print statistics line , if the termial is too narrow, doesn't print it:
  ......
  test_load_program_from_bin()    [SUCCESS]
  profiling_exec()    [SUCCESS]
  enqueue_copy_buf()    [SUCCESS]
   [run/total: 656/656]      pass: 629; fail: 25; pass rate: 0.961890

3. If case crashes, count it as failed, add the function to show statistic summary.

4. When all cases finished, list a summary like follows:
summary:
----------
  total: 656
  run: 656
  pass: 629
  fail: 25
  pass rate: 0.961890

5. If ./utest_run &> log, the log will be a little messy, tring the following command to analyse the log:

  sed 's/\r/\n/g' log | egrep "\w*\(\)" | sed -e 's/\s//g'

  After analysed:
  -----------------
......
builtin_minmag_float2()[SUCCESS]
builtin_minmag_float4()[SUCCESS]
builtin_minmag_float8()[SUCCESS]
builtin_minmag_float16()[SUCCESS]
builtin_nextafter_float()[FAILED]
builtin_nextafter_float2()[FAILED]
builtin_nextafter_float4()[FAILED]
......

6. Fix one issue, print out the crashed case name.

7. Delete the debug line in utests/compiler_basic_arithmetic.cpp, which
   output the kernel name.

8. Define function statistics() in struct UTest, which called by "utest_run -a/-c/-n".
   We just call this function to run each case, and print the statistics line.

Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd one tests case specific for unaligned buffer copy.
Junyan He [Wed, 26 Mar 2014 10:28:02 +0000 (18:28 +0800)]
Add one tests case specific for unaligned buffer copy.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoOptimize the unaligned buffer copy logic
Junyan He [Wed, 26 Mar 2014 10:27:56 +0000 (18:27 +0800)]
Optimize the unaligned buffer copy logic

Because the byte aligned read and write send instruction is
very slow, we optimize to avoid the using of it.
We seperate the unaligned case into three cases,
   1. The src and dst has same %4 unaligned offset.
      Then we just need to handle first and last dword.
   2. The src has bigger %4 unaligned offset than the dst.
      We need to do some shift and montage between src[i]
      and src[i+1]
   3. The last case, src has smaller 4% unaligned.
      Then we need to do the same for src[i-1] and src[i].

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd three copy cl files for Enqueue Copy usage.
Junyan He [Wed, 26 Mar 2014 10:27:48 +0000 (18:27 +0800)]
Add three copy cl files for Enqueue Copy usage.

Add these three cl files,
one for src and dst are not aligned but have same offset to 4.
second for src's %4 offset is bigger than the dst's
third for src's %4 offset is small than the dst's

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
10 years agoAdd kernels performance output
Yongjia Zhang [Tue, 1 Apr 2014 09:16:46 +0000 (17:16 +0800)]
Add kernels performance output

if environment variable OCL_OUTPUT_KERNEL_PERF is set non-zero,
then after the executable program exits, beignet will output the
time information of each kernel executed.

v2:fixed the patch's trailing whitespace problem.

v3:if OCL_OUTPUT_KERNEL_PERF is 1, then the output will only
contains time summary, if it is 2, then the output will contain
time summary and detail. Add output 'Ave' and 'Dev', 'Ave' is
the average time per kernel per execution round, 'Dev' is the
result of 'Ave' divide a kernel's all executions' standard deviation.

Signed-off-by: Yongjia Zhang <yongjia.zhang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>