review.tizen.org Git - contrib/beignet.git/log

GBE: Add support double to float conversion.

Previous double to float conversion will go to the
int64 to float code path incorrectly. And don't really
have double to float conversion support at gen_encoder.
This patch fix the above issues.

v2:
fix some bug on HSW platform.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

GBE: optimize a special case of convert INT64 to float.

We found the following instruction sequence is common
in luxmark:
CVT.int64.uin32 %75 %74
LOADI.int64 %537 16777215
AND.int64 %76 %75 %537
CVT.float.uin64 %77 %76

Actually, the immediate value is a pure 32 bit value,
and the %74 is also a uint32 bit value. The AND instruction
will not touch the high 32 bit as well. So we can simply optimize
the above instruction series to the follow:
AND.uint32 %tmp %74 16777215
MOV.float %77 %tmp

This way, it will finally save about 55 instructions for each
of the above case. This patch could bring about 8% performance
gain with sala scene in luxmark.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

add DRM_LIBDIR path into link directory list

Then beignet can link to user preferred drm library rather than default

Signed-off-by: Li Peng <peng.li@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

HSW: Fix a compact assert.

Also use const static int instead of const int to avoid build error
in some gcc.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Optmize phi elimination

During phi elimination, we simply insert 3 MOVs for one phi instruction
to avoid lost copy issue. But in fact, only two of them are needed for
most of time. This patch tries to see whether the move from phiCopy
to phi can be avoided.

The patch basically checks whether the phiCopy and phi have live range
interference. If no, then they can be coalesced, thus one instruction
can be optimized.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Revert "GBE: No need to compute liveout again in value.cpp."

We need to transfer ValueDef from predecessors to their successors.
Consider a register defined in BB0, and used in BB3. we need to
iterate over liveout to pass the def in BB0 to BB3, so the use
in BB3 could get that correct def. Otherwise, the UD/DU graph is incomplete.

This reverts commit 89b490b5a17cfda2d9816dc1c246ce5bbff12648.
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

refine code for the usage of set_image_base_index

In libgbe.so and libgbeinterp.so, the same function pointer name
gbe_set_image_base_index is used for a unified source code.

In libcl.so, function pointer names begin with compiler_* point to
the functions from libgbe.so, function pointer names begin with
gbe_* point to the functions from libgbeinterp.so.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Fix bitcast between long and other type.

As we store long low/high 32bits separately, when we do bitcast
like int64 --> int16, the horizontal stride of the int64's low/high
half should be set as 2 instead of 4.

This fix an regression of opencv test:
Imgproc/Threshold.Mat/40, where GetParam() = (16SC1, 0, 0, false)

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Make utest pass rate reach 100%.

1. Add more input values
2. remove case pow(0,0)
3. remove negtive values test in powr && pown

Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: Yangweix Shui <yangweix.shui@intel.com>

Refine some test for math function

1. nextafter: we originally use nextafter as cpu execution result, It's return value is double, so changed it to nextafterf.
2. sinpi: add judgement to reduce input data limitation from [-2pi,2pi] to [-pi,pi]
3. cospi: define cospi function.
4. tanpi: define tanpi function by using sinpi/cospi.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: YangweiX Shui <yangweix.shui@intel.com>

Refine the cl thread implement for queue.

Because the cl_command_queue can be used in several threads simultaneously but
without add ref to it, we now handle it like this:
Keep one threads_slot_array, every time the thread get gpgpu or batch buffer, if it
does not have a slot, assign it.
The resources are keeped in queue private, and resize it if needed.
When the thread exit, the slot will be set invalid.
When queue released, all the resources will be released. If user still enqueue, flush
or finish the queue after it has been released, the behavior is undefined.
TODO: Need to shrink the slot map.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix timestamp on HASWELL

The GPU timestamp should be lower 36 bit on HASWELL

Signed-off-by: Li Peng <peng.li@intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>

extract libgbeinterp.so from runtime (libcl.so)

currently, there are same symbol names in libinterp.a (inside
libcl.so) and libgbe.so (compiler), and so have to dlopen libgbe.so
with RTLD_DEEPBIND, this flag makes std::cerr inside libgbe crash.

extract the interp part from libcl.so as libgbeinterp.so, therefore,
first dlopen libgbe.so without RTLD_DEEPBIND, then dlopen libgbeinterp.so
with RTLD_DEEPBIND, to fix the std:cerr crash issue.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: fix one illegal instruction when replace a uniform dst.

When the dst is a uniform value, we replace it with a vector value, then
copy the vector value back may generate an illegal instruction as below
at address 18:

    (14      )  mov(16)         g124<1>:F       g127.7<0,1,0>:F                 { align1 WE_all 1H };
    (16      )  send(16)        g122<1>:UW      g124<8,8,1>:UD
                data (bti: 1, rgba: 14, SIMD16, legacy, Untyped Surface Read) mlen 2 rlen 2 { align1 WE_all 1H };
    (18      )  mov(1)          g127.6<1>:F     g122<8,8,1>:F                   { align1 WE_all };

This patch could fix this issue and generate correct instruction as below:

    (      14)  mov(16)         g124<1>:UD      g127.7<0,1,0>:UD                { align1 WE_all 1H };
    (      16)  send(16)        g122<1>:UW      g124<8,8,1>:UD
                data (bti: 1, rgba: 14, SIMD16, legacy, Untyped Surface Read) mlen 2 rlen 2 { align1 WE_all 1H };
    (      18)  mov(1)          g127.6<1>:UD    g122<0,1,0>:UD                  { align1 WE_all };

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

utests: disable double test case.

As we could not provide full support of double now,
and my patch to refine long support breaks double load/store.
So, we disable all double test cases.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: Pass correct register type when replaceReg

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: Change 64bit integer storage in register

Previously, we store low/high half of 64bit together, which need several
32bit instructions to do one 64bit instruction. Now we simply change its
storage in register, low 32bit of all lanes are stored together, and then the
high 32bit of all lanes. This will make long support cleaner and less
32bit instructions needed.

v2:
fix a typo when getRegAtrrib().
Refine SelectionVector alignment.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: optimize scalar data type conversion.

If the dst is scalar, the register region restrication is relaxed.
we can save one instruction as below:

    (12      )  mov.sat(1)      g127.24<4>:B    g1.3<0,1,0>:D              { align1 WE_all };
    (14      )  mov(1)          g127.28<1>:B    g127.24<0,1,4>:D           { align1 WE_all };

Optimized to:

    (12      )  mov.sat(1)      g128.28<4>:B    g1.3<0,1,0>:D              { align1 WE_all };

No need to create a temporary register g127.24.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fix uniform/scalar related bugs.

One major fix is that even a register is a scalar, when
we move a scalar Dword to a scalar Byte, we have to set
the hstride to 4, otherwise, it breaks the following
register restication:
  B. When the Execution Data Type is wider than the destination data type,
     the destination must be aligned as required by the wider execution data
     type and specify a HorzStride equal to the ratio in sizes of the two data
     types. For example, a mov with a D source and B destination must use a
     4-byte aligned destination and a Dst.HorzStride of 4.

The following instruction may doesn't take effect.
mov.sat(1)  g127.4<1>:B  g126<0,1,0>:D
We have to change it to
mov.sat(1)  g127.4<4>:B  g126<0,1,0>:D

v2: keep the instruction selection stage unchanged, we fix this restircation
    in setDst only.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fix a regression for piglit test.

Access this->store[insnID+2] is not always safe, as it may
not exist.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Return CL_IMAGE_FORMAT_NOT_SUPPORTED if image_format is not supported.

And move the function cl_image_byte_per_pixel call before cl_image_get_supported_fmt
to return correct error code when format invalid.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Silence compilation warnings when release build.

Also silence warnings in 32bit system.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Silence some compilation warnings.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

GBE: Consolidate all read/write instruction's bti handling.

The previous bti handling for each read/write instruction is
slightly different from each other. There are two major bugs,
the OP_ATOMIC store the bti in different position, so the
post scheduling for ATOMIC instruction is buggy.
The second bug is the DWORD_GATHER instruction is not in
the isRead list. That may cause potential bug.

This patch fixes both of them.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

separate runtime(libcl.so) and compiler(libgbe.so)

On embedded/handheld devices, storage and memory are scarce, it is
necessary to provide only the OpenCL runtime library with small size,
and only the executable binary kernel will be supported on such device.

At the beginning of process (before function main), OpenCL runtime
(libcl.so) will try to load the compiler (libgbe.so), the system's
behavior is the same as before if successfully loaded, otherwise,
the runtime assumes no OpenCL compiler in the system, and the device
info will be changed as CL_DEVICE_COMPILER_AVAILABLE=false and
CL_DEVICE_PROFILE="EMBEDDED_PROFILE", the clBuildProgram returns
CL_COMPILER_NOT_AVAILABLE if the program is created with
clCreateProgramWithSource, following the OpenCL spec.

To simulate the case without OpenCL compiler, just delete the file
libgbe.so, or export OCL_NON_COMPILER=1.

Some explanation of the binary kernel interpreter (libinterp.a):

libinterp.a is used to interpret the binary kernel inside runtime,
and the runtime library libcl.so is built against libinterp.a.

Since the code to interpret binary kernel is tightly integrated inside
the compiler, to avoid code duplicate, a new file gbe_bin_interpreter.cpp
is created to include some other .cpp files; to make libinterp.a small
(the purpose to make libcl.so small), the macro GBE_COMPILER_AVAILABLE
is used to make only the needed code active when build for libinterp.a.

V2: code base is changed to call function gbe_set_image_base_index in
gbe_bin_generater, while this function is modified in this patch as
gbe_set_image_base_index_compiler, fix it accordingly.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: fix baytrail L3 cache configuration.

Reduce URB from 128KB to 64KB causes rendering artifact in X window.
I have to change it to 96KB URB and also change the RO and DC to 16KB
to satisfy the total 192KB L3 size limitation.

With this fix, the artifact is gone and utests has no new failures.

Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Guo Yejun <yejun.guo@intel.com>

GBE: Make compatible with old gcc version.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Refine pci id detecting.

Some platforms do not have key word "Gen" or "Graphic" when run the
lspci command. So we failed to get the pci id in such cases.
We now just use the 8086 key word and get the sub pci id, and compare
it to all the gen known gen pci ids. This can be safe in all platforms.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: fix post scheduling related bug for spill/unspill.

spill/unspill instruction touch some registers directly which
are not in dst/src. This breaks the post scheduling. Simply
work around it by add all the reserved registers to the dst
array.

The scratch memory is not correctly indexed and the barrier is
not handled properly.

After this patch, the post scheduling will be enabled by default.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: optimize post reg allocation instruction scheduling.

To make the post scheduling working better, I relax the frequency of
the calling of expireGRF when doing register allocation. Thus we can
reduce the physical register conflict and doing the post scheduling.

Another optimization is to insert a pre retire for the instruction
to release those WRITE_AFTER_READ dependency. Write after read will
not bring any hazard, so we can release those register as soon as
the instruction scheduled.

The pre register allocation scheduling is quite different than post
schedlulig, for now, just disable it.

The whole patch could get about 10% perfromance gain with luxmark.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fix one post register allocation instruction scheduling bug.

The instuction has modFlag 1 indicating it will modify the flag.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: disable mad for some cases.

One case is one operand is an imm value. Then it turns to
save one instruction but add an extra LOADI instruction.
We don't need to bother to use mad in this case.
And considering when we optimize the simd16 under simd8
mode, we

The other case is under simd16 mode. As mad is a 3-src instruction,
which only support simd8, it will convert one mad(16) instruction to
two mad(8) instructions. Then we don't need to use mad.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fix a uniform analysis bug.

If a value is defined in a loop and is used out-of the
loop. That value could not be a uniform(scalar) value.
The reason is that value may be assigned different
scalar value on different lanes when it reenters with
different lanes actived.
Thanks for yang rong reporting this bug.

Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: don't allocate/modify flag if it is not used in current BB.

If a flag is not used in current BB, we don't need to
set the modFlag bit on that instruction. Thus the register
allocation stage will not allocate a flag register for it.

No performance impact, as the previous implementation will
expire that flag register immediately.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: optimize IMM handling for SEL/SEL_CMP/CMP.

Actually, all of the above 3 instructions could avoid
one LOADI instruction by switching operands position.

This patch impemented this optimization. And consolidate
all the same type of optimization into one place.

No obvious performance impact on luxmark.

v2:
fix some wrong indent.
v3:
fix the OP_ORD issue. OP_ORD use both src0/src1 as both src0/src1
so can't use this IMM optimization.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: optimize SUB dst, imm, src1 instruction.

We could easily convert it to SUB dst, -src1, -imm.
Thus we can avoid one LOADI instruction eventually.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: optimize CMP instruction encoding.

This patch fixes the following two things.
1. Use a temporary register as dst register for the CMP
instruction in the middle of a block.
2. fix the switch flag for the CMP instruction at the begining
of each block. As the compact instruction handling will handle
the cmp instruction directly, and will ignore the switch
flag which is incorrect.

This patch could get about 2-3% performance gain for luxmark.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: refine disassembly code to show null register's type.

We should show null register's type in the assembly output, as
if a null reigster is using a wrong type, such as the following
instruction:

cmp.le(8) null:UW g2<8,8,1>:F 0.1F

It is a fatal error from the hardware point of view. We should
output that information.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

gbe_bin_generater: fix two bugs.

The pci id detecting method is broken on some system.
And the gen pci id parsing in gbe_bin_generater is incorrect when
the pci id has a-f hex digit.

v2:
Add VGA to filter out some nonVGA devices.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>

correct L3 cache settings for baytrail

baytrail and ivb have different register bits layout for L3 cache,
so, add a special path for baytrail.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-bu: "Song, Ruiling" <ruiling.song@intel.com>

move enqueue_copy_image kernels outside of runtime code.

seperate the kernel code from host code to make it clean; build the
kernels offline by gbe_bin_generator to improve the performance.

v2:
fix the image base issue with the standalone compiler.

Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

fix event related bugs.

1. remove repeated user events in list.
2. missed braces in loops.
3. fix barrier event reference not incresed.

Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: optimize builtin atan2.

clang will generate extra stores for the implementation.
So, put the data in __constant address space.
This will improve opencv test PhaseFixture_Phase by 3x.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Fix the bug of forgetting release sampler in utest.

utest helper will not help us to free the sampler resource
as buffer and kernel. So we need to release it by ourself.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: fix unpacked_uw/ub on uniform registers.

unpacked_uw/ub macros hard coded the register's width to 8
which is bad for uniform registers. This patch fix that issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

Add the pci id support for gbe_generate

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

Fix map gtt fail when memory object size is too large.

After max allocate size is changed to 256M, the large memory object would map gtt
fail in some system. So when image size is large then 128M, disable tiling, and
used normal map. But in function clEnqueueMapBuffer/Image, may still fail because
unsync map.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

HSW: Corret the scratch buffer size calc and set the correct index in vfe state.

HSW's scratch buffer alignment and the index set in vfe state are different with IVB.
And when calc per thread's stack offset, will used R0.0's FFTID to, the define of
FFTID also changed in HSW.
With this patch, all utest pass.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>

HSW: Fix the atomic msg type typo.

The atomic msg type should be GEN75_P1_UNTYPED_ATOMIC_OP. Correct it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>

Correct the double bug in HSW.

Should set the nomask in mov_df_imm and need handle exec_width=4 case in setHeader.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>

HSW: Use the drm flag I915_EXEC_ENABLE_SLM to set L3 control config.

Because LRI commands will be converted to NOOP, add the I915_EXEC_ENABLE_SLM
flag to the drm kernal driver, to enable SLM in the L3. Set the flag when
application use slm. Still keep the L3 config in the batch buffer for fulsim.
Also create and use the openCL own context when exec, to avoid affect the other context.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>

HSW: Workaround the slm address issue.

Each work group has it's own slm offset, and when dispatch threads,
TSG will handle it automatic in IVB. But it will fail in HSW.
After check, all work group's slm offset are 0, even the slm index is
correct in R0.0. So calc the slm offset for slm index, and add it
to the slm address.
TODO: need to find the root casue.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>

Enable pipe control.

The previour pipe control don't work, because it don't advance the batch buffer.
So the value set in function intel_gpgpu_pipe_control will be flushed later. Fix it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>

Fix a crash when clSetKernelArg of parameter point to NULL value.

Per OCL spec, if the arg_value of clSetKernelArg is a memory object, it can be
NULL or point to NULL. Driver only handle NULL case, will crash if point to NULL.
Correct it.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>

HSW: align buffer's size to DWORD.

HSW: Byte scattered Read/Write require that the buffer size must be a multiple of 4 bytes.
So simply alignment all buffer size to 4. Pass utest compiler_function_constant0.

Because it is very light work around, align it without not check device.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>

Modify the GenContext and GenEncoder's destructor to virtual

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Runtime: Fix a bug in L3 configuration.

We forgot to set L3SQCREG1 register.
And also add a more suitable configuration.
This patch improves Luxmark score above 50%.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: fix one regression caused by uniform analysis.

Some instructions handle simd1 incorrectly. Disable them
currently.

v2:
add addsat into the unsupported list.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fix the legacy use of isScalarOrBool.

isScalarOrBool is a legacy function which was used when the bool
is treated as a scalar register by default. Now, we are using
normal vector word register to represent bool, we no need to
keep this macro. And repace all of the uses to isScalarReg.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>

GBE: enable uniform analysis for bool data type.

v2:
refine the flag allocation implementation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>

GBE: enable uniform for load instruction.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>

GBE: implement uniform analysis.

We have many uniform (scalar) input values which include
the kernel input argument and some special registers.

And all those variables derived by all uniform values are
also uniform values. This patch analysis this type of register
at liveness analysis stage, and change uniform register's
type to scalar type. Then latter, these registers need
less register space.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>

GBE: change scalar byte size to 2 from 1.

Due to the exec size is always larger or equal to 2,
we need to change the scalar byte size to 2 rather than
1. Otherwise, it may generate the following illegal instruction:

(17 ) mov(1) g127.31<1>UB 0x2UW { align1 WE_all };

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>

GBE: No need to compute liveout again in value.cpp.

We already did a complete liveness analysis at the liveness.cpp.
Don't need to do that again. Save about 10% of the compile time.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>

Fix double bugs for hsw

In bspec, IVB should use SIMD8 for double ops, but HSW should use SIMD4.
TODO: The long ops maybe also need change.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Signed-off-by: Junyan He <junyan.he@linux.intel.com>

Make the surface typed write work for HSW

1.Modify the typed write for state write using GEN_SFID_DATAPORT_DATA_CACHE.
2.Add the channel select for surface state setting.
3.Correct the send message for setting slot in send description.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

correct jump distance of hsw's jmpi.

Gen5+ bspec: the jump distance is in number of eight-byte units.
Gen7.5+: the offset is in unit of 8bits for JMPI, 64bits for other flow control instructions.
So need multiple all jump distance with 8 in jmpi.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>

Using a correct DATAPORT and SFID for some send message of haswell.

Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Signed-off-by: Junyan He <junyan.he@linux.intel.com>

Add Gen75Context and Gen75Encoder class for hsw

We will create the Gen75Context and Gen75Encoder
dynamically based on the vendor ID, which is same
with the PCI ID.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

Update the device info description for HSW

Split the cl_device_id description for HSW into
GT1, GT2 and GT3, with different parameters.

Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>

Runtime: change default tiling mode to TILE_X from TILE_Y.

Nanhai found that tiling mode does matter the performance much
for some cases. So make the tiling mode configurable at runtime
and make the default tiling mode as TILE_X which is much
better than the original TILE_Y for many cases.

At runtime, it is easy to change the default tiling mode as below
export OCL_TILING=0 # enable NO_TILE
export OCL_TILING=1 # enable TILE_X
export OCL_TILING=2 # enable TILE_Y

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

GBE: Merge successive load/store together for better performance.

Gen support at most 4 DWORD read/write in one single instruction.
So we merge successive read/write for less instruction and better performance.
This improves about 10% for LuxMark medium scene.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Refine logic of finding where the local variable is defined.

Traverse all uses of the local variable as there maybe some dead use.
Most time, the function will return fast, as the use tree is not deep.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

do not serialize zero image/sampler info into binary

if there is no image/sampler used in kernel source, it is not
necessary to serialize the zero image/sampler info into kernel binary.

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

GBE: fix one potential bug in UnsignedI64ToFloat.

Set exp to a proper value to make sure all the inactive lanes
flag bits are 1s which satisfy the requirement of the following
ALL16/ALL8H condition check.

v2:
enable the first JMPI's optimization rather the second,
as it has higher probability.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: Fix one build error of friend declaration for a class.

If the g++ is older than 4.7.0, the class-key of the
elaborated-type-specifier is required in a friend declaration
for a class. So modify the code to make it compitible with old
g++ version.

Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: remove some useless code.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: increase the global memory size to 1GB.

Also increase the global memory to 1GB.

v2: change the max memory size to 256MB

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fixed a regression at "Long" div/rem.

If the GEN_PREDICATE_ALIGN1_ANY8H/ANY16H or ALL8H/ALL16H
are used, we must make sure those inactive lanes are initialized
correctly. For "ANY" condition, all the inactive lanes need to
be clear to zero. For "ALL" condition, all the inactive lanes
need to be set to 1s. Otherwise, it may cause infinite loop.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

Init Benchmark suite

The first benchmark case is name enqueue_copy_buf.

Signed-off-by: Yi Sun <yi.sun@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: reserve flag0.0 for large basic block.

As in large basic block, there are more than one IF instruction which
need to use the flag0.0. We have to reserve flag 0.0 to those IF
instructions.

Signed-off-by: Zhigang Gong <zhigang.gong@gmail.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fix the large if/endif block issue.

Some test cases have some very large block which contains
more than 32768/2 instructions which could fit into one
if/endif block.

This patch introduce a ifendif fix switch at the GenContext.
Once we encounter one of such error, we set the switch on
and then recompile the kernel. When the switch is on, we will
insert extra endif/if pair to the block to split one if/endif
block to multiple ones to fix the large if/endif issue.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fix the hard coded endif offset calculation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: Avoid unecessary dag/liveness computing at backend.

We don't need to compute dag/liveness at the backend when
we switch to a new code gen strategy.
For the unit test case, this patch could save 15% of the
overall execution time. For the luxmark with STRICT conformance
mode, it saves about 40% of the build time.

v3: fix some minor bugs.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>

GBE: fixed a potential scalarize bug.

We need to append extract instruction when do a bitcast to
a vector. Otherwise, we may trigger an assert as the extract
instruction uses a undefined vector.

After this patch, it becomes safe to do many rounds of scalarize
pass.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Song, Ruiling" <ruiling.song@intel.com>

add support for cross compiler

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: refine the gen program strategy.

The limitRegisterPressure only affects the MAD pattern matching
which could not bring noticeable difference here. I change it to always
be false. And add the reserved registers for spill to the strategy
structure. Thus we can try to build a program as the following
strategy:

1. SIMD16 without spilling
2. SIMD16 with 10 spilling registers and with a default spilling threshold
value 16. When need to spill more than 16 registers, we fall back to next
method.
3. SIMD8 without spilling
4. SIMD8 with 8 spilling registers.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: fixed the undefined phi value's liveness analysis.

If a phi component is undef from one of the predecessors,
we should not pass it as the predecessor's liveout registers.
Otherwise, that phi register's liveness may be extent to
the basic block zero which is not good.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

GBE: Try expire some register before register allocation

1. This would free unused register asap, so it becomes easy to allocate
   contiguous registers.

2. We previously met many hidden register liveness issue. Let's try
   to reuse the expired register early. Then I think wrong liveness may
   easy to find.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>

GBE: Optimize byte gather read using untyped read.

Untyped read seems better than byte gather read.
Some performance test in opencv got doubled after the patch.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>

add test for __gen_ocl_simd_any and __gen_ocl_simd_all

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

support __gen_ocl_simd_any and __gen_ocl_simd_all

short __gen_ocl_simd_any(short x):
if x in any of the active threads in the same SIMD is not zero,
the return value for all these threads is not zero, otherwise, zero returned.

short __gen_ocl_simd_all(short x):
only if x in all of the active threads in the same SIMD is not zero,
the return value for all these threads is not zero, otherwise, zero returned.

for example:
to check if a special value exists in a global buffer, use one SIMD
to do the searching parallelly, the whole SIMD can stop the task
once the value is found. The key kernel code looks like:

for(; ; ) {
  ...
  if (__gen_ocl_simd_any(...))
    break;   //the whole SIMD stop the searching
}

Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Delete the printing of dynamic statistics line.

summary:
---------------------
  1. Delete the printing of dynamic statistics line.
  2. Add function to catch signals(like CTRL+C,core dumped ...),
     if caught, reminder user the signal name.
     core dumped example:
...
displacement_map_element()    [SUCCESS]
compiler_clod()    Interrupt signal (SIGSEGV) received.
summary:
----------
  total: 657
  run: 297
  pass: 271
  fail: 26
  pass rate: 0.960426

Signed-off-by: Yi Sun <yi.sun@intel.com>
Signed-off-by: Yangwei Shui <yangweix.shui@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: Implement instruction compact.

A native GEN ASM would takes 2*64bit, but GEN also support compact instruction
which only takes 64bit. To make code easily understood, GenInstruction now only
stands for 64bit memory, and use GenNativeInstruction & GenCompactInstruction
to represent normal(native) and compact instruction.

After this change, it is not easily to map SelectionInstruction distance to ASM distance.
As the instructions in the distance maybe compacted. To not introduce too much
complexity, JMP, IF, ENDIF, NOP will NEVER be compacted.

Some experiment in luxMark shows it could reduce about 20% instruction memory.
But it is sad that no performance improvement observed.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>

GBE: fix a Q64 spilling bug in non-simd8 mode.

For simd16 mode, the payload need to have 2 GRFs not the hard coded 1 GRF.
This patch fixes the corresponding regression on piglit.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>

GBE: work around baytrail-t hang issue.

There is an unkown issue with baytrail-t platform. It will hang at
utest's compiler_global_constant case. After some investigation,
it turns out to be related to the DWORD GATHER READ send message
on the constand cache data port. I change to use data cache data
port could work around that hang issue.

Now we only fail one more case on baytrail-t compare to the IVB
desktop platform which is the:

profiling_exec() [FAILED]
Error: Too large time from submit to start

That may be caused by kernel related issue. And that bug will not
cause serious issue for normal kernel. So after this patch, the
baytrail-t platform should be in a pretty good shape with beignet.

Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>

GBE/Runtime: pass the device id to the compiler backend.

For some reason, we need to know current target device id
at the code generation stage. This patch introduces such
a mechanism. This is the preparation for the baytrail werid
hang issue fixing.

Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>

Runtime: increase the build log buffer size to 1000.

200 is too small sometimes.

Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: He Junyan <junyan.he@inbox.com>

Runtime: Add support for Bay Trail-T device.

According to the baytrial-t spec, baytrail-t has 4 EUs and each
EU has 8 threads. So the compute unit is 32 and the maximum
work group size is 32 * 8 which is 256.

Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>

Mark SandyBridge as unsupported

Signed-off-by: Jesper Pedersen <jesper.pedersen@comcast.net>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>