+++ /dev/null
-<h1>Beignet</h1>
-
-<p>Beignet is an open source implementaion of the OpenCL specification - a generic
-compute oriented API. This code base contains the code to run OpenCL programs on
-Intel GPUs which bsically defines and implements the OpenCL host functions
-required to initialize the device, create the command queues, the kernels and
-the programs and run them on the GPU. The code base also contains the compiler
-part of the stack which is included in <code>backend/</code>. For more specific information
-about the compiler, please refer to <code>backend/README.md</code></p>
-
-<h2>How to build</h2>
-
-<p>The project uses CMake with three profiles:</p>
-
-<ol>
-<li>Debug (-g)</li>
-<li>RelWithDebInfo (-g with optimizations)</li>
-<li>Release (only optimizations)</li>
-</ol>
-
-<p>Basically, from the root directory of the project</p>
-
-<p><code>> mkdir build</code></p>
-
-<p><code>> cd build</code></p>
-
-<p><code>> cmake ../ # to configure</code></p>
-
-<p>Choose whatever you want for the build.</p>
-
-<p>Then press 'c' to configure and 'g' to generate the code.</p>
-
-<p><code>> make</code></p>
-
-<p>The project depends on several external libraries:</p>
-
-<ul>
-<li>Several X components (XLib, Xfixes, Xext)</li>
-<li>libdrm libraries (libdrm and libdrm_intel)</li>
-<li>Various LLVM components</li>
-<li>The compiler backend itself (libgbe)</li>
-<li>Mesa git master version built with gbm enabled to support extension cl_khr_gl_sharing.</li>
-</ul>
-
-<p>CMake will check the dependencies and will complain if it does not find them.</p>
-
-<p>The cmake will also build the backend project. Please refer to:
-<a href="backend/README.html">OpenCL Gen Backend</a> to get more dependencies.</p>
-
-<p>Once built, the run-time produces a shared object libcl.so which basically
-directly implements the OpenCL API. A set of tests are also produced. They may
-be found in <code>utests/</code>.</p>
-
-<p>Note that the compiler depends on LLVM (Low-Level Virtual Machine project).
-Right now, the code has been compiled with LLVM 3.1/3.2. It will not compile
-with any thing older. </p>
-
-<p><a href="http://llvm.org/releases/">http://llvm.org/releases/</a></p>
-
-<p>LLVM 3.1 and 3.2 are supported.</p>
-
-<p>Also note that the code was compiled on GCC 4.6 and GCC 4.7. Since the code uses
-really recent C++11 features, you may expect problems with older compilers. Last
-time I tried, the code breaks ICC 12 and Clang with internal compiler errors
-while compiling anonymous nested lambda functions.</p>
-
-<h2>How to run</h2>
-
-<p>Apart from the OpenCL library itself that can be used by any OpenCL application,
-this code also produces various tests to ensure the compiler and the run-time
-consistency. This small test framework uses a simple c++ registration system to
-register all the unit tests.</p>
-
-<p>You need to set the variable <code>OCL_KERNEL_PATH</code> to locate the OCL kernels. They
-are with the run-time in <code>./kernels</code>.</p>
-
-<p>Then in <code>utests/</code>:</p>
-
-<p><code>> ./utest_run</code></p>
-
-<p>will run all the unit tests one after the others</p>
-
-<p><code>> ./utest_run some_unit_test0 some_unit_test1</code></p>
-
-<p>will only run <code>some_unit_test0</code> and <code>some_unit_test1</code> tests</p>
-
-<h2>Supported Hardware</h2>
-
-<p>The code was tested on IVB GT2 with ubuntu and fedora core distribution.
-Currently Only IVB is supported right now. Actually, the code was only run on IVB GT2. You
-may expect some issues with IVB GT1.</p>
-
-<h2>TODO</h2>
-
-<p>The run-time is far from being complete. Most of the pieces have been put
-together to test and develop the OpenCL compiler. A partial list of things to
-do:</p>
-
-<ul>
-<li><p>Complete cl_khr_gl_sharing support. We lack of some APIs implementation such
-as clCreateFromGLBuffer,clCreateFromGLRenderbuffer,clGetGLObjectInfo... Currently,
-the working APIs are clCreateFromGLTexture,clCreateFromGLTexture2D.</p></li>
-<li><p>Support for events.</p></li>
-<li><p>Check that NDRangeKernels can be pushed into <em>different</em> queues from several
-threads.</p></li>
-<li><p>Support for nonblocking mode Enqueue*Buffer. Now we only use the map extension to
-implement those Enqueue*Buffer functions. </p></li>
-<li><p>No state tracking at all. One batch buffer is created at each "draw call"
-(i.e. for each NDRangeKernels). This is really inefficient since some
-expensive pipe controls are issued for each batch buffer</p></li>
-<li><p>Valgrind reports some leaks in libdrm. It sounds like a false positive but it
-has to be checked. Idem for LLVM. There is one leak here to check.</p></li>
-</ul>
-
-<p>More generally, everything in the run-time that triggers the "FATAL" macro means
-that something that must be supported is not implemented properly (either it
-does not comply with the standard or it is just missing)</p>
-
-<h2>Project repository</h2>
-
-<p>Right now, we host our project on fdo at: git://anongit.freedesktop.org/beignet.</p>
-
-<h2>The team</h2>
-
-<p>This project was created by Ben Segovia when he was working for Intel. Now we
-have a team in China OTC graphics department continue to work on this project.
-The official contact for this project is: Zou Nanhai (<a href="mailto:nanhai.zou@intel.com">nanhai.zou@intel.com</a>).</p>
-
-<h2>How to contribute</h2>
-
-<p>You are always welcome to contribute to this project, just need to subscribe
-to the beignet mail list and send patches to it for review.
-The official mail list is as below:
-http://lists.freedesktop.org/mailman/listinfo/beignet</p>
-Beignet
-=======
-
-Beignet is an open source implementaion of the OpenCL specification - a generic
-compute oriented API. This code base contains the code to run OpenCL programs on
-Intel GPUs which bsically defines and implements the OpenCL host functions
-required to initialize the device, create the command queues, the kernels and
-the programs and run them on the GPU. The code base also contains the compiler
-part of the stack which is included in `backend/`. For more specific information
-about the compiler, please refer to `backend/README.md`
-
-How to build
-------------
-
-The project uses CMake with three profiles:
-
-1. Debug (-g)
-2. RelWithDebInfo (-g with optimizations)
-3. Release (only optimizations)
-
-Basically, from the root directory of the project
-
-`> mkdir build`
-
-`> cd build`
-
-`> cmake ../ # to configure`
-
-Choose whatever you want for the build.
-
-Then press 'c' to configure and 'g' to generate the code.
-
-`> make`
-
-The project depends on several external libraries:
-
-- Several X components (XLib, Xfixes, Xext)
-- libdrm libraries (libdrm and libdrm\_intel)
-- Various LLVM components
-- The compiler backend itself (libgbe)
-- Mesa git master version built with gbm enabled to support extension cl\_khr\_gl\_sharing.
-
-CMake will check the dependencies and will complain if it does not find them.
-
-The cmake will also build the backend project. Please refer to:
-[OpenCL Gen Backend](backend/README.html) to get more dependencies.
-
-Once built, the run-time produces a shared object libcl.so which basically
-directly implements the OpenCL API. A set of tests are also produced. They may
-be found in `utests/`.
-
-Note that the compiler depends on LLVM (Low-Level Virtual Machine project).
-Right now, the code has been compiled with LLVM 3.1/3.2. It will not compile
-with any thing older.
-
-[http://llvm.org/releases/](http://llvm.org/releases/)
-
-LLVM 3.1 and 3.2 are supported.
-
-Also note that the code was compiled on GCC 4.6 and GCC 4.7. Since the code uses
-really recent C++11 features, you may expect problems with older compilers. Last
-time I tried, the code breaks ICC 12 and Clang with internal compiler errors
-while compiling anonymous nested lambda functions.
-
-How to run
-----------
-
-Apart from the OpenCL library itself that can be used by any OpenCL application,
-this code also produces various tests to ensure the compiler and the run-time
-consistency. This small test framework uses a simple c++ registration system to
-register all the unit tests.
-
-You need to set the variable `OCL_KERNEL_PATH` to locate the OCL kernels. They
-are with the run-time in `./kernels`.
-
-Then in `utests/`:
-
-`> ./utest_run`
-
-will run all the unit tests one after the others
-
-`> ./utest_run some_unit_test0 some_unit_test1`
-
-will only run `some_unit_test0` and `some_unit_test1` tests
-
-Supported Hardware
-------------------
-
-The code was tested on IVB GT2 with ubuntu and fedora core distribution.
-Currently Only IVB is supported right now. Actually, the code was only run on IVB GT2. You
-may expect some issues with IVB GT1.
-
-TODO
-----
-
-The run-time is far from being complete. Most of the pieces have been put
-together to test and develop the OpenCL compiler. A partial list of things to
-do:
-
-- Complete cl\_khr\_gl\_sharing support. We lack of some APIs implementation such
- as clCreateFromGLBuffer,clCreateFromGLRenderbuffer,clGetGLObjectInfo... Currently,
- the working APIs are clCreateFromGLTexture,clCreateFromGLTexture2D.
-
-- Support for events.
-
-- Check that NDRangeKernels can be pushed into _different_ queues from several
- threads.
-
-- Support for nonblocking mode Enqueue\*Buffer. Now we only use the map extension to
- implement those Enqueue\*Buffer functions.
-
-- No state tracking at all. One batch buffer is created at each "draw call"
- (i.e. for each NDRangeKernels). This is really inefficient since some
- expensive pipe controls are issued for each batch buffer
-
-- Valgrind reports some leaks in libdrm. It sounds like a false positive but it
- has to be checked. Idem for LLVM. There is one leak here to check.
-
-More generally, everything in the run-time that triggers the "FATAL" macro means
-that something that must be supported is not implemented properly (either it
-does not comply with the standard or it is just missing)
-
-Project repository
-------------------
-Right now, we host our project on fdo at: git://anongit.freedesktop.org/beignet.
-
-The team
---------
-This project was created by Ben Segovia when he was working for Intel. Now we
-have a team in China OTC graphics department continue to work on this project.
-The official contact for this project is: Zou Nanhai (<nanhai.zou@intel.com>).
-
-How to contribute
------------------
-You are always welcome to contribute to this project, just need to subscribe
-to the beignet mail list and send patches to it for review.
-The official mail list is as below:
-http://lists.freedesktop.org/mailman/listinfo/beignet
+We host documents at the following wiki page:
+[http://wiki.freedesktop.org/www/Software/Beignet](http://wiki.freedesktop.org/www/Software/Beignet)
+++ /dev/null
-<h1>Beignet Compiler</h1>
-
-<p>This code base contains the compiler part of the Beignet OpenCL stack. The
-compiler is responsible to take a OpenCL language string and to compile it into
-a binary that can be executed on Intel integrated GPUs.</p>
-
-<h2>Limitations</h2>
-
-<p>Today, the compiler is far from complete. See <a href="doc/TODO.html">here</a> for a
-(incomplete) lists of things to do.</p>
-
-<h2>Interface with the run-time</h2>
-
-<p>Even if the compiler makes a very liberal use of C++ (templates, variadic
-templates, macros), we really tried hard to make a very simple interface with
-the run-time. The interface is therefore a pure C99 interface and it is defined
-in <code>src/backend/program.h</code>.</p>
-
-<p>The goal is to hide the complexity of the inner data structures and to enable
-simple run-time implementation using straightforward C99.</p>
-
-<p>Note that the data structures are fully opaque: this allows us to use both the
-C++ simulator or the real Gen program in a relatively non-intrusive way.</p>
-
-<h2>Various environment variables</h2>
-
-<p>Environment variables are used all over the code. Most important ones are:</p>
-
-<ul>
-<li><p><code>OCL_SIMD_WIDTH</code> <code>(8 or 16)</code>. Change the number of lanes per hardware thread</p></li>
-<li><p><code>OCL_OUTPUT_GEN_IR</code> <code>(0 or 1)</code>. Output Gen IR (scalar intermediate
-representation) code</p></li>
-<li><p><code>OCL_OUTPUT_LLVM</code> <code>(0 or 1)</code>. Output LLVM code after the lowering passes</p></li>
-<li><p><code>OCL_OUTPUT_LLVM_BEFORE_EXTRA_PASS</code> <code>(0 or 1)</code>. Output LLVM code before the
-lowering passes</p></li>
-<li><p><code>OCL_OUTPUT_ASM</code> <code>(0 or 1)</code>. Output Gen ISA</p></li>
-<li><p><code>OCL_OUTPUT_REG_ALLOC</code> <code>(0 or 1)</code>. Output Gen register allocations</p></li>
-</ul>
-
-<h2>Implementation details</h2>
-
-<p>Several key decisions may use the hardware in an usual way. See the following
-documents for the technical details about the compiler implementation:</p>
-
-<ul>
-<li><a href="doc/flat_address_space.html">Flat address space</a></li>
-<li><a href="doc/unstructured_branches.html">Unstructured branches</a></li>
-<li><a href="doc/gen_ir.html">Scalar intermediate representation</a></li>
-<li><a href="doc/compiler_backend.html">Clean backend implementation</a></li>
-</ul>
-
-<p>Ben Segovia. </p>
+++ /dev/null
-<h1>TODO</h1>
-
-<p>The compiler is far from complete. Even if the skeleton is now done and should
-be solid, There are a <em>lot</em> of things to do from trivial to complex.</p>
-
-<h2>OpenCL standard library</h2>
-
-<p>Today we define the OpenCL API in header file <code>src/ocl_stdlib.h</code>. This file is
-from being complete.</p>
-
-<p>By the way, one question remains: do we want to implement
-the high-precision functions as <em>inline</em> functions or as external functions to
-call? Indeed, inlining all functions may lead to severe code bloats while
-calling functions will require to implement a proper ABI. We certainly want to
-do both actually.</p>
-
-<h2>LLVM front-end</h2>
-
-<p>The code is defined in <code>src/llvm</code>. We used the PTX ABI and the OpenCL profile
-to compile the code. Therefore, a good part of the job is already done. However,
-many things must be implemented:</p>
-
-<ul>
-<li><p>Lowering down of various intrinsics like <code>llvm.memcpy</code></p></li>
-<li><p>Conformance test for all OpenCL built-ins (<code>native_cos</code>, <code>native_sin</code>,
-<code>mad</code>, atomic operations, barriers...).</p></li>
-<li><p>Lowering down of int16 / int8 / float16 / char16 / char8 / char4 loads and
-stores into the supported loads and stores</p></li>
-<li><p>Support for local declaration of local array (the OpenCL profile will properly
-declare them as global arrays)</p></li>
-<li><p>Support for doubles</p></li>
-<li><p>Support atomic extensions.</p></li>
-<li><p>Better resolving of the PHI functions. Today, we always generate MOV
-instructions at the end of each basic block . They can be easily optimized.</p></li>
-</ul>
-
-<h2>Gen IR</h2>
-
-<p>The code is defined in <code>src/ir</code>. Main things to do are:</p>
-
-<ul>
-<li><p>Bringing support for doubles</p></li>
-<li><p>Adding support for atomic extensions.</p></li>
-<li><p>Finishing the handling of function arguments (see the <a href="gen_ir.html">IR
-description</a> for more details)</p></li>
-<li><p>Adding support for linking IR units together. OpenCL indeed allows to create
-programs from several sources</p></li>
-<li><p>Uniform analysys. This is a major performance improvement. A "uniform" value
-is basically a value where regardless the control flow, all the activated
-lanes will be identical. Trivial examples are immediate values, function
-arguments. Also, operations on uniform will produce uniform values and so
-on...</p></li>
-<li><p>Merging of independent uniform loads (and samples). This is a major
-performance improvement once the uniform analysis is done. Basically, several
-uniform loads may be collapsed into one load if no writes happens in-between.
-This will obviously impact both instruction selection and the register
-allocation.</p></li>
-</ul>
-
-<h2>Backend</h2>
-
-<p>The code is defined in <code>src/backend</code>. Main things to do are:</p>
-
-<ul>
-<li><p>Implementing support for doubles</p></li>
-<li><p>Implementing atomic extensions.</p></li>
-<li><p>Implementing register spilling (see the <a href="./compiler_backend.html">compiler backend
-description</a> for more details)</p></li>
-<li><p>Implementing proper instruction selection. A "simple" tree matching algorithm
-should provide good results for Gen</p></li>
-<li><p>Improving the instruction scheduling pass</p></li>
-</ul>
-
-<h2>General plumbing</h2>
-
-<p>I tried to keep the code clean, well, as far as C++ can be really clean. There
-are some header cleaning steps required though, in particular in the backend
-code.</p>
-
-<p>The context used in the IR code generation (see <code>src/ir/context.*pp</code>) should be
-split up and cleaned up too.</p>
-
-<p>I also purely and simply copied and pasted the Gen ISA disassembler from Mesa.
-This leads to code duplication. Also some messages used by OpenCL (untyped reads
-and writes) are not properly decoded yet.</p>
-
-<p>There are some quick and dirty hacks also like the use of function call <code>system</code>
-(...). This should be cleanly replaced by popen and stuff. I also directly
-called the LLVM compiler executable instead of using Clang library. All of this
-should be improved and cleaned up. Track "XXX" comments in the code.</p>
-
-<p>Parts of the code leaks memory when exceptions are used. There are some pointers
-to track and replace with std::unique_ptr. Note that we also add a custom memory
-debugger that nicely complements (i.e. it is fast) Valgrind.</p>
+++ /dev/null
-<h1>Compiler Back End</h1>
-
-<p>Well, the complete code base is somehow a compiler backend for LLVM. Here, we
-really speak about the final code generation passes that you may find in
-<code>src/backend</code>.</p>
-
-<p>As explained in <a href="./gen_ir.html">the scalar IR presentation</a>, we bet on a very
-simple scalar IR to make it easy to parse and modify. The idea is to fix the
-unrelated problem (very Gen specific) where we can i.e. when the code is
-generated.</p>
-
-<p>The code generation in the compiler backend is classically divided into four
-steps</p>
-
-<ul>
-<li><p>Instruction selection (defined in <code>src/backend/gen_insn_selection.*pp</code>). We
-expose an interface for the instruction selection engine. We implemented a
-very simple selection (called <code>SimpleSelection</code>) that does a quick and dirty
-one-to-many instruction generation.</p></li>
-<li><p>Register allocation (defined in <code>src/backend/gen_reg_allocation.*pp</code>). The
-code implements a linear scan allocator on the code selected in the previous
-pass. See below for more details about register vector allocations.</p></li>
-<li><p>Instruction scheduling. This one is not done yet. We just output the same
-instruction order as the program order. Note that we plan to implement an
-adaptive scheduling between register allocation and instruction selection (to
-avoid spilling as much as possible)</p></li>
-<li><p>Instruction encoding. This is the final step that encodes the program into Gen
-ISA.</p></li>
-</ul>
-
-<h2>Instruction selection</h2>
-
-<p>Usually, the instruction selection consists in mapping <code>p</code> instructions to <code>q</code>
-ISA instructions under a cost driven model. Each basic block is therefore <em>tiled</em>
-into some numbers of groups of ISA instructions such that the final cost is
-minimized.</p>
-
-<p>The literature is particularly dense on the subject. Compilers usually use today
-either tree matching methods or selection DAG techniques (as LLVM backends do)</p>
-
-<p>The instruction selection is still a work in progress in our compiler and we
-only implement the most stupid (and inefficient) technique: we simply generate
-as many instructions as we need for each <em>individual</em> IR instructions. Since we
-do not support immediate sources, this in particular leads to really ugly
-looking code such as <code>mov (16) r2:f 1.f</code>. It is still a work in progress.</p>
-
-<p>Other than that, the instruction selection is really a book keeping structure.
-We basically output <code>SelectionInstruction</code> objects which are the 1-to-1 mapping
-of Gen ISA encoding functions defined in <code>src/backend/gen_encoder.*pp</code>.</p>
-
-<p>However, the <code>SelectionInstruction</code> still use unallocated virtual registers and
-do <em>not</em> use vectors but simply tuples of virtual registers.</p>
-
-<h2>Register allocation</h2>
-
-<p>The register allocation actually consists in two steps:</p>
-
-<ol>
-<li><p>Handling the vector for all the instructions that require them</p></li>
-<li><p>Performing the register allocation itself</p></li>
-</ol>
-
-<p>Step 1 consists in scanning all the vectors required by sends. Obviously, the
-same register may be used in different vectors and that may lead to
-interferences. We simply sort the vectors from the largest to the smallest and
-allocate them in that order. As an optimization we also identify sub-vectors
-i.e. vectors included in larger ones and no not allocate them.</p>
-
-<p>The code may be largely improved in particular if we take into account liveness
-interferences as well. Basically, a register may be part of several vectors if the
-registers that are not in both vectors at the same location are not alive at the
-same time.</p>
-
-<p>This is still a work in progress. Code is right now handled by method
-<code>GenRegAllocator::allocateVector</code>.</p>
-
-<p>Step 2 performs the register allocation i.e. it associates each virtual register
-to one (or several) physical registers. The first thing is that the Gen register
-file is very flexible i.e. it can (almost) be freely partitioned. To handle this
-peculiarity, we simply implemented a free list based generic memory allocator as
-done with <code>RegisterFilePartitioner</code> in <code>src/backend/context.cpp</code>.</p>
-
-<p>We then simply implemented a linear scan allocator (see
-<code>gen_reg_allocation.cpp</code>). The spilling is not implemented and is still a work
-in progress. The thing is that spilling must be specifically handled with Gen.
-Indeed:</p>
-
-<ol>
-<li><p>Bad point. Spilling is expensive and require to assemble messages for it</p></li>
-<li><p>Good point. Gen is able to spill up to 256 <em>contiguous</em> bytes in one message.
-This must be used for high performance spilling and this may require to reorder
-properly registers to spill.</p></li>
-</ol>
-
-<h2>Instruction scheduling</h2>
-
-<p>Intra-basic block instruction scheduling is relatively simple. It is not
-implemented yet.</p>
-
-<h2>Instruction encoding</h2>
-
-<p>This is mostly done in <code>src/backend/gen_context.cpp</code> and
-<code>src/backend/gen_encoder./*pp</code>. This is mostly glue code and it is pretty
-straightforward. We just forward the selection code using the physically
-allocated registers. There is nothing special here. Just boilerplate.</p>
-
-<p><a href="../README.html">Up</a></p>
+++ /dev/null
-<h1>Flat Address Space</h1>
-
-<h2>Segmented address space...</h2>
-
-<p>The first challenge with OpenCL is its very liberal use of pointers. The memory
-is segment into several address spaces:</p>
-
-<ul>
-<li><p>private. This is the memory for each work item</p></li>
-<li><p>global. These are buffers in memory shared by all work items and work groups</p></li>
-<li><p>constant. These are constant buffers in memory shared by all work items and
-work groups as well</p></li>
-<li><p>local. These is a memory shared by all work items in the <em>same</em> work group</p></li>
-</ul>
-
-<h2>... But with no restriction inside each address space</h2>
-
-<p>The challenge is that there is no restriction in OpenCL inside each address
-space i.e. the full C semantic applies in particular regarding pointer
-arithmetic.</p>
-
-<p>Therefore the following code is valid:</p>
-
-<p><code>
-__kernel void example(__global int *dst, __global int *src0, __global int *src1)<br/>
-{<br/>
- __global int *from;<br/>
- if (get_global_id(0) % 2)<br/>
- from = src0;<br/>
- else<br/>
- from = src1;<br/>
- dst[get_global_id(0)] = from[get_global_id(0)];<br/>
-}
-</code></p>
-
-<p>As one may see, the load done in the last line actually mixes pointers from both
-source src0 and src1. This typically makes the use of binding table indices
-pretty hard. In we use binding table 0 for dst, 1 for src0 and 2 for src1 (for
-example), we are not able to express the load in the last line with one send
-only.</p>
-
-<h2>No support for stateless in required messages</h2>
-
-<p>Furthermore, in IVB, we are going four types of messages to implement the loads
-and the stores</p>
-
-<ul>
-<li><p>Byte scattered reads. They are used to read bytes/shorts/integers that are not
-aligned on 4 bytes. This is a gather message i.e. the user provides up to 16
-addresses</p></li>
-<li><p>Byte scattered writes. They are used to write bytes/shorts/integers that are not
-aligned on 4 bytes. This is a scatter message i.e. the user provides up to 16
-addresses</p></li>
-<li><p>Untyped reads. They allow to read from 1 to 4 double words (i.e 4 bytes) per
-lane. This is also a gather message i.e. up to 16 address are provided per
-message.</p></li>
-<li><p>Untyped writes. They are the counter part of the untyped reads</p></li>
-</ul>
-
-<p>Problem is that IVB does not support stateless accesses for these messages. So
-surfaces are required. Secondly, stateless messages are not that interesting
-since all of them require a header which is still slow to assemble.</p>
-
-<h2>Implemented solution</h2>
-
-<p>The solution is actually quite simple. Even with no stateless support, it is
-actually possible to simulate it with a surface. As one may see in the run-time
-code in <code>intel/intel_gpgpu.c</code>, we simply create a surface:</p>
-
-<ul>
-<li><p>2GB big</p></li>
-<li><p>Which starts at offset 0</p></li>
-</ul>
-
-<p>Surprisingly, this surface can actually map the complete GTT address space which
-is 2GB big. One may look at <code>flat_address_space</code> unit test in the run-time code
-that creates and copies buffers in such a way that the complete GTT address
-space is traversed.</p>
-
-<p>This solution brings a pretty simple implementation in the compiler side.
-Basically, there is nothing to do when translating from LLVM to Gen ISA. A
-pointer to <code>__global</code> or <code>__constant</code> memory is simply a 32 bits offset in that
-surface.</p>
-
-<h2>Related problems</h2>
-
-<p>There is one drawback for this approach. Since we use a 2GB surface that maps
-the complete GTT space, there is no protection at all. Each write can therefore
-potentially modify any buffer including the command buffer, the frame buffer or
-the kernel code. There is <em>no</em> protection at all in the hardware to prevent
-that.</p>
-
-<p><a href="../README.html">Up</a></p>
+++ /dev/null
-<h1>Scalar Intermediate Representation</h1>
-
-<p>The IR code is included in <code>src/ir/</code> of the compiler code base
-The IR as designed in this compiler is the fruit of a long reflection I mostly
-have with Thomas Raoux. Note I usually call it "Gen IR".</p>
-
-<h2>Scalar vs vector IR</h2>
-
-<p>This is actually the major question: do we need a vector IR or a scalar IR? On
-the LLVM side, we have both. LLVM IR can manipulate vectors and scalars (and
-even generalized values but we can ignore it for now).</p>
-
-<p>For that reason, the Clang front-end generates both scalar and vector code.
-Typically, a <code>uint4</code> variable will output a vector of 4 integers. Arithmetic
-computations will be directly done on vector variables.</p>
-
-<p>One the HW side, the situation is completely different:</p>
-
-<ul>
-<li><p>We are going to use the parallel mode (align1) i.e. the struct-of-array mode
-for the EU. This is a SIMD scalar mode.</p></li>
-<li><p>The only source of vectors we are going to have is on the sends instructions
-(and marginally for some other instructions like the div_rem math instruction)</p></li>
-</ul>
-
-<p>One may therefore argue that we need vector instructions to handle the sends.
-Send will indeed require both vector destinations and sources. This may be a
-strong argument <em>for</em> vectors in the IR. However, the situation is not that
-good.</p>
-
-<p>Indeed, if we look carefully at the send instructions we see that they will
-require vectors that are <em>not</em> vectors in LLVM IR. This code for example:</p>
-
-<p><code>
-__global uint4 *src;<br/>
-uint4 x = src[get_global_id(0)];<br/>
-</code></p>
-
-<p>will be translated into an untyped write in the Gen ISA. Unfortunately, the
-address and the values to write are in the <em>same</em> vector. However, LLVM IR will
-output a store like:</p>
-
-<p><code>store(%addr, %value)</code></p>
-
-<p>which basically uses one scalar (the address) and one value (the vector to
-write). Therefore even if we handle vectors in the IR, that will not directly
-solve the problem we have at the end for the send instructions.</p>
-
-<p>We therefore decided to go the other direction:</p>
-
-<ul>
-<li><p>We have a purely scalar IR</p></li>
-<li><p>To replace vectors, we simply use multiple sources and destinations</p></li>
-<li><p>Real vectors required by send instructions are handled at the very bottom of
-the stack in the register allocation passes.</p></li>
-</ul>
-
-<p>This leads to a very simple intermediate representation which is mostly a pure
-scalar RISC machine.</p>
-
-<h2>Very limited IR</h2>
-
-<p>The other major question, in particular when you look similar stacks like NVidia
-PTX, is:</p>
-
-<p>do we need to encode in the IR register modifiers (abs, negate...) and immediate
-registers (like in add.f x y 1.0)?</p>
-
-<p>Contrary to other IRs (PTX and even LLVM that both supports immediates), we also
-chose to have a very simply IR, much simpler than the final ISA, and to merge
-back what we need at the instruction selection pass. Since we need instruction
-selection, let us keep the IR simple.</p>
-
-<p>Also, there are a lot of major issues that can not be covered in the IR and
-require to be specifically handled at the very end of the code:</p>
-
-<ul>
-<li><p>send vectors (see previous section)</p></li>
-<li><p>send headers (value and register allocation) which are also part of the vector
-problem</p></li>
-<li><p>SIMD8 mode in SIMD16 code. Some send messages do not support SIMD16 encoding
-and require SIMD8. Typically examples are typed writes i.e. scatters to textures.
-Also, this cannot be encoded in some way in a regular scalar IR.</p></li>
-</ul>
-
-<p>For these reasons, most of the problems directly related to Gen naturally find
-their solutions in either the instruction selection or the register allocator.</p>
-
-<p>This leads to the following strategy:</p>
-
-<ul>
-<li><p>Keep the IR very simple and limited</p></li>
-<li><p>Use all the analysis tools you need in the IR before the final code generation
-to build any information you need. This is pure "book-keeping".</p></li>
-<li><p>Use any previous analysis and finish the job at the very end</p></li>
-</ul>
-
-<p>This classical approach leads to limit the complexity in the IR while forcing us
-to write the proper tools in the final stages.</p>
-
-<h2>Why not using LLVM IR directly?</h2>
-
-<p>We hesitated a long time between writing a dedicated IR (as we did) and just
-using LLVM IR. Indeed, LLVM comes with a large set of tools that are parts of
-"LLVM backends". LLVM provides a lot of tools to perform the instruction
-selection (<code>SelectionDAG</code>) and the register allocation. Two things however
-prevent us from choosing this path:</p>
-
-<ul>
-<li><p>We only have a limited experience with LLVM and no experience at all with the
-LLVM backends</p></li>
-<li><p>LLVM register allocators do not handle at all the peculiarities of Gen:</p>
-
-<ul>
-<li><p>flexible register file. Gen registers are more like memory than registers
-and can be freely allocated and aliased. LLVM register allocators only
-support partial aliasing like x86 machines do (rax -> eax -> ax)</p></li>
-<li><p>no proper tools to handle vectors in the register allocator as we need for
-sends</p></li>
-</ul></li>
-</ul>
-
-<p>Since we will need to do some significant work anyway, this leads us to choose a
-more hard-coded path with a in-house IR. Note that will not prevent us from
-implementing later a LLVM backend "by the book" as Nvidia does today with PTX
-(using a LLVM backend to do the LLVM IR -> PTX conversion)</p>
-
-<h2>SSA or no SSA</h2>
-
-<p>Since we have a purely scalar IR, implementing a SSA transformation on the IR
-may be convenient. However, most the literature about compiler back-ends use
-non-SSA representation of the code. Since the primary goal is to write a
-compiler <em>back-end</em> (instruction selection, register allocation and instruction
-scheduling), we keep the code in non-SSA letting the higher level optimizations
-to LLVM.</p>
-
-<h2>Types, registers, instructions, functions and units</h2>
-
-<p>The IR is organized as follows:</p>
-
-<ul>
-<li><p>Types (defined in <code>src/ir/type.*pp</code>). These are scalar types only. Since the
-code is completely lowered down, there is no more reference to structures,
-pointers or vectors. Everything is scalar values and when "vectors" or
-"structures" would be needed, we use instead multiple scalar sources or
-destinations.</p></li>
-<li><p>Registers (defined in <code>src/ir/register.*pp</code>). They are untyped (since Gen IR
-are untyped) and we have 65,535 of them per function</p></li>
-<li><p>Instructions (defined in <code>src/ir/instruction.*pp</code>). They are typed (to
-distinguish integer and FP adds for example) and possibly support multiple
-destinations and sources. We also provide a convenient framework to introspect
-the instruction in a simple (and memory efficient) way</p></li>
-<li><p>Functions (defined in <code>src/ir/function.*pp</code>). They are basically the counter
-part of LLVM functions or OpenCL kernels. Note that function arguments are a
-problem. We actually use the PTX ABI. Everything smaller than the machine word
-size (i.e. 32 bits for Gen) is passed by value with a register. Everything
-else which is bigger than is passed by pointer with a ByVal attribute.
-Note that requires some special treatment in the IR (see below) to make the
-code faster by replacing function argument loads by "pushed constants". We
-also defined one "register file" per function i.e. the registers are defined
-relatively to the function that uses them. Each function is made of basic
-blocks i.e. sequence of instructions that are executed linearly.</p></li>
-<li><p>Units (defined in <code>src/ir/unit.*pp</code>). Units are just a collection of
-functions and constants (not supported yet).</p></li>
-</ul>
-
-<h2>Function arguments and pushed constants</h2>
-
-<p>Gen can push values into the register file i.e. some registers are preset when
-the kernel starts to run. As detailed previously, the PTX ABI is convenient
-since every argument is either one register or one pointer to load from or to
-store to.</p>
-
-<p>However, when a pointer is used for an argument, loads are issued which may be
-avoided by using constant pushes.</p>
-
-<p>Once again OCL makes the task a bit harder than expected. Indeed, the C
-semantic once again applies to function arguments as well.</p>
-
-<p>Look at these three examples:</p>
-
-<h3>Case 1. Direct loads -> constant push can be used</h3>
-
-<p><code>
-struct foo { int x; int y; }; </br>
-__kernel void case1(__global int *dst, struct foo bar) </br>
-{<br/>
- dst[get_global_id(0)] = bar.x + bar.y;<br/>
-}
-</code></p>
-
-<p>We use a <em>direct</em> <em>load</em> for <code>bar</code> with <code>bar.x</code> and <code>bar.y</code>. Values can be
-pushed into registers and we can replace the loads by register reads.</p>
-
-<h3>Case 2. Indirect loads -> we need to load the values from memory</h3>
-
-<p><code>
-struct foo { int x[16]; }; </br>
-__kernel void case1(__global int *dst, struct foo bar) </br>
-{<br/>
- dst[get_global_id(0)] = bar.x[get_local_id(0)];<br/>
-}
-</code></p>
-
-<p>We use an indirect load with <code>bar.x[get\_local\_id(0)]</code>. Here we need to issue a
-load from memory (well, actually, we could do a gather from registers, but it is
-not supported yet).</p>
-
-<h3>Case 3. Writes to arguments -> we need to spill the values to memory first</h3>
-
-<p><code>
-struct foo { int x[16]; }; </br>
-__kernel void case1(__global int *dst, struct foo bar) </br>
-{<br/>
-bar.x[0] = get_global_id(1);<br/>
- dst[get_global_id(0)] = bar.x[get_local_id(0)];<br/>
-}
-</code></p>
-
-<p>Here the values are written before being read. This causes some troubles since
-we are running in SIMD mode. Indeed, we only have in memory <em>one</em> instance of
-the function arguments. Here, <em>many</em> SIMD lanes and actually <em>many</em> hardware
-threads are running at the same time. This means that we can not write the data
-to memory. We need to allocate a private area for each SIMD lane.</p>
-
-<p>In that case, we need to spill back the function arguments into memory. We spill
-once per SIMD lane. Then, we read from this private area rather than the
-function arguments directly.</p>
-
-<p>This analysis is partially done today in <code>src/ir/lowering.*pp</code>. We identify all
-the cases but only the case with constant pushing is fully implemented.
-Actually, the two last cases are easy to implement but this requires one or two
-days of work.</p>
-
-<h2>Value and liveness analysis tools</h2>
-
-<p>You may also notice that we provide a complete framework for value analysis
-(i.e. to figure when a value or instruction destination is used and where the
-instruction sources come from). The code is in <code>src/ir/value.*pp</code>. Well, today,
-this code will burn a crazy amount of memory (use of std::set all over the
-place) but it at least provides the analysis required by many other passes.
-Compacting the data structures and using O(n) algorithms instead of the O(ln(n))
-are in the TODO list for sure :-)</p>
-
-<p>Finally, we also provide a liveness analysis tool which simply figures out which
-registers are alive at the end of each block (classically "live out" sets).</p>
-
-<p><a href="../README.html">Up</a></p>
+++ /dev/null
-<h1>Unstructured Branches</h1>
-
-<p>A major challenge in making a OpenCL compiler is certainly to handle any kind of
-branches. Indeed LLVM does not make any distinction between structured branches.
-See <a href="http://llvm.org/docs/LangRef.html">here</a> for a complete description of
-the LLVM assembly specification.</p>
-
-<p>The C branching code is simply lowered down in the following instructions:</p>
-
-<ul>
-<li><code>ret</code> to return from the current function</li>
-<li><code>br</code> that, if predicated, possibly jumps to two destinations (one for the
-taken branch and one for the other).</li>
-<li><code>switch</code> that implements the C switch/case construct.</li>
-<li><code>indirectbr</code> that implements a jump table</li>
-<li><code>invoke</code> and <code>resume</code> mostly used to handle exceptions</li>
-</ul>
-
-<p>Exceptions and jump tables are not supported in OpenCL. Switch cases can be
-lowered down to a sequence of if/else statements (using a divide and conquer
-approach a switch/case can be dispatched in log(n) complexity where n is the
-number of targets).</p>
-
-<p>This leads us to properly implement <code>br</code> and <code>ret</code> instructions.</p>
-
-<h2>Solution 1 - Using Gen structured branches</h2>
-
-<p>Gen structured branches are the following instructions:</p>
-
-<p><code>if</code> <code>else</code> <code>endif</code> <code>break</code> <code>continue</code> <code>while</code> <code>brd</code> <code>brc</code></p>
-
-<p>Transforming the LLVM IR code into structured code results in basically
-reverse-engineering the LLVM code into the original C code.
-Unfortunately, there are several key problems:</p>
-
-<ul>
-<li>OpenCL supports <code>goto</code> keyword that may jump to an arbitrary location</li>
-<li>LLVM can transform the control flow graph in any kind of form</li>
-<li>Worse is that a reducible control flow graph can be turned into an irreducible
-one by the optimizer.</li>
-</ul>
-
-<p>This can lead to complicated code transform and basic block duplication. The
-specification allows the compiler to abort if an irreducible control flow is
-detected but as an implementor, this is quite awkward to abort the compilation
-because the optimizer turns an reducible CFG to an irreducible one. Using
-structured branches is the open door to many corner cases.</p>
-
-<p>Thing is it exists a pretty elegant solution that can be almost seamlessly
-supported by Gen. This is the solution we retained.</p>
-
-<h2>Solution 2 - Linearizing the control flow graph</h2>
-
-<p>The general problem is to map a general control flow graph to a SIMD machine.
-The problem is fairly well understood today. A recent research paper actually
-dedicated to OpenCL like languages which use the "SPMD" (single program multiple
-data) programming model present interesting insights about how to map SIMD
-architectures to such languages (see [here]
-(http://www.cdl.uni-saarland.de/papers/karrenberg_opencl.pdf)).</p>
-
-<h3>Core idea</h3>
-
-<ul>
-<li><p>Linearizing the CFG initially consists in removing all forward branches and
-"replace" them by predication. Indeed, the program will be still correct if you
-predicate instructions based instead of forward jumps. This is basically the
-a control flow to data flow conversion.</p></li>
-<li><p>Of course, removing all forward branches is inefficient. To improve that, we
-simply introduce "if conditions" in the head of basic blocks to know if we run
-the basic block. If no lanes is going to be activated in the basic block, we
-jump to another basic block where <em>potentially</em> some lanes are going to be
-reactivated.</p></li>
-</ul>
-
-<p>Consider the following CFG:</p>
-
-<pre>
-o-------o
-| |
-| 1 |---->-----o
-| | |
-o-------o |
- | |
- | |
-o-------o |
-| | |
-| 2 |---->-----------o
-| | | |
-o-------o | |
- | | |
- | | |
- | o------o | |
- | | | | |
- | v | | |
-o-------o | | |
-| | | | |
-| 3 | | | |
-| | | | |
-o-------o | | |
- | | | | |
- | o------o | |
- | | |
-o-------o | |
-| | | |
-| 4 |<---------o |
-| | |
-o-------o |
- | |
- | |
-o-------o |
-| | |
-| 5 |<----------------o
-| |
-o-------o
-</pre>
-
-<p>Mapping it to a SIMD machine may seem challenging. Actually it is not too
-complicated. The problem is with the 2->5 jump. Indeed, we have to be sure that
-we are not missing any computation done in block 4.</p>
-
-<p>To do so:
-- Instead of jumping from block 2 to block 5, we jump from block 2 to block 4.
-- We implement a <code>JOIN</code> point on top of block 4. We check if any lane is going
-to be reactivated for the block 4. If not, we jump to block 5.</p>
-
-<p>This leads to the following linearized CFG:</p>
-
-<pre>
-o-------o
-| |
-| 1 |---->-----o
-| | |
-o-------o |
- | |
- | |
-o-------o |
-| | |
-| 2 |---->-----------o
-| | | |
-o-------o | |
- | | |
- | | |
- | o--<---o | |
- | | | | |
- | v | | |
-o-------o | | |
-| | | | |
-| 3 | ^ | |
-| | | | |
-o-------o | | |
- | | | | |
- | o-->---o | |
- | | |
-o-------o | |
-| |==========|=====|====O
-| 4 |<---------|-----o |
-| |<---------o |
-o-------o |
- | |
- | |
-o-------o |
-| | |
-| 5 |<====================O
-| |
-o-------o
-</pre>
-
-<p>There is a new jump from block 4 to block 5.</p>
-
-<h3>Implementation on Gen</h3>
-
-<p>When using structured branches, Gen can supports auto-masking i.e. based on the
-branches which are taken, the control flow is properly handled and masks are
-automatically applied on all instructions.</p>
-
-<p>However, there is no similar support for unstructured branches. We therefore
-decided to mask instructions manually and use single program flow. This is
-actually quite easy to do since Gen is able to predicate any branches.</p>
-
-<p>Now, how to evaluate the if conditions in an efficient way?</p>
-
-<p>The choice we did is to use <em>per-lane block IPs</em>: for each SIMD lane, we store a
-short (16 bits) for each lane in a regular 256 bits GPR (general purpose
-register). This "blockIP" register is used in the following way:</p>
-
-<p>At the beginning of each block, we compare the blockIP register with the ID of
-the block. The lane is going to be <em>activated</em> if its blockIP is <em>smaller</em> than
-the ID of the block. Otherwise, the lane is deactivated.</p>
-
-<p>Therefore, we build a flag register at the entry of each basic block with a
-single 16-wide uint16_t compare. If no lane is activated, a jump is performed to
-the next block where some lanes is going to be activated.</p>
-
-<p>Since this is regular jumps, we just use <code>jmpi</code> instruction. With the help of
-predication, we can express all the different possibilities:</p>
-
-<ul>
-<li>backward branches are always taken if <em>any</em> of lanes in the predicate is true.
-We just use <code><+f0.0.anyh></code> predication.</li>
-<li>forward branches is <em>not</em> taken if some of the lanes are going to activated in
-the next block. We therefore compare the blockIP with the ID of the <em>next</em>
-block. If all of them are strictly greater than the ID of the next block, we
-jump. We therefore use the <code><+f0.0.allh></code> predicate in that case.</li>
-<li><code>JOIN</code> points are even simpler. We simply jump if none of the lane is activated.
-We therefore use the <code><-f0.0.anyh></code> predicate.</li>
-</ul>
-
-<p>The complete encoding is done in <code>src/backend/gen_insn_selection.cpp</code>. Forward
-branches are handled by <code>SimpleSelection::emitForwardBranch</code>. Backward branches
-are handled by <code>SimpleSelection::emitBackwardBranch</code>. Finally, since <code>JOIN</code> points
-are at the top of each basic blocks, they are handled by
-<code>SimpleSelection::emitLabelInstruction</code>.</p>
-
-<h3>Computing <code>JOIN</code> points</h3>
-
-<p>The last problem is to compute <code>JOIN</code> point i.e. we need to know if we need to
-jump at the beginning of each block and if we do, what is the target of the
-branch. The code is relatively straightforward and can be found in
-<code>src/backend/context.cpp</code>. Function is <code>Context::buildJIPs</code>.
-</br>
-Actually, the current implementation is not that elegant. A colleague, Thomas
-Raoux, has a simpler and better idea to handle it.</p>
-
-<h3>Advantages and drawbacks of the method</h3>
-
-<ul>
-<li>The method has one decisive advantage: it is simple and extremely robust. It can
-handle any kind of CFGs (reducible or not) and does not require any
-transformation. The use of shorts is also not random. 16-wide compares is issued
-in 2 cycles (so it is twice fast as 16-wide 32 bits compares).</li>
-<li>Main drawback will be performance. Even if this is not so bad, we still need
-more instructions than if we used structured branches. Mostly
-<ul>
-<li>one or two instructions for <code>JOIN</code> points</li>
-<li>three instructions for backward and forward jumps (two more than structured
-branches that just require the branch instruction itself)</li>
-</ul></li>
-</ul>
-
-<p>Note that all extra instructions are 16 bits instructions (i.e. they use shorts)
-so they will only cost 2 cycles anyway.</p>
-
-<p>The last point is that Gen encoding restricts conditional modifiers and
-predicates to be the same in the instruction. This requires to copy or recompute
-the flag register for compares and select. So one more instruction is required
-for these two instructions. Once again, this would require only 2 cycles.</p>
-
-<h2>Remarks on <code>ret</code> instructions</h2>
-
-<p>Since we can handle any kind of CFG, handling the return statements are
-relatively straightforward. We first create one return block at the end of the
-program. Then we replace all other returns by a unconditional jump to this
-block. The CFG linearization will take care of the rest.
-We then simply encode the (only one) return instruction as a End-Of-Thread
-message (EOT).</p>
-
-<h2>Code examples</h2>
-
-<p>Some tests were written to assert the correctness of the CFG linearization and the
-code generation. They can be found in the <em>run-time</em> code base here:</p>
-
-<p><code>utest/compiler_if_else.cpp</code></p>
-
-<p><code>utest/compiler_lower_return0.cpp</code></p>
-
-<p><code>utest/compiler_lower_return1.cpp</code></p>
-
-<p><code>utest/compiler_lower_return2.cpp</code></p>
-
-<p><code>utest/compiler_short_scatter.cpp</code></p>
-
-<p><code>utest/compiler_unstructured_branch0.cpp</code></p>
-
-<p><code>utest/compiler_unstructured_branch1.cpp</code></p>
-
-<p><code>utest/compiler_unstructured_branch2.cpp</code></p>
-
-<p><code>utest/compiler_unstructured_branch3.cpp</code></p>
-
-<p><a href="../README.html">Up</a></p>
--- /dev/null
+Beignet
+=======
+
+Beignet is an open source implementaion of the OpenCL specification - a generic
+compute oriented API. This code base contains the code to run OpenCL programs on
+Intel GPUs which bsically defines and implements the OpenCL host functions
+required to initialize the device, create the command queues, the kernels and
+the programs and run them on the GPU. The code base also contains the compiler
+part of the stack which is included in `backend/`. For more specific information
+about the compiler, please refer to `backend/README.md`
+
+How to build
+------------
+
+The project uses CMake with three profiles:
+
+1. Debug (-g)
+2. RelWithDebInfo (-g with optimizations)
+3. Release (only optimizations)
+
+Basically, from the root directory of the project
+
+`> mkdir build`
+
+`> cd build`
+
+`> cmake ../ # to configure`
+
+Choose whatever you want for the build.
+
+Then press 'c' to configure and 'g' to generate the code.
+
+`> make`
+
+The project depends on several external libraries:
+
+- Several X components (XLib, Xfixes, Xext)
+- libdrm libraries (libdrm and libdrm\_intel)
+- Various LLVM components
+- The compiler backend itself (libgbe)
+- Mesa git master version built with gbm enabled to support extension cl\_khr\_gl\_sharing.
+
+CMake will check the dependencies and will complain if it does not find them.
+
+The cmake will also build the backend project. Please refer to:
+[[OpenCL Gen Backend|Beignet/Backend]] to get more dependencies.
+
+Once built, the run-time produces a shared object libcl.so which basically
+directly implements the OpenCL API. A set of tests are also produced. They may
+be found in `utests/`.
+
+Note that the compiler depends on LLVM (Low-Level Virtual Machine project).
+Right now, the code has been compiled with LLVM 3.1/3.2. It will not compile
+with any thing older.
+
+[http://llvm.org/releases/](http://llvm.org/releases/)
+
+LLVM 3.1,3.2 and 3.3 are supported.
+
+Also note that the code was compiled on GCC 4.6 and GCC 4.7. Since the code uses
+really recent C++11 features, you may expect problems with older compilers. Last
+time I tried, the code breaks ICC 12 and Clang with internal compiler errors
+while compiling anonymous nested lambda functions.
+
+How to run
+----------
+
+Apart from the OpenCL library itself that can be used by any OpenCL application,
+this code also produces various tests to ensure the compiler and the run-time
+consistency. This small test framework uses a simple c++ registration system to
+register all the unit tests.
+
+You need to set the variable `OCL_KERNEL_PATH` to locate the OCL kernels. They
+are with the run-time in `./kernels`.
+
+Then in `utests/`:
+
+`> ./utest_run`
+
+will run all the unit tests one after the others
+
+`> ./utest_run some_unit_test0 some_unit_test1`
+
+will only run `some_unit_test0` and `some_unit_test1` tests
+
+Supported Hardware
+------------------
+
+The code was tested on IVB GT2 with ubuntu and fedora core distribution.
+Currently Only IVB is supported right now. Actually, the code was only run on IVB GT2. You
+may expect some issues with IVB GT1.
+
+TODO
+----
+
+The run-time is far from being complete. Most of the pieces have been put
+together to test and develop the OpenCL compiler. A partial list of things to
+do:
+
+- Complete cl\_khr\_gl\_sharing support. We lack of some APIs implementation such
+ as clCreateFromGLBuffer,clCreateFromGLRenderbuffer,clGetGLObjectInfo... Currently,
+ the working APIs are clCreateFromGLTexture,clCreateFromGLTexture2D.
+
+- Support for events.
+
+- Check that NDRangeKernels can be pushed into _different_ queues from several
+ threads.
+
+- Support for nonblocking mode Enqueue\*Buffer. Now we only use the map extension to
+ implement those Enqueue\*Buffer functions.
+
+- No state tracking at all. One batch buffer is created at each "draw call"
+ (i.e. for each NDRangeKernels). This is really inefficient since some
+ expensive pipe controls are issued for each batch buffer
+
+- Valgrind reports some leaks in libdrm. It sounds like a false positive but it
+ has to be checked. Idem for LLVM. There is one leak here to check.
+
+More generally, everything in the run-time that triggers the "FATAL" macro means
+that something that must be supported is not implemented properly (either it
+does not comply with the standard or it is just missing)
+
+Project repository
+------------------
+Right now, we host our project on fdo at: git://anongit.freedesktop.org/beignet.
+
+The team
+--------
+This project was created by Ben Segovia when he was working for Intel. Now we
+have a team in China OTC graphics department continue to work on this project.
+The official contact for this project is: Zou Nanhai (<nanhai.zou@intel.com>).
+
+How to contribute
+-----------------
+You are always welcome to contribute to this project, just need to subscribe
+to the beignet mail list and send patches to it for review.
+The official mail list is as below:
+http://lists.freedesktop.org/mailman/listinfo/beignet
Limitations
-----------
-Today, the compiler is far from complete. See [here](doc/TODO.html) for a
+Today, the compiler is far from complete. See [[here|Backend/TODO]] for a
(incomplete) lists of things to do.
Interface with the run-time
Several key decisions may use the hardware in an usual way. See the following
documents for the technical details about the compiler implementation:
-- [Flat address space](doc/flat\_address\_space.html)
-- [Unstructured branches](doc/unstructured\_branches.html)
-- [Scalar intermediate representation](doc/gen\_ir.html)
-- [Clean backend implementation](doc/compiler_backend.html)
-
-Ben Segovia.
+- [[Flat address space|flat_address_space]]
+- [[Unstructured branches|unstructured_branches]]
+- [[Scalar intermediate representation|gen_ir]]
+- [[Clean backend implementation|compiler_backend]]
+Ben Segovia.
- Adding support for atomic extensions.
-- Finishing the handling of function arguments (see the [IR
- description](gen_ir.html) for more details)
+- Finishing the handling of function arguments (see the [[IR
+ description|gen_ir]] for more details)
- Adding support for linking IR units together. OpenCL indeed allows to create
programs from several sources
- Implementing atomic extensions.
-- Implementing register spilling (see the [compiler backend
- description](./compiler_backend.html) for more details)
+- Implementing register spilling (see the [[compiler backend
+ description|compiler_backend]] for more details)
- Implementing proper instruction selection. A "simple" tree matching algorithm
should provide good results for Gen
Parts of the code leaks memory when exceptions are used. There are some pointers
to track and replace with std::unique_ptr. Note that we also add a custom memory
debugger that nicely complements (i.e. it is fast) Valgrind.
-
really speak about the final code generation passes that you may find in
`src/backend`.
-As explained in [the scalar IR presentation](./gen_ir.html), we bet on a very
+As explained in [[the scalar IR presentation|gen_ir]], we bet on a very
simple scalar IR to make it easy to parse and modify. The idea is to fix the
unrelated problem (very Gen specific) where we can i.e. when the code is
generated.
straightforward. We just forward the selection code using the physically
allocated registers. There is nothing special here. Just boilerplate.
-[Up](../README.html)
-
potentially modify any buffer including the command buffer, the frame buffer or
the kernel code. There is *no* protection at all in the hardware to prevent
that.
-
-[Up](../README.html)
-
implementing later a LLVM backend "by the book" as Nvidia does today with PTX
(using a LLVM backend to do the LLVM IR -> PTX conversion)
+
SSA or no SSA
-------------
Finally, we also provide a liveness analysis tool which simply figures out which
registers are alive at the end of each block (classically "live out" sets).
-
-[Up](../README.html)
-
we are not missing any computation done in block 4.
To do so:
-- Instead of jumping from block 2 to block 5, we jump from block 2 to block 4.
+- Instead of jumping from block 2 to block 5, we jump from block 2 to block 4.
- We implement a `JOIN` point on top of block 4. We check if any lane is going
to be reactivated for the block 4. If not, we jump to block 5.
block. The CFG linearization will take care of the rest.
We then simply encode the (only one) return instruction as a End-Of-Thread
message (EOT).
-
Code examples
-------------
`utest/compiler_unstructured_branch3.cpp`
-[Up](../README.html)
-