--- /dev/null
+Optimization Guide
+====================
+
+All the SIMD optimization principle also apply to Beignet optimization.
+Furthermore, there are some special tips for Beignet optimization.
+
+1. It is recommended to choose multiple of 16 work group size. Too much SLM usage may reduce parallelism at group level.
+ If kernel uses large amount SLM, it's better to choose large work group size. Please refer the following table for recommendations
+ with some SLM usage.
+| Amount of SLM | 0 | 4K | 8K | 16K | 32K |
+| WorkGroup size| 16 | 64 | 128 | 256 | 512 |
+
+2. GEN7's read/write on global memory with DWORD and DWORD4 are significantly faster than read/write on BYTE/WORD.
+ Use DWORD or DWORD4 to access data in global memory if possible. If you cannot avoid the byte/word access, try to do it on SLM.
+
+3. Use float data type as much as possible.
+
+4. Avoid using long. GEN7's performance for long integer is poor.
+
+5. If there is a small constant buffer, define it in the kernel instead of using the constant buffer argument if possible.
+ The compiler may optimize it if the buffer is defined inside kernel.
+
+6. Avoid unnecessary synchronizations, both in the runtime and in the kernel. For examples, clFinish and clWaitForEvents in runtime
+ and barrier() in the kernel.
+
+7. Consider native version of math built-ins, such as native\_sin, native\_cos, if your kernel is not precision sensitive.
+
+8. Try to eliminate branching as much as possible. For example using min, max, clamp or select built-ins instead of if/else if possible.