From cac02034f6f8d312f2eb793f9db134a6eca6a2c0 Mon Sep 17 00:00:00 2001
From: Edward Yang <ezyang@fb.com>
Date: Thu, 20 Dec 2018 11:14:21 -0800
Subject: [PATCH] Extend README for ATen/native/cpu (#15437)

Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15437

Differential Revision: D13529436

Pulled By: ezyang

fbshipit-source-id: 2e2193d54ea7f7626fe7392e4d0c130c2f87a76f
---
 aten/src/ATen/native/cpu/README    | 30 ---------------
 aten/src/ATen/native/cpu/README.md | 78 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 78 insertions(+), 30 deletions(-)
 delete mode 100644 aten/src/ATen/native/cpu/README
 create mode 100644 aten/src/ATen/native/cpu/README.md

diff --git a/aten/src/ATen/native/cpu/README b/aten/src/ATen/native/cpu/README
deleted file mode 100644
index 338692e..0000000
--- a/aten/src/ATen/native/cpu/README
+++ /dev/null
@@ -1,30 +0,0 @@
-TODO: Clarify and add more documentation all around.
-
-All of the *.cpp files in this folder will be compiled under all compiler
-flags specified by CPU_CAPABILITY_FLAGS in aten/src/ATen/CMakeLists.txt.
-
-The purpose of this is to allow the compilation with various compiler
-flags to enable features such as AVX instructions, while using runtime
-dispatch, which makes sure only valid instructions will be used on any
-given platform.
-
-Vec256.h provides a generic implementation of a vec256 type that allows
-the programmer to write code packing various primitives (such as floats)
-within 256bit registers. vec256 defines various operators such as + and *
-and provides functions to allow operations such as max, min, etc.
-
-As an example ReduceOpsKernel.cpp implements a generic kernel_ that reduces
-an entire array using a given associative binary operation such as +.
-
-More explicity, calling kernel_ with template argument std::plus will cause
-it to sum up the entire array into a single value.
-
-ReduceOpsKernel.cpp uses the CPU_CAPABILITY_* macros to "know" under which
-compiler flags it is currently compiled. This allows the programmer to write
-generic code, which will be compiled under multipled compilation settings.
-
-../ReduceOps.cpp now includes the header ReduceOpsKernel.h, which contains
-a generic definition of sumImplAll. This function allows the user to reduce
-over a dimension or all dimensions. The appropiate capability is chosen at
-runtime using cpuinfo. If the current platform has avx, sumImpl will be set
-to umImplAll<CPUCapability::AVX>.
diff --git a/aten/src/ATen/native/cpu/README.md b/aten/src/ATen/native/cpu/README.md
new file mode 100644
index 0000000..d084e0f
--- /dev/null
+++ b/aten/src/ATen/native/cpu/README.md
@@ -0,0 +1,78 @@
+The most important things to know:
+
+**Don't add a kernel to this folder unless you want it to be
+compiled multiple times for different instruction sets.**  Yes,
+this folder is named `cpu`, but that doesn't mean put any old
+CPU kernel it.  Only put CPU kernels which need to be compiled
+multiple times to take advantage of AVX/SSE instructions, but
+only on processors that support them.
+
+**Ensure that all implementations in this folder are put in an
+anonymous namespace.**  The files in this folder are compiled multiple
+times with different headers. It's important that these functions have
+internal linkage so that kernels for different architectures don't get
+combined during linking.  It's sufficient to label functions "static",
+but class methods must be an unnamed namespace to have internal linkage
+(since static means something different in the context of classes).
+
+**The basic recipe is to define your kernel, and then register
+it using DECLARE/REGISTER DISPATCH.**  Writing a kernel requires
+three steps:
+
+1. Declare your dispatch in a header file using
+  `DECLARE_DISPATCH(fn_type, fnNameImpl);`
+   where `fn_type` is the function pointer type of the kernel (e.g.,
+   defined as `using fn_type = void(*)(Tensor&, const Tensor&)`
+   and `fnNameImpl` is the name of your dispatch registry.
+   (It doesn't really matter where you  put this declaration.)
+
+2. Define your dispatch in a C++ file that is NOT in the cpu
+   directory (dispatch must be defined exactly once) using
+   `DEFINE_DISPATCH(fnNameImpl)` (matching the name of your declaration.)
+   Include the header file that declares the dispatch in this C++
+   file.  Conventionally, we define the dispatch in the same file
+   we will define our native function in.
+
+3. Define a native function which calls into the dispatch using
+   `fnNameImpl(kCPU, arguments...)`, where the arguments are
+   the arguments according to the `fn_type` you defined in the
+   declaration.
+
+4. Write your actual kernel (e.g., `your_kernel`) in the
+   cpu directory, and register it to
+   the dispatch using `REGISTER_DISPATCH(fnNameImpl, &your_kernel)`.
+
+There are plenty of existing examples, look at them for more details.
+
+----
+
+TODO: Clarify and add more documentation all around.
+
+All of the `*.cpp` files in this folder will be compiled under all compiler
+flags specified by `CPU_CAPABILITY_FLAGS` in `aten/src/ATen/CMakeLists.txt`.
+
+The purpose of this is to allow the compilation with various compiler
+flags to enable features such as AVX instructions, while using runtime
+dispatch, which makes sure only valid instructions will be used on any
+given platform.
+
+Vec256.h provides a generic implementation of a vec256 type that allows
+the programmer to write code packing various primitives (such as floats)
+within 256bit registers. vec256 defines various operators such as + and *
+and provides functions to allow operations such as max, min, etc.
+
+As an example `ReduceOpsKernel.cpp` implements a generic `kernel_` that reduces
+an entire array using a given associative binary operation such as +.
+
+More explicity, calling `kernel_` with template argument `std::plus` will cause
+it to sum up the entire array into a single value.
+
+`ReduceOpsKernel.cpp` uses the `CPU_CAPABILITY_*` macros to "know" under which
+compiler flags it is currently compiled. This allows the programmer to write
+generic code, which will be compiled under multipled compilation settings.
+
+`../ReduceOps.cpp` now includes the header `ReduceOpsKernel.h`, which contains
+a generic definition of `sumImplAll`. This function allows the user to reduce
+over a dimension or all dimensions. The appropiate capability is chosen at
+runtime using cpuinfo. If the current platform has AVX, `sumImpl` will be set
+to `sumImplAll<CPUCapability::AVX>`.
-- 
2.7.4