third_party/highway/README.md

   1 # Efficient and performance-portable vector software
   2
   3 [//]: # (placeholder, do not remove)
   4
   5 Highway is a C++ library that provides portable SIMD/vector intrinsics.
   6
   7 ## Why
   8
   9 We are passionate about high-performance software. We see major untapped
  10 potential in CPUs (servers, mobile, desktops). Highway is for engineers who want
  11 to reliably and economically push the boundaries of what is possible in
  12 software.
  13
  14 ## How
  15
  16 CPUs provide SIMD/vector instructions that apply the same operation to multiple
  17 data items. This can reduce energy usage e.g. *fivefold* because fewer
  18 instructions are executed. We also often see *5-10x* speedups.
  19
  20 Highway makes SIMD/vector programming practical and workable according to these
  21 guiding principles:
  22
  23 **Does what you expect**: Highway is a C++ library with carefully-chosen
  24 functions that map well to CPU instructions without extensive compiler
  25 transformations. The resulting code is more predictable and robust to code
  26 changes/compiler updates than autovectorization.
  27
  28 **Works on widely-used platforms**: Highway supports four architectures; the
  29 same application code can target eight instruction sets, including those with
  30 'scalable' vectors (size unknown at compile time). Highway only requires C++11
  31 and supports four families of compilers. If you would like to use Highway on
  32 other platforms, please raise an issue.
  33
  34 **Flexible to deploy**: Applications using Highway can run on heterogeneous
  35 clouds or client devices, choosing the best available instruction set at
  36 runtime. Alternatively, developers may choose to target a single instruction set
  37 without any runtime overhead. In both cases, the application code is the same
  38 except for swapping `HWY_STATIC_DISPATCH` with `HWY_DYNAMIC_DISPATCH` plus one
  39 line of code.
  40
  41 **Suitable for a variety of domains**: Highway provides an extensive set of
  42 operations, used for image processing (floating-point), compression, video
  43 analysis, linear algebra, cryptography, sorting and random generation. We
  44 recognise that new use-cases may require additional ops and are happy to add
  45 them where it makes sense (e.g. no performance cliffs on some architectures). If
  46 you would like to discuss, please file an issue.
  47
  48 **Rewards data-parallel design**: Highway provides tools such as Gather,
  49 MaskedLoad, and FixedTag to enable speedups for legacy data structures. However,
  50 the biggest gains are unlocked by designing algorithms and data structures for
  51 scalable vectors. Helpful techniques include batching, structure-of-array
  52 layouts, and aligned/padded allocations.
  53
  54 ## Examples
  55
  56 Online demos using Compiler Explorer:
  57
  58 -   [generating code for multiple targets](https://gcc.godbolt.org/z/n6rx6xK5h) (recommended)
  59 -   [single target using -m flags](https://gcc.godbolt.org/z/rGnjMevKG)
  60
  61 Projects using Highway: (to add yours, feel free to raise an issue or contact us
  62 via the below email)
  63
  64 *   [iresearch database index](https://github.com/iresearch-toolkit/iresearch/blob/e7638e7a4b99136ca41f82be6edccf01351a7223/core/utils/simd_utils.hpp)
  65 *   [JPEG XL image codec](https://github.com/libjxl/libjxl)
  66 *   [Grok JPEG 2000 image codec](https://github.com/GrokImageCompression/grok)
  67 *   [vectorized Quicksort](https://github.com/google/highway/tree/master/hwy/contrib/sort) ([paper](https://arxiv.org/abs/2205.05982))
  68
  69 ## Current status
  70
  71 ### Targets
  72
  73 Supported targets: scalar, S-SSE3, SSE4, AVX2, AVX-512, AVX3_DL (~Icelake,
  74 requires opt-in by defining `HWY_WANT_AVX3_DL`), NEON (ARMv7 and v8), SVE, SVE2,
  75 WASM SIMD, RISC-V V.
  76
  77 SVE was initially tested using farm_sve (see acknowledgments).
  78
  79 ### Versioning
  80
  81 Highway releases aim to follow the semver.org system (MAJOR.MINOR.PATCH),
  82 incrementing MINOR after backward-compatible additions and PATCH after
  83 backward-compatible fixes. We recommend using releases (rather than the Git tip)
  84 because they are tested more extensively, see below.
  85
  86 The current version 1.0 signals an increased focus on backwards compatibility.
  87 Applications using documented functionality will remain compatible with future
  88 updates that have the same major version number.
  89
  90 ### Testing
  91
  92 Continuous integration tests build with a recent version of Clang (running on
  93 native x86, or QEMU for RVV and ARM) and MSVC 2019 (v19.28, running on native
  94 x86).
  95
  96 Before releases, we also test on x86 with Clang and GCC, and ARMv7/8 via GCC
  97 cross-compile. See the [testing process](g3doc/release_testing_process.md) for
  98 details.
  99
 100 ### Related modules
 101
 102 The `contrib` directory contains SIMD-related utilities: an image class with
 103 aligned rows, a math library (16 functions already implemented, mostly
 104 trigonometry), and functions for computing dot products and sorting.
 105
 106 ## Installation
 107
 108 This project uses CMake to generate and build. In a Debian-based system you can
 109 install it via:
 110
 111 ```bash
 112 sudo apt install cmake
 113 ```
 114
 115 Highway's unit tests use [googletest](https://github.com/google/googletest).
 116 By default, Highway's CMake downloads this dependency at configuration time.
 117 You can disable this by setting the `HWY_SYSTEM_GTEST` CMake variable to ON and
 118 installing gtest separately:
 119
 120 ```bash
 121 sudo apt install libgtest-dev
 122 ```
 123
 124 To build Highway as a shared or static library (depending on BUILD_SHARED_LIBS),
 125 the standard CMake workflow can be used:
 126
 127 ```bash
 128 mkdir -p build && cd build
 129 cmake ..
 130 make -j && make test
 131 ```
 132
 133 Or you can run `run_tests.sh` (`run_tests.bat` on Windows).
 134
 135 Bazel is also supported for building, but it is not as widely used/tested.
 136
 137 ## Quick start
 138
 139 You can use the `benchmark` inside examples/ as a starting point.
 140
 141 A [quick-reference page](g3doc/quick_reference.md) briefly lists all operations
 142 and their parameters, and the [instruction_matrix](g3doc/instruction_matrix.pdf)
 143 indicates the number of instructions per operation.
 144
 145 We recommend using full SIMD vectors whenever possible for maximum performance
 146 portability. To obtain them, pass a `ScalableTag<float>` (or equivalently
 147 `HWY_FULL(float)`) tag to functions such as `Zero/Set/Load`. There are two
 148 alternatives for use-cases requiring an upper bound on the lanes:
 149
 150 -   For up to `N` lanes, specify `CappedTag<T, N>` or the equivalent
 151     `HWY_CAPPED(T, N)`. The actual number of lanes will be `N` rounded down to
 152     the nearest power of two, such as 4 if `N` is 5, or 8 if `N` is 8. This is
 153     useful for data structures such as a narrow matrix. A loop is still required
 154     because vectors may actually have fewer than `N` lanes.
 155
 156 -   For exactly a power of two `N` lanes, specify `FixedTag<T, N>`. The largest
 157     supported `N` depends on the target, but is guaranteed to be at least
 158     `16/sizeof(T)`.
 159
 160 Due to ADL restrictions, user code calling Highway ops must either:
 161 *   Reside inside `namespace hwy { namespace HWY_NAMESPACE {`; or
 162 *   prefix each op with an alias such as `namespace hn = hwy::HWY_NAMESPACE;
 163     hn::Add()`; or
 164 *   add using-declarations for each op used: `using hwy::HWY_NAMESPACE::Add;`.
 165
 166 Additionally, each function that calls Highway ops must either be prefixed with
 167 `HWY_ATTR`, OR reside between `HWY_BEFORE_NAMESPACE()` and
 168 `HWY_AFTER_NAMESPACE()`. Lambda functions currently require `HWY_ATTR` before
 169 their opening brace.
 170
 171 The entry points into code using Highway differ slightly depending on whether
 172 they use static or dynamic dispatch.
 173
 174 *   For static dispatch, `HWY_TARGET` will be the best available target among
 175     `HWY_BASELINE_TARGETS`, i.e. those allowed for use by the compiler (see
 176     [quick-reference](g3doc/quick_reference.md)). Functions inside
 177     `HWY_NAMESPACE` can be called using `HWY_STATIC_DISPATCH(func)(args)` within
 178     the same module they are defined in. You can call the function from other
 179     modules by wrapping it in a regular function and declaring the regular
 180     function in a header.
 181
 182 *   For dynamic dispatch, a table of function pointers is generated via the
 183     `HWY_EXPORT` macro that is used by `HWY_DYNAMIC_DISPATCH(func)(args)` to
 184     call the best function pointer for the current CPU's supported targets. A
 185     module is automatically compiled for each target in `HWY_TARGETS` (see
 186     [quick-reference](g3doc/quick_reference.md)) if `HWY_TARGET_INCLUDE` is
 187     defined and `foreach_target.h` is included.
 188
 189 ## Compiler flags
 190
 191 Applications should be compiled with optimizations enabled - without inlining,
 192 SIMD code may slow down by factors of 10 to 100. For clang and GCC, `-O2` is
 193 generally sufficient.
 194
 195 For MSVC, we recommend compiling with `/Gv` to allow non-inlined functions to
 196 pass vector arguments in registers. If intending to use the AVX2 target together
 197 with half-width vectors (e.g. for `PromoteTo`), it is also important to compile
 198 with `/arch:AVX2`. This seems to be the only way to generate VEX-encoded SSE4
 199 instructions on MSVC. Otherwise, mixing VEX-encoded AVX2 instructions and
 200 non-VEX SSE4 may cause severe performance degradation. Unfortunately, the
 201 resulting binary will then require AVX2. Note that no such flag is needed for
 202 clang and GCC because they support target-specific attributes, which we use to
 203 ensure proper VEX code generation for AVX2 targets.
 204
 205 ## Strip-mining loops
 206
 207 To vectorize a loop, "strip-mining" transforms it into an outer loop and inner
 208 loop with number of iterations matching the preferred vector width.
 209
 210 In this section, let `T` denote the element type, `d = ScalableTag<T>`, `count`
 211 the number of elements to process, and `N = Lanes(d)` the number of lanes in a
 212 full vector. Assume the loop body is given as a function `template<bool partial,
 213 class D> void LoopBody(D d, size_t index, size_t max_n)`.
 214
 215 Highway offers several ways to express loops where `N` need not divide `count`:
 216
 217 *   Ensure all inputs/outputs are padded. Then the loop is simply
 218
 219     ```
 220     for (size_t i = 0; i < count; i += N) LoopBody<false>(d, i, 0);
 221     ```
 222     Here, the template parameter and second function argument are not needed.
 223
 224     This is the preferred option, unless `N` is in the thousands and vector
 225     operations are pipelined with long latencies. This was the case for
 226     supercomputers in the 90s, but nowadays ALUs are cheap and we see most
 227     implementations split vectors into 1, 2 or 4 parts, so there is little cost
 228     to processing entire vectors even if we do not need all their lanes. Indeed
 229     this avoids the (potentially large) cost of predication or partial
 230     loads/stores on older targets, and does not duplicate code.
 231
 232 *   Use the `Transform*` functions in hwy/contrib/algo/transform-inl.h. This
 233     takes care of the loop and remainder handling and you simply define a
 234     generic lambda function (C++14) or functor which receives the current vector
 235     from the input/output array, plus optionally vectors from up to two extra
 236     input arrays, and returns the value to write to the input/output array.
 237
 238     Here is an example implementing the BLAS function SAXPY (`alpha * x + y`):
 239
 240     ```
 241     Transform1(d, x, n, y, [](auto d, const auto v, const auto v1) HWY_ATTR {
 242       return MulAdd(Set(d, alpha), v, v1);
 243     });
 244     ```
 245
 246 *   Process whole vectors as above, followed by a scalar loop:
 247
 248     ```
 249     size_t i = 0;
 250     for (; i + N <= count; i += N) LoopBody<false>(d, i, 0);
 251     for (; i < count; ++i) LoopBody<false>(CappedTag<T, 1>(), i, 0);
 252     ```
 253     The template parameter and second function arguments are again not needed.
 254
 255     This avoids duplicating code, and is reasonable if `count` is large.
 256     If `count` is small, the second loop may be slower than the next option.
 257
 258 *   Process whole vectors as above, followed by a single call to a modified
 259     `LoopBody` with masking:
 260
 261     ```
 262     size_t i = 0;
 263     for (; i + N <= count; i += N) {
 264       LoopBody<false>(d, i, 0);
 265     }
 266     if (i < count) {
 267       LoopBody<true>(d, i, count - i);
 268     }
 269     ```
 270     Now the template parameter and third function argument can be used inside
 271     `LoopBody` to non-atomically 'blend' the first `num_remaining` lanes of `v`
 272     with the previous contents of memory at subsequent locations:
 273     `BlendedStore(v, FirstN(d, num_remaining), d, pointer);`. Similarly,
 274     `MaskedLoad(FirstN(d, num_remaining), d, pointer)` loads the first
 275     `num_remaining` elements and returns zero in other lanes.
 276
 277     This is a good default when it is infeasible to ensure vectors are padded,
 278     but is only safe `#if !HWY_MEM_OPS_MIGHT_FAULT`!
 279     In contrast to the scalar loop, only a single final iteration is needed.
 280     The increased code size from two loop bodies is expected to be worthwhile
 281     because it avoids the cost of masking in all but the final iteration.
 282
 283 ## Additional resources
 284
 285 *   [Highway introduction (slides)](g3doc/highway_intro.pdf)
 286 *   [Overview of instructions per operation on different architectures](g3doc/instruction_matrix.pdf)
 287 *   [Design philosophy and comparison](g3doc/design_philosophy.md)
 288 *   [Implementation details](g3doc/impl_details.md)
 289
 290 ## Acknowledgments
 291
 292 We have used [farm-sve](https://gitlab.inria.fr/bramas/farm-sve) by Berenger
 293 Bramas; it has proved useful for checking the SVE port on an x86 development
 294 machine.
 295
 296 This is not an officially supported Google product.
 297 Contact: janwas@google.com