Add vectorization to improve CRC32 performance (#83321)
* Add x86 intrinsics to improve CRC32 performance
This significantly improves performance for System.IO.Hashing.Crc32 for
cases where the source span is 64 bytes or larger on Intel x86/x64
architectures.
The change only applies to .NET 7 and later targets of System.IO.Hashing
because it uses some Vector128 APIs added in .NET 7.
BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22000.1641/21H2)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.1.23115.2
[Host] : .NET 8.0.0 (8.0.23.11008), X64 RyuJIT AVX2
Job-PBKTIR : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-TVEBLV : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PowerPlanMode=
00000000-0000-0000-0000-
000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
| Method | Job | BufferSize | Mean | Error | StdDev | Median | Min | Max | Ratio |
|------- |----------- |----------- |------------:|----------:|----------:|------------:|------------:|------------:|------:|
| Append | Current | 128 | 228.20 ns | 2.366 ns | 2.213 ns | 228.07 ns | 225.54 ns | 232.75 ns | 1.00 |
| Append | Intrinsics | 128 | 17.62 ns | 0.096 ns | 0.075 ns | 17.59 ns | 17.56 ns | 17.80 ns | 0.08 |
| | | | | | | | | | |
| Append | Current | 1024 | 1,988.07 ns | 47.120 ns | 54.264 ns | 1,990.18 ns | 1,892.83 ns | 2,089.15 ns | 1.00 |
| Append | Intrinsics | 1024 | 64.71 ns | 0.794 ns | 0.704 ns | 64.67 ns | 63.13 ns | 65.96 ns | 0.03 |
* Use vector operator overloads and ref byte indexing
* Fix error and remove ref ROS
* Drop aggressive inlining and legibility improvements
* Don't overcheck intrinsics
* First pass at ARM support
* ARM tweaks
* A bit of cleanup for legibility
* A little more cleanup
* Add license notices
* Move vector shift right to helper function
* A bit of cleanup
* Use System.Diagnostics.UnreachableException
* Use ReadUnaligned for ARM CRC