Optimize vectorized sorting - reduce code size, improve speed for large heaps (#40613)
* Improved vectorized sort - smaller bitonic sorters, dynamic packing/unpacking.
There are two optimizations in this PR:
- reduction of code size in the bitonic sorters: by limiting the amount of inlining in this code, we can reduce overall code size in coreclr.dll by about 180 kB.
- dynamic packing: during sorting, we can switch to 32-bit sorting as soon as the address range in a partition is less 32 GB. This will only have an impact on large heaps or machines with many processors, because we already have a similar, but static optimization where we use 32-bit sorting if the overall address range in the ephemeral region is less than 32 GB. So this additional optimization will give improvements if the overall address range is greater than 32 GB initially, but becomes less during the sort. In this case, we get about a 1.6x improvement in sorting speed.
30 files changed: