Improvements to allocation performance on machines with many processors (#32539)
* Improvements to allocation performance on machines with many processors. In particular:
- Make sure we reserve memory on the correct NUMA node. Otherwise the OS will pick a NUMA node for us on first touch, which is sometimes correct, sometimes not.
- Only look at a subset of the available heaps in balance_heaps. Look at a different subset each time, and look both at heaps inside and outside the allocating processor's NUMA node, but give preference to heaps within the allocating processor's NUMA node and in particular, the heap associated with the allocating processor.
- Fix issue where our logic assumed that NUMA node numbers are non-decreasing as the processor numbers increase, but this doesn't hold true on all machines.
- Fix issue where we decreased the number of heaps to 1 in the special case of using large pages with the size for the pinned object heap set to 0.
Results for a pure allocation test on a 4 socket NUMA machine (128 cores, 256 processors) showed improvements ranging from 18 to 28%. A 2 socket Intel machine (56 cores, 56 processors) showed results ranging from a 3% regression to an 8% improvement.