Try changing alignment (#76451)
Looking at CPU traces for microbenchmarks, I noticed a hotspot in memset (the flavor that uses AVX2 instructions) for the instruction that clears the very last double quadword at the end of an allocation context. Also, the buffer being cleared is not aligned on a 32-byte boundary.
Two tiny changes address this:
1. adding additional padding at the start of regions align the allocation context for the microbenchmark cases.
2. increasing CLR_SIZE slightly ensure the end of an allocation context doesn't consistently fall on a page boundary.
Change 1 makes sure we start with an aligned allocation context at the start of a region.
Change 2 minimizes the number of movdqu instructions executed and makes sure we don't concistently hit a new page at the end of the memset range.