Parallelize card table copy/clear in minor GC. (#43167)
Currently, when using overlapped card table, each minor GC will do a copy/clear of all marked cards into a temporary card table used during minor scan. In order to find marked cards, major heap blocks + LOS is scanned. This is currently done serial and could take significant time of total minor GC depending on platform/hardware.
On hardware with multiple cores it is possible to parallelize the copy/clear of card table, decreasing the minor GC pause times. As an example, a major heap of ~3 GB and a LOS of 600 MB, went from 5 ms minor GC down to 3 ms where most objects didn't have any references (just byte arrays). Increasing the number of objects with references will benefit event more. Before switching
over LOS to a SgenArrayList, similar scenario took 8 ms and was reduced to 3 ms, and those numbers are close to what you would get if sample included more objects with references, since then, you will need to visit more objects, causing more cache misses. Parallelizing that will significantly reduce minor GC pause times.
As with most optimizations the gain is platform/hardware dependent, that's why the parallelization is currently an opt-in feature, that needs to be explicitly enabled (remset-copy-clear-par) on platforms/hardware where similar performance gains can
be identified.
Replacing mono/mono#17689
Co-authored-by: lateralusX <lateralusX@users.noreply.github.com>