vc4: Rework scheduling of thread switch to cut one more NOP.
Jonas's patch got us most of the benefit of scheduling instructions into
the delay slots of thread switch, but if there had been nothing to pair
the thrsw with, it would move the thrsw up and leave a NOP where the thrsw
was.
Instead, don't pair anything with thrsw through the normal scheduling
path, and have a separate helper function that inserts the thrsw earlier
if possible and inserts any necessary NOPs.
total instructions in shared programs: 93027 -> 92643 (-0.41%)
instructions in affected programs: 14952 -> 14568 (-2.57%)