[X86] Fix logic for optimizing movmsk(bitcast(shuffle(x))); PR67287
Prior logic would remove the shuffle iff all of the elements in `x`
where used. This is incorrect.
The issue is `movmsk` only cares about the highbits, so if the width
of the elements in `x` is smaller than the width of the elements
for the `movmsk`, then the shuffle, even if it preserves all the elements,
may change which ones are used by the highbits.
For example:
`movmsk64(bitcast(shuffle32(x, (1,0,3,2))))`
Even though the shuffle mask `(1,0,3,2)` preserves all the elements, it
flips which will be relevant to the `movmsk64` (x[1] and x[3]
before and x[0] and x[2] after).
The fix here, is to ensure that the shuffle mask can be scaled to the
element width of the `movmsk` instruction. This ensure that the
"high" elements stay "high". This is overly conservative as it
misses cases like `(1,1,3,3)` where the "high" elements stay
intact despite not be scalable, but for an relatively edge-case
optimization that should generally be handled during
simplifyDemandedBits, it seems okay.
(cherry picked from commit
1684c65bc997a8ce0ecf96a493784fe39def75de)