mmx: don't pack and unpack src unnecessarily
The combine function was store8888'ing the result, and all consumers
were immediately load8888'ing it, causing lots of unnecessary pack and
unpack instructions.
It's a very straight forward conversion, except for mmx_combine_over_u
and mmx_combine_saturate_u. mmx_combine_over_u was testing the integer
result to skip pixels, so we use the is_* functions to test the __m64
data directly without loading it into an integer register.
For mmx_combine_saturate_u there's not a lot we can do, since it uses
DIV_UN8.