GBE: optimize unaligned char and short data vector's load.
The gather the contiguous short/char loads into a single load instruction
could give us a good pportunity to use untyped load to optimize them.
This patch enable the short/char load gathering at the load store optimize
pass. Then at the backend, it will load corresponding DWORDs then covert to
short/char accordingly by applying shift and bitwise operations.
The benchmark shows, for vload4/8/16 char or vload/2/4/8/16 short, this patch brings
about 80%-100% improvement.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>