GBE: Add a customized loop unrolling handling mechanism.
By default, the unrolling threshold is relatively small.
Thus some relative large loops which access private array
will not be unrolled, thus those private array can't
be scalarized latter. And the private array is allocated
in stack which is extreme slow for Gen backend currently.
To increase the unrolling threshold for all loops is not
a good idea, as most of the loops don't need to do unrolling
for this purpose and a large unrolling threshold will cause
a big code size and unecessary big register pressure which
may lead to register spilling.
So this patch introduce a trade-off pass to identify those
loops which still have private load/store in the outer most
of the loop. Then add a metadata to it to indicate aggresive
unrolling on those loops. Then do another round loop unrolling.
This patch with the previous small patch, can bring significant
performance improvement for some cases. I just tested with some
opencv test cases, and observed it can bring 2x to 10x improvement.
v2:
refine the parent loop unroll analysis method.
v3:
disable this pass for LLVM 3.3/3.4.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>