[ hgemm ] Generalize redundant micro hgemm kernel implementation
- Previous implementation naively used fixed-sized ukernels for the K-direction accumulation.
- Such kernels were excessively long, but had better performance than looping through single K-iteration.
- However, recent test results have shown that justing stacking 4 K iters, and looping through such ukernel preserved the performance with better code readability.
**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped
Signed-off-by: skykongkong8 <ss.kong@samsung.com>