[ hgemm ] Generalize redundant micro hgemm kernel implementation
authorskykongkong8 <ss.kong@samsung.com>
Wed, 7 Aug 2024 11:41:39 +0000 (20:41 +0900)
committerJijoong Moon <jijoong.moon@samsung.com>
Fri, 9 Aug 2024 04:46:40 +0000 (13:46 +0900)
commit99a2a3d3c0d3f9d78ebbfbe6ec0de9f06e9c3f7e
tree0fe4190a539cad7ce57029192bd42e84d4885c23
parent23c09837cb992d958a83cafdc7d6b7bff3d1eb15
[ hgemm ] Generalize redundant micro hgemm kernel implementation

- Previous implementation naively used fixed-sized ukernels for the K-direction accumulation.
- Such kernels were excessively long, but had better performance than looping through single K-iteration.
- However, recent test results have shown that justing stacking 4 K iters, and looping through such ukernel preserved the performance with better code readability.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
nntrainer/tensor/hgemm/hgemm_kernel/hgemm_kernel_4x8.cpp
nntrainer/tensor/hgemm/hgemm_kernel/hgemm_kernel_8x16.cpp
nntrainer/tensor/hgemm/hgemm_kernel/hgemm_kernel_8x16_experimental.cpp
nntrainer/tensor/hgemm/hgemm_kernel/hgemm_kernel_8x8.cpp