[ hgemm ] Implement 4x8 hgemm kernel
authorskykongkong8 <ss.kong@samsung.com>
Wed, 3 Apr 2024 04:16:06 +0000 (13:16 +0900)
committerJijoong Moon <jijoong.moon@samsung.com>
Wed, 3 Apr 2024 11:48:34 +0000 (20:48 +0900)
commit08d02ec03f2d27d5339452216037ee67c0460379
tree1279935d513b14b92ad2a14356b925e7f0400e5a
parent5d38c09ace47e17b391a9eb84c681bec4814ffc0
[ hgemm ] Implement 4x8 hgemm kernel

- This commit introduces 2 types of 4x8 hgemm kernel
        1. full-fp16
        2. fp16-fp32 partial accumulation
- Additionally, 4x8 kernel has macro kernel that can regulate accuracy-latency tradeoff. By default it uses partial sum up to 256 digits. Other kernels will be refactored in this way ASAP.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>
nntrainer/tensor/hgemm/hgemm_kernel_4x8.h [new file with mode: 0644]