review.tizen.org Git - platform/core/ml/nntrainer.git/log

[GPU/OpenCL] Initial version of FC Layer with OpenCL ops

Added naive version of OpenCl implementation for FC Layer.
Incorporated separate kernels for ops used.
Added unit test for fc_layer_cl.

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[ Trivial ] Remove redundant comments and format

- Due to adaptive macro kernel usage, previous comment is no longer needed.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm ] Refactor kernel init process

- I found there was a repeated usage of matrix initialization before mul-add fused operations.
- With separate initialization code, we can enjoy:
1. Cleaner code that is reusable for both f16 & f16-f32 kernel
2. Redundant init process is minimized for f16 kernel. Better latency with the SAME accuracy.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm/bugfix ] Adaptive macro kernel usage in 4x4 4x8 kernels

- To avoid the constraint of 4-8 divisibilty w.r.t. K, loop for adaptive K direction.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm ] Apply acc16 partial sum strategy and adaptive macro use in 8x8 kernel

- Apply similar change made in commit#52a3c734 but in 8x8 kernel

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm ] Apply ACC16 partial sum strategy & adaptive macro use in 8x16 kernel

- With more digits computed with fp16 (in this case 1024 -> 2048) I could observe latency improvement with the cost of accuracy loss. However, according to current accuracy measurement criteria, it is still acceptable. Note that it is highly desired to be proven with model output once more.
- With variety of partial sum kernels, we can adaptively apply internal macro kernels without being constrained to K-divisibilty w.r.t. 4, 8, 16.Commit title (Until 50 colums per line)

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm ] Apply macro kernel in 4x4 noTrans

- With macro-defined code, the function latency is expected to be optimized by compiler more easily

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm ] Add 4x4 kernel-using f16-f32 hgemm_noTrans

- Now Hgemm supports 4x4 f16-f32 partial accumulation strategy

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm ] Implement 4x4 f16-f32 kernel

- Implement 4x4 GEMM kernel that works f16-f32 partial accumulation

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

Edited build instructions for Resnet18 test

Edited build instructions for Resnet18 test
**Fixing the meson build option**

Resolves: Error on building the test example where it says
`-c is an un-recognized option` and in the meson documentation -C is used, so it seems to be a typo.

**Self evaluation:**
1. Build test: []Passed [ ]Failed [ X]Skipped
2. Run test: []Passed [ ]Failed [ X]Skipped

Signed-off-by: Udit Jain <udit.jain@samsung.com>

[Trivial] Update gitignore file

add ".idea/" in gitignore file
- For ignore jetbrain's IDE

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <donghak.park@samsung.com>

[coverity] fix coverity issue

This PR resolves the coverity issue of the constructor may not initialize class members.

**Changes proposed in this PR:**
- initialize lora_idx and lora_scaling in class constructor.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <dhyeon.jeong@samsung.com>

[bugfix] Fix LoRA indices array size in the FC layer

This PR resolves an issue related to the incorrect array size for lora_idx in the fully connected layer.
Specifically, the fix has made the array size four elements long, corresponding to loraA, loraB, loraTmp, and loraOut.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <dhyeon.jeong@samsung.com>

[Application] update yolo v2 python for building pre-training model

In order to train a large dataset, instead of loading the dataset into memory in advance, it was changed to a real-time loading method during training, and visualization code was added to check whether the training proceeded well.

Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

[Nnstreamer-subplugin] Add save_path to setProperty

- Add save_path to setProperty to save the model for each epoch.
- Remove model->save() call to avoid saving the current epoch result
  to the model when current epoch is interrupted

**Self evaluation:**
1. Build test:   [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: hyunil park <hyunil46.park@samsung.com>

[Application] cuda support for example of pytorch yolo v2

- add cuda option to train yolo v2 model backbone
- preprocessing for input dataset
* unmatched paired dataset
* no annotation value

Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

[Application] Rename yolo -> yolo v2

To prevent confusion, the name of YOLOv2 implementation was changed from
YOLO to YOLOv2.

Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

[hgemm] Optimizing dimension checks using bitmask

Used bitmasks for dimension checks.
e.g: N % 8 is same as N & 0x7

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[hgemm] Added K divisible condition for 1x8 and 1x4 kernels

Added condition for better accuracy while calling 1x4 and 1x8 kernels

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[hgemm] Interchanged hgemm_noTrans_1x8 and hgemm_noTrans_4x4 calls

Moving 1x8 kernel call after 4x4 kernel call.
Added couple of testcases.

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[ hdot ] Use precision-enhanced hdot

- Previous hdot was using full-fp16.
- Since this is also one of dimension-shrinking computation, should use inter-fp32 values to enhance precision.
- This has not been detected due to small dimension Tensor usage in unittest. Add higher dimension test case accordingly.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[Trivial] Removing unnecessary files from the repo and adding an ignore file.

In an Android project, the files ".gradle" and ".idea" are created locally and have nothing to do with the repository.
Therefore, it is common to delete them, and add a "gitignore" file so that they will not be uploaded again as development progresses.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <donghak.park@samsung.com>

[CI] Remove Pylinter in CI

Previously, since Pylinter did not exist as gitaction, it was run directly in gitaction.
but there is no need to do the same task twice because pylint is included in static_check.scripts when importing ci from nnstreamer.
So delete pylinter.yml file because it continues to create unnecessary CI errors.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <donghak.park@samsung.com>

[Application] fix LLaMA application example error

in case of running without encoder, a problem has been fixed where invalid values are set during operation due to incorrect assignment of input data, and this causes word index related errors.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

[Application] Update weights_converter

the num_layer parameter is set to be automatically through auto config
when converting weights from pytorch format to nntrainer format.

Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

[ NEURALNET ] change the loss scale property to Rigid Property

Loss Scale is more like Rigid Property of model, rather than flexible
property.

Resolves:

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <jijoong.moon@samsung.com>

[ Weight ] split variable dim and grad dim to set separately

This PR split the Variable and Gradient Dim in Var_Grad and Weight.
By this way we can set different Variable Type and Gradient in Wegiht.
. add dim_g for gradient in WeightSpec.
. manager need to update to support WeightSpec.
. Create Tensors according to dim_v and dim_g
. Create Weight chaged in Weight.h

Resolves:

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <jijoong.moon@samsung.com>

[ Weight ] Add Loss Scale factor in Weight

This PR enables the loss scale factor in Weight.
. Change the WeightSpec to incluide the loss factor
. Add LossScaleForMixed Property as an layer common property, so that
it can set the scale factor in initContext.
. Add Loss Scale in initContext
. Set the LossScaleForMixed Property when there is LossScale Model
Property

Resolves:

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: jijoong.moon <jijoong.moon@samsung.com>

[Property] Add loss scale property

It add loss scale property as model common property.

Signed-off-by: Jiho Chu <jiho.chu@samsung.com>

meson: fix fp16 support conditions for arm/aarch64

According to GCC document,
https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/Half-Precision.html
even if -mfp16-format=ieee is not given, aarch64 supports
ieee fp16. Thus, for aarch64, even if the option is not available,
try to built it with __fp16 type.

Then, add condition for arm: the final "else" is written for x64/x86
machines.

Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>

[Wait for #2536][application] add generate_multiple_tokens for llm

Added generate_multiple_tokens function for first generation on llm.

This function takes one logits and generates multiple output tokens.
To meet the purpose of the target application,
even if input are multiple logits,
only the first logits is used to generate multiple output tokens.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

Add SELU activation function

- Now, user can use SELU activation function like torch or tensor flow.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: kimhan0515 <kimhan0515@gmail.com>

[ hnrm2 ] Use precision-enhanced hscal

- Previous hnrm2 was using full-fp16.
- Since this is also one of dimension-shrinking computation, should use inter-fp32 values to enhance precision.
- This has not been detected due to small dimension Tensor usage in unittest. Add higher dimension test case accordingly.
- Note that this function is responsible for Tensor::l2norm(), frequently used for mse loss computation.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[Build] dependency to api

Code clean, fix cyclic dependency between nntrainer and ml-api.
Build dependency to ml-api on nntrainer is unnecessary.

Signed-off-by: Jaeyun Jung <jy1210.jung@samsung.com>

[LLaMA] Bugfix in LLaMA application

- This commit fixes a bug in `applyTKP` function.
- It seems applying Top-K and Top-P to logits didn't work as intended

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

[hgemm] hgemm noTrans with 1x4 kernel

Added hgemm_kernel_1x4
Added hgemm_noTrans_1x4 calls
Added unittest dot_gemm_50_768_516

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[bugfix] Fix build issues when fp16 is enabled

This PR resolves build issues occur in acti_func.h when fp16 is enabled.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <dhyeon.jeong@samsung.com>

[LoRA] add alpha parameter to LoRA

- This commit adds `alpha` parameter to LoRA (fc)
- In the original paper,they adopted `alpha (int)` as a parameter to
derive the scaling factor internally, i.e., scaling = alpha / rank
- This commit takes `alpha` as a hyper-parameter and apply the scaling
factor to the LoRA layer.
- This commit's updates are summarized as follows:
- `common_properties.h` : add LoraAlpha as a parameter.
- `fc_layer.cpp`: update forwarding / calcGradient /
calcDerivative func to apply scaling factor in LoRA computation
- `fc_layer.h`: update to take LoraAlpha as fc_props
- `node_exporter.cpp/h`: add LoraAlpha as a parameter in
tf.export format of fc layer (to pass the test code)
- fix the code lines which may cause coverity issue.
- LoRA initialization is updated:
- LoRA A : ZEROS
- LoRA B : Normal
- [TODO] update tf exporter of fc layer

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

Add Mish activation function

- Now, user can use Mish activation function like torch or tensorflow.

**Self evaluation**:
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Boseong Seo <suzy13549@snu.ac.kr>

[hgemm] Removed unused header

Deleted unused header inclusion
Removed #include <iostream>

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[hgemm] hgemm noTrans with kernel 1x8

Added 1x8 hgemm kernel, packing_A1, packing_B1 functions.
Incorporated hgemm_noTrans_1x8.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

ci / remove cpp-linter's false positive reports.

It gives an error of "iostream" not found.

Install libstdc++-dev for possible compilers.

Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>

[API] Add tensor&operations API structure for supporting autograd

I added tensor & function(operation) api structure for supporting autograd.

Users can make a model graph using this api.

The operators will be supported one-to-one with ONNX.

Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

Add Softplus activation function

- Now, user can use Softplus activation function like torch or tensor flow.
- Furthermore, we can use this function to build Mish or other activation functions

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: heka1024 <heka1024@gmail.com>

[neuralnet] bugfix multi batch incremental inference

- This commit will handle when the model activation datatype is fp32

Signed-off-by: hyeonseok lee <hs89.lee@samsung.com>

[Docs] add yolov3 readme file

added yolov3 readme file

Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

[OpenCL/GPU] Modified ifstream condition

Updated ifstream object valid condition

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[GPU/OpenCL] Create kernel utility with binaries

Added feature for reading kernel binaries.
Managing already created kernels.
Added static flag and bitmask to check existing kernels.

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

Add ELU activation function

- Now, user can use ELU activation function like torch or tensor flow.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Co-authored-by: Hanbyeol Kim kimhan0515@snu.ac.kr
Co-authored-by: Boseong Seo suzy13549@snu.ac.kr
Signed-off-by: heka1024 <heka1024@gmail.com>

[application] add repetition_penalty to generate func

add some options to 'generate' function of llm

- add naive repetition_penalty option
- add bad_words option

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

Remove LSTM example in Applications/README.md

- existing link to LSTM example does not work
- user can find LSTM example in Layers dir
+ LSTM dir merged to Layers dir (in PR nnstreamer#2107)
- delete LSTM example in Applications/README.md file in order to reduce confusion

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Boseong Seo <suzy13549@snu.ac.kr>

[ hgemm ] Use macro kernel in 8x8 kernel

- Using macro kernel, we can choose somewhere between accuracy-latency tradeoff. Furthermore, it is easier to maintain this way.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm ] Apply software prefetching in 4x8 kernel

- We can expect to minimize cache miss using software prefetching

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm ] Implement 8x16 hgemm kernel

- This commit introduces 2 types of 8x16 hgemm kernel
1. full-fp16
2. fp16-fp32 partial accumulation

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[neuralnet] enable multi batch incremental inference

- The output was not considered multi batch input in incremental inference.
Now it will return multi batch output.

Signed-off-by: hyeonseok lee <hs89.lee@samsung.com>

[application] update llm generate function

- fix "temperature" operation
- add "top-k, top-p" option
- support batch mode

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

Reformat code with .clang_format

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Boseong Seo <suzy13549@snu.ac.kr>

[ BugFix ] Modify the wrong input in `EXPECT_EQ`

- `registerFactory` function returns the unsigned value of the int_key when int_key is given as -1 (default), but it was not considered in the code.
- So, modified the second argument (expected value) of `EXPECT_EQ` as follows.

Signed-off-by: Boseong Seo <suzy13549@snu.ac.kr>

Use parameterized test in unittest

Use parameterized test according to existing TODO comment.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Boseong Seo <suzy13549@snu.ac.kr>

Fix typo in docs

Fix typos for some docs
- README.md and docs/configuration-ini.md: simple typo
- Applications/MNIST/README.md: typo and duplicate image

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Hanbyeol Kim kimhan0515@snu.ac.kr
Signed-off-by: kimhan0515 <kimhan0515@gmail.com>

[layer] multi batch incremental forwarding

- Enable multi batch incremental forwarding by looping batchwise

Signed-off-by: hyeonseok lee <hs89.lee@samsung.com>

[OpenCL] Added stringification macro and kernel path

Add DEFAULT_KERNEL_PATH as static member of Program class
Modified macros for stringification

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[OpenCL] Added opencl kernel path as option

Added opencl-kernel-path preprocessor directive and handled inside opencl_program.

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[OpenCL] Proper cleanup and readability

Used better C++ paradigm to enhance readability.
Added proper cleanup stub.

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[OpenCL/GPU] Kernel binary caching

Added utilities for saving kernel as binary files.
Added wrapper for clCreateProgramWithBinary.

Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[LoRA] Apply Inception-LoRA

- updates the LoRA computation (applying Inception-LoRA)
- compute with LoRA vectors without matrix construction
- revise `forwarding()`
- revise `calcGradient()`
- revise `calcDerivative()`

Self evaluation:
Build test: [X]Passed [ ]Failed [ ]Skipped
Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

[ Trivial ] apply clang-format to fc_layer.cpp

- clang-format re-apply to pass static checker
- `fc_layer.cpp`

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

[ trivial ] fix doxgen tag check error

- remove a redundant and incorrect block comment in
`nntrainer/layers/fc_layer.cpp`

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

[ trivial ] apply clang-format

- apply clang format to
- nntrainer/tensor/tensor_v2.cpp
- nntrainer/utils/node_exporter.cpp

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

[LoRA/Trivial] fix typo and edit comments

- Fix typo in the code
- edit comments to to add some explanations

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

[LoRA] Revise LoRA implementation for fc_layer

- remove `forwarding_lora()` function
- update forwarding path with LoRA option
- First, compute the forwarding logits of base weight (W) and lora weight (A @ B) respectively.
- then merge the logits to return
- [update] (W + A @ B)x -> Wx + (A @ B)x
- update `calcDerivative` to reflect the changes in forwarding operation
- implicit update of calcDerivative is updated.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

[LoRA] revise type of LoraRank property & fix error in fc_layer

- update type of LoraRank property : Property<int> -> PositiveIntegerProperty
- fix typo dot_batched_deriv_wrt_1 -> dot_deriv_wrt_1
- update code with add -> add_i
- apply clang-format

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

[LoRA] update node_exporter of fully connected layer

This commit updates TfLite node exporter of fully conntected layer. It adds new property (LoraRank) as additional input property of fullyconnected layer

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

[LoRA] add a new feat(lora) to fc layer

This commit includes implementation of lora only for the FC layer, which means it is not the generalized version. It is required to be written as a seperate class in order to remove code duplicates for other layers

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

[Layer] Create Depthwise 2D Convolution

This pull request defines a header file for depthwise convolution.

It is a draft for a new layer and welcome any feedback or assistance you may have.

This layer is necessary to support various applications such as SV.

- Depthwise convolution is a type of convolution in which each input channel is convolved with a different kernel (called a depthwise kernel).
- Unlike a regular 2D convolution, depthwise convolution does not mix information across different input channels.

**Changes proposed in this PR:**
- Add Depthwise Convolution 2D Layer

Resolves:
- #2520

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <donghak.park@samsung.com>

[ neon ] Apply kernel based hgemm

- Now hgemm subdirectory is included when neon fp16 is in use
- WIP : hgemm 8x16 kernel

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ Trivial ] Fix typo

- GEMM unittest for square 1024 was generating improper dimension. Fix accordingly.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm ] Use optimized hgemm if possible

- We can use optimized version of hgemm with following condition:
1. noTrans hgemm
2. M, N, K is divisible with 4 or 8
3. Row Major GEMM
4. alpha = 1.0, beta = 0.0 (will be patched soon)
- Otherwise, use previous version as a fallback.
- Note that there are a few optimization strategy is left for optimal hgemm.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm ] Implement 8x8 hgemm kernel

- This commit introduces 2 types of 8x8 hgemm kernel
    1. full-fp16
    2. fp16-fp32 partial accumulationCommit title (Until 50 colums per line)

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm ] Implement 4x8 hgemm kernel

- This commit introduces 2 types of 4x8 hgemm kernel
        1. full-fp16
        2. fp16-fp32 partial accumulation
- Additionally, 4x8 kernel has macro kernel that can regulate accuracy-latency tradeoff. By default it uses partial sum up to 256 digits. Other kernels will be refactored in this way ASAP.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

Fix typo in test

Fix some typo in testcase. `duing` -> `during`, `TSETS` -> `TESTS`. And add doxgen for `nntrainer_LazyTensorOpsTest`

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: heka1024 <heka1024@gmail.com>

[ HGEMM/draft ] Draft of kernel-based hgemm

- Previously, hgemm was implemented without taking packing / kernel into consideration.
- Here I would like to introduce kernel-based hgemm. It consists of:
1. packing A / B matrix for 4 / 8 divisible case
2. 4x4, 8x8 hgemm kernel for full-fp16 case
- More features like fine-grained packing strategies and kernels will be updated in the near future.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[Coverity] Fix coverity issues

This PR resolves coverity issues of overflow, use of auto that causes a copy, missing lock and thread lock.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <dhyeon.jeong@samsung.com>

[svace] fix svace issues

fixed all svace issues on main branch

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

[Coverity] Fix coverity issues

This PR resolves coverity issues of use of auto that causes a copy and missing lock.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <dhyeon.jeong@samsung.com>

[Trivial] Disable cpp-linter action's clang-format

We currently perform a Clang format check during our static checks.
The CPP-Linter we are using is from the Action Market and occasionally produces different results even when the same version is specified.
This reduces efficiency for developers, so only the static check with more detailed logs will be left and the CPP-Linter function will be disabled.
However, the existing Linter function will remain.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <donghak.park@samsung.com>

[coverity] fix coverity issues

- Added const auto & to avoid copy of an object
- Added missing lock

Signed-off-by: hyeonseok lee <hs89.lee@samsung.com>

[ coverity ] Fix Coverity issue

Fix Coverity issue on
- /test/unittest/layers/layers_golden_tests.cpp
- /test/unittest/models/unittest_models_recurrent.cpp
- /test/unittest/unittest_nntrainer_models.cpp

Resolves:
```
Use of auto that causes a copy (AUTO_CAUSES_COPY)
auto_causes_copy: This lambda has an unspecified return type
copy: This return statement creates a copy.
```
**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK <donghak.park@samsung.com>

[coverity] fix coverity issue

- This commit fixes the coverity issues
- AUTO_CAUSES_COPY
- MSSING_LOCK

Self-evaluation:

1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

[coverity] fix coverity issue

Fix coverity issue on
- /test/unittest/layers/layers_golden_recurrent.cpp

The other issues assigned have already been fixed.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

[Coverity] Fix the coverity issue

This PR resolves the coverity issues of use of auto that causes a copy.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <dhyeon.jeong@samsung.com>

Fix minor errors in github action

- `actions/setup-python@v1` is deprecated. So, bump version to v5.
- Step name says it uses python3.9, but it actually installs 3.10. Match name and what it actually doing.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: heka1024 <heka1024@gmail.com>

[ coverity ] Fix coverity issue

- Fix coverity issue 1774230, 1774235, 1774238, 1774239, 1774243

Resolves:
```
Use of auto that causes a copy (AUTO_CAUSES_COPY)
auto_causes_copy: This lambda has an unspecified return type
copy: This return statement creates a copy.
```

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

Use parameterized test in `NamePropertyTest`

To make code more readable, use parameterized test according to existing TODO comment.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: heka1024 <heka1024@gmail.com>

Bump actions/checkout in Ubuntu Meson build & test

Node 16 has reached end of life. So, github recommend transition to actions which use node 20+. [Ref](https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/)

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: heka1024 <heka1024@gmail.com>

[coverity] fix coverity issues

This commit fixes coverity issues of auto_causes_copy
- 1739360
- 1740106

Self-evaluation:

Build test: [X]Passed [ ]Failed [ ]Skipped
Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Eunju Yang <ej.yang@samsung.com>

[Coverity] Fix the coverity issue

This PR resolves the coverity issues of missing lock and use of auto that causes a copy.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <dhyeon.jeong@samsung.com>

[Layer] Remove Tensor setDataType() usuage

In several layers, there are attempts to change the data type of a Tensor object after initializing it.
This is currently possible but can cause issues down the line (e.g., treat FloatTensor object as HalfTensor).
As such, the setDataType() method will be removed and considered not to be used in future updates.
Instead, users will need to provide the desired data type when creating a new tensor.

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghyeon Jeong <dhyeon.jeong@samsung.com>

[ neon/trivial ] Use N8 for hgemm, and for starting index for the remaining Tensor area

- Like hgemv_transpose, use N8 for hgemm_noTrans as well
- we can re-use this value for the starting index for the remaining area

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ neon ] Use bigger kernel in hgemv

- Using up to 16x8 sized kernel shows highest latency. Apply accordingly.

**Self evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <ss.kong@samsung.com>