Multiply only for Transform Matrix + NEON comment clean up
If 4x4 matrix form as Transform, we can optimize matrix multiply function.
It will be reduce the time of Transform Update time.
Below are some test result.
1. VLD1.F32 each time is more faster than VLDM.
2. Transpose lhs -> multply -> transpose tmp is slower than current logic
3. "+r"(temp) at Output Operand is slower than "r"(temp) Intput Oprerand with "%r0"(why?)
--> But when we make current Multiply with Output Operand as Input Operand, it makes slow down. (why?)
Change-Id: Ibc5e1c252ec200d356e649ed6448cd45b3a5d980
Signed-off-by: Eunki Hong <eunkiki.hong@samsung.com>