Optimize the unaligned buffer copy logic
Because the byte aligned read and write send instruction is
very slow, we optimize to avoid the using of it.
We seperate the unaligned case into three cases,
1. The src and dst has same %4 unaligned offset.
Then we just need to handle first and last dword.
2. The src has bigger %4 unaligned offset than the dst.
We need to do some shift and montage between src[i]
and src[i+1]
3. The last case, src has smaller 4% unaligned.
Then we need to do the same for src[i-1] and src[i].
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>