crypto: twofish-avx - tune assembler code for more performance
Patch replaces 'movb' instructions with 'movzbl' to break false register
dependencies and interleaves instructions better for out-of-order scheduling.
Tested on Intel Core i5-2450M and AMD FX-8100.
tcrypt ECB results:
Intel Core i5-2450M:
size old-vs-new new-vs-3way old-vs-3way
enc dec enc dec enc dec
256 1.12x 1.13x 1.36x 1.37x 1.21x 1.22x
1k 1.14x 1.14x 1.48x 1.49x 1.29x 1.31x
8k 1.14x 1.14x 1.50x 1.52x 1.32x 1.33x
AMD FX-8100:
size old-vs-new new-vs-3way old-vs-3way
enc dec enc dec enc dec
256 1.10x 1.11x 1.01x 1.01x 0.92x 0.91x
1k 1.11x 1.12x 1.08x 1.07x 0.97x 0.96x
8k 1.11x 1.13x 1.10x 1.08x 0.99x 0.97x
[v2]
- Do instruction interleaving another way to avoid adding new FPU<=>CPU
register moves as these cause performance drop on Bulldozer.
- Further interleaving improvements for better out-of-order scheduling.
Tested-by: Borislav Petkov <bp@alien8.de>
Cc: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>