NE10/DSP/RFFT: optimise RFFT for armv8
intrinsic, LLVM 3.5, -O2
on a57, juno, android
size| time in ms | boost |
| NE10 | pffft | pffft/NE10|
|R2C|C2R|R2C|C2R*| R2C| C2R|
32|145|185|279| 319|1.92x|1.72x|
64|175|200|239| 279|1.36x|1.39x|
128|166|185|237| 262|1.42x|1.41x|
256|197|208|232| 256|1.17x|1.23x|
512|208|216|254| 270|1.22x|1.25x|
1024|241|244|260| 278|1.07x|1.14x|
2048|258|263|332| 322|1.28x|1.22x|
4096|303|304|388| 353|1.28x|1.16x|
8192|339|334|424| 426|1.25x|1.27x|
intrinsic, GCC 4.9, -O2
on a57, juno, android
size| time in ms | boost |
| NE10 | pffft | pffft/NE10|
|R2C|C2R|R2C|C2R*| R2C| C2R|
32|174|181|328| 410|1.88x|2.26x|
64|214|216|270| 338|1.26x|1.56x|
128|210|197|259| 310|1.23x|1.57x|
256|232|223|243| 283|1.04x|1.26x|
512|250|222|263| 307|1.04x|1.38x|
1024|274|251|272| 304|1.00x|1.20x|
2048|288|277|314| 353|1.08x|1.27x|
4096|333|303|349| 379|1.04x|1.25x|
8192|370|342|424| 452|1.14x|1.31x|
* Ne10 supports scale of output for backward RFFT,
while pffft doesn't. To normalize the benchmark,
a scale operation was added to the end of each
call to pffft.
* pffft C2R FFT costs 410ms when size==32, 338ms when
size==64, this is because the former loops more times
than the latter does, so it does not mean pffft cost
more time for short input.
intrinsic, GCC 4.9, -O2
on a53, juno, android
size| time in ms | boost |
| NE10 | pffft | pffft/NE10|
| R2C| R2C| R2C|
32| 347| 607| 1.74x|
64| 389| 489| 1.25x|
128| 334| 484| 1.44x|
256| 401| 456| 1.13x|
512| 380| 502| 1.32x|
1024| 460| 512| 1.11x|
2048| 481| 593| 1.23x|
4096| 605| 709| 1.17x|
8192| 704| 891| 1.26x|
Change-Id: Ide0b974620ae8d06cfa862769004b2110abaaeff
12 files changed: