From: Christophe Leroy Date: Tue, 22 Sep 2015 14:34:32 +0000 (+0200) Subject: powerpc32: optimise csum_partial() loop X-Git-Tag: v5.15~13993^2~4^2~51 X-Git-Url: http://review.tizen.org/git/?a=commitdiff_plain;h=f867d556dd8525fe6ff0d22a34249528e590f994;p=platform%2Fkernel%2Flinux-starfive.git powerpc32: optimise csum_partial() loop On the 8xx, load latency is 2 cycles and taking branches also takes 2 cycles. So let's unroll the loop. This patch improves csum_partial() speed by around 10% on both: * 8xx (single issue processor with parallel execution) * 83xx (superscalar 6xx processor with dual instruction fetch and parallel execution) Signed-off-by: Christophe Leroy Signed-off-by: Scott Wood --- diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S index 9c12602..0d34f47 100644 --- a/arch/powerpc/lib/checksum_32.S +++ b/arch/powerpc/lib/checksum_32.S @@ -38,10 +38,24 @@ _GLOBAL(csum_partial) srwi. r6,r4,2 /* # words to do */ adde r5,r5,r0 beq 3f -1: mtctr r6 +1: andi. r6,r6,3 /* Prepare to handle words 4 by 4 */ + beq 21f + mtctr r6 2: lwzu r0,4(r3) adde r5,r5,r0 bdnz 2b +21: srwi. r6,r4,4 /* # blocks of 4 words to do */ + beq 3f + mtctr r6 +22: lwz r0,4(r3) + lwz r6,8(r3) + lwz r7,12(r3) + lwzu r8,16(r3) + adde r5,r5,r0 + adde r5,r5,r6 + adde r5,r5,r7 + adde r5,r5,r8 + bdnz 22b 3: andi. r0,r4,2 beq+ 4f lhz r0,4(r3)