mpn/sparc64/README

   1 Copyright 1997, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
   2
   3 This file is part of the GNU MP Library.
   4
   5 The GNU MP Library is free software; you can redistribute it and/or modify
   6 it under the terms of the GNU Lesser General Public License as published by
   7 the Free Software Foundation; either version 3 of the License, or (at your
   8 option) any later version.
   9
  10 The GNU MP Library is distributed in the hope that it will be useful, but
  11 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
  12 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
  13 License for more details.
  14
  15 You should have received a copy of the GNU Lesser General Public License
  16 along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
  17
  18
  19
  20
  21
  22 This directory contains mpn functions for 64-bit V9 SPARC
  23
  24 RELEVANT OPTIMIZATION ISSUES
  25
  26 Notation:
  27   IANY = shift/add/sub/logical/sethi
  28   IADDLOG = add/sub/logical/sethi
  29   MEM = ld*/st*
  30   FA = fadd*/fsub*/f*to*/fmov*
  31   FM = fmul*
  32
  33 UltraSPARC can issue four instructions per cycle, with these restrictions:
  34 * Two IANY instructions, but only one of these may be a shift.  If there is a
  35   shift and an IANY instruction, the shift must precede the IANY instruction.
  36 * One FA.
  37 * One FM.
  38 * One branch.
  39 * One MEM.
  40 * IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle.  Taken branches
  41   should not be in slot 4, since that makes the delay insn come from separate
  42   bundle.
  43 * If two IANY/IADDLOG instructions are to be executed in the same cycle and one
  44   of these is setting the condition codes, that instruction must be the second
  45   one.
  46
  47 To summarize, ignoring branches, these are the bundles that can reach the peak
  48 execution speed:
  49
  50 insn1   iany    iany    mem     iany    iany    mem     iany    iany    mem
  51 insn2   iaddlog mem     iany    mem     iaddlog iany    mem     iaddlog iany
  52 insn3   mem     iaddlog iaddlog fa      fa      fa      fm      fm      fm
  53 insn4   fa/fm   fa/fm   fa/fm   fm      fm      fm      fa      fa      fa
  54
  55 The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
  56 depending on the position of the most significant bit of the first source
  57 operand.  When used for 32x32->64 multiplication, it needs 20 cycles.
  58 Furthermore, it stalls the processor while executing.  We stay away from that
  59 instruction, and instead use floating-point operations.
  60
  61 Floating-point add and multiply units are fully pipelined.  The latency for
  62 UltraSPARC-1/2 is 3 cycles and for UltraSPARC-3 it is 4 cycles.
  63
  64 Integer conditional move instructions cannot dual-issue with other integer
  65 instructions.  No conditional move can issue 1-5 cycles after a load.  (This
  66 might have been fixed for UltraSPARC-3.)
  67
  68 The UltraSPARC-3 pipeline is very simular to he one of UltraSPARC-1/2 , but is
  69 somewhat slower.  Branches execute slower, and there may be other new stalls.
  70 But integer multiply doesn't stall the entire CPU and also has a much lower
  71 latency.  But it's still not pipelined, and thus useless for our needs.
  72
  73 STATUS
  74
  75 * mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb on
  76   UltraSPARC-1/2 and 2.65 on UltraSPARC-3.  For UltraSPARC-1/2, the IEU0
  77   functional unit is saturated with shifts.
  78
  79 * mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb on
  80   UltraSPARC-1/2 and 4.5 cycles/limb on UltraSPARC-3.  The 4 instruction
  81   recurrency is the speed limiter.
  82
  83 * mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically on
  84   UltraSPARC-1/2 and 17.5 cycles/limb on UltraSPARC-3.  On UltraSPARC-1/2, the
  85   code sustains 4 instructions/cycle.  It might be possible to invent a better
  86   way of summing the intermediate 49-bit operands, but it is unlikely that it
  87   will save enough instructions to save an entire cycle.
  88
  89   The load-use of the u operand is not enough scheduled for good L2 cache
  90   performance.  The UltraSPARC-1/2 L1 cache is direct mapped, and since we use
  91   temporary stack slots that will conflict with the u and r operands, we miss
  92   to L2 very often.  The load-use of the std/ldx pairs via the stack are
  93   perhaps over-scheduled.
  94
  95   It would be possible to save two instructions: (1) The mov could be avoided
  96   if the std/ldx were less scheduled.  (2) The ldx of the r operand could be
  97   split into two ld instructions, saving the shifts/masks.
  98
  99   It should be possible to reach 14 cycles/limb for UltraSPARC-3 if the fp
 100   operations where rescheduled for this processor's 4-cycle latency.
 101
 102 * mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
 103   code.  It would be possible to shave one or two cycles from it, with some
 104   labour.
 105
 106 * mpn_submul_1: Simpleminded code just calling mpn_mul_1 + mpn_sub_n.  This
 107   means that it runs at 18 cycles/limb on UltraSPARC-1/2 and 23 cycles/limb on
 108   UltraSPARC-3.  It would be possible to either match the mpn_addmul_1
 109   performance, or in the worst case use one more instruction group.
 110
 111 * US1/US2 cache conflict resolving.  The direct mapped L1 date cache of US1/US2
 112   is a problem for mul_1, addmul_1 (and a prospective submul_1).  We should
 113   allocate a larger cache area, and put the stack temp area in a place that
 114   doesn't cause cache conflicts.