mpn/alpha/README

   1 Copyright 1996, 1997, 1999, 2000, 2001, 2002, 2003, 2004, 2005 Free Software
   2 Foundation, Inc.
   3
   4 This file is part of the GNU MP Library.
   5
   6 The GNU MP Library is free software; you can redistribute it and/or modify it
   7 under the terms of the GNU Lesser General Public License as published by the
   8 Free Software Foundation; either version 3 of the License, or (at your
   9 option) any later version.
  10
  11 The GNU MP Library is distributed in the hope that it will be useful, but
  12 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
  13 FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
  14 for more details.
  15
  16 You should have received a copy of the GNU Lesser General Public License along
  17 with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
  18
  19
  20
  21
  22
  23 This directory contains mpn functions optimized for DEC Alpha processors.
  24
  25 ALPHA ASSEMBLY RULES AND REGULATIONS
  26
  27 The `.prologue N' pseudo op marks the end of instruction that needs special
  28 handling by unwinding.  It also says whether $27 is really needed for computing
  29 the gp.  The `.mask M' pseudo op says which registers are saved on the stack,
  30 and at what offset in the frame.
  31
  32 Cray T3 code is very very different...
  33
  34 "$6" / "$f6" etc is the usual syntax for registers, but on Unicos instead "r6"
  35 / "f6" is required.  We use the "r6" / "f6" forms, and have m4 defines expand
  36 them to "$6" or "$f6" where necessary.
  37
  38 "0x" introduces a hex constant in gas and DEC as, but on Unicos "^X" is
  39 required.  The X() macro accommodates this difference.
  40
  41 "cvttqc" is required by DEC as, "cvttq/c" is required by Unicos, and gas will
  42 accept either.  We use cvttqc and have an m4 define expand to cvttq/c where
  43 necessary.
  44
  45 "not" as an alias for "ornot r31, ..." is available in gas and DEC as, but not
  46 the Unicos assembler.  The full "ornot" must be used.
  47
  48 "unop" is not available in Unicos.  We make an m4 define to the usual "ldq_u
  49 r31,0(r30)", and in fact use that define on all systems since it comes out the
  50 same.
  51
  52 "!literal!123" etc explicit relocations as per Tru64 4.0 are apparently not
  53 available in older alpha assemblers (including gas prior to 2.12), according to
  54 the GCC manual, so the assembler macro forms must be used (eg. ldgp).
  55
  56
  57
  58 RELEVANT OPTIMIZATION ISSUES
  59
  60 EV4
  61
  62 1. This chip has very limited store bandwidth.  The on-chip L1 cache is write-
  63    through, and a cache line is transferred from the store buffer to the off-
  64    chip L2 in as much 15 cycles on most systems.  This delay hurts mpn_add_n,
  65    mpn_sub_n, mpn_lshift, and mpn_rshift.
  66
  67 2. Pairing is possible between memory instructions and integer arithmetic
  68    instructions.
  69
  70 3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
  71    cycles are pipelined.  Thus, multiply instructions can be issued at a rate
  72    of one each 21st cycle.
  73
  74 EV5
  75
  76 1. The memory bandwidth of this chip is good, both for loads and stores.  The
  77    L1 cache can handle two loads or one store per cycle, but two cycles after a
  78    store, no ld can issue.
  79
  80 2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
  81    umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
  82    (Note that published documentation gets these numbers slightly wrong.)
  83
  84 3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
  85    are memory operations.  This will take at least
  86         ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
  87    We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
  88    cache cycles, which should be completely hidden in the 19 issue cycles.
  89    The computation is inherently serial, with these dependencies:
  90
  91                ldq  ldq
  92                  \  /\
  93           (or)   addq |
  94            |\   /   \ |
  95            | addq  cmpult
  96             \  |     |
  97              cmpult  |
  98                  \  /
  99                   or
 100
 101    I.e., 3 operations are needed between carry-in and carry-out, making 12
 102    cycles the absolute minimum for the 4 limbs.  We could replace the `or' with
 103    a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
 104    might waste a cycle on EV4.  The total depth remain unaffected, since cmov
 105    has a latency of 2 cycles.
 106
 107      addq
 108      /   \
 109    addq  cmpult
 110      |      \
 111    cmpult -> cmovne
 112
 113   Montgomery has a slightly different way of computing carry that requires one
 114   less instruction, but has depth 4 (instead of the current 3).  Since the code
 115   is currently instruction issue bound, Montgomery's idea should save us 1/2
 116   cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
 117   Unfortunately, this method will not be good for the EV6.
 118
 119 4. addmul_1 and friends: We previously had a scheme for splitting the single-
 120    limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
 121    and then use FP operations for every 2nd multiply, and integer operations
 122    for every 2nd multiply.
 123
 124    But it seems much better to split the single-limb operand in 16-bit chunks,
 125    since we save many integer shifts and adds that way.  See powerpc64/README
 126    for some more details.
 127
 128 EV6
 129
 130 Here we have a really parallel pipeline, capable of issuing up to 4 integer
 131 instructions per cycle.  In actual practice, it is never possible to sustain
 132 more than 3.5 integer insns/cycle due to rename register shortage.  One integer
 133 multiply instruction can issue each cycle.  To get optimal speed, we need to
 134 pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
 135
 136 There are two dependencies to watch out for.  1) Address arithmetic
 137 dependencies, and 2) carry propagation dependencies.
 138
 139 We can avoid serializing due to address arithmetic by unrolling loops, so that
 140 addresses don't depend heavily on an index variable.  Avoiding serializing
 141 because of carry propagation is trickier; the ultimate performance of the code
 142 will be determined of the number of latency cycles it takes from accepting
 143 carry-in to a vector point until we can generate carry-out.
 144
 145 Most integer instructions can execute in either the L0, U0, L1, or U1
 146 pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
 147
 148 CMOV instructions split into two internal instructions, CMOV1 and CMOV2.  CMOV
 149 split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
 150 should always be placed as the last instruction of an aligned 4 instruction
 151 block, or perhaps simply avoided.
 152
 153 Perhaps the most important issue is the latency between the L0/U0 and L1/U1
 154 clusters; a result obtained on either cluster has an extra cycle of latency for
 155 consumers in the opposite cluster.  Because of the dynamic nature of the
 156 implementation, it is hard to predict where an instruction will execute.
 157
 158
 159
 160 REFERENCES
 161
 162 "Alpha Architecture Handbook", version 4, Compaq, October 1998, order number
 163 EC-QD2KC-TE.
 164
 165 "Alpha 21164 Microprocessor Hardware Reference Manual", Compaq, December 1998,
 166 order number EC-QP99C-TE.
 167
 168 "Alpha 21264/EV67 Microprocessor Hardware Reference Manual", revision 1.4,
 169 Compaq, September 2000, order number DS-0028B-TE.
 170
 171 "Compiler Writer's Guide for the Alpha 21264", Compaq, June 1999, order number
 172 EC-RJ66A-TE.
 173
 174 All of the above are available online from
 175
 176   http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html
 177   ftp://ftp.compaq.com/pub/products/alphaCPUdocs
 178
 179 "Tru64 Unix Assembly Language Programmer's Guide", Compaq, March 1996, part
 180 number AA-PS31D-TE.
 181
 182 "Digital UNIX Calling Standard for Alpha Systems", Digital Equipment Corp,
 183 March 1996, part number AA-PY8AC-TE.
 184
 185 The above are available online,
 186
 187   http://h30097.www3.hp.com/docs/pub_page/V40F_DOCS.HTM
 188
 189 (Dunno what h30097 means in this URL, but if it moves try searching for "tru64
 190 online documentation" from the main www.hp.com page.)
 191
 192
 193
 194 ----------------
 195 Local variables:
 196 mode: text
 197 fill-column: 79
 198 End: