mpn/ia64/README

   1 Copyright 2000, 2001, 2002, 2003, 2004, 2005 Free Software Foundation, Inc.
   2
   3 This file is part of the GNU MP Library.
   4
   5 The GNU MP Library is free software; you can redistribute it and/or modify
   6 it under the terms of the GNU Lesser General Public License as published by
   7 the Free Software Foundation; either version 3 of the License, or (at your
   8 option) any later version.
   9
  10 The GNU MP Library is distributed in the hope that it will be useful, but
  11 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
  12 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
  13 License for more details.
  14
  15 You should have received a copy of the GNU Lesser General Public License
  16 along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
  17
  18
  19
  20                       IA-64 MPN SUBROUTINES
  21
  22
  23 This directory contains mpn functions for the IA-64 architecture.
  24
  25
  26 CODE ORGANIZATION
  27
  28         mpn/ia64          itanium-2, and generic ia64
  29
  30 The code here has been optimized primarily for Itanium 2.  Very few Itanium 1
  31 chips were ever sold, and Itanium 2 is more powerful, so the latter is what
  32 we concentrate on.
  33
  34
  35
  36 CHIP NOTES
  37
  38 The IA-64 ISA keeps instructions three and three in 128 bit bundles.
  39 Programmers/compilers need to put explicit breaks `;;' when there are WAW or
  40 RAW dependencies, with some notable exceptions.  Such "breaks" are typically
  41 at the end of a bundle, but can be put between operations within some bundle
  42 types too.
  43
  44 The Itanium 1 and Itanium 2 implementations can under ideal conditions
  45 execute two bundles per cycle.  The Itanium 1 allows 4 of these instructions
  46 to do integer operations, while the Itanium 2 allows all 6 to be integer
  47 operations.
  48
  49 Taken cloop branches seem to insert a bubble into the pipeline most of the
  50 time on Itanium 1.
  51
  52 Loads to the fp registers bypass the L1 cache and thus get extremely long
  53 latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.
  54
  55 The software pipeline stuff using br.ctop instruction causes delays, since
  56 many issue slots are taken up by instructions with zero predicates, and
  57 since many extra instructions are needed to set things up.  These features
  58 are clearly designed for code density, not speed.
  59
  60 Misc pipeline limitations (Itanium 1):
  61 * The getf.sig instruction can only execute in M0.
  62 * At most four integer instructions/cycle.
  63 * Nops take up resources like any plain instructions.
  64
  65 Misc pipeline limitations (Itanium 2):
  66 * The getf.sig instruction can only execute in M0.
  67 * Nops take up resources like any plain instructions.
  68
  69
  70 ASSEMBLY SYNTAX
  71
  72 .align pads with nops in a text segment, but gas 2.14 and earlier
  73 incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
  74 it come out as break instructions.  We use the ALIGN() macro in
  75 mpn/ia64/ia64-defs.m4 when it might be executed across.  That macro
  76 suppresses any .align if the problem is detected by configure.  Lack of
  77 alignment might hurt performance but will at least be correct.
  78
  79 foo:: to create a global symbol is not accepted by gas.  Use separate
  80 ".global foo" and "foo:" instead.
  81
  82 .global is the standard global directive.  gas accepts .globl, but hpux "as"
  83 doesn't.
  84
  85 .proc / .endp generates the appropriate .type and .size information for ELF,
  86 so the latter directives don't need to be given explicitly.
  87
  88 .pred.rel "mutex"... is standard for annotating predicate register
  89 relationships.  gas also accepts .pred.rel.mutex, but hpux "as" doesn't.
  90
  91 .pred directives can't be put on a line with a label, like
  92 ".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
  93 gas is happy with it, and past versions of HP had seemed ok.
  94
  95 // is the standard comment sequence, but we prefer "C" since it inhibits m4
  96 macro expansion.  See comments in ia64-defs.m4.
  97
  98
  99 REGISTER USAGE
 100
 101 Special:
 102    r0: constant 0
 103    r1: global pointer (gp)
 104    r8: return value
 105    r12: stack pointer (sp)
 106    r13: thread pointer (tp)
 107 Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
 108 Caller-saves but rotating: r32-
 109
 110
 111 ================================================================
 112 mpn_add_n, mpn_sub_n:
 113
 114 The current code runs at 1.25 c/l on Itanium 2.
 115
 116 ================================================================
 117 mpn_mul_1:
 118
 119 The current code runs at 2 c/l on Itanium 2.
 120
 121 Using a blocked approach, working off of 4 separate places in the operands,
 122 one could make use of the xma accumulation, and approach 1 c/l.
 123
 124         ldf8 [up]
 125         xma.l
 126         xma.hu
 127         stf8  [wrp]
 128
 129 ================================================================
 130 mpn_addmul_1:
 131
 132 The current code runs at 2 c/l on Itanium 2.
 133
 134 It seems possible to use a blocked approach, as with mpn_mul_1.  We should
 135 read rp[] to integer registers, allowing for just one getf.sig per cycle.
 136
 137         ld8  [rp]
 138         ldf8 [up]
 139         xma.l
 140         xma.hu
 141         getf.sig
 142         add+add+cmp+cmp
 143         st8  [wrp]
 144
 145 These 10 instructions can be scheduled to approach 1.667 cycles, and with
 146 the 4 cycle latency of xma, this means we need at least 3 blocks.  Using
 147 ldfp8 we could approach 1.583 c/l.
 148
 149 ================================================================
 150 mpn_submul_1:
 151
 152 The current code runs at 2.25 c/l on Itanium 2.  Getting to 2 c/l requires
 153 ldfp8 with all alignment headache that implies.
 154
 155 ================================================================
 156 mpn_addmul_N
 157
 158 For best speed, we need to give up using mpn_addmul_1 as the main multiply
 159 building block, and instead take multiple v limbs per loop.  For the Itanium
 160 1, we need to take about 8 limbs at a time for full speed.  For the Itanium
 161 2, something like mpn_addmul_4 should be enough.
 162
 163 The add+cmp+cmp+add we use on the other codes is optimal for shortening
 164 recurrencies (1 cycle) but the sequence takes up 4 execution slots.  When
 165 recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
 166 better.
 167
 168 /* First load the 8 values from v */
 169         ldfp8           v0, v1 = [r35], 16;;
 170         ldfp8           v2, v3 = [r35], 16;;
 171         ldfp8           v4, v5 = [r35], 16;;
 172         ldfp8           v6, v7 = [r35], 16;;
 173
 174 /* In the inner loop, get a new U limb and store a result limb. */
 175         mov             lc = un
 176 Loop:   ldf8            u0 = [r33], 8
 177         ld8             r0 = [r32]
 178         xma.l           lp0 = v0, u0, hp0
 179         xma.hu          hp0 = v0, u0, hp0
 180         xma.l           lp1 = v1, u0, hp1
 181         xma.hu          hp1 = v1, u0, hp1
 182         xma.l           lp2 = v2, u0, hp2
 183         xma.hu          hp2 = v2, u0, hp2
 184         xma.l           lp3 = v3, u0, hp3
 185         xma.hu          hp3 = v3, u0, hp3
 186         xma.l           lp4 = v4, u0, hp4
 187         xma.hu          hp4 = v4, u0, hp4
 188         xma.l           lp5 = v5, u0, hp5
 189         xma.hu          hp5 = v5, u0, hp5
 190         xma.l           lp6 = v6, u0, hp6
 191         xma.hu          hp6 = v6, u0, hp6
 192         xma.l           lp7 = v7, u0, hp7
 193         xma.hu          hp7 = v7, u0, hp7
 194         getf.sig        l0 = lp0
 195         getf.sig        l1 = lp1
 196         getf.sig        l2 = lp2
 197         getf.sig        l3 = lp3
 198         getf.sig        l4 = lp4
 199         getf.sig        l5 = lp5
 200         getf.sig        l6 = lp6
 201         add+cmp+add     xx, l0, r0
 202         add+cmp+add     acc0, acc1, l1
 203         add+cmp+add     acc1, acc2, l2
 204         add+cmp+add     acc2, acc3, l3
 205         add+cmp+add     acc3, acc4, l4
 206         add+cmp+add     acc4, acc5, l5
 207         add+cmp+add     acc5, acc6, l6
 208         getf.sig        acc6 = lp7
 209         st8             [r32] = xx, 8
 210         br.cloop Loop
 211
 212         49 insn at max 6 insn/cycle:            8.167 cycles/limb8
 213         11 memops at max 2 memops/cycle:        5.5 cycles/limb8
 214         16 fpops at max 2 fpops/cycle:          8 cycles/limb8
 215         21 intops at max 4 intops/cycle:        5.25 cycles/limb8
 216         11+21 memops+intops at max 4/cycle      8 cycles/limb8
 217
 218 ================================================================
 219 mpn_lshift, mpn_rshift
 220
 221 The current code runs at 1 cycle/limb on Itanium 2.
 222
 223 Using 63 separate loops, we could use the double-word shrp instruction.
 224 That instruction has a plain single-cycle latency.  We need 63 loops since
 225 this instruction only accept immediate count.  That would lead to a somewhat
 226 silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
 227 each cycle plus shl/shr going down I1 for a further limb every second
 228 cycle).
 229
 230 ================================================================
 231 mpn_copyi, mpn_copyd
 232
 233 The current code runs at 0.5 c/l on Itanium 2.  But that is just for L1
 234 cache hit.  The 4-way unrolled loop takes just 2 cycles, and thus load-use
 235 scheduling isn't great.  It might be best to actually use modulo scheduled
 236 loops, since that will allow us to do better load-use scheduling without too
 237 much unrolling.
 238
 239 Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
 240 2, according to tune/speed.  Cache bank conflicts?
 241
 242
 243
 244 REFERENCES
 245
 246 Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
 247 Intel document 245317-004, 245318-004, 245319-004 October 2002.  Volume 1
 248 includes an Itanium optimization guide.
 249
 250 Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
 251 document 245370-003, May 2001.  Describes C type sizes, dynamic linking,
 252 etc.
 253
 254 Intel Itanium Architecture Assembly Language Reference Guide, Intel document
 255 248801-004, 2000-2002.  Describes assembly instruction syntax and other
 256 directives.
 257
 258 Itanium Software Conventions and Runtime Architecture Guide, Intel document
 259 245358-003, May 2001.  Describes calling conventions, including stack
 260 unwinding requirements.
 261
 262 Intel Itanium Processor Reference Manual for Software Optimization, Intel
 263 document 245473-003, November 2001.
 264
 265 Intel Itanium-2 Processor Reference Manual for Software Development and
 266 Optimization, Intel document 251110-003, May 2004.
 267
 268 All the above documents can be found online at
 269
 270     http://developer.intel.com/design/itanium/manuals.htm