1 Copyright 2000, 2001, 2002, 2003, 2004, 2005 Free Software Foundation, Inc.
3 This file is part of the GNU MP Library.
5 The GNU MP Library is free software; you can redistribute it and/or modify
6 it under the terms of the GNU Lesser General Public License as published by
7 the Free Software Foundation; either version 3 of the License, or (at your
8 option) any later version.
10 The GNU MP Library is distributed in the hope that it will be useful, but
11 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12 or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
13 License for more details.
15 You should have received a copy of the GNU Lesser General Public License
16 along with the GNU MP Library. If not, see http://www.gnu.org/licenses/.
23 This directory contains mpn functions for the IA-64 architecture.
28 mpn/ia64 itanium-2, and generic ia64
30 The code here has been optimized primarily for Itanium 2. Very few Itanium 1
31 chips were ever sold, and Itanium 2 is more powerful, so the latter is what
38 The IA-64 ISA keeps instructions three and three in 128 bit bundles.
39 Programmers/compilers need to put explicit breaks `;;' when there are WAW or
40 RAW dependencies, with some notable exceptions. Such "breaks" are typically
41 at the end of a bundle, but can be put between operations within some bundle
44 The Itanium 1 and Itanium 2 implementations can under ideal conditions
45 execute two bundles per cycle. The Itanium 1 allows 4 of these instructions
46 to do integer operations, while the Itanium 2 allows all 6 to be integer
49 Taken cloop branches seem to insert a bubble into the pipeline most of the
52 Loads to the fp registers bypass the L1 cache and thus get extremely long
53 latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.
55 The software pipeline stuff using br.ctop instruction causes delays, since
56 many issue slots are taken up by instructions with zero predicates, and
57 since many extra instructions are needed to set things up. These features
58 are clearly designed for code density, not speed.
60 Misc pipeline limitations (Itanium 1):
61 * The getf.sig instruction can only execute in M0.
62 * At most four integer instructions/cycle.
63 * Nops take up resources like any plain instructions.
65 Misc pipeline limitations (Itanium 2):
66 * The getf.sig instruction can only execute in M0.
67 * Nops take up resources like any plain instructions.
72 .align pads with nops in a text segment, but gas 2.14 and earlier
73 incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
74 it come out as break instructions. We use the ALIGN() macro in
75 mpn/ia64/ia64-defs.m4 when it might be executed across. That macro
76 suppresses any .align if the problem is detected by configure. Lack of
77 alignment might hurt performance but will at least be correct.
79 foo:: to create a global symbol is not accepted by gas. Use separate
80 ".global foo" and "foo:" instead.
82 .global is the standard global directive. gas accepts .globl, but hpux "as"
85 .proc / .endp generates the appropriate .type and .size information for ELF,
86 so the latter directives don't need to be given explicitly.
88 .pred.rel "mutex"... is standard for annotating predicate register
89 relationships. gas also accepts .pred.rel.mutex, but hpux "as" doesn't.
91 .pred directives can't be put on a line with a label, like
92 ".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
93 gas is happy with it, and past versions of HP had seemed ok.
95 // is the standard comment sequence, but we prefer "C" since it inhibits m4
96 macro expansion. See comments in ia64-defs.m4.
103 r1: global pointer (gp)
105 r12: stack pointer (sp)
106 r13: thread pointer (tp)
107 Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
108 Caller-saves but rotating: r32-
111 ================================================================
112 mpn_add_n, mpn_sub_n:
114 The current code runs at 1.25 c/l on Itanium 2.
116 ================================================================
119 The current code runs at 2 c/l on Itanium 2.
121 Using a blocked approach, working off of 4 separate places in the operands,
122 one could make use of the xma accumulation, and approach 1 c/l.
129 ================================================================
132 The current code runs at 2 c/l on Itanium 2.
134 It seems possible to use a blocked approach, as with mpn_mul_1. We should
135 read rp[] to integer registers, allowing for just one getf.sig per cycle.
145 These 10 instructions can be scheduled to approach 1.667 cycles, and with
146 the 4 cycle latency of xma, this means we need at least 3 blocks. Using
147 ldfp8 we could approach 1.583 c/l.
149 ================================================================
152 The current code runs at 2.25 c/l on Itanium 2. Getting to 2 c/l requires
153 ldfp8 with all alignment headache that implies.
155 ================================================================
158 For best speed, we need to give up using mpn_addmul_1 as the main multiply
159 building block, and instead take multiple v limbs per loop. For the Itanium
160 1, we need to take about 8 limbs at a time for full speed. For the Itanium
161 2, something like mpn_addmul_4 should be enough.
163 The add+cmp+cmp+add we use on the other codes is optimal for shortening
164 recurrencies (1 cycle) but the sequence takes up 4 execution slots. When
165 recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
168 /* First load the 8 values from v */
169 ldfp8 v0, v1 = [r35], 16;;
170 ldfp8 v2, v3 = [r35], 16;;
171 ldfp8 v4, v5 = [r35], 16;;
172 ldfp8 v6, v7 = [r35], 16;;
174 /* In the inner loop, get a new U limb and store a result limb. */
176 Loop: ldf8 u0 = [r33], 8
178 xma.l lp0 = v0, u0, hp0
179 xma.hu hp0 = v0, u0, hp0
180 xma.l lp1 = v1, u0, hp1
181 xma.hu hp1 = v1, u0, hp1
182 xma.l lp2 = v2, u0, hp2
183 xma.hu hp2 = v2, u0, hp2
184 xma.l lp3 = v3, u0, hp3
185 xma.hu hp3 = v3, u0, hp3
186 xma.l lp4 = v4, u0, hp4
187 xma.hu hp4 = v4, u0, hp4
188 xma.l lp5 = v5, u0, hp5
189 xma.hu hp5 = v5, u0, hp5
190 xma.l lp6 = v6, u0, hp6
191 xma.hu hp6 = v6, u0, hp6
192 xma.l lp7 = v7, u0, hp7
193 xma.hu hp7 = v7, u0, hp7
201 add+cmp+add xx, l0, r0
202 add+cmp+add acc0, acc1, l1
203 add+cmp+add acc1, acc2, l2
204 add+cmp+add acc2, acc3, l3
205 add+cmp+add acc3, acc4, l4
206 add+cmp+add acc4, acc5, l5
207 add+cmp+add acc5, acc6, l6
212 49 insn at max 6 insn/cycle: 8.167 cycles/limb8
213 11 memops at max 2 memops/cycle: 5.5 cycles/limb8
214 16 fpops at max 2 fpops/cycle: 8 cycles/limb8
215 21 intops at max 4 intops/cycle: 5.25 cycles/limb8
216 11+21 memops+intops at max 4/cycle 8 cycles/limb8
218 ================================================================
219 mpn_lshift, mpn_rshift
221 The current code runs at 1 cycle/limb on Itanium 2.
223 Using 63 separate loops, we could use the double-word shrp instruction.
224 That instruction has a plain single-cycle latency. We need 63 loops since
225 this instruction only accept immediate count. That would lead to a somewhat
226 silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
227 each cycle plus shl/shr going down I1 for a further limb every second
230 ================================================================
233 The current code runs at 0.5 c/l on Itanium 2. But that is just for L1
234 cache hit. The 4-way unrolled loop takes just 2 cycles, and thus load-use
235 scheduling isn't great. It might be best to actually use modulo scheduled
236 loops, since that will allow us to do better load-use scheduling without too
239 Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
240 2, according to tune/speed. Cache bank conflicts?
246 Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
247 Intel document 245317-004, 245318-004, 245319-004 October 2002. Volume 1
248 includes an Itanium optimization guide.
250 Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
251 document 245370-003, May 2001. Describes C type sizes, dynamic linking,
254 Intel Itanium Architecture Assembly Language Reference Guide, Intel document
255 248801-004, 2000-2002. Describes assembly instruction syntax and other
258 Itanium Software Conventions and Runtime Architecture Guide, Intel document
259 245358-003, May 2001. Describes calling conventions, including stack
260 unwinding requirements.
262 Intel Itanium Processor Reference Manual for Software Optimization, Intel
263 document 245473-003, November 2001.
265 Intel Itanium-2 Processor Reference Manual for Software Development and
266 Optimization, Intel document 251110-003, May 2004.
268 All the above documents can be found online at
270 http://developer.intel.com/design/itanium/manuals.htm