mpn/x86/README

   1 Copyright 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
   2
   3 This file is part of the GNU MP Library.
   4
   5 The GNU MP Library is free software; you can redistribute it and/or modify
   6 it under the terms of the GNU Lesser General Public License as published by
   7 the Free Software Foundation; either version 3 of the License, or (at your
   8 option) any later version.
   9
  10 The GNU MP Library is distributed in the hope that it will be useful, but
  11 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
  12 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
  13 License for more details.
  14
  15 You should have received a copy of the GNU Lesser General Public License
  16 along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
  17
  18
  19
  20
  21
  22                       X86 MPN SUBROUTINES
  23
  24
  25 This directory contains mpn functions for various 80x86 chips.
  26
  27
  28 CODE ORGANIZATION
  29
  30         x86               i386, generic
  31         x86/i486          i486
  32         x86/pentium       Intel Pentium (P5, P54)
  33         x86/pentium/mmx   Intel Pentium with MMX (P55)
  34         x86/p6            Intel Pentium Pro
  35         x86/p6/mmx        Intel Pentium II, III
  36         x86/p6/p3mmx      Intel Pentium III
  37         x86/k6            \ AMD K6
  38         x86/k6/mmx        /
  39         x86/k6/k62mmx     AMD K6-2
  40         x86/k7            \ AMD Athlon
  41         x86/k7/mmx        /
  42         x86/pentium4      \
  43         x86/pentium4/mmx  | Intel Pentium 4
  44         x86/pentium4/sse2 /
  45
  46
  47 The top-level x86 directory contains blended style code, meant to be
  48 reasonable on all x86s.
  49
  50
  51
  52 STATUS
  53
  54 The code is well-optimized for AMD and Intel chips, but there's nothing
  55 specific for Cyrix chips, nor for actual 80386 and 80486 chips.
  56
  57
  58
  59 ASM FILES
  60
  61 The x86 .asm files are BSD style assembler code, first put through m4 for
  62 macro processing.  The generic mpn/asm-defs.m4 is used, together with
  63 mpn/x86/x86-defs.m4.  See comments in those files.
  64
  65 The code is meant for use with GNU "gas" or a system "as".  There's no
  66 support for assemblers that demand Intel style code.
  67
  68
  69
  70 STACK FRAME
  71
  72 m4 macros are used to define the parameters passed on the stack, and these
  73 act like comments on what the stack frame looks like too.  For example,
  74 mpn_mul_1() has the following.
  75
  76         defframe(PARAM_MULTIPLIER, 16)
  77         defframe(PARAM_SIZE,       12)
  78         defframe(PARAM_SRC,         8)
  79         defframe(PARAM_DST,         4)
  80
  81 PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly.  The
  82 return address is at offset 0, but there's not normally any need to access
  83 that.
  84
  85 FRAME is redefined as necessary through the code so it's the number of bytes
  86 pushed on the stack, and hence the offsets in the parameter macros stay
  87 correct.  At the start of a routine FRAME should be zero.
  88
  89         deflit(`FRAME',0)
  90         ...
  91         deflit(`FRAME',4)
  92         ...
  93         deflit(`FRAME',8)
  94         ...
  95
  96 Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
  97 FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
  98 and can be used instead of explicit definitions if preferred.
  99 defframe_pushl() is a combination FRAME_pushl() and defframe().
 100
 101 There's generally some slackness in redefining FRAME.  If new values aren't
 102 going to get used then the redefinitions are omitted to keep from cluttering
 103 up the code.  This happens for instance at the end of a routine, where there
 104 might be just four pops and then a ret, so FRAME isn't getting used.
 105
 106 Local variables and saved registers can be similarly defined, with negative
 107 offsets representing stack space below the initial stack pointer.  For
 108 example,
 109
 110         defframe(SAVE_ESI,   -4)
 111         defframe(SAVE_EDI,   -8)
 112         defframe(VAR_COUNTER,-12)
 113
 114         deflit(STACK_SPACE, 12)
 115
 116 Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
 117 space, and that instruction must be followed by a redefinition of FRAME
 118 (setting it equal to STACK_SPACE) to reflect the change in %esp.
 119
 120 Definitions for pushed registers are only put in when they're going to be
 121 used.  If registers are just saved and restored with pushes and pops then
 122 definitions aren't made.
 123
 124
 125
 126 ASSEMBLER EXPRESSIONS
 127
 128 Only addition and subtraction seem to be universally available, certainly
 129 that's all the Solaris 8 "as" seems to accept.  If expressions are wanted
 130 then m4 eval() should be used.
 131
 132 In particular note that a "/" anywhere in a line starts a comment in Solaris
 133 "as", and in some configurations of gas too.
 134
 135         addl    $32/2, %eax           <-- wrong
 136
 137         addl    $eval(32/2), %eax     <-- right
 138
 139 Binutils gas/config/tc-i386.c has a choice between "/" being a comment
 140 anywhere in a line, or only at the start.  FreeBSD patches 2.9.1 to select
 141 the latter, and from 2.9.5 it's the default for GNU/Linux too.
 142
 143
 144
 145 ASSEMBLER COMMENTS
 146
 147 Solaris "as" doesn't support "#" commenting, using /* */ instead.  For that
 148 reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s"
 149 files have no comments.
 150
 151 Any comments before include(`../config.m4') must use m4 "dnl", since it's
 152 only after the include that "C" is available.  By convention "dnl" is also
 153 used for comments about m4 macros.
 154
 155
 156
 157 TEMPORARY LABELS
 158
 159 Temporary numbered labels like "1:" used as "1f" or "1b" are available in
 160 "gas" and Solaris "as", but not in SCO "as".  Normal L() labels should be
 161 used instead, possibly with a counter to make them unique, see jadcl0() in
 162 x86-defs.m4 for instance.  A separate counter for each macro makes it
 163 possible to nest them, for instance movl_text_address() can be used within
 164 an ASSERT().
 165
 166 "1:" etc must be avoided in gcc __asm__ blocks too.  "%=" for generating a
 167 unique number looks like a good alternative, but is that actually a
 168 documented feature?  In any case this problem doesn't currently arise.
 169
 170
 171
 172 ZERO DISPLACEMENTS
 173
 174 In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
 175 displacement are wanted, rather than (%ebx) with no displacement.  These are
 176 either for computed jumps or to get desirable code alignment.  Explicit
 177 .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
 178 (%ebx).  The Zdisp() macro in x86-defs.m4 is used for this.
 179
 180 Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
 181 1.92.3 changes it.  In general changing would be the sort of "optimization"
 182 an assembler might perform, hence explicit ".byte"s are used where
 183 necessary.
 184
 185
 186
 187 SHLD/SHRD INSTRUCTIONS
 188
 189 The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
 190 must be written "shldl %eax,%ebx" for some assemblers.  gas takes either,
 191 Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
 192 gas), and omits %cl elsewhere.
 193
 194 For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether
 195 %cl should be used, and the macros shldl, shrdl, shldw and shrdw in
 196 mpn/x86/x86-defs.m4 pass through or omit %cl as necessary.  See the comments
 197 with those macros for usage.
 198
 199
 200
 201 IMUL INSTRUCTION
 202
 203 GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes
 204 that the following two forms produce identical object code
 205
 206         imul    $12, %eax
 207         imul    $12, %eax, %eax
 208
 209 but that the former isn't accepted by some assemblers, in particular the SCO
 210 OSR5 COFF assembler.  GMP follows GCC and uses only the latter form.
 211
 212 (This applies only to immediate operands, the three operand form is only
 213 valid with an immediate.)
 214
 215
 216
 217 DIRECTION FLAG
 218
 219 The x86 calling conventions say that the direction flag should be clear at
 220 function entry and exit.  (See iBCS2 and SVR4 ABI books, references below.)
 221 Although this has been so since the year dot, it's not absolutely clear
 222 whether it's universally respected.  Since it's better to be safe than
 223 sorry, GMP follows glibc and does a "cld" if it depends on the direction
 224 flag being clear.  This happens only in a few places.
 225
 226
 227
 228 POSITION INDEPENDENT CODE
 229
 230   Coding Style
 231
 232     Defining the symbol PIC in m4 processing selects SVR4 / ELF style
 233     position independent code.  This is necessary for shared libraries
 234     because they can be mapped into different processes at different virtual
 235     addresses.  Actually, relocations are allowed but text pages with
 236     relocations aren't shared, defeating the purpose of a shared library.
 237
 238     The GOT is used to access global data, and the PLT is used for
 239     functions.  The use of the PLT adds a fixed cost to every function call,
 240     and the GOT adds a cost to any function accessing global variables.
 241     These are small but might be noticeable when working with small
 242     operands.
 243
 244   Scope
 245
 246     It's intended, as a matter of policy, that references within libgmp are
 247     resolved within libgmp.  Certainly there's no need for an application to
 248     replace any internals, and we take the view that there's no value in an
 249     application subverting anything documented either.
 250
 251     Resolving references within libgmp in theory means calls can be made with a
 252     plain PC-relative call instruction, which is faster and smaller than going
 253     through the PLT, and data references can be similarly PC-relative, saving a
 254     GOT entry and fetch from there.  Unfortunately the normal linker behaviour
 255     doesn't allow us to do this.
 256
 257     By default an R_386_PC32 PC-relative reference, either for a call or for
 258     data, is left in libgmp.so by the linker so that it can be resolved at
 259     runtime to a location in the application or another shared library.  This
 260     means a text segment relocation which we don't want.
 261
 262   -Bsymbolic
 263
 264     Under the "-Bsymbolic" option, the linker resolves references to symbols
 265     within libgmp.so.  This gives us the desired effect for R_386_PC32,
 266     ie. it's resolved at link time.  It also resolves R_386_PLT32 calls
 267     directly to their target without creating a PLT entry (though if this is
 268     done to normal compiler-generated code it still leaves a setup of %ebx
 269     to _GLOBAL_OFFSET_TABLE_ which may then be unnecessary).
 270
 271     Unfortunately -Bsymbolic does bad things to global variables defined in
 272     a shared library but accessed by non-PIC code from the mainline (or a
 273     static library).
 274
 275     The problem is that the mainline needs a fixed data address to avoid
 276     text segment relocations, so space is allocated in its data segment and
 277     the value from the variable is copied from the shared library's data
 278     segment when the library is loaded.  Under -Bsymbolic, however,
 279     references in the shared library are then resolved still to the shared
 280     library data area.  Not surprisingly it bombs badly to have mainline
 281     code and library code accessing different locations for what should be
 282     one variable.
 283
 284     Note that this -Bsymbolic effect for the shared library is not just for
 285     R_386_PC32 offsets which might have been cooked up in assembler, but is
 286     done also for the contents of GOT entries.  -Bsymbolic simply applies a
 287     general rule that symbols are resolved first from the local module.
 288
 289   Visibility Attributes
 290
 291     GCC __attribute__ ((visibility ("protected"))), which is available in
 292     recent versions, eg. 3.3, is probably what we'd like to use.  It makes
 293     gcc generate plain PC-relative calls to indicated functions, and directs
 294     the linker to resolve references to the given function within the link
 295     module.
 296
 297     Unfortunately, as of debian binutils 2.13.90.0.16 at least, the
 298     resulting libgmp.so comes out with text segment relocations, references
 299     are not resolved at link time.  If the gcc description is to be believed
 300     this is this not how it should work.  If a symbol cannot be overridden
 301     by another module then surely references within that module can be
 302     resolved immediately (ie. at link time).
 303
 304   Present
 305
 306     In any case, all this means that we have no optimizations we can
 307     usefully make to function or variable usages, neither for assembler nor
 308     C code.  Perhaps in the future the visibility attribute will work as
 309     we'd like.
 310
 311
 312
 313
 314 GLOBAL OFFSET TABLE
 315
 316 The magic _GLOBAL_OFFSET_TABLE_ used by code establishing the address of the
 317 GOT sometimes requires an extra underscore prefix.  SVR4 systems and NetBSD
 318 don't need a prefix, OpenBSD does need one.  Note that NetBSD and OpenBSD
 319 are both a.out underscore systems, so the prefix for _GLOBAL_OFFSET_TABLE_
 320 is not simply the same as the prefix for ordinary globals.
 321
 322 In any case in the asm code we write _GLOBAL_OFFSET_TABLE_ and let a macro
 323 in x86-defs.m4 add an extra underscore if required (according to a configure
 324 test).
 325
 326 Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when
 327 asked to assemble the following,
 328
 329         L1:
 330             addl  $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx
 331
 332 It seems that using the label in the same instruction it refers to is the
 333 problem, since a nop in between works.  But the simplest workaround is to
 334 follow gcc and omit the +[.-L1] since it does nothing,
 335
 336             addl  $_GLOBAL_OFFSET_TABLE_, %ebx
 337
 338 Current gas 2.10 generates incorrect object code when %eax is used in such a
 339 construction (with or without +[.-L1]),
 340
 341             addl  $_GLOBAL_OFFSET_TABLE_, %eax
 342
 343 The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for
 344 the 1 byte opcode of "addl $n,%eax".  The best workaround is just to use any
 345 other register, since then it's a two byte opcode+mod/rm.  GCC for example
 346 always uses %ebx (which is needed for calls through the PLT).
 347
 348 A similar problem occurs in an leal (again with or without a +[.-L1]),
 349
 350             leal  _GLOBAL_OFFSET_TABLE_(%edi), %ebx
 351
 352 This time the R_386_GOTPC gets a displacement of 0 rather than the 2
 353 appropriate for the opcode and mod/rm, making this form unusable.
 354
 355
 356
 357
 358 SIMPLE LOOPS
 359
 360 The overheads in setting up for an unrolled loop can mean that at small
 361 sizes a simple loop is faster.  Making small sizes go fast is important,
 362 even if it adds a cycle or two to bigger sizes.  To this end various
 363 routines choose between a simple loop and an unrolled loop according to
 364 operand size.  The path to the simple loop, or to special case code for
 365 small sizes, is always as fast as possible.
 366
 367 Adding a simple loop requires a conditional jump to choose between the
 368 simple and unrolled code.  The size of a branch misprediction penalty
 369 affects whether a simple loop is worthwhile.
 370
 371 The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
 372 point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
 373 UNROLL_THRESHOLD using the unrolled loop.  If position independent code adds
 374 a couple of cycles to an unrolled loop setup, the threshold will vary with
 375 PIC or non-PIC.  Something like the following is typical.
 376
 377         deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8))
 378
 379 There's no automated way to determine the threshold.  Setting it to a small
 380 value and then to a big value makes it possible to measure the simple and
 381 unrolled loops each over a range of sizes, from which the crossover point
 382 can be determined.  Alternately, just adjust the threshold up or down until
 383 there's no more speedups.
 384
 385
 386
 387 UNROLLED LOOP CODING
 388
 389 The x86 addressing modes allow a byte displacement of -128 to +127, making
 390 it possible to access 256 bytes, which is 64 limbs, without adjusting
 391 pointer registers within the loop.  Dword sized displacements can be used
 392 too, but they increase code size, and unrolling to 64 ought to be enough.
 393
 394 When unrolling to the full 64 limbs/loop, the limb at the top of the loop
 395 will have a displacement of -128, so pointers have to have a corresponding
 396 +128 added before entering the loop.  When unrolling to 32 limbs/loop
 397 displacements 0 to 127 can be used with 0 at the top of the loop and no
 398 adjustment needed to the pointers.
 399
 400 Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
 401 limbs/loop is selected.  Usually the gain in speed using 64 instead of 32 or
 402 16 is small, so support for 64 limbs/loop is generally only for comparison.
 403
 404
 405
 406 COMPUTED JUMPS
 407
 408 When working from least significant limb to most significant limb (most
 409 routines) the computed jump and pointer calculations in preparation for an
 410 unrolled loop are as follows.
 411
 412         S = operand size in limbs
 413         N = number of limbs per loop (UNROLL_COUNT)
 414         L = log2 of unrolling (UNROLL_LOG2)
 415         M = mask for unrolling (UNROLL_MASK)
 416         C = code bytes per limb in the loop
 417         B = bytes per limb (4 for x86)
 418
 419         computed jump            (-S & M) * C + entrypoint
 420         subtract from pointers   (-S & M) * B
 421         initial loop counter     (S-1) >> L
 422         displacements            0 to B*(N-1)
 423
 424 The loop counter is decremented at the end of each loop, and the looping
 425 stops when the decrement takes the counter to -1.  The displacements are for
 426 the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
 427
 428 Usually the multiply by "C" can be handled without an imul, using instead an
 429 leal, or a shift and subtract.
 430
 431 When working from most significant to least significant limb (eg. mpn_lshift
 432 and mpn_copyd), the calculations change as follows.
 433
 434         add to pointers          (-S & M) * B
 435         displacements            0 to -B*(N-1)
 436
 437
 438
 439 OLD GAS 1.92.3
 440
 441 This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
 442 affect GMP code.
 443
 444 Firstly, an expression involving two forward references to labels comes out
 445 as zero.  For example,
 446
 447                 addl    $bar-foo, %eax
 448         foo:
 449                 nop
 450         bar:
 451
 452 This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
 453 When only one forward reference is involved, it works correctly, as for
 454 example,
 455
 456         foo:
 457                 addl    $bar-foo, %eax
 458                 nop
 459         bar:
 460
 461 Secondly, an expression involving two labels can't be used as the
 462 displacement for an leal.  For example,
 463
 464         foo:
 465                 nop
 466         bar:
 467                 leal    bar-foo(%eax,%ebx,8), %ecx
 468
 469 A slightly cryptic error is given, "Unimplemented segment type 0 in
 470 parse_operand".  When only one label is used it's ok, and the label can be a
 471 forward reference too, as for example,
 472
 473                 leal    foo(%eax,%ebx,8), %ecx
 474                 nop
 475         foo:
 476
 477 These problems only affect PIC computed jump calculations.  The workarounds
 478 are just to do an leal without a displacement and then an addl, and to make
 479 sure the code is placed so that there's at most one forward reference in the
 480 addl.
 481
 482
 483
 484 REFERENCES
 485
 486 "Intel Architecture Software Developer's Manual", volumes 1, 2a, 2b, 3a, 3b,
 487 2006, order numbers 253665 through 253669.  Available on-line,
 488
 489         ftp://download.intel.com/design/Pentium4/manuals/25366518.pdf
 490         ftp://download.intel.com/design/Pentium4/manuals/25366618.pdf
 491         ftp://download.intel.com/design/Pentium4/manuals/25366718.pdf
 492         ftp://download.intel.com/design/Pentium4/manuals/25366818.pdf
 493         ftp://download.intel.com/design/Pentium4/manuals/25366918.pdf
 494
 495
 496 "System V Application Binary Interface", Unix System Laboratories Inc, 1992,
 497 published by Prentice Hall, ISBN 0-13-880410-9.  And the "Intel386 Processor
 498 Supplement", AT&T, 1991, ISBN 0-13-877689-X.  These have details of calling
 499 conventions and ELF shared library PIC coding.  Versions of both available
 500 on-line,
 501
 502         http://www.sco.com/developer/devspecs
 503
 504 "Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
 505 published by McGraw-Hill, 1991, ISBN 0-07-031219-2.  (Same as the above 386
 506 ABI supplement.)
 507
 508
 509
 510 ----------------
 511 Local variables:
 512 mode: text
 513 fill-column: 76
 514 End: