mpn/x86/pentium/README

   1 Copyright 1996, 1999, 2000, 2001, 2003 Free Software Foundation, Inc.
   2
   3 This file is part of the GNU MP Library.
   4
   5 The GNU MP Library is free software; you can redistribute it and/or modify
   6 it under the terms of the GNU Lesser General Public License as published by
   7 the Free Software Foundation; either version 3 of the License, or (at your
   8 option) any later version.
   9
  10 The GNU MP Library is distributed in the hope that it will be useful, but
  11 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
  12 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
  13 License for more details.
  14
  15 You should have received a copy of the GNU Lesser General Public License
  16 along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
  17
  18
  19
  20
  21
  22                    INTEL PENTIUM P5 MPN SUBROUTINES
  23
  24
  25 This directory contains mpn functions optimized for Intel Pentium (P5,P54)
  26 processors.  The mmx subdirectory has additional code for Pentium with MMX
  27 (P55).
  28
  29
  30 STATUS
  31
  32                                 cycles/limb
  33
  34         mpn_add_n/sub_n            2.375
  35
  36         mpn_mul_1                 12.0
  37         mpn_add/submul_1          14.0
  38
  39         mpn_mul_basecase          14.2 cycles/crossproduct (approx)
  40
  41         mpn_sqr_basecase           8 cycles/crossproduct (approx)
  42                                    or 15.5 cycles/triangleproduct (approx)
  43
  44         mpn_l/rshift               5.375 normal (6.0 on P54)
  45                                    1.875 special shift by 1 bit
  46
  47         mpn_divrem_1              44.0
  48         mpn_mod_1                 28.0
  49         mpn_divexact_by3          15.0
  50
  51         mpn_copyi/copyd            1.0
  52
  53 Pentium MMX gets the following improvements
  54
  55         mpn_l/rshift               1.75
  56
  57         mpn_mul_1                 12.0 normal, 7.0 for 16-bit multiplier
  58
  59
  60 mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb.  Due to loop
  61 overhead and other delays (cache refill?), they run at or near 2.5
  62 cycles/limb.
  63
  64 mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they
  65 should.  Intel documentation says a mul instruction is 10 cycles, but it
  66 measures 9 and the routines using it run as 9.
  67
  68
  69
  70 P55 MMX AND X87
  71
  72 The cost of switching between MMX and x87 floating point on P55 is about 100
  73 cycles (fld1/por/emms for instance).  In order to avoid that the two aren't
  74 mixed and currently that means using MMX and not x87.
  75
  76 MMX offers a big speedup for lshift and rshift, and a nice speedup for
  77 16-bit multipliers in mpn_mul_1.  If fast code using x87 is found then
  78 perhaps the preference for MMX will be reversed.
  79
  80
  81
  82
  83 P54 SHLDL
  84
  85 mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the
  86 documentation indicates that they should take only 43/8 = 5.375 cycles/limb,
  87 or 5 cycles/limb asymptotically.  The P55 runs them at the expected speed.
  88
  89 It seems that on P54 a shldl or shrdl allows pairing in one following cycle,
  90 but not two.  For example, back to back repetitions of the following
  91
  92         shldl(  %cl, %eax, %ebx)
  93         xorl    %edx, %edx
  94         xorl    %esi, %esi
  95
  96 run at 5 cycles, as expected, but repetitions of the following run at 7
  97 cycles, whereas 6 would be expected (and is achieved on P55),
  98
  99         shldl(  %cl, %eax, %ebx)
 100         xorl    %edx, %edx
 101         xorl    %esi, %esi
 102         xorl    %edi, %edi
 103         xorl    %ebp, %ebp
 104
 105 Three xorls run at 7 cycles too, so it doesn't seem to be just that pairing
 106 inhibited is only in the second following cycle (or something like that).
 107
 108 Avoiding this problem would bring P54 shifts down from 6.0 c/l to 5.5 with a
 109 pattern of shift, 2 loads, shift, 2 stores, shift, etc.  A start has been
 110 made on something like that, but it's not yet complete.
 111
 112
 113
 114
 115 OTHER NOTES
 116
 117 Prefetching Destinations
 118
 119     Pentium doesn't allocate cache lines on writes, unlike most other modern
 120     processors.  Since the functions in the mpn class do array writes, we
 121     have to handle allocating the destination cache lines by reading a word
 122     from it in the loops, to achieve the best performance.
 123
 124 Prefetching Sources
 125
 126     Prefetching of sources is pointless since there's no out-of-order loads.
 127     Any load instruction blocks until the line is brought to L1, so it may
 128     as well be the load that wants the data which blocks.
 129
 130 Data Cache Bank Clashes
 131
 132     Pairing of memory operations requires that the two issued operations
 133     refer to different cache banks (ie. different addresses modulo 32
 134     bytes).  The simplest way to ensure this is to read/write two words from
 135     the same object.  If we make operations on different objects, they might
 136     or might not be to the same cache bank.
 137
 138 PIC %eip Fetching
 139
 140     A simple call $+5 and popl can be used to get %eip, there's no need to
 141     balance calls and returns since P5 doesn't have any return stack branch
 142     prediction.
 143
 144 Float Multiplies
 145
 146     fmul is pairable and can be issued every 2 cycles (with a 4 cycle
 147     latency for data ready to use).  This is a lot better than integer mull
 148     or imull at 9 cycles non-pairing.  Unfortunately the advantage is
 149     quickly eaten away by needing to throw data through memory back to the
 150     integer registers to adjust for fild and fist being signed, and to do
 151     things like propagating carry bits.
 152
 153
 154
 155
 156
 157 REFERENCES
 158
 159 "Intel Architecture Optimization Manual", 1997, order number 242816.  This
 160 is mostly about P5, the parts about P6 aren't relevant.  Available on-line:
 161
 162         http://download.intel.com/design/PentiumII/manuals/242816.htm
 163
 164
 165
 166 ----------------
 167 Local variables:
 168 mode: text
 169 fill-column: 76
 170 End: