mpn/x86/k7/README

   1 Copyright 2000, 2001 Free Software Foundation, Inc.
   2
   3 This file is part of the GNU MP Library.
   4
   5 The GNU MP Library is free software; you can redistribute it and/or modify
   6 it under the terms of the GNU Lesser General Public License as published by
   7 the Free Software Foundation; either version 3 of the License, or (at your
   8 option) any later version.
   9
  10 The GNU MP Library is distributed in the hope that it will be useful, but
  11 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
  12 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
  13 License for more details.
  14
  15 You should have received a copy of the GNU Lesser General Public License
  16 along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
  17
  18
  19
  20
  21                       AMD K7 MPN SUBROUTINES
  22
  23
  24 This directory contains code optimized for the AMD Athlon CPU.
  25
  26 The mmx subdirectory has routines using MMX instructions.  All Athlons have
  27 MMX, the separate directory is just so that configure can omit it if the
  28 assembler doesn't support MMX.
  29
  30
  31
  32 STATUS
  33
  34 Times for the loops, with all code and data in L1 cache.
  35
  36                                cycles/limb
  37         mpn_add/sub_n             1.6
  38
  39         mpn_copyi                 0.75 or 1.0   \ varying with data alignment
  40         mpn_copyd                 0.75 or 1.0   /
  41
  42         mpn_divrem_1             17.0 integer part, 15.0 fractional part
  43         mpn_mod_1                17.0
  44         mpn_divexact_by3          8.0
  45
  46         mpn_l/rshift              1.2
  47
  48         mpn_mul_1                 3.4
  49         mpn_addmul/submul_1       3.9
  50
  51         mpn_mul_basecase          4.42 cycles/crossproduct (approx)
  52         mpn_sqr_basecase          2.3 cycles/crossproduct (approx)
  53                                   or 4.55 cycles/triangleproduct (approx)
  54
  55 Prefetching of sources hasn't yet been tried.
  56
  57
  58
  59 NOTES
  60
  61 cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.
  62
  63 Write-allocate L1 data cache means prefetching of destinations is unnecessary.
  64
  65 Floating point multiplications can be done in parallel with integer
  66 multiplications, but there doesn't seem to be any way to make use of this.
  67
  68 Unsigned "mul"s can be issued every 3 cycles.  This suggests 3 is a limit on
  69 the speed of the multiplication routines.  The documentation shows mul
  70 executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
  71 to get near 3 cycles code has to be arranged so that nothing else is issued
  72 to IEU0.  A busy IEU0 could explain why some code takes 4 cycles and other
  73 apparently equivalent code takes 5.
  74
  75
  76
  77 OPTIMIZATIONS
  78
  79 Unrolled loops are used to reduce looping overhead.  The unrolling is
  80 configurable up to 32 limbs/loop for most routines and up to 64 for some.
  81 The K7 has 64k L1 code cache so quite big unrolling is allowable.
  82
  83 Computed jumps into the unrolling are used to handle sizes not a multiple of
  84 the unrolling.  An attractive feature of this is that times increase
  85 smoothly with operand size, but it may be that some routines should just
  86 have simple loops to finish up, especially when PIC adds between 2 and 16
  87 cycles to get %eip.
  88
  89 Position independent code is implemented using a call to get %eip for the
  90 computed jumps and a ret is always done, rather than an addl $4,%esp or a
  91 popl, so the CPU return address branch prediction stack stays synchronised
  92 with the actual stack in memory.
  93
  94 Branch prediction, in absence of any history, will guess forward jumps are
  95 not taken and backward jumps are taken.  Where possible it's arranged that
  96 the less likely or less important case is under a taken forward jump.
  97
  98
  99
 100 CODING
 101
 102 Instructions in general code have been shown grouped if they can execute
 103 together, which means up to three direct-path instructions which have no
 104 successive dependencies.  K7 always decodes three and has out-of-order
 105 execution, but the groupings show what slots might be available and what
 106 dependency chains exist.
 107
 108 When there's vector-path instructions an effort is made to get triplets of
 109 direct-path instructions in between them, even if there's dependencies,
 110 since this maximizes decoding throughput and might save a cycle or two if
 111 decoding is the limiting factor.
 112
 113
 114
 115 INSTRUCTIONS
 116
 117 adcl       direct
 118 divl       39 cycles back-to-back
 119 lodsl,etc  vector
 120 loop       1 cycle vector (decl/jnz opens up one decode slot)
 121 movd reg   vector
 122 movd mem   direct
 123 mull       issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
 124 popl       vector (use movl for more than one pop)
 125 pushl      direct, will pair with a load
 126 shrdl %cl  vector, 3 cycles, seems to be 3 decode too
 127 xorl r,r   false read dependency recognised
 128
 129
 130
 131 REFERENCES
 132
 133 "AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
 134 22007, revision K, February 2002.  Available on-line,
 135
 136 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
 137
 138 "3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
 139 This describes the femms and prefetch instructions.  Available on-line,
 140
 141 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf
 142
 143 "AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
 144 publication number 22466, revision D, March 2000.  This describes
 145 instructions added in the Athlon processor, such as pswapd and the extra
 146 prefetch forms.  Available on-line,
 147
 148 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf
 149
 150 "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
 151 August 1999.  This has some notes on general Athlon optimizations as well as
 152 3DNow.  Available on-line,
 153
 154 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf
 155
 156
 157
 158
 159 ----------------
 160 Local variables:
 161 mode: text
 162 fill-column: 76
 163 End: