doc/projects.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
   2 <html>
   3 <head>
   4   <title>GMP Development Projects</title>
   5   <link rel="shortcut icon" href="favicon.ico">
   6   <link rel="stylesheet" href="gmp.css">
   7   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   8 </head>
   9
  10 <center>
  11   <h1>
  12     GMP Development Projects
  13   </h1>
  14 </center>
  15
  16 <font size=-1>
  17 <pre>
  18 Copyright 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009 Free Software
  19 Foundation, Inc.
  20
  21 This file is part of the GNU MP Library.
  22
  23 The GNU MP Library is free software; you can redistribute it and/or modify
  24 it under the terms of the GNU Lesser General Public License as published
  25 by the Free Software Foundation; either version 3 of the License, or (at
  26 your option) any later version.
  27
  28 The GNU MP Library is distributed in the hope that it will be useful, but
  29 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
  30 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
  31 License for more details.
  32
  33 You should have received a copy of the GNU Lesser General Public License
  34 along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
  35 </pre>
  36 </font>
  37
  38 <hr>
  39 <!-- NB. timestamp updated automatically by emacs -->
  40   This file current as of 15 Nov 2009.  An up-to-date version is available at
  41   <a href="http://gmplib.org/projects.html">http://gmplib.org/projects.html</a>.
  42   Please send comments about this page to gmp-devel<font>@</font>gmplib.org.
  43
  44 <p> This file lists projects suitable for volunteers.  Please see the
  45     <a href="tasks.html">tasks file</a> for smaller tasks.
  46
  47 <p> If you want to work on any of the projects below, please let
  48     gmp-devel<font>@</font>gmplib.org know.  If you want to help with a project
  49     that already somebody else is working on, you will get in touch through
  50     gmp-devel<font>@</font>gmplib.org.  (There are no email addresses of
  51     volunteers below, due to spamming problems.)
  52
  53 <ul>
  54 <li> <strong>Faster multiplication</strong>
  55
  56   <p> The current multiplication code uses Karatsuba, 3-way and 4-way Toom, and
  57       Fermat FFT.  Several new developments are desirable:
  58
  59   <ol>
  60
  61     <li> Write more toom multiply functions for unbalanced operands.  We now have
  62          toom22, toom32, toom42, toom62, toom33, toom53, and toom44.  Most
  63          desirable is toom43, which will require a new toom_interpolate_6pts
  64          function.  Writing toom52 will then be straightforward.  See also
  65          <a href="http://bodrato.it/software/toom.html">Marco Bodrato's
  66          site</a>
  67
  68     <li> Perhaps consider N-way Toom, N > 4.  See Knuth's Seminumerical
  69          Algorithms for details on the method, as well as Bodrato's site.  Code
  70          implementing it exists.  This is asymptotically inferior to FFTs, but
  71          is finer grained.
  72
  73     <li> The mpn_mul call now (from GMP 4.3) uses toom22, toom32, and toom42
  74          for unbalanced operations.  We don't use any of the other new toom
  75          functions currently.  Write new clever code for choosing the best toom
  76          function from an m-limb and an n-limb operand.
  77
  78     <li> Implement an FFT variant computing the coefficients mod m different
  79          limb size primes of the form l*2^k+1. i.e., compute m separate FFTs.
  80          The wanted coefficients will at the end be found by lifting with CRT
  81          (Chinese Remainder Theorem).  If we let m = 3, i.e., use 3 primes, we
  82          can split the operands into coefficients at limb boundaries, and if
  83          our machine uses b-bit limbs, we can multiply numbers with close to
  84          2^b limbs without coefficient overflow.  For smaller multiplication,
  85          we might perhaps let m = 1, and instead of splitting our operands at
  86          limb boundaries, split them in much smaller pieces.  We might also use
  87          4 or more primes, and split operands into bigger than b-bit chunks.
  88          By using more primes, the gain in shorter transform length, but lose
  89          in having to do more FFTs, but that is a slight total save.  We then
  90          lose in more expensive CRT. <br><br>
  91
  92          <p> [We now have two implementations of this algorithm, one by Tommy
  93          Färnqvist and one by Niels Möller.]
  94
  95     <li> Add support for short products, either a given number of low limbs, a
  96          given number of high limbs, or perhaps the middle limbs of the result.
  97          High short product can be used by <code>mpf_mul</code>, by
  98          left-to-right Newton approximations, and for quotient approximation.
  99          Low half short product can be of use in sub-quadratic REDC and for
 100          right-to-left Newton approximations.  On small sizes a short product
 101          will be faster simply through fewer cross-products, similar to the way
 102          squaring is faster.  But work by Thom Mulders shows that for Karatsuba
 103          and higher order algorithms the advantage is progressively lost, so
 104          for large sizes shows products turn out to be no faster.
 105
 106   </ol>
 107
 108   <p> Another possibility would be an optimized cube.  In the basecase that
 109       should definitely be able to save cross-products in a similar fashion to
 110       squaring, but some investigation might be needed for how best to adapt
 111       the higher-order algorithms.  Not sure whether cubing or further small
 112       powers have any particularly important uses though.
 113
 114
 115 <li> <strong>Assembly routines</strong>
 116
 117   <p> Write new and improve existing assembly routines.  The tests/devel
 118       programs and the tune/speed.c and tune/many.pl programs are useful for
 119       testing and timing the routines you write.  See the README files in those
 120       directories for more information.
 121
 122   <p> Please make sure your new routines are fast for these three situations:
 123       <ol>
 124         <li> Operands that fit into the cache.
 125         <li> Small operands of less than, say, 10 limbs.
 126         <li> Huge operands that does not fit into the cache.
 127       </ol>
 128
 129   <p> The most important routines are mpn_addmul_1, mpn_mul_basecase and
 130       mpn_sqr_basecase.  The latter two don't exist for all machines, while
 131       mpn_addmul_1 exists for almost all machines.
 132
 133   <p> Standard techniques for these routines are unrolling, software
 134       pipelining, and specialization for common operand values.  For machines
 135       with poor integer multiplication, it is sometimes possible to remedy the
 136       situation using floating-point operations or SIMD operations such as MMX
 137       (x86) (x86), SSE (x86), VMX (PowerPC), VIS (Sparc).
 138
 139   <p> Using floating-point operations is interesting but somewhat tricky.
 140       Since IEEE double has 53 bit of mantissa, one has to split the operands
 141       in small pieces, so that no intermediates are greater than 2^53.  For
 142       32-bit computers, splitting one operand into 16-bit pieces works.  For
 143       64-bit machines, one operand can be split into 21-bit pieces and the
 144       other into 32-bit pieces.  (A 64-bit operand can be split into just three
 145       21-bit pieces if one allows the split operands to be negative!)
 146
 147
 148 <li> <strong>Math functions for the mpf layer</strong>
 149
 150   <p> Implement the functions of math.h for the GMP mpf layer!  Check the book
 151       "Pi and the AGM" by Borwein and Borwein for ideas how to do this.  These
 152       functions are desirable: acos, acosh, asin, asinh, atan, atanh, atan2,
 153       cos, cosh, exp, log, log10, pow, sin, sinh, tan, tanh.
 154
 155   <p> Note that the <a href="http://mpfr.org">mpfr</a> functions already
 156   provide these functions, and that we usually recommend new programs to use
 157   mpfr instead of mpf.
 158
 159
 160 <li> <strong>Faster sqrt</strong>
 161
 162   <p> The current code uses divisions, which are reasonably fast, but it'd be
 163       possible to use only multiplications by computing 1/sqrt(A) using this
 164       iteration:
 165       <pre>
 166                                     2
 167                    x   = x  (3 &minus; A x )/2
 168                     i+1   i         i  </pre>
 169       The square root can then be computed like this:
 170       <pre>
 171                      sqrt(A) = A x
 172                                   n  </pre>
 173   <p> That final multiply might be the full size of the input (though it might
 174       only need the high half of that), so there may or may not be any speedup
 175       overall.
 176
 177   <p> We should probably allow a special exponent-like parameter, to speed
 178       computations of a precise square root of a small number in mpf and mpfr.
 179
 180
 181 <li> <strong>Nth root</strong>
 182
 183   <p> Improve mpn_rootrem.  The current code is not too bad, but its average
 184       time complexity is a function of the input, while it is possible to
 185       make it a function of the output.
 186
 187
 188 <li> <strong>Exceptions</strong>
 189
 190   <p> Some sort of scheme for exceptions handling would be desirable.
 191       Presently the only thing documented is that divide by zero in GMP
 192       functions provokes a deliberate machine divide by zero (on those systems
 193       where such a thing exists at least).  The global <code>gmp_errno</code>
 194       is not actually documented, except for the old <code>gmp_randinit</code>
 195       function.  Being currently just a plain global means it's not
 196       thread-safe.
 197
 198   <p> The basic choices for exceptions are returning an error code or having a
 199       handler function to be called.  The disadvantage of error returns is they
 200       have to be checked, leading to tedious and rarely executed code, and
 201       strictly speaking such a scheme wouldn't be source or binary compatible.
 202       The disadvantage of a handler function is that a <code>longjmp</code> or
 203       similar recovery from it may be difficult.  A combination would be
 204       possible, for instance by allowing the handler to return an error code.
 205
 206   <p> Divide-by-zero, sqrt-of-negative, and similar operand range errors can
 207       normally be detected at the start of functions, so exception handling
 208       would have a clean state.  What's worth considering though is that the
 209       GMP function detecting the exception may have been called via some third
 210       party library or self contained application module, and hence have
 211       various bits of state to be cleaned up above it.  It'd be highly
 212       desirable for an exceptions scheme to allow for such cleanups.
 213
 214   <p> The C++ destructor mechanism could help with cleanups both internally and
 215       externally, but being a plain C library we don't want to depend on that.
 216
 217   <p> A C++ <code>throw</code> might be a good optional extra exceptions
 218       mechanism, perhaps under a build option.  For
 219       GCC <code>-fexceptions</code> will add the necessary frame information to
 220       plain C code, or GMP could be compiled as C++.
 221
 222   <p> Out-of-memory exceptions are expected to be handled by the
 223       <code>mp_set_memory_functions</code> routines, rather than being a
 224       prospective part of divide-by-zero etc.  Some similar considerations
 225       apply but what differs is that out-of-memory can arise deep within GMP
 226       internals.  Even fundamental routines like <code>mpn_add_n</code> and
 227       <code>mpn_addmul_1</code> can use temporary memory (for instance on Cray
 228       vector systems).  Allowing for an error code return would require an
 229       awful lot of checking internally.  Perhaps it'd still be worthwhile, but
 230       it'd be a lot of changes and the extra code would probably be rather
 231       rarely executed in normal usages.
 232
 233   <p> A <code>longjmp</code> recovery for out-of-memory will currently, in
 234       general, lead to memory leaks and may leave GMP variables operated on in
 235       inconsistent states.  Maybe it'd be possible to record recovery
 236       information for use by the relevant allocate or reallocate function, but
 237       that too would be a lot of changes.
 238
 239   <p> One scheme for out-of-memory would be to note that all GMP allocations go
 240       through the <code>mp_set_memory_functions</code> routines.  So if the
 241       application has an intended <code>setjmp</code> recovery point it can
 242       record memory activity by GMP and abandon space allocated and variables
 243       initialized after that point.  This might be as simple as directing the
 244       allocation functions to a separate pool, but in general would have the
 245       disadvantage of needing application-level bookkeeping on top of the
 246       normal system <code>malloc</code>.  An advantage however is that it needs
 247       nothing from GMP itself and on that basis doesn't burden applications not
 248       needing recovery.  Note that there's probably some details to be worked
 249       out here about reallocs of existing variables, and perhaps about copying
 250       or swapping between "permanent" and "temporary" variables.
 251
 252   <p> Applications desiring a fine-grained error control, for instance a
 253       language interpreter, would very possibly not be well served by a scheme
 254       requiring <code>longjmp</code>.  Wrapping every GMP function call with a
 255       <code>setjmp</code> would be very inconvenient.
 256
 257   <p> Another option would be to let <code>mpz_t</code> etc hold a sort of NaN,
 258       a special value indicating an out-of-memory or other failure.  This would
 259       be similar to NaNs in mpfr.  Unfortunately such a scheme could only be
 260       used by programs prepared to handle such special values, since for
 261       instance a program waiting for some condition to be satisfied could
 262       become an infinite loop if it wasn't also watching for NaNs.  The work to
 263       implement this would be significant too, lots of checking of inputs and
 264       intermediate results.  And if <code>mpn</code> routines were to
 265       participate in this (which they would have to internally) a lot of new
 266       return values would need to be added, since of course there's no
 267       <code>mpz_t</code> etc structure for them to indicate failure in.
 268
 269   <p> Stack overflow is another possible exception, but perhaps not one that
 270       can be easily detected in general.  On i386 GNU/Linux for instance GCC
 271       normally doesn't generate stack probes for an <code>alloca</code>, but
 272       merely adjusts <code>%esp</code>.  A big enough <code>alloca</code> can
 273       miss the stack redzone and hit arbitrary data.  GMP stack usage is
 274       normally a function of operand size, which might be enough for some
 275       applications to know they'll be safe.  Otherwise a fixed maximum usage
 276       can probably be obtained by building with
 277       <code>--enable-alloca=malloc-reentrant</code> (or
 278       <code>notreentrant</code>).  Arranging the default to be
 279       <code>alloca</code> only on blocks up to a certain size and
 280       <code>malloc</code> thereafter might be a better approach and would have
 281       the advantage of not having calculations limited by available stack.
 282
 283   <p> Actually recovering from stack overflow is of course another problem.  It
 284       might be possible to catch a <code>SIGSEGV</code> in the stack redzone
 285       and do something in a <code>sigaltstack</code>, on systems which have
 286       that, but recovery might otherwise not be possible.  This is worth
 287       bearing in mind because there's no point worrying about tight and careful
 288       out-of-memory recovery if an out-of-stack is fatal.
 289
 290   <p> Operand overflow is another exception to be addressed.  It's easy for
 291       instance to ask <code>mpz_pow_ui</code> for a result bigger than an
 292       <code>mpz_t</code> can possibly represent.  Currently overflows in limb
 293       or byte count calculations will go undetected.  Often they'll still end
 294       up asking the memory functions for blocks bigger than available memory,
 295       but that's by no means certain and results are unpredictable in general.
 296       It'd be desirable to tighten up such size calculations.  Probably only
 297       selected routines would need checks, if it's assumed say that no input
 298       will be more than half of all memory and hence size additions like say
 299       <code>mpz_mul</code> won't overflow.
 300
 301
 302 <li> <strong>Performance Tool</strong>
 303
 304   <p> It'd be nice to have some sort of tool for getting an overview of
 305       performance.  Clearly a great many things could be done, but some primary
 306       uses would be,
 307
 308       <ol>
 309         <li> Checking speed variations between compilers.
 310         <li> Checking relative performance between systems or CPUs.
 311       </ol>
 312
 313   <p> A combination of measuring some fundamental routines and some
 314       representative application routines might satisfy these.
 315
 316   <p> The tune/time.c routines would be the easiest way to get good accurate
 317       measurements on lots of different systems.  The high level
 318       <code>speed_measure</code> may or may not suit, but the basic
 319       <code>speed_starttime</code> and <code>speed_endtime</code> would cover
 320       lots of portability and accuracy questions.
 321
 322
 323 <li> <strong>Using <code>restrict</code></strong>
 324
 325   <p> There might be some value in judicious use of C99 style
 326       <code>restrict</code> on various pointers, but this would need some
 327       careful thought about what it implies for the various operand overlaps
 328       permitted in GMP.
 329
 330   <p> Rumour has it some pre-C99 compilers had <code>restrict</code>, but
 331       expressing tighter (or perhaps looser) requirements.  Might be worth
 332       investigating that before using <code>restrict</code> unconditionally.
 333
 334   <p> Loops are presumably where the greatest benefit would be had, by allowing
 335       the compiler to advance reads ahead of writes, perhaps as part of loop
 336       unrolling.  However critical loops are generally coded in assembler, so
 337       there might not be very much to gain.  And on Cray systems the explicit
 338       use of <code>_Pragma</code> gives an equivalent effect.
 339
 340   <p> One thing to note is that Microsoft C headers (on ia64 at least) contain
 341       <code>__declspec(restrict)</code>, so a <code>#define</code> of
 342       <code>restrict</code> should be avoided.  It might be wisest to setup a
 343       <code>gmp_restrict</code>.
 344
 345
 346 <li> <strong>Nx1 Division</strong>
 347
 348   <p> The limb-by-limb dependencies in the existing Nx1 division (and
 349       remainder) code means that chips with multiple execution units or
 350       pipelined multipliers are not fully utilized.
 351
 352   <p> One possibility is to follow the current preinv method but taking two
 353       limbs at a time.  That means a 2x2-&gt;4 and a 2x1-&gt;2 multiply for
 354       each two limbs processed, and because the 2x2 and 2x1 can each be done in
 355       parallel the latency will be not much more than 2 multiplies for two
 356       limbs, whereas the single limb method has a 2 multiply latency for just
 357       one limb.  A version of <code>mpn_divrem_1</code> doing this has been
 358       written in C, but not yet tested on likely chips.  Clearly this scheme
 359       would extend to 3x3-&gt;9 and 3x1-&gt;3 etc, though with diminishing
 360       returns.
 361
 362   <p> For <code>mpn_mod_1</code>, Peter L. Montgomery proposes the following
 363       scheme.  For a limb R=2^<code>bits_per_mp_limb</code>, pre-calculate
 364       values R mod N, R^2 mod N, R^3 mod N, R^4 mod N.  Then take dividend
 365       limbs and multiply them by those values, thereby reducing them (moving
 366       them down) by the corresponding factor.  The products can be added to
 367       produce an intermediate remainder of 2 or 3 limbs to be similarly
 368       included in the next step.  The point is that such multiplies can be done
 369       in parallel, meaning as little as 1 multiply worth of latency for 4
 370       limbs.  If the modulus N is less than R/4 (or is it R/5?) the summed
 371       products will fit in 2 limbs, otherwise 3 will be required, but with the
 372       high only being small.  Clearly this extends to as many factors of R as a
 373       chip can efficiently apply.
 374
 375   <p> The logical conclusion for powers R^i is a whole array "p[i] = R^i mod N"
 376       for i up to k, the size of the dividend.  This could then be applied at
 377       multiplier throughput speed like an inner product.  If the powers took
 378       roughly k divide steps to calculate then there'd be an advantage any time
 379       the same N was used three or more times.  Suggested by Victor Shoup in
 380       connection with chinese-remainder style decompositions, but perhaps with
 381       other uses.
 382
 383   <p> <code>mpn_modexact_1_odd</code> calculates an x in the range 0&lt;=x&lt;d
 384       satisfying a = q*d + x*b^n, where b=2^bits_per_limb.  The factor b^n
 385       needed to get the true remainder r could be calculated by a powering
 386       algorithm, allowing <code>mpn_modexact_1_odd</code> to be pressed into
 387       service for an <code>mpn_mod_1</code>.  <code>modexact_1</code> is
 388       simpler and on some chips can run noticeably faster than plain
 389       <code>mod_1</code>, on Athlon for instance 11 cycles/limb instead of 17.
 390       Such a difference could soon overcome the time to calculate b^n.  The
 391       requirement for an odd divisor in <code>modexact</code> can be handled by
 392       some shifting on-the-fly, or perhaps by an extra partial-limb step at the
 393       end.
 394
 395
 396 <li> <strong>Factorial</strong>
 397
 398   <p> The removal of twos in the current code could be extended to factors of 3
 399       or 5.  Taking this to its logical conclusion would be a complete
 400       decomposition into powers of primes.  The power for a prime p is of
 401       course floor(n/p)+floor(n/p^2)+...  Conrad Curry found this is quite fast
 402       (using simultaneous powering as per Handbook of Applied Cryptography
 403       algorithm 14.88).
 404
 405   <p> A difficulty with using all primes is that quite large n can be
 406       calculated on a system with enough memory, larger than we'd probably want
 407       for a table of primes, so some sort of sieving would be wanted.  Perhaps
 408       just taking out the factors of 3 and 5 would give most of the speedup
 409       that a prime decomposition can offer.
 410
 411
 412 <li> <strong>Binomial Coefficients</strong>
 413
 414   <p> An obvious improvement to the current code would be to strip factors of 2
 415       from each multiplier and divisor and count them separately, to be applied
 416       with a bit shift at the end.  Factors of 3 and perhaps 5 could even be
 417       handled similarly.
 418
 419   <p> Conrad Curry reports a big speedup for binomial coefficients using a
 420       prime powering scheme, at least for k near n/2.  Of course this is only
 421       practical for moderate size n since again it requires primes up to n.
 422
 423   <p> When k is small the current (n-k+1)...n/1...k will be fastest.  Some sort
 424       of rule would be needed for when to use this or when to use prime
 425       powering.  Such a rule will be a function of both n and k.  Some
 426       investigation is needed to see what sort of shape the crossover line will
 427       have, the usual parameter tuning can of course find machine dependent
 428       constants to fill in where necessary.
 429
 430   <p> An easier possibility also reported by Conrad Curry is that it may be
 431       faster not to divide out the denominator (1...k) one-limb at a time, but
 432       do one big division at the end.  Is this because a big divisor in
 433       <code>mpn_bdivmod</code> trades the latency of
 434       <code>mpn_divexact_1</code> for the throughput of
 435       <code>mpn_submul_1</code>?  Overheads must hurt though.
 436
 437   <p> Another reason a big divisor might help is that
 438       <code>mpn_divexact_1</code> won't be getting a full limb in
 439       <code>mpz_bin_uiui</code>.  It's called when the n accumulator is full
 440       but the k may be far from full.  Perhaps the two could be decoupled so k
 441       is applied when full.  It'd be necessary to delay consideration of k
 442       terms until the corresponding n terms had been applied though, since
 443       otherwise the division won't be exact.
 444
 445
 446 <li> <strong>Perfect Power Testing</strong>
 447
 448   <p> <code>mpz_perfect_power_p</code> could be improved in a number of ways,
 449       for instance p-adic arithmetic to find possible roots.
 450
 451   <p> Non-powers can be quickly identified by checking for Nth power residues
 452       modulo small primes, like <code>mpn_perfect_square_p</code> does for
 453       squares.  The residues to each power N for a given remainder could be
 454       grouped into a bit mask, the masks for the remainders to each divisor
 455       would then be "and"ed together to hopefully leave only a few candidate
 456       powers.  Need to think about how wide to make such masks, ie. how many
 457       powers to examine in this way.
 458
 459   <p> Any zero remainders found in residue testing reveal factors which can be
 460       divided out, with the multiplicity restricting the powers that need to be
 461       considered, as per the current code.  Further prime dividing should be
 462       grouped into limbs like <code>PP</code>.  Need to think about how much
 463       dividing to do like that, probably more for bigger inputs, less for
 464       smaller inputs.
 465
 466   <p> <code>mpn_gcd_1</code> would probably be better than the current private
 467       GCD routine.  The use it's put to isn't time-critical, and it might help
 468       ensure correctness to just use the main GCD routine.
 469
 470   <p> [There is work-in-progress with a very fast function.]
 471
 472
 473 <li> <strong>Prime Testing</strong>
 474
 475   <p> GMP is not really a number theory library and probably shouldn't have
 476       large amounts of code dedicated to sophisticated prime testing
 477       algorithms, but basic things well-implemented would suit.  Tests offering
 478       certainty are probably all too big or too slow (or both!) to justify
 479       inclusion in the main library.  Demo programs showing some possibilities
 480       would be good though.
 481
 482   <p> The present "repetitions" argument to <code>mpz_probab_prime_p</code> is
 483       rather specific to the Miller-Rabin tests of the current implementation.
 484       Better would be some sort of parameter asking perhaps for a maximum
 485       chance 1/2^x of a probable prime in fact being composite.  If
 486       applications follow the advice that the present reps gives 1/4^reps
 487       chance then perhaps such a change is unnecessary, but an explicitly
 488       described 1/2^x would allow for changes in the implementation or even for
 489       new proofs about the theory.
 490
 491   <p> <code>mpz_probab_prime_p</code> always initializes a new
 492       <code>gmp_randstate_t</code> for randomized tests, which unfortunately
 493       means it's not really very random and in particular always runs the same
 494       tests for a given input.  Perhaps a new interface could accept an rstate
 495       to use, so successive tests could increase confidence in the result.
 496
 497   <p> <code>mpn_mod_34lsub1</code> is an obvious and easy improvement to the
 498       trial divisions.  And since the various prime factors are constants, the
 499       remainder can be tested with something like
 500 <pre>
 501 #define MP_LIMB_DIVISIBLE_7_P(n) \
 502   ((n) * MODLIMB_INVERSE_7 &lt;= MP_LIMB_T_MAX/7)
 503 </pre>
 504       Which would help compilers that don't know how to optimize divisions by
 505       constants, and is even an improvement on current gcc 3.2 code.  This
 506       technique works for any modulus, see Granlund and Montgomery "Division by
 507       Invariant Integers" section 9.
 508
 509   <p> The trial divisions are done with primes generated and grouped at
 510       runtime.  This could instead be a table of data, with pre-calculated
 511       inverses too.  Storing deltas, ie. amounts to add, rather than actual
 512       primes would save space.  <code>udiv_qrnnd_preinv</code> style inverses
 513       can be made to exist by adding dummy factors of 2 if necessary.  Some
 514       thought needs to be given as to how big such a table should be, based on
 515       how much dividing would be profitable for what sort of size inputs.  The
 516       data could be shared by the perfect power testing.
 517
 518   <p> Jason Moxham points out that if a sqrt(-1) mod N exists then any factor
 519       of N must be == 1 mod 4, saving half the work in trial dividing.  (If
 520       x^2==-1 mod N then for a prime factor p we have x^2==-1 mod p and so the
 521       jacobi symbol (-1/p)=1.  But also (-1/p)=(-1)^((p-1)/2), hence must have
 522       p==1 mod 4.)  But knowing whether sqrt(-1) mod N exists is not too easy.
 523       A strong pseudoprime test can reveal one, so perhaps such a test could be
 524       inserted part way though the dividing.
 525
 526   <p> Jon Grantham "Frobenius Pseudoprimes" (www.pseudoprime.com) describes a
 527       quadratic pseudoprime test taking about 3x longer than a plain test, but
 528       with only a 1/7710 chance of error (whereas 3 plain Miller-Rabin tests
 529       would offer only (1/4)^3 == 1/64).  Such a test needs completely random
 530       parameters to satisfy the theory, though single-limb values would run
 531       faster.  It's probably best to do at least one plain Miller-Rabin before
 532       any quadratic tests, since that can identify composites in less total
 533       time.
 534
 535   <p> Some thought needs to be given to the structure of which tests (trial
 536       division, Miller-Rabin, quadratic) and how many are done, based on what
 537       sort of inputs we expect, with a view to minimizing average time.
 538
 539   <p> It might be a good idea to break out subroutines for the various tests,
 540       so that an application can combine them in ways it prefers, if sensible
 541       defaults in <code>mpz_probab_prime_p</code> don't suit.  In particular
 542       this would let applications skip tests it knew would be unprofitable,
 543       like trial dividing when an input is already known to have no small
 544       factors.
 545
 546   <p> For small inputs, combinations of theory and explicit search make it
 547       relatively easy to offer certainty.  For instance numbers up to 2^32
 548       could be handled with a strong pseudoprime test and table lookup.  But
 549       it's rather doubtful whether a smallnum prime test belongs in a bignum
 550       library.  Perhaps if it had other internal uses.
 551
 552   <p> An <code>mpz_nthprime</code> might be cute, but is almost certainly
 553       impractical for anything but small n.
 554
 555
 556 <li> <strong>Intra-Library Calls</strong>
 557
 558   <p> On various systems, calls within libgmp still go through the PLT, TOC or
 559       other mechanism, which makes the code bigger and slower than it needs to
 560       be.
 561
 562   <p> The theory would be to have all GMP intra-library calls resolved directly
 563       to the routines in the library.  An application wouldn't be able to
 564       replace a routine, the way it can normally, but there seems no good
 565       reason to do that, in normal circumstances.
 566
 567   <p> The <code>visibility</code> attribute in recent gcc is good for this,
 568       because it lets gcc omit unnecessary GOT pointer setups or whatever if it
 569       finds all calls are local and there's no global data references.
 570       Documented entrypoints would be <code>protected</code>, and purely
 571       internal things not wanted by test programs or anything can be
 572       <code>internal</code>.
 573
 574   <p> Unfortunately, on i386 it seems <code>protected</code> ends up causing
 575       text segment relocations within libgmp.so, meaning the library code can't
 576       be shared between processes, defeating the purpose of a shared library.
 577       Perhaps this is just a gremlin in binutils (debian packaged
 578       2.13.90.0.16-1).
 579
 580   <p> The linker can be told directly (with a link script, or options) to do
 581       the same sort of thing.  This doesn't change the code emitted by gcc of
 582       course, but it does mean calls are resolved directly to their targets,
 583       avoiding a PLT entry.
 584
 585   <p> Keeping symbols private to libgmp.so is probably a good thing in general
 586       too, to stop anyone even attempting to access them.  But some
 587       undocumented things will need or want to be kept visible, for use by
 588       mpfr, or the test and tune programs.  Libtool has a standard option for
 589       selecting public symbols (used now for libmp).
 590
 591
 592 </ul>
 593 <hr>
 594
 595 </body>
 596 </html>
 597
 598 <!--
 599 Local variables:
 600 eval: (add-hook 'write-file-hooks 'time-stamp)
 601 time-stamp-start: "This file current as of "
 602 time-stamp-format: "%:d %3b %:y"
 603 time-stamp-end: "\\."
 604 time-stamp-line-limit: 50
 605 End:
 606 -->