tune/README

   1 Copyright 2000, 2001, 2002, 2004 Free Software Foundation, Inc.
   2
   3 This file is part of the GNU MP Library.
   4
   5 The GNU MP Library is free software; you can redistribute it and/or modify
   6 it under the terms of the GNU Lesser General Public License as published by
   7 the Free Software Foundation; either version 3 of the License, or (at your
   8 option) any later version.
   9
  10 The GNU MP Library is distributed in the hope that it will be useful, but
  11 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
  12 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
  13 License for more details.
  14
  15 You should have received a copy of the GNU Lesser General Public License
  16 along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
  17
  18
  19
  20
  21
  22                GMP SPEED MEASURING AND PARAMETER TUNING
  23
  24
  25 The programs in this directory are for knowledgeable users who want to
  26 measure GMP routines on their machine, and perhaps tweak some settings or
  27 identify things that can be improved.
  28
  29 The programs here are tools, not ready to run solutions.  Nothing is built
  30 in a normal "make all", but various Makefile targets described below exist.
  31
  32 Relatively few systems and CPUs have been tested, so be sure to verify that
  33 results are sensible before relying on them.
  34
  35
  36
  37
  38 MISCELLANEOUS NOTES
  39
  40 --enable-assert
  41
  42     Don't configure with --enable-assert, since the extra code added by
  43     assertion checking may influence measurements.
  44
  45 Direct mapped caches
  46
  47     Some effort has been made to accommodate CPUs with direct mapped caches,
  48     by putting data blocks more or less contiguously on the stack.  But this
  49     will depend on TMP_ALLOC using alloca, and even then it may or may not
  50     be enough.
  51
  52 FreeBSD 4.2 i486 getrusage
  53
  54     This getrusage seems to be a bit doubtful, it looks like it's
  55     microsecond accurate, but sometimes ru_utime remains unchanged after a
  56     time of many microseconds has elapsed.  It'd be good to detect this in
  57     the time.c initializations, but for now the suggestion is to pretend it
  58     doesn't exist.
  59
  60         ./configure ac_cv_func_getrusage=no
  61
  62 NetBSD 1.4.1 m68k macintosh time base
  63
  64     On this system it's been found getrusage often goes backwards, making it
  65     unusable (time.c getrusage_backwards_p detects this).  gettimeofday
  66     sometimes doesn't update atomically when it crosses a 1 second boundary.
  67     Not sure what to do about this.  Expect possible intermittent failures.
  68
  69 SCO OpenUNIX 8 /etc/hw
  70
  71     /etc/hw takes about a second to return the cpu frequency, which suggests
  72     perhaps it's measuring each time it runs.  If this is annoying when
  73     running the speed program repeatedly then set a GMP_CPU_FREQUENCY
  74     environment variable (see TIME BASE section below).
  75
  76 Timing on GNU/Linux
  77
  78     On Linux, timing currently uses the cycle counter. This is unreliable,
  79     since the counter is not saved and restored at context switches (unlike
  80     FreeBSD and Solaris where the cycle counter is "virtualized").
  81
  82     Using the clock_gettime method with CLOCK_PROCESS_CPUTIME_ID (posix) or
  83     CLOCK_VIRTUAL (BSD) should be more reliable. To get clock_gettime
  84     with glibc, one has to link with -lrt (which also drags in the pthreads
  85     threading library). configure.in must be hacked to detect this and
  86     arrange proper linking. Something like
  87
  88       old_LIBS="$LIBS"
  89       AC_SEARCH_LIBS(clock_gettime, rt, [AC_DEFINE(HAVE_CLOCK_GETTIME)])
  90       TUNE_LIBS="$LIBS"
  91       LIBS="$old_LIBS"
  92
  93       AC_SUBST(TUNE_LIBS)
  94
  95     might work.
  96
  97 Low resolution timebase
  98
  99     Parameter tuning can be very time consuming if the only timebase
 100     available is a 10 millisecond clock tick, to the point of being
 101     unusable.  This is currently the case on VAX and ARM systems.
 102
 103
 104
 105
 106 PARAMETER TUNING
 107
 108 The "tuneup" program runs some tests designed to find the best settings for
 109 various thresholds, like MUL_TOOM22_THRESHOLD.  Its output can be put
 110 into gmp-mparam.h.  The program is built and run with
 111
 112         make tune
 113
 114 If the thresholds indicated are grossly different from the values in the
 115 selected gmp-mparam.h then there may be a performance boost in applicable
 116 size ranges by changing gmp-mparam.h accordingly.
 117
 118 Be sure to do a full reconfigure and rebuild to get any newly set thresholds
 119 to take effect.  A partial rebuild is enough sometimes, but a fresh
 120 configure and make is certain to be correct.
 121
 122 If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of
 123 the mpn subdirectories then the values from "make tune" should be similar.
 124 But check that the configured CPU is right and there are no machine specific
 125 effects causing a difference.
 126
 127 It's hoped the compiler and options used won't have too much effect on
 128 thresholds, since for most CPUs they ultimately come down to comparisons
 129 between assembler subroutines.  Missing out on the longlong.h macros by not
 130 using gcc will probably have an effect.
 131
 132 Some thresholds produced by the tune program are merely single values chosen
 133 from what's a range of sizes where two algorithms are pretty much the same
 134 speed.  When this happens the program is likely to give somewhat different
 135 values on successive runs.  This is noticeable on the toom3 thresholds for
 136 instance.
 137
 138
 139
 140
 141 SPEED PROGRAM
 142
 143 The "speed" program can be used for measuring and comparing various
 144 routines, and producing tables of data or gnuplot graphs.  Compile it with
 145
 146         make speed
 147
 148 (Or on DOS systems "make speed.exe".)
 149
 150 Here are some examples of how to use it.  Check the code for all the
 151 options.
 152
 153 Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
 154 (whichever is greater).
 155
 156         ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
 157         gnuplot foo.gnuplot
 158
 159 Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and
 160 showing under mpn_lshift the difference between it and mpn_add_n.
 161
 162         ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
 163
 164 Using option -c for times in cycles is interesting but normally only
 165 necessary when looking carefully at assembler subroutines.  You might think
 166 it would always give an integer value, but this doesn't happen in practice,
 167 probably due to overheads in the time measurements.
 168
 169 In the free-form output the "#" symbol against a measurement means the
 170 corresponding routine is fastest at that size.  This is a convenient visual
 171 cue when comparing different routines.  The graph data files <name>.data
 172 don't get this since it would upset gnuplot or other data viewers.
 173
 174
 175
 176
 177 TIME BASE
 178
 179 The time measuring method is determined in time.c, based on what the
 180 configured host has available.  A cycle counter is preferred, possibly
 181 supplemented by another method if the counter has a limited range.  A
 182 microsecond accurate getrusage() or gettimeofday() will work quite well too.
 183
 184 The cycle counters (except possibly on alpha) and gettimeofday() will depend
 185 on the machine being otherwise idle, or rather on other jobs not stealing
 186 CPU time from the measuring program.  Short routines (those that complete
 187 within a timeslice) should work even on a busy machine.
 188
 189 Some trouble is taken by speed_measure() in common.c to avoid ill effects
 190 from sporadic interrupts, or other intermittent things (like cron waking up
 191 every minute).  But generally an idle machine will be necessary to be
 192 certain of consistent results.
 193
 194 The CPU frequency is needed to convert between cycles and seconds, or for
 195 when a cycle counter is supplemented by getrusage() etc.  The speed program
 196 will convert as necessary according to the output format requested.  The
 197 tune program will work with either cycles or seconds.
 198
 199 freq.c knows how to get the frequency on some systems, or can measure a
 200 cycle counter against gettimeofday() or getrusage(), but when that fails, or
 201 needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be
 202 used (in Hertz).  For example in "bash" on a 650 MHz machine,
 203
 204         export GMP_CPU_FREQUENCY=650e6
 205
 206 A high precision time base makes it possible to get accurate measurements in
 207 a shorter time.
 208
 209
 210
 211
 212 EXAMPLE COMPARISONS - VARIOUS
 213
 214 Here are some ideas for things that can be done with the speed program.
 215
 216 There's always going to be a certain amount of overhead in the time
 217 measurements, due to reading the time base, and in the loop that runs a
 218 routine enough times to get a reading of the desired precision.  Noop
 219 functions taking various arguments are available to measure this.  The
 220 "overhead" printed by the speed program each time in its intro is the "noop"
 221 routine, but note that this is just for information, it isn't deducted from
 222 the times printed or anything.
 223
 224         ./speed -s 1 noop noop_wxs noop_wxys
 225
 226 To see how many cycles per limb a routine is taking, look at the time
 227 increase when the size increments, using option -D.  This avoids fixed
 228 overheads in the measuring.  Also, remember many of the assembler routines
 229 have unrolled loops, so it might be necessary to compare times at, say, 16,
 230 32, 48, 64 etc to see what the unrolled part is taking, as opposed to any
 231 finishing off.
 232
 233         ./speed -s 16-64 -t 16 -C -D mpn_add_n
 234
 235 The -C option on its own gives cycles per limb, but is really only useful at
 236 big sizes where fixed overheads are small compared to the code doing the
 237 real work.  Remember of course memory caching and/or page swapping will
 238 affect results at large sizes.
 239
 240         ./speed -s 500000 -C mpn_add_n
 241
 242 Once a calculation stops fitting in the CPU data cache, it's going to start
 243 taking longer.  Exactly where this happens depends on the cache priming in
 244 the measuring routines, and on what sort of "least recently used" the
 245 hardware does.  Here's an example for a CPU with a 16kbyte L1 data cache and
 246 32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
 247 limbs.
 248
 249         ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
 250         gnuplot foo.gnuplot
 251
 252 When a routine has an unrolled loop for, say, multiples of 8 limbs and then
 253 an ordinary loop for the remainder, it can happen that it's actually faster
 254 to do an operation on, say, 8 limbs than it is on 7 limbs.  The following
 255 draws a graph of mpn_sub_n, to see whether times smoothly increase with
 256 size.
 257
 258         ./speed -s 1-100 -c -P foo mpn_sub_n
 259         gnuplot foo.gnuplot
 260
 261 If mpn_lshift and mpn_rshift have special case code for shifts by 1, it
 262 ought to be faster (or at least not slower) than shifting by, say, 2 bits.
 263
 264         ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
 265
 266 An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
 267 if the lshift isn't faster there's an obvious improvement that's possible.
 268
 269         ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
 270
 271 On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
 272 destination is one of the sources is faster than a separate destination.
 273 Here's an example to see this.  ".1" selects dst==src1 for mpn_add_n (and
 274 mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL.
 275
 276         ./speed -s 1-200 -c mpn_add_n mpn_add_n.1
 277
 278 The gmp manual points out that divisions by powers of two should be done
 279 using a right shift because it'll be significantly faster than an actual
 280 division.  The following shows by what factor mpn_rshift is faster than
 281 mpn_divrem_1, using division by 32 as an example.
 282
 283         ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
 284
 285
 286
 287
 288 EXAMPLE COMPARISONS - MULTIPLICATION
 289
 290 mul_basecase takes a ".<r>" parameter which is the first (larger) size
 291 parameter.  For example to show speeds for 20x1 up to 20x15 in cycles,
 292
 293         ./speed -s 1-15 -c mpn_mul_basecase.20
 294
 295 mul_basecase with no parameter does an NxN multiply, so for example to show
 296 speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
 297
 298         ./speed -s 1-20 -c mpn_mul_basecase
 299
 300 sqr_basecase is implemented by a "triangular" method on most CPUs, making it
 301 up to twice as fast as mul_basecase.  In practice loop overheads and the
 302 products on the diagonal mean it falls short of this.  Here's an example
 303 running the two and showing by what factor an NxN mul_basecase is slower
 304 than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
 305 below SQR_TOOM2_THRESHOLD, so if it crashes at that point don't worry.)
 306
 307         ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
 308
 309 The technique described above with -CD for showing the time difference in
 310 cycles per limb between two size operations can be done on an NxN
 311 mul_basecase using -E to change the basis for the size increment to N*N.
 312 For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
 313 doing 256 limbs.  The following therefore shows the per crossproduct speed
 314 of mul_basecase and sqr_basecase at around 20x20 limbs.
 315
 316         ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
 317
 318 Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
 319 interesting to compare it to mul_basecase as if it was.  For sqr_basecase
 320 the -F option can be used to base the deltas on N*(N+1)/2 operations, which
 321 is the triangular products sqr_basecase does.  For example,
 322
 323         ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
 324
 325 Both -E and -F are preliminary and might change.  A consistent approach to
 326 using them when claiming certain per crossproduct or per triangularproduct
 327 speeds hasn't really been established, but the increment between speeds in
 328 the range karatsuba will call seems sensible, that being k to k/2.  For
 329 instance, if the karatsuba threshold was 20 for the multiply and 30 for the
 330 square,
 331
 332         ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
 333         ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
 334
 335
 336
 337 EXAMPLE COMPARISONS - MALLOC
 338
 339 The gmp manual recommends application programs avoid excessive initializing
 340 and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
 341 variable will at a minimum go through an init, a realloc for its first
 342 store, and finally a clear.  Quite how long that takes depends on the C
 343 library.  The following compares an mpz_init/realloc/clear to a 10 limb
 344 mpz_add.  Don't be surprised if the mallocing is quite slow.
 345
 346         ./speed -s 10 -c mpz_init_realloc_clear mpz_add
 347
 348 On some systems malloc and free are much slower when dynamic linked.  The
 349 speed-dynamic program can be used to see this.  For example the following
 350 measures malloc/free, first static then dynamic.
 351
 352         ./speed -s 10 -c malloc_free
 353         ./speed-dynamic -s 10 -c malloc_free
 354
 355 Of course a real world program has big problems if it's doing so many
 356 mallocs and frees that it gets slowed down by a dynamic linked malloc.
 357
 358
 359
 360
 361
 362 EXAMPLE COMPARISONS - STRING CONVERSIONS
 363
 364 mpn_get_str does a binary to string conversion.  The base is specified with
 365 a ".<r>" parameter, or decimal by default.  Power of 2 bases are much faster
 366 than general bases.  The following compares decimal and hex for instance.
 367
 368         ./speed -s 1-20 -c mpn_get_str mpn_get_str.16
 369
 370 Smaller bases need more divisions to split a given size number, and so are
 371 slower.  The following compares base 3 and base 9.  On small operands 9 will
 372 be nearly twice as fast, though at bigger sizes this reduces since in the
 373 current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit
 374 limbs) and those divisions come to dominate.
 375
 376         ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9
 377
 378 mpn_set_str does a string to binary conversion.  The base is specified with
 379 a ".<r>" parameter, or decimal by default.  Power of 2 bases are faster than
 380 general bases on large conversions.
 381
 382         ./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10
 383
 384 mpn_set_str also has some special case code for decimal which is a bit
 385 faster than the general case, basically by giving the compiler a chance to
 386 optimize some multiplications by 10.
 387
 388         ./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11
 389
 390
 391
 392
 393 EXAMPLE COMPARISONS - GCDs
 394
 395 mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both
 396 x and y are single limbs.  This isn't tuned currently, but a value can be
 397 established by a measurement like
 398
 399         ./speed -s 10-32 mpn_gcd_1.10
 400
 401 This runs src[0] from 10 to 32 bits, and y fixed at 10 bits.  If the div
 402 threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd
 403 is done by nibbling away at the 32-bit operands bit-by-bit.  When the
 404 threshold is small, say 1 bit, then an initial x%y is done to reduce it to a
 405 10x10 bit operation.
 406
 407 The threshold in mpn/generic/gcd_1.c or the various assembler
 408 implementations can be tweaked up or down until there's no more speedups on
 409 interesting combinations of sizes.  Note that this affects only a 1x1 limb
 410 operation and so isn't very important.  (An Nx1 limb operation always does
 411 an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.)
 412
 413
 414
 415
 416 SPEED PROGRAM EXTENSIONS
 417
 418 Potentially lots of things could be made available in the program, but it's
 419 been left at only the things that have actually been wanted and are likely
 420 to be reasonably useful in the future.
 421
 422 Extensions should be fairly easy to make though.  speed-ext.c is an example,
 423 in a style that should suit one-off tests, or new code fragments under
 424 development.
 425
 426 many.pl is a script for generating a new speed program supplemented with
 427 alternate versions of the standard routines.  It can be used for measuring
 428 experimental code, or for comparing different implementations that exist
 429 within a CPU family.
 430
 431
 432
 433
 434 THRESHOLD EXAMINING
 435
 436 The speed program can be used to examine the speeds of different algorithms
 437 to check the tune program has done the right thing.  For example to examine
 438 the karatsuba multiply threshold,
 439
 440         ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
 441
 442 When examining the toom3 threshold, remember it depends on the karatsuba
 443 threshold, so the right karatsuba threshold needs to be compiled into the
 444 library first.  The tune program uses specially recompiled versions of
 445 mpn/mul_n.c etc for this reason, but the speed program simply uses the
 446 normal libgmp.la.
 447
 448 Note further that the various routines may recurse into themselves on sizes
 449 far enough above applicable thresholds.  For example, mpn_kara_mul_n will
 450 recurse into itself on sizes greater than twice the compiled-in
 451 MUL_TOOM22_THRESHOLD.
 452
 453 When doing the above comparison between mul_basecase and kara_mul_n what's
 454 probably of interest is mul_basecase versus a kara_mul_n that does one level
 455 of Karatsuba then calls to mul_basecase, but this only happens on sizes less
 456 than twice the compiled MUL_TOOM22_THRESHOLD.  A larger value for that
 457 setting can be compiled-in to avoid the problem if necessary.  The same
 458 applies to toom3 and DC, though in a trickier fashion.
 459
 460 There are some upper limits on some of the thresholds, arising from arrays
 461 dimensioned according to a threshold (mpn_mul_n), or asm code with certain
 462 sized displacements (some x86 versions of sqr_basecase).  So putting huge
 463 values for the thresholds, even just for testing, may fail.
 464
 465
 466
 467
 468 FUTURE
 469
 470 Make a program to check the time base is working properly, for small and
 471 large measurements.  Make it able to test each available method, including
 472 perhaps the apparent resolution of each.
 473
 474 Make a general mechanism for specifying operand overlap, and a syntax like
 475 maybe "mpn_add_n.dst=src2" to select it.  Some measuring routines do this
 476 sort of thing with the "r" parameter currently.
 477
 478
 479
 480 ----------------
 481 Local variables:
 482 mode: text
 483 fill-column: 76
 484 End: