libs/math/doc/distributions/dist_tutorial.qbk

   1 [/ def names all end in distrib to avoid clashes with names of functions]
   2
   3 [def __binomial_distrib [link math_toolkit.dist_ref.dists.binomial_dist Binomial Distribution]]
   4 [def __chi_squared_distrib [link math_toolkit.dist_ref.dists.chi_squared_dist Chi Squared Distribution]]
   5 [def __normal_distrib [link math_toolkit.dist_ref.dists.normal_dist Normal Distribution]]
   6 [def __F_distrib [link math_toolkit.dist_ref.dists.f_dist Fisher F Distribution]]
   7 [def __students_t_distrib [link math_toolkit.dist_ref.dists.students_t_dist Students t Distribution]]
   8
   9 [def __handbook [@http://www.itl.nist.gov/div898/handbook/
  10 NIST/SEMATECH e-Handbook of Statistical Methods.]]
  11
  12 [section:stat_tut Statistical Distributions Tutorial]
  13 This library is centred around statistical distributions, this tutorial
  14 will give you an overview of what they are, how they can be used, and
  15 provides a few worked examples of applying the library to statistical tests.
  16
  17 [section:overview Overview of Statistical Distributions]
  18
  19 [section:headers Headers and Namespaces]
  20
  21 All the code in this library is inside `namespace boost::math`.
  22
  23 In order to use a distribution /my_distribution/ you will need to include
  24 either the header(s) `<boost/math/my_distribution.hpp>` (quicker compiles), or
  25 the "include all the distributions" header: `<boost/math/distributions.hpp>`.
  26
  27 For example, to use the Students-t distribution include either
  28 `<boost/math/students_t.hpp>` or
  29 `<boost/math/distributions.hpp>`
  30
  31 You also need to bring distribution names into scope,
  32 perhaps with a `using namespace boost::math;` declaration,
  33
  34 or specific  `using` declarations like `using boost::math::normal;` (*recommended*).
  35
  36 [caution Some math function names are also used in `namespace std` so including `<random>` could cause ambiguity!]
  37
  38 [endsect] [/ section:headers Headers and Namespaces]
  39
  40 [section:objects Distributions are Objects]
  41
  42 Each kind of distribution in this library is a class type - an object, with member functions.
  43
  44 [tip If you are familiar with statistics libraries using functions,
  45 and 'Distributions as Objects' seem alien, see
  46 [link math_toolkit.stat_tut.weg.nag_library the comparison to
  47 other statistics libraries.]
  48 ] [/tip]
  49
  50 [link policy Policies] provide optional fine-grained control
  51 of the behaviour of these classes, allowing the user to customise
  52 behaviour such as how errors are handled, or how the quantiles
  53 of discrete distributions behave.
  54
  55 Making distributions class types does two things:
  56
  57 * It encapsulates the kind of distribution in the C++ type system;
  58 so, for example, Students-t distributions are always a different C++ type from
  59 Chi-Squared distributions.
  60 * The distribution objects store any parameters associated with the
  61 distribution: for example, the Students-t distribution has a
  62 ['degrees of freedom] parameter that controls the shape of the distribution.
  63 This ['degrees of freedom] parameter has to be provided
  64 to the Students-t object when it is constructed.
  65
  66 Although the distribution classes in this library are templates, there
  67 are typedefs on type /double/ that mostly take the usual name of the
  68 distribution
  69 (except where there is a clash with a function of the same name: beta and gamma,
  70 in which case using the default template arguments - `RealType = double` -
  71 is nearly as convenient).
  72 Probably 95% of uses are covered by these typedefs:
  73
  74    // using namespace boost::math; // Avoid potential ambiguity with names in std <random>
  75    // Safer to declare specific functions with using statement(s):
  76
  77    using boost::math::beta_distribution;
  78    using boost::math::binomial_distribution;
  79    using boost::math::students_t;
  80
  81    // Construct a students_t distribution with 4 degrees of freedom:
  82    students_t d1(4);
  83
  84    // Construct a double-precision beta distribution
  85    // with parameters a = 10, b = 20
  86    beta_distribution<> d2(10, 20); // Note: _distribution<> suffix !
  87
  88 If you need to use the distributions with a type other than `double`,
  89 then you can instantiate the template directly: the names of the
  90 templates are the same as the `double` typedef but with `_distribution`
  91 appended, for example: __students_t_distrib or __binomial_distrib:
  92
  93    // Construct a students_t distribution, of float type,
  94    // with 4 degrees of freedom:
  95    students_t_distribution<float> d3(4);
  96
  97    // Construct a binomial distribution, of long double type,
  98    // with probability of success 0.3
  99    // and 20 trials in total:
 100    binomial_distribution<long double> d4(20, 0.3);
 101
 102 The parameters passed to the distributions can be accessed via getter member
 103 functions:
 104
 105    d1.degrees_of_freedom();  // returns 4.0
 106
 107 This is all well and good, but not very useful so far.  What we often want
 108 is to be able to calculate the /cumulative distribution functions/ and
 109 /quantiles/ etc for these distributions.
 110
 111 [endsect] [/section:objects Distributions are Objects]
 112
 113
 114 [section:generic Generic operations common to all distributions are non-member functions]
 115
 116 Want to calculate the PDF (Probability Density Function) of a distribution?
 117 No problem, just use:
 118
 119    pdf(my_dist, x);  // Returns PDF (density) at point x of distribution my_dist.
 120
 121 Or how about the CDF (Cumulative Distribution Function):
 122
 123    cdf(my_dist, x);  // Returns CDF (integral from -infinity to point x)
 124                      // of distribution my_dist.
 125
 126 And quantiles are just the same:
 127
 128    quantile(my_dist, p);  // Returns the value of the random variable x
 129                           // such that cdf(my_dist, x) == p.
 130
 131 If you're wondering why these aren't member functions, it's to
 132 make the library more easily extensible: if you want to add additional
 133 generic operations - let's say the /n'th moment/ - then all you have to
 134 do is add the appropriate non-member functions, overloaded for each
 135 implemented distribution type.
 136
 137 [tip
 138
 139 [*Random numbers that approximate Quantiles of Distributions]
 140
 141 If you want random numbers that are distributed in a specific way,
 142 for example in a uniform, normal or triangular,
 143 see [@http://www.boost.org/libs/random/ Boost.Random].
 144
 145 Whilst in principal there's nothing to prevent you from using the
 146 quantile function to convert a uniformly distributed random
 147 number to another distribution, in practice there are much more
 148 efficient algorithms available that are specific to random number generation.
 149 ] [/tip Random numbers that approximate Quantiles of Distributions]
 150
 151 For example, the binomial distribution has two parameters:
 152 n (the number of trials) and p (the probability of success on any one trial).
 153
 154 The `binomial_distribution` constructor therefore has two parameters:
 155
 156 `binomial_distribution(RealType n, RealType p);`
 157
 158 For this distribution the __random_variate is k: the number of successes observed.
 159 The probability density\/mass function (pdf) is therefore written as ['f(k; n, p)].
 160
 161 [note
 162
 163 [*Random Variates and Distribution Parameters]
 164
 165 The concept of a __random_variable is closely linked to the term __random_variate:
 166 a random variate is a particular value (outcome) of a random variable.
 167 and [@http://en.wikipedia.org/wiki/Parameter distribution parameters]
 168 are conventionally distinguished (for example in Wikipedia and Wolfram MathWorld)
 169 by placing a semi-colon or vertical bar)
 170 /after/ the __random_variable (whose value you 'choose'),
 171 to separate the variate from the parameter(s) that defines the shape of the distribution.
 172
 173 For example, the binomial distribution probability distribution function (PDF) is written as
 174 [role serif_italic ['f(k| n, p)] = Pr(K = k|n, p) = ] probability of observing k successes out of n trials.
 175 K is the __random_variable, k is the __random_variate,
 176 the parameters are n (trials) and p (probability).
 177 ] [/tip Random Variates and Distribution Parameters]
 178
 179 [note  By convention, __random_variate are lower case, usually k is integral, x if real, and
 180 __random_variable are upper case, K if integral, X if real.  But this implementation treats
 181 all as floating point values `RealType`, so if you really want an integral result,
 182 you must round: see note on Discrete Probability Distributions below for details.]
 183
 184 As noted above the non-member function `pdf` has one parameter for the distribution object,
 185 and a second for the random variate.  So taking our binomial distribution
 186 example, we would write:
 187
 188 `pdf(binomial_distribution<RealType>(n, p), k);`
 189
 190 The ranges of __random_variate values that are permitted and are supported can be
 191 tested by using two functions `range` and `support`.
 192
 193 The distribution (effectively the __random_variate) is said to be 'supported'
 194 over a range that is
 195 [@http://en.wikipedia.org/wiki/Probability_distribution
 196  "the smallest closed set whose complement has probability zero"].
 197 MathWorld uses the word 'defined' for this range.
 198 Non-mathematicians might say it means the 'interesting' smallest range
 199 of random variate x that has the cdf going from zero to unity.
 200 Outside are uninteresting zones where the pdf is zero, and the cdf zero or unity.
 201
 202 For most distributions, with probability distribution functions one might describe
 203 as 'well-behaved', we have decided that it is most useful for the supported range
 204 to *exclude* random variate values like exact zero *if the end point is discontinuous*.
 205 For example, the Weibull (scale 1, shape 1) distribution smoothly heads for unity
 206 as the random variate x declines towards zero.
 207 But at x = zero, the value of the pdf is suddenly exactly zero, by definition.
 208 If you are plotting the PDF, or otherwise calculating,
 209 zero is not the most useful value for the lower limit of supported, as we discovered.
 210 So for this, and similar distributions,
 211 we have decided it is most numerically useful to use
 212 the closest value to zero, min_value, for the limit of the supported range.
 213 (The `range` remains from zero, so you will still get `pdf(weibull, 0) == 0`).
 214 (Exponential and gamma distributions have similarly discontinuous functions).
 215
 216 Mathematically, the functions may make sense with an (+ or -) infinite value,
 217 but except for a few special cases (in the Normal and Cauchy distributions)
 218 this implementation limits random variates to finite values from the `max`
 219 to `min` for the `RealType`.
 220 (See [link math_toolkit.sf_implementation.handling_of_floating_point_infin
 221 Handling of Floating-Point Infinity] for rationale).
 222
 223
 224 [note
 225
 226 [*Discrete Probability Distributions]
 227
 228 Note that the [@http://en.wikipedia.org/wiki/Discrete_probability_distribution
 229 discrete distributions], including the binomial, negative binomial, Poisson & Bernoulli,
 230 are all mathematically defined as discrete functions:
 231 that is to say the functions `cdf` and `pdf` are only defined for integral values
 232 of the random variate.
 233
 234 However, because the method of calculation often uses continuous functions
 235 it is convenient to treat them as if they were continuous functions,
 236 and permit non-integral values of their parameters.
 237
 238 Users wanting to enforce a strict mathematical model may use `floor`
 239 or `ceil` functions on the random variate prior to calling the distribution
 240 function.
 241
 242 The quantile functions for these distributions are hard to specify
 243 in a manner that will satisfy everyone all of the time.  The default
 244 behaviour is to return an integer result, that has been rounded
 245 /outwards/: that is to say, lower quantiles - where the probablity
 246 is less than 0.5 are rounded down, while upper quantiles - where
 247 the probability is greater than 0.5 - are rounded up.  This behaviour
 248 ensures that if an X% quantile is requested, then /at least/ the requested
 249 coverage will be present in the central region, and /no more than/
 250 the requested coverage will be present in the tails.
 251
 252 This behaviour can be changed so that the quantile functions are rounded
 253 differently, or return a real-valued result using
 254 [link math_toolkit.pol_overview Policies].  It is strongly
 255 recommended that you read the tutorial
 256 [link math_toolkit.pol_tutorial.understand_dis_quant
 257 Understanding Quantiles of Discrete Distributions] before
 258 using the quantile function on a discrete distribtion.  The
 259 [link math_toolkit.pol_ref.discrete_quant_ref reference docs]
 260 describe how to change the rounding policy
 261 for these distributions.
 262
 263 For similar reasons continuous distributions with parameters like
 264 "degrees of freedom"
 265 that might appear to be integral, are treated as real values
 266 (and are promoted from integer to floating-point if necessary).
 267 In this case however, there are a small number of situations where non-integral
 268 degrees of freedom do have a genuine meaning.
 269 ]
 270
 271 [endsect] [/ section:generic Generic operations common to all distributions are non-member functions]
 272
 273 [section:complements Complements are supported too - and when to use them]
 274
 275 Often you don't want the value of the CDF, but its complement, which is
 276 to say `1-p` rather than `p`.  It is tempting to calculate the CDF and subtract
 277 it from `1`, but if `p` is very close to `1` then cancellation error
 278 will cause you to lose accuracy, perhaps totally.
 279
 280 [link why_complements See below ['"Why and when to use complements?"]]
 281
 282 In this library, whenever you want to receive a complement, just wrap
 283 all the function arguments in a call to `complement(...)`, for example:
 284
 285    students_t dist(5);
 286    cout << "CDF at t = 1 is " << cdf(dist, 1.0) << endl;
 287    cout << "Complement of CDF at t = 1 is " << cdf(complement(dist, 1.0)) << endl;
 288
 289 But wait, now that we have a complement, we have to be able to use it as well.
 290 Any function that accepts a probability as an argument can also accept a complement
 291 by wrapping all of its arguments in a call to `complement(...)`, for example:
 292
 293    students_t dist(5);
 294
 295    for(double i = 10; i < 1e10; i *= 10)
 296    {
 297       // Calculate the quantile for a 1 in i chance:
 298       double t = quantile(complement(dist, 1/i));
 299       // Print it out:
 300       cout << "Quantile of students-t with 5 degrees of freedom\n"
 301               "for a 1 in " << i << " chance is " << t << endl;
 302    }
 303
 304 [tip
 305
 306 [*Critical values are just quantiles]
 307
 308 Some texts talk about quantiles, or percentiles or fractiles,
 309 others about critical values, the basic rule is:
 310
 311 ['Lower critical values] are the same as the quantile.
 312
 313 ['Upper critical values] are the same as the quantile from the complement
 314 of the probability.
 315
 316 For example, suppose we have a Bernoulli process, giving rise to a binomial
 317 distribution with success ratio 0.1 and 100 trials in total.  The
 318 ['lower critical value] for a probability of 0.05 is given by:
 319
 320 `quantile(binomial(100, 0.1), 0.05)`
 321
 322 and the ['upper critical value] is given by:
 323
 324 `quantile(complement(binomial(100, 0.1), 0.05))`
 325
 326 which return 4.82 and 14.63 respectively.
 327 ]
 328
 329 [#why_complements]
 330 [tip
 331
 332 [*Why bother with complements anyway?]
 333
 334 It's very tempting to dispense with complements, and simply subtract
 335 the probability from 1 when required.  However, consider what happens when
 336 the probability is very close to 1: let's say the probability expressed at
 337 float precision is `0.999999940f`, then `1 - 0.999999940f = 5.96046448e-008`,
 338 but the result is actually accurate to just ['one single bit]: the only
 339 bit that didn't cancel out!
 340
 341 Or to look at this another way: consider that we want the risk of falsely
 342 rejecting the null-hypothesis in the Student's t test to be 1 in 1 billion,
 343 for a sample size of 10,000.
 344 This gives a probability of 1 - 10[super -9], which is exactly 1 when
 345 calculated at float precision.  In this case calculating the quantile from
 346 the complement neatly solves the problem, so for example:
 347
 348 `quantile(complement(students_t(10000), 1e-9))`
 349
 350 returns the expected t-statistic `6.00336`, where as:
 351
 352 `quantile(students_t(10000), 1-1e-9f)`
 353
 354 raises an overflow error, since it is the same as:
 355
 356 `quantile(students_t(10000), 1)`
 357
 358 Which has no finite result.
 359
 360 With all distributions, even for more reasonable probability
 361 (unless the value of p can be represented exactly in the floating-point type)
 362 the loss of accuracy quickly becomes significant if you simply calculate probability from 1 - p
 363 (because it will be mostly garbage digits for p ~ 1).
 364
 365 So always avoid, for example, using a probability near to unity like 0.99999
 366
 367 `quantile(my_distribution, 0.99999)`
 368
 369 and instead use
 370
 371 `quantile(complement(my_distribution, 0.00001))`
 372
 373 since 1 - 0.99999 is not exactly equal to 0.00001 when using floating-point arithmetic.
 374
 375 This assumes that the 0.00001 value is either a constant,
 376 or can be computed by some manner other than subtracting 0.99999 from 1.
 377
 378 ] [/ tip *Why bother with complements anyway?]
 379
 380 [endsect] [/ section:complements Complements are supported too - and why]
 381
 382 [section:parameters Parameters can be calculated]
 383
 384 Sometimes it's the parameters that define the distribution that you
 385 need to find.  Suppose, for example, you have conducted a Students-t test
 386 for equal means and the result is borderline.  Maybe your two samples
 387 differ from each other, or maybe they don't; based on the result
 388 of the test you can't be sure.  A legitimate question to ask then is
 389 "How many more measurements would I have to take before I would get
 390 an X% probability that the difference is real?"  Parameter finders
 391 can answer questions like this, and are necessarily different for
 392 each distribution.  They are implemented as static member functions
 393 of the distributions, for example:
 394
 395    students_t::find_degrees_of_freedom(
 396       1.3,        // difference from true mean to detect
 397       0.05,       // maximum risk of falsely rejecting the null-hypothesis.
 398       0.1,        // maximum risk of falsely failing to reject the null-hypothesis.
 399       0.13);      // sample standard deviation
 400
 401 Returns the number of degrees of freedom required to obtain a 95%
 402 probability that the observed differences in means is not down to
 403 chance alone.  In the case that a borderline Students-t test result
 404 was previously obtained, this can be used to estimate how large the sample size
 405 would have to become before the observed difference was considered
 406 significant.  It assumes, of course, that the sample mean and standard
 407 deviation are invariant with sample size.
 408
 409 [endsect] [/ section:parameters Parameters can be calculated]
 410
 411 [section:summary Summary]
 412
 413 * Distributions are objects, which are constructed from whatever
 414 parameters the distribution may have.
 415 * Member functions allow you to retrieve the parameters of a distribution.
 416 * Generic non-member functions provide access to the properties that
 417 are common to all the distributions (PDF, CDF, quantile etc).
 418 * Complements of probabilities are calculated by wrapping the function's
 419 arguments in a call to `complement(...)`.
 420 * Functions that accept a probability can accept a complement of the
 421 probability as well, by wrapping the function's
 422 arguments in a call to `complement(...)`.
 423 * Static member functions allow the parameters of a distribution
 424 to be found from other information.
 425
 426 Now that you have the basics, the next section looks at some worked examples.
 427
 428 [endsect] [/section:summary Summary]
 429 [endsect] [/section:overview Overview]
 430
 431 [section:weg Worked Examples]
 432 [include distribution_construction.qbk]
 433 [include students_t_examples.qbk]
 434 [include chi_squared_examples.qbk]
 435 [include f_dist_example.qbk]
 436 [include binomial_example.qbk]
 437 [include geometric_example.qbk]
 438 [include negative_binomial_example.qbk]
 439 [include normal_example.qbk]
 440 [/include inverse_gamma_example.qbk]
 441 [/include inverse_gaussian_example.qbk]
 442 [include inverse_chi_squared_eg.qbk]
 443 [include nc_chi_squared_example.qbk]
 444 [include error_handling_example.qbk]
 445 [include find_location_and_scale.qbk]
 446 [include nag_library.qbk]
 447 [include c_sharp.qbk]
 448 [endsect] [/section:weg Worked Examples]
 449
 450 [include background.qbk]
 451
 452 [endsect] [/ section:stat_tut Statistical Distributions Tutorial]
 453
 454 [/ dist_tutorial.qbk
 455   Copyright 2006, 2010, 2011 John Maddock and Paul A. Bristow.
 456   Distributed under the Boost Software License, Version 1.0.
 457   (See accompanying file LICENSE_1_0.txt or copy at
 458   http://www.boost.org/LICENSE_1_0.txt).
 459 ]
 460