libs/math/doc/statistics/univariate_statistics.qbk

   1 [/
   2   Copyright 2018 Nick Thompson
   3
   4   Distributed under the Boost Software License, Version 1.0.
   5   (See accompanying file LICENSE_1_0.txt or copy at
   6   http://www.boost.org/LICENSE_1_0.txt).
   7 ]
   8
   9 [section:univariate_statistics Univariate Statistics]
  10
  11 [heading Synopsis]
  12
  13 ``
  14 #include <boost/math/statistics/univariate_statistics.hpp>
  15
  16 namespace boost{ namespace math{ namespace statistics {
  17
  18     template<class Container>
  19     auto mean(Container const & c);
  20
  21     template<class ForwardIterator>
  22     auto mean(ForwardIterator first, ForwardIterator last);
  23
  24     template<class Container>
  25     auto variance(Container const & c);
  26
  27     template<class ForwardIterator>
  28     auto variance(ForwardIterator first, ForwardIterator last);
  29
  30     template<class Container>
  31     auto sample_variance(Container const & c);
  32
  33     template<class ForwardIterator>
  34     auto sample_variance(ForwardIterator first, ForwardIterator last);
  35
  36     template<class Container>
  37     auto mean_and_sample_variance(Container const & c);
  38
  39     template<class Container>
  40     auto skewness(Container const & c);
  41
  42     template<class ForwardIterator>
  43     auto skewness(ForwardIterator first, ForwardIterator last);
  44
  45     template<class Container>
  46     auto kurtosis(Container const & c);
  47
  48     template<class ForwardIterator>
  49     auto kurtosis(ForwardIterator first, ForwardIterator last);
  50
  51     template<class Container>
  52     auto excess_kurtosis(Container const & c);
  53
  54     template<class ForwardIterator>
  55     auto excess_kurtosis(ForwardIterator first, ForwardIterator last);
  56
  57     template<class Container>
  58     auto first_four_moments(Container const & c);
  59
  60     template<class ForwardIterator>
  61     auto first_four_moments(ForwardIterator first, ForwardIterator last);
  62
  63     template<class Container>
  64     auto median(Container & c);
  65
  66     template<class ForwardIterator>
  67     auto median(ForwardIterator first, ForwardIterator last);
  68
  69     template<class RandomAccessIterator>
  70     auto median_absolute_deviation(ForwardIterator first, ForwardIterator last, typename std::iterator_traits<RandomAccessIterator>::value_type center=std::numeric_limits<Real>::quiet_NaN());
  71
  72     template<class RandomAccessContainer>
  73     auto median_absolute_deviation(RandomAccessContainer v, typename RandomAccessContainer::value_type center=std::numeric_limits<Real>::quiet_NaN());
  74
  75     template<class Container>
  76     auto gini_coefficient(Container & c);
  77
  78     template<class ForwardIterator>
  79     auto gini_coefficient(ForwardIterator first, ForwardIterator last);
  80
  81     template<class Container>
  82     auto sample_gini_coefficient(Container & c);
  83
  84     template<class ForwardIterator>
  85     auto sample_gini_coefficient(ForwardIterator first, ForwardIterator last);
  86
  87 }}}
  88 ``
  89
  90 [heading Description]
  91
  92 The file `boost/math/statistics/univariate_statistics.hpp` is a set of facilities for computing scalar values from vectors.
  93
  94 Many of these functionals have trivial naive implementations, but experienced programmers will recognize that even trivial algorithms are easy to screw up, and that numerical instabilities often lurk in corner cases.
  95 We have attempted to do our "due diligence" to root out these problems-scouring the literature for numerically stable algorithms for even the simplest of functionals.
  96
  97 /Nota bene/: Some similar functionality is provided in [@https://www.boost.org/doc/libs/1_68_0/doc/html/accumulators/user_s_guide.html Boost Accumulators Framework].
  98 These accumulators should be used in real-time applications; `univariate_statistics.hpp` should be used when CPU vectorization is needed.
  99 As a reminder, remember that to actually /get/ vectorization, compile with `-march=native -O3` flags.
 100
 101 We now describe each functional in detail.
 102 Our examples use `std::vector<double>` to hold the data, but this not required.
 103 In general, you can store your data in an Eigen array, and Armadillo vector, `std::array`, and for many of the routines, a `std::forward_list`.
 104 These routines are usable in float, double, long double, and Boost.Multiprecision precision, as well as their complex extensions whenever the computation is well-defined.
 105 For certain operations (total variation, for example) integer inputs are supported.
 106
 107 [heading Mean]
 108
 109     std::vector<double> v{1,2,3,4,5};
 110     double mu = boost::math::statistics::mean(v.cbegin(), v.cend());
 111     // Alternative syntax if you want to use entire container:
 112     mu = boost::math::statistics::mean(v);
 113
 114 The implementation follows [@https://doi.org/10.1137/1.9780898718027 Higham 1.6a].
 115 The data is not modified and must be forward iterable.
 116 Works with real and integer data.
 117 If the input is an integer type, the output is a double precision float.
 118
 119 [heading Variance]
 120
 121     std::vector<double> v{1,2,3,4,5};
 122     Real sigma_sq = boost::math::statistics::variance(v.cbegin(), v.cend());
 123
 124 If you don't need to calculate on a subset of the input, then the range call is more terse:
 125
 126     std::vector<double> v{1,2,3,4,5};
 127     Real sigma_sq = boost::math::statistics::variance(v);
 128
 129 The implementation follows [@https://doi.org/10.1137/1.9780898718027 Higham 1.6b].
 130 The input data must be forward iterable and the range `[first, last)` must contain at least two elements.
 131 It is /not/ in general sensible to pass complex numbers to this routine.
 132 If integers are passed as input, then the output is a double precision float.
 133
 134 `boost::math::statistics::variance` returns the population variance.
 135 If you want a sample variance, use
 136
 137     std::vector<double> v{1,2,3,4,5};
 138     Real sn_sq = boost::math::statistics::sample_variance(v);
 139
 140
 141 [heading Skewness]
 142
 143 Computes the skewness of a dataset:
 144
 145     std::vector<double> v{1,2,3,4,5};
 146     double skewness = boost::math::statistics::skewness(v);
 147     // skewness = 0.
 148
 149 The input vector is not modified, works with integral and real data.
 150 If the input data is integral, the output is a double precision float.
 151
 152 For a dataset consisting of a single constant value, we take the skewness to be zero by definition.
 153
 154 The implementation follows [@https://prod.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf Pebay].
 155
 156 [heading Kurtosis]
 157
 158 Computes the kurtosis of a dataset:
 159
 160     std::vector<double> v{1,2,3,4,5};
 161     double kurtosis = boost::math::statistics::kurtosis(v);
 162     // kurtosis = 17/10
 163
 164 The implementation follows [@https://prod.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf Pebay].
 165 The input data must be forward iterable and must consist of real or integral values.
 166 If the input data is integral, the output is a double precision float.
 167 Note that this is /not/ the excess kurtosis.
 168 If you require the excess kurtosis, use `boost::math::statistics::excess_kurtosis`.
 169 This function simply subtracts 3 from the kurtosis, but it makes eminently clear our definition of kurtosis.
 170
 171 [heading First four moments]
 172
 173 Simultaneously computes the first four [@https://en.wikipedia.org/wiki/Central_moment central moments] in a single pass through the data:
 174
 175     std::vector<double> v{1,2,3,4,5};
 176     auto [M1, M2, M3, M4] = boost::math::statistics::first_four_moments(v);
 177
 178
 179 [heading Median]
 180
 181 Computes the median of a dataset:
 182
 183     std::vector<double> v{1,2,3,4,5};
 184     double m = boost::math::statistics::median(v.begin(), v.end());
 185
 186 /Nota bene: The input vector is modified./
 187 The calculation of the median is a thin wrapper around the C++11 [@https://en.cppreference.com/w/cpp/algorithm/nth_element `nth_element`].
 188 Therefore, all requirements of `std::nth_element` are inherited by the median calculation.
 189 In particular, the container must allow random access.
 190
 191 [heading Median Absolute Deviation]
 192
 193 Computes the [@https://en.wikipedia.org/wiki/Median_absolute_deviation median absolute deviation] of a dataset:
 194
 195     std::vector<double> v{1,2,3,4,5};
 196     double mad = boost::math::statistics::median_absolute_deviation(v);
 197
 198 By default, the deviation from the median is used.
 199 If you have some prior that the median is zero, or wish to compute the median absolute deviation from the mean,
 200 use the following:
 201
 202     // prior is that center is zero:
 203     double center = 0;
 204     double mad = boost::math::statistics::median_absolute_deviation(v, center);
 205
 206     // compute median absolute deviation from the mean:
 207     double mu = boost::math::statistics::mean(v);
 208     double mad = boost::math::statistics::median_absolute_deviation(v, mu);
 209
 210 /Nota bene:/ The input vector is modified.
 211 Again the vector is passed into a call to [@https://en.cppreference.com/w/cpp/algorithm/nth_element `nth_element`].
 212
 213 [heading Gini Coefficient]
 214
 215 Compute the Gini coefficient of a dataset:
 216
 217     std::vector<double> v{1,0,0,0};
 218     double gini = boost::math::statistics::gini_coefficient(v);
 219     // gini = 3/4
 220     double s_gini = boost::math::statistics::sample_gini_coefficient(v);
 221     // s_gini = 1.
 222     std::vector<double> w{1,1,1,1};
 223     gini = boost::math::statistics::gini_coefficient(w.begin(), w.end());
 224     // gini = 0, as all elements are now equal.
 225
 226 /Nota bene/: The input data is altered: in particular, it is sorted. Makes a call to `std::sort`, and as such requires random access iterators.
 227
 228 The sample Gini coefficient lies in the range [0,1], whereas the population Gini coefficient is in the range [0, 1 - 1/ /n/].
 229
 230 /Nota bene:/ There is essentially no reason to pass negative values to the Gini coefficient function.
 231 However, a use case (measuring wealth inequality when some people have negative wealth) exists, so we do not throw an exception when negative values are encountered.
 232 You should have /very/ good cause to pass negative values to the Gini coefficient calculator.
 233 Another use case is found in signal processing, but the sorting is by magnitude and hence has a different implementation.
 234 See `absolute_gini_coefficient` for details.
 235
 236 [heading References]
 237
 238 * Higham, Nicholas J. ['Accuracy and stability of numerical algorithms.] Vol. 80. Siam, 2002.
 239 * Philippe P. Pébay: ["Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments.] Technical Report SAND2008-6212, Sandia National Laboratories, September 2008.
 240
 241 [endsect]
 242 [/section:univariate_statistics Univariate Statistics]