libs/math/doc/vector_functionals/univariate_statistics.qbk

   1 [/
   2   Copyright 2018 Nick Thompson
   3
   4   Distributed under the Boost Software License, Version 1.0.
   5   (See accompanying file LICENSE_1_0.txt or copy at
   6   http://www.boost.org/LICENSE_1_0.txt).
   7 ]
   8
   9 [section:univariate_statistics Univariate Statistics]
  10
  11 [heading Synopsis]
  12
  13 ``
  14 #include <boost/math/tools/univariate_statistics.hpp>
  15
  16 namespace boost{ namespace math{ namespace tools {
  17
  18     template<class Container>
  19     auto mean(Container const & c);
  20
  21     template<class ForwardIterator>
  22     auto mean(ForwardIterator first, ForwardIterator last);
  23
  24     template<class Container>
  25     auto variance(Container const & c);
  26
  27     template<class ForwardIterator>
  28     auto variance(ForwardIterator first, ForwardIterator last);
  29
  30     template<class Container>
  31     auto sample_variance(Container const & c);
  32
  33     template<class ForwardIterator>
  34     auto sample_variance(ForwardIterator first, ForwardIterator last);
  35
  36     template<class Container>
  37     auto skewness(Container const & c);
  38
  39     template<class ForwardIterator>
  40     auto skewness(ForwardIterator first, ForwardIterator last);
  41
  42     template<class Container>
  43     auto kurtosis(Container const & c);
  44
  45     template<class ForwardIterator>
  46     auto kurtosis(ForwardIterator first, ForwardIterator last);
  47
  48     template<class Container>
  49     auto excess_kurtosis(Container const & c);
  50
  51     template<class ForwardIterator>
  52     auto excess_kurtosis(ForwardIterator first, ForwardIterator last);
  53
  54     template<class Container>
  55     auto first_four_moments(Container const & c);
  56
  57     template<class ForwardIterator>
  58     auto first_four_moments(ForwardIterator first, ForwardIterator last);
  59
  60     template<class Container>
  61     auto median(Container & c);
  62
  63     template<class ForwardIterator>
  64     auto median(ForwardIterator first, ForwardIterator last);
  65
  66     template<class RandomAccessIterator>
  67     auto median_absolute_deviation(ForwardIterator first, ForwardIterator last, typename std::iterator_traits<RandomAccessIterator>::value_type center=std::numeric_limits<Real>::quiet_NaN());
  68
  69     template<class RandomAccessContainer>
  70     auto median_absolute_deviation(RandomAccessContainer v, typename RandomAccessContainer::value_type center=std::numeric_limits<Real>::quiet_NaN());
  71
  72     template<class Container>
  73     auto gini_coefficient(Container & c);
  74
  75     template<class ForwardIterator>
  76     auto gini_coefficient(ForwardIterator first, ForwardIterator last);
  77
  78     template<class Container>
  79     auto sample_gini_coefficient(Container & c);
  80
  81     template<class ForwardIterator>
  82     auto sample_gini_coefficient(ForwardIterator first, ForwardIterator last);
  83
  84 }}}
  85 ``
  86
  87 [heading Description]
  88
  89 The file `boost/math/tools/univariate_statistics.hpp` is a set of facilities for computing scalar values from vectors.
  90
  91 Many of these functionals have trivial naive implementations, but experienced programmers will recognize that even trivial algorithms are easy to screw up, and that numerical instabilities often lurk in corner cases.
  92 We have attempted to do our "due diligence" to root out these problems-scouring the literature for numerically stable algorithms for even the simplest of functionals.
  93
  94 /Nota bene/: Some similar functionality is provided in [@https://www.boost.org/doc/libs/1_68_0/doc/html/accumulators/user_s_guide.html Boost Accumulators Framework].
  95 These accumulators should be used in real-time applications; `univariate_statistics.hpp` should be used when CPU vectorization is needed.
  96 As a reminder, remember that to actually /get/ vectorization, compile with `-march=native -O3` flags.
  97
  98 We now describe each functional in detail.
  99 Our examples use `std::vector<double>` to hold the data, but this not required.
 100 In general, you can store your data in an Eigen array, and Armadillo vector, `std::array`, and for many of the routines, a `std::forward_list`.
 101 These routines are usable in float, double, long double, and Boost.Multiprecision precision, as well as their complex extensions whenever the computation is well-defined.
 102 For certain operations (total variation, for example) integer inputs are supported.
 103
 104 [heading Mean]
 105
 106     std::vector<double> v{1,2,3,4,5};
 107     double mu = boost::math::tools::mean(v.cbegin(), v.cend());
 108     // Alternative syntax if you want to use entire container:
 109     mu = boost::math::tools::mean(v);
 110
 111 The implementation follows [@https://doi.org/10.1137/1.9780898718027 Higham 1.6a].
 112 The data is not modified and must be forward iterable.
 113 Works with real and integer data.
 114 If the input is an integer type, the output is a double precision float.
 115
 116 [heading Variance]
 117
 118     std::vector<double> v{1,2,3,4,5};
 119     Real sigma_sq = boost::math::tools::variance(v.cbegin(), v.cend());
 120
 121 If you don't need to calculate on a subset of the input, then the range call is more terse:
 122
 123     std::vector<double> v{1,2,3,4,5};
 124     Real sigma_sq = boost::math::tools::variance(v);
 125
 126 The implementation follows [@https://doi.org/10.1137/1.9780898718027 Higham 1.6b].
 127 The input data must be forward iterable and the range `[first, last)` must contain at least two elements.
 128 It is /not/ in general sensible to pass complex numbers to this routine.
 129 If integers are passed as input, then the output is a double precision float.
 130
 131 `boost::math::tools::variance` returns the population variance.
 132 If you want a sample variance, use
 133
 134     std::vector<double> v{1,2,3,4,5};
 135     Real sn_sq = boost::math::tools::sample_variance(v);
 136
 137
 138 [heading Skewness]
 139
 140 Computes the skewness of a dataset:
 141
 142     std::vector<double> v{1,2,3,4,5};
 143     double skewness = boost::math::tools::skewness(v);
 144     // skewness = 0.
 145
 146 The input vector is not modified, works with integral and real data.
 147 If the input data is integral, the output is a double precision float.
 148
 149 For a dataset consisting of a single constant value, we take the skewness to be zero by definition.
 150
 151 The implementation follows [@https://prod.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf Pebay].
 152
 153 [heading Kurtosis]
 154
 155 Computes the kurtosis of a dataset:
 156
 157     std::vector<double> v{1,2,3,4,5};
 158     double kurtosis = boost::math::tools::kurtosis(v);
 159     // kurtosis = 17/10
 160
 161 The implementation follows [@https://prod.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf Pebay].
 162 The input data must be forward iterable and must consist of real or integral values.
 163 If the input data is integral, the output is a double precision float.
 164 Note that this is /not/ the excess kurtosis.
 165 If you require the excess kurtosis, use `boost::math::tools::excess_kurtosis`.
 166 This function simply subtracts 3 from the kurtosis, but it makes eminently clear our definition of kurtosis.
 167
 168 [heading First four moments]
 169
 170 Simultaneously computes the first four [@https://en.wikipedia.org/wiki/Central_moment central moments] in a single pass through the data:
 171
 172     std::vector<double> v{1,2,3,4,5};
 173     auto [M1, M2, M3, M4] = boost::math::tools::first_four_moments(v);
 174
 175
 176 [heading Median]
 177
 178 Computes the median of a dataset:
 179
 180     std::vector<double> v{1,2,3,4,5};
 181     double m = boost::math::tools::median(v.begin(), v.end());
 182
 183 /Nota bene: The input vector is modified./
 184 The calculation of the median is a thin wrapper around the C++11 [@https://en.cppreference.com/w/cpp/algorithm/nth_element `nth_element`].
 185 Therefore, all requirements of `std::nth_element` are inherited by the median calculation.
 186 In particular, the container must allow random access.
 187
 188 [heading Median Absolute Deviation]
 189
 190 Computes the [@https://en.wikipedia.org/wiki/Median_absolute_deviation median absolute deviation] of a dataset:
 191
 192     std::vector<double> v{1,2,3,4,5};
 193     double mad = boost::math::tools::median_absolute_deviation(v);
 194
 195 By default, the deviation from the median is used.
 196 If you have some prior that the median is zero, or wish to compute the median absolute deviation from the mean,
 197 use the following:
 198
 199     // prior is that center is zero:
 200     double center = 0;
 201     double mad = boost::math::tools::median_absolute_deviation(v, center);
 202
 203     // compute median absolute deviation from the mean:
 204     double mu = boost::math::tools::mean(v);
 205     double mad = boost::math::tools::median_absolute_deviation(v, mu);
 206
 207 /Nota bene:/ The input vector is modified.
 208 Again the vector is passed into a call to [@https://en.cppreference.com/w/cpp/algorithm/nth_element `nth_element`].
 209
 210 [heading Gini Coefficient]
 211
 212 Compute the Gini coefficient of a dataset:
 213
 214     std::vector<double> v{1,0,0,0};
 215     double gini = boost::math::tools::gini_coefficient(v);
 216     // gini = 3/4
 217     double s_gini = boost::math::tools::sample_gini_coefficient(v);
 218     // s_gini = 1.
 219     std::vector<double> w{1,1,1,1};
 220     gini = boost::math::tools::gini_coefficient(w.begin(), w.end());
 221     // gini = 0, as all elements are now equal.
 222
 223 /Nota bene/: The input data is altered: in particular, it is sorted. Makes a call to `std::sort`, and as such requires random access iterators.
 224
 225 The sample Gini coefficient lies in the range [0,1], whereas the population Gini coefficient is in the range [0, 1 - 1/ /n/].
 226
 227 /Nota bene:/ There is essentially no reason to pass negative values to the Gini coefficient function.
 228 However, a use case (measuring wealth inequality when some people have negative wealth) exists, so we do not throw an exception when negative values are encountered.
 229 You should have /very/ good cause to pass negative values to the Gini coefficient calculator.
 230 Another use case is found in signal processing, but the sorting is by magnitude and hence has a different implementation.
 231 See `absolute_gini_coefficient` for details.
 232
 233 [heading References]
 234
 235 * Higham, Nicholas J. ['Accuracy and stability of numerical algorithms.] Vol. 80. Siam, 2002.
 236 * Philippe P. Pébay: ["Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments.] Technical Report SAND2008-6212, Sandia National Laboratories, September 2008.
 237
 238 [endsect]
 239 [/section:univariate_statistics Univariate Statistics]