libs/histogram/doc/html/histogram/rationale.html

   1 <html>
   2 <head>
   3 <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
   4 <title>Rationale</title>
   5 <link rel="stylesheet" href="../../../../../doc/src/boostbook.css" type="text/css">
   6 <meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
   7 <link rel="home" href="../index.html" title="Chapter&#160;1.&#160;Boost.Histogram">
   8 <link rel="up" href="../index.html" title="Chapter&#160;1.&#160;Boost.Histogram">
   9 <link rel="prev" href="../boost/histogram/weight.html" title="Function template weight">
  10 <link rel="next" href="history.html" title="Revision history">
  11 </head>
  12 <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
  13 <table cellpadding="2" width="100%"><tr>
  14 <td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../boost.png"></td>
  15 <td align="center"><a href="../../../../../index.html">Home</a></td>
  16 <td align="center"><a href="../../../../libraries.htm">Libraries</a></td>
  17 <td align="center"><a href="http://www.boost.org/users/people.html">People</a></td>
  18 <td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td>
  19 <td align="center"><a href="../../../../../more/index.htm">More</a></td>
  20 </tr></table>
  21 <hr>
  22 <div class="spirit-nav">
  23 <a accesskey="p" href="../boost/histogram/weight.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="history.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a>
  24 </div>
  25 <div class="section">
  26 <div class="titlepage"><div><div><h2 class="title" style="clear: both">
  27 <a name="histogram.rationale"></a><a class="link" href="rationale.html" title="Rationale">Rationale</a>
  28 </h2></div></div></div>
  29 <div class="toc"><dl class="toc">
  30 <dt><span class="section"><a href="rationale.html#histogram.rationale.motivation">Motivation</a></span></dt>
  31 <dt><span class="section"><a href="rationale.html#histogram.rationale.guidelines">Guidelines</a></span></dt>
  32 <dt><span class="section"><a href="rationale.html#histogram.rationale.no_lambdas">No lambdas as axis types</a></span></dt>
  33 <dt><span class="section"><a href="rationale.html#histogram.rationale.uoflow">Under- and overflow bins</a></span></dt>
  34 <dt><span class="section"><a href="rationale.html#histogram.rationale.index_type">Size method of axis returns
  35       signed integer</a></span></dt>
  36 <dt><span class="section"><a href="rationale.html#histogram.rationale.real_index_type">Continuous axis
  37       accepts real-valued cell index</a></span></dt>
  38 <dt><span class="section"><a href="rationale.html#histogram.rationale.variance">On variance estimates</a></span></dt>
  39 <dt><span class="section"><a href="rationale.html#histogram.rationale.weights">Support of weighted fills</a></span></dt>
  40 <dt><span class="section"><a href="rationale.html#histogram.rationale.python_support">Python support</a></span></dt>
  41 <dt><span class="section"><a href="rationale.html#histogram.rationale.support_of_boost_accumulators">Support
  42       of Boost.Accumulators</a></span></dt>
  43 <dt><span class="section"><a href="rationale.html#histogram.rationale.support_of_boost_range">Support of
  44       Boost.Range</a></span></dt>
  45 <dt><span class="section"><a href="rationale.html#histogram.rationale.support_of_serialization">Support
  46       of serialization</a></span></dt>
  47 <dt><span class="section"><a href="rationale.html#histogram.rationale.comparison_to_boost_accumulators">Comparison
  48       to Boost.Accumulators</a></span></dt>
  49 <dt><span class="section"><a href="rationale.html#histogram.rationale.why_is_boost_histogram_not_built">Why
  50       is Boost.Histogram not built on top of Boost.MultiArray?</a></span></dt>
  51 </dl></div>
  52 <div class="section">
  53 <div class="titlepage"><div><div><h3 class="title">
  54 <a name="histogram.rationale.motivation"></a><a class="link" href="rationale.html#histogram.rationale.motivation" title="Motivation">Motivation</a>
  55 </h3></div></div></div>
  56 <p>
  57         C++ lacks a widely-used, free multi-dimensional histogram class. While it
  58         is easy to write a one-dimensional histogram, writing a general multi-dimensional
  59         histogram poses more of a challenge. If a few more features required by scientific
  60         professionals are added onto the wish-list, then the implementation becomes
  61         non-trivial and a well-tested library solution desirable.
  62       </p>
  63 <p>
  64         The <a href="https://www.gnu.org/software/gsl" target="_top">GNU Scientific Library
  65         (GSL)</a> and the <a href="https://root.cern.ch" target="_top">ROOT framework</a>
  66         from CERN have histogram implementations. The GSL has histograms for one
  67         and two dimensions in C. The implementations are not customizable. ROOT has
  68         well-tested implementations of histograms, but they are not customizable
  69         and they are not easy to use correctly. ROOT also has new implementations
  70         in beta-stage similar to this one, but they are still less flexible, not
  71         easy to use, and they cannot be used without the rest of ROOT, which is a
  72         huge library to install just to get histograms.
  73       </p>
  74 <p>
  75         The templated histogram class in this library has a minimal interface and
  76         focuses on the core task of creating histograms from input data. It is very
  77         customizable and extensible through user-provided classes. A single implementation
  78         is used for one and multi-dimensional histograms. While being safe, customizable,
  79         and convenient, the histogram is also very fast. The static version, which
  80         has an axis configuration that is hard-coded at compile-time, is faster than
  81         any tested competitor.
  82       </p>
  83 <p>
  84         One of the central design goals was to hide the implementation details of
  85         the internal counters of the histogram. The internal counting mechanism is
  86         encapsulated in a storage class, which can be switched out. The default storage
  87         uses an adaptive memory management which is safe to use, memory-efficient,
  88         and fast. The safety comes from the guarantee, that counts cannot overflow
  89         or be capped. This is a rare guarantee, hardly found in other libraries.
  90         In the standard configuration, the histogram <span class="emphasis"><em>just works</em></span>
  91         under any circumstance. Yet, users with special requirements can implement
  92         their own custom storage class or use an alternative builtin array-based
  93         storage.
  94       </p>
  95 </div>
  96 <div class="section">
  97 <div class="titlepage"><div><div><h3 class="title">
  98 <a name="histogram.rationale.guidelines"></a><a class="link" href="rationale.html#histogram.rationale.guidelines" title="Guidelines">Guidelines</a>
  99 </h3></div></div></div>
 100 <p>
 101         This library was written based on a decade of experience collected in working
 102         with big data, more precisely in the field of particle physics and astroparticle
 103         physics. The design is guided by advice from people like Bjarne Stroustrup,
 104         Scott Meyers, Herb Sutter, and Andrei Alexandrescu, and Chandler Carruth.
 105         The <a href="https://www.python.org/dev/peps/pep-0020" target="_top">Zen of Python</a>
 106         (also applies to other languages) was an inspiration and well as ideas from
 107         the <a href="https://eigen.tuxfamily.org/" target="_top">Eigen library</a>. The
 108         feature set was designed to be a superset of what is offered by the <a href="https://root.cern.ch" target="_top">ROOT framework</a> and the <a href="https://www.gnu.org/software/gsl" target="_top">GNU
 109         Scientific Library (GSL)</a>.
 110       </p>
 111 <p>
 112         Design goals of the library:
 113       </p>
 114 <div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
 115 <li class="listitem">
 116             Provide a simple and convenient default behavior for the casual user,
 117             yet allow a maximum of customization for the power user. Follow the "Don't
 118             pay for what you don't use" principle. Features that you don't use
 119             should not affect your performance negatively.
 120           </li>
 121 <li class="listitem">
 122             Provide the same interface for one-dimensional and multi-dimensional
 123             histograms. This makes the interface easier to learn, and makes it easier
 124             to move a project from one-dimensional to multi-dimensional analysis.
 125           </li>
 126 <li class="listitem">
 127             Hide the details of how the bin counters work. This design allows for
 128             interesting implementations, such as the default storage that provides
 129             a no-overflow-guarantee, which no other library offers.
 130           </li>
 131 <li class="listitem">
 132             Minimalism, STL and Boost compatibility. Focus the library on the task
 133             of creating histograms. Functionality on top of that (drawing, further
 134             processing...) should come from other libraries. This gives users maximum
 135             flexibility to mix and match libraries. The histogram provides iterators
 136             ranges that allow other libraries access to the histogram state. The
 137             library provides iterators to access its internal counters, making it
 138             compatible with STL algorithms and other Boost libraries. In addition,
 139             the library was made compatible with <a href="../../../../../libs/accumulators/index.html" target="_top">Boost.Accumulators</a>
 140             and <a href="../../../../../libs/range/index.html" target="_top">Boost.Range</a>.
 141           </li>
 142 </ul></div>
 143 </div>
 144 <div class="section">
 145 <div class="titlepage"><div><div><h3 class="title">
 146 <a name="histogram.rationale.no_lambdas"></a><a class="link" href="rationale.html#histogram.rationale.no_lambdas" title="No lambdas as axis types">No lambdas as axis types</a>
 147 </h3></div></div></div>
 148 <p>
 149         Lambdas were considered and rejected as a form of simple user-defined axis
 150         type, because they do not allow access to their state, such as the current
 151         axis size. Lambdas can be fully replaced by locally-defined structs. A local
 152         struct cannot be templated and cannot have templated methods, but this is
 153         not an issue. In the local context where the struct is created, all relevant
 154         types must be known already so that locally defined structs can simply use
 155         these concrete types and there is no need for templates.
 156       </p>
 157 </div>
 158 <div class="section">
 159 <div class="titlepage"><div><div><h3 class="title">
 160 <a name="histogram.rationale.uoflow"></a><a class="link" href="rationale.html#histogram.rationale.uoflow" title="Under- and overflow bins">Under- and overflow bins</a>
 161 </h3></div></div></div>
 162 <p>
 163         Axis instances by default add extra bins that count values which fall below
 164         or above the range covered by the axis (for those types where that makes
 165         sense). These extra bins are called under- and overflow bins, respectively.
 166         The extra bins can be turned off individually for each axis to conserve memory,
 167         but it is generally recommended to have them. The normal bins, excluding
 168         under- and overflow, are called <span class="bold"><strong>inner bins</strong></span>.
 169       </p>
 170 <p>
 171         Under- and overflow bins are useful in one-dimensional histograms, and nearly
 172         essential in multi-dimensional histograms. Here are the advantages:
 173       </p>
 174 <div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
 175 <li class="listitem">
 176             No loss: The total sum over all bin counts is strictly equal to the number
 177             of times the histogram was filled. Even NaN values are counted, they
 178             are put in the overflow-bin by convention.
 179           </li>
 180 <li class="listitem">
 181             Diagnosis: Unexpected extreme values show up in the extra bins, which
 182             otherwise may be overlooked.
 183           </li>
 184 <li class="listitem">
 185             Ability to reduce histograms: In multi-dimensional histograms, an out-of-range
 186             value along one axis may be paired with an in-range value along another
 187             axis. If under- and overflow bins are missing, such a value pair is lost
 188             completely. If you apply a <code class="computeroutput"><span class="identifier">reduce</span></code>
 189             operation on a histogram, which removes some axes by summing all counts
 190             along that dimension, this would lead to distortions of the histogram
 191             along the remaining axes. When under- and overflow bins are present,
 192             the <code class="computeroutput"><span class="identifier">reduce</span></code> operation
 193             always produces a sub-histogram identical to one obtained, if it was
 194             filled with the original data.
 195           </li>
 196 </ul></div>
 197 <p>
 198         The presence of the extra bins does not interfere with normal indexing. On
 199         an axis with <code class="computeroutput"><span class="identifier">n</span></code> bins, the
 200         first bin has the index <code class="computeroutput"><span class="number">0</span></code>, the
 201         last bin <code class="computeroutput"><span class="identifier">n</span><span class="special">-</span><span class="number">1</span></code>, while the under- and overflow bins are accessible
 202         at the indices <code class="computeroutput"><span class="special">-</span><span class="number">1</span></code>
 203         and <code class="computeroutput"><span class="identifier">n</span></code>, respectively. This
 204         choice is optimized for users who are unaware of the existence of these extra
 205         bins. They would find the other indexing scheme surprising, where you start
 206         with <code class="computeroutput"><span class="number">0</span></code> at the underflow bin and
 207         the first normal bin is at <code class="computeroutput"><span class="number">1</span></code>.
 208         Also, the chosen scheme allows one to turn off the extra bins in the code
 209         where the histogram is created, without changing any code downstream that
 210         addresses inner bins with indices.
 211       </p>
 212 </div>
 213 <div class="section">
 214 <div class="titlepage"><div><div><h3 class="title">
 215 <a name="histogram.rationale.index_type"></a><a class="link" href="rationale.html#histogram.rationale.index_type" title="Size method of axis returns signed integer">Size method of axis returns
 216       signed integer</a>
 217 </h3></div></div></div>
 218 <p>
 219         The standard library returns a container size as an unsigned integer, because
 220         a container size cannot be negative. The <code class="computeroutput"><span class="identifier">size</span><span class="special">()</span></code> method of the histogram class follows this
 221         rule, but the <code class="computeroutput"><span class="identifier">size</span><span class="special">()</span></code>
 222         methods of axis types return a signed integral type. Why?
 223       </p>
 224 <p>
 225         As explained in the <a class="link" href="rationale.html#histogram.rationale.uoflow" title="Under- and overflow bins">section about
 226         under- and overflow</a>, a histogram axis may have an optional underflow
 227         bin, which is addressed by the index <code class="computeroutput"><span class="special">-</span><span class="number">1</span></code>. It follows that the index type must be signed
 228         integer for all axis types.
 229       </p>
 230 <p>
 231         The <code class="computeroutput"><span class="identifier">size</span><span class="special">()</span></code>
 232         method of any axis returns the same signed integer type. The size of an axis
 233         cannot be negative, but this choice has two advantages. Firstly, the value
 234         returned by <code class="computeroutput"><span class="identifier">size</span><span class="special">()</span></code>
 235         itself is guaranteed to be a valid index, which is good since it may address
 236         the overflow bin. Secondly, comparisons between an index and the value returned
 237         by <code class="computeroutput"><span class="identifier">size</span><span class="special">()</span></code>
 238         are frequent. If <code class="computeroutput"><span class="identifier">size</span><span class="special">()</span></code>
 239         returned an unsigned integral type, compilers would produce a warning for
 240         each comparisons, and rightly so. <a href="https://www.youtube.com/watch?v=wvtFGa6XJDU" target="_top">Something
 241         awful happens</a> on most machines when you compare <code class="computeroutput"><span class="special">-</span><span class="number">1</span></code> with an unsigned integer, <code class="computeroutput"><span class="special">-</span><span class="number">1</span> <span class="special">&lt;</span> <span class="number">1u</span>
 242         <span class="special">==</span> <span class="keyword">false</span></code>,
 243         which causes a serious bug in the following innocent-looking loop:
 244       </p>
 245 <pre class="programlisting"><span class="keyword">auto</span> <span class="identifier">my_axis</span> <span class="special">=</span> <span class="comment">/* ... */</span><span class="special">;</span>
 246 <span class="comment">// naive loop to iterate over all bins, including underflow and overflow</span>
 247 <span class="keyword">for</span> <span class="special">(</span><span class="keyword">int</span> <span class="identifier">i</span> <span class="special">=</span> <span class="special">-</span><span class="number">1</span><span class="special">;</span> <span class="identifier">i</span> <span class="special">&lt;=</span> <span class="identifier">my_axis</span><span class="special">.</span><span class="identifier">size</span><span class="special">();</span> <span class="special">++</span><span class="identifier">i</span><span class="special">)</span> <span class="special">{</span>
 248   <span class="comment">// body is never executed if return value of my_axis.size() is an unsigned integral type</span>
 249 <span class="special">}</span>
 250 </pre>
 251 <p>
 252         The advantages clearly override the disadvantages of this choice.
 253       </p>
 254 </div>
 255 <div class="section">
 256 <div class="titlepage"><div><div><h3 class="title">
 257 <a name="histogram.rationale.real_index_type"></a><a class="link" href="rationale.html#histogram.rationale.real_index_type" title="Continuous axis accepts real-valued cell index">Continuous axis
 258       accepts real-valued cell index</a>
 259 </h3></div></div></div>
 260 <p>
 261         Each axis has a method called <code class="computeroutput"><span class="identifier">value</span><span class="special">(</span><span class="identifier">index_type</span><span class="special">)</span></code> which converts an index into the equivalent
 262         value at that index. If the axis is continuous, there are many possible values
 263         in the interval between two adjacent integer indices. User often want to
 264         access the center of such an interval. An easy and very efficient way to
 265         access the center value is for this method to accept real-valued indices.
 266         Then, the center of the first bin between index <code class="computeroutput"><span class="identifier">i</span></code>
 267         and <code class="computeroutput"><span class="identifier">i</span><span class="special">+</span><span class="number">1</span></code> is simply obtained by passing <code class="computeroutput"><span class="identifier">i</span><span class="special">+</span><span class="number">0.5</span></code>.
 268       </p>
 269 <p>
 270         This scheme is computationally efficient and intuitive. Each continuous axis
 271         is required to accept a real-valued index, in fact, internal library code
 272         relies uses this to detect whether an axis is continuous or discrete.
 273       </p>
 274 </div>
 275 <div class="section">
 276 <div class="titlepage"><div><div><h3 class="title">
 277 <a name="histogram.rationale.variance"></a><a class="link" href="rationale.html#histogram.rationale.variance" title="On variance estimates">On variance estimates</a>
 278 </h3></div></div></div>
 279 <p>
 280         Once a histogram is filled, the bin counter can be accessed with the <code class="computeroutput"><span class="identifier">at</span><span class="special">(...)</span></code>
 281         method. Some accumulators offer a <code class="computeroutput"><span class="identifier">value</span><span class="special">()</span></code> method to return the cell value <span class="emphasis"><em>k</em></span>
 282         and a <code class="computeroutput"><span class="identifier">variance</span><span class="special">()</span></code>
 283         method, which returns an estimate <span class="emphasis"><em>v</em></span> of the <a href="https://en.wikipedia.org/wiki/Variance" target="_top">variance</a>
 284         of that cell.
 285       </p>
 286 <p>
 287         If the input values for the histogram come from a <a href="https://en.wikipedia.org/wiki/Stochastic_process" target="_top">stochastic
 288         process</a>, the variance estimate provides useful additional information.
 289         Examples for a stochastic process are a physics experiment or a random person
 290         filling out a questionnaire <a href="#ftn.histogram.rationale.variance.f0" class="footnote" name="histogram.rationale.variance.f0"><sup class="footnote">[3]</sup></a>. The variance <span class="emphasis"><em>v</em></span> is the square of the <a href="https://en.wikipedia.org/wiki/Standard_deviation" target="_top">standard deviation</a>.
 291         The standard deviation is a number that tells us how much we can expect the
 292         observed value to fluctuate if we or someone else would repeat our experiment
 293         with new random input.
 294       </p>
 295 <p>
 296         Variance estimates are useful in many ways:
 297       </p>
 298 <div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
 299 <li class="listitem">
 300             Error bars: Drawing an <a href="https://en.wikipedia.org/wiki/Error_bar" target="_top">error
 301             bar</a> over the interval <span class="emphasis"><em>(k - sqrt(v), k + sqrt(v))</em></span>
 302             is a simple visualization of the expected random scatter of the bin value
 303             <span class="emphasis"><em>k</em></span>, if the histogram was cleared and filled again
 304             with another independent sample of the same size (e.g. by repeating the
 305             physics experiment or asking more people to fill a questionnaire). If
 306             you compare the result with a fitted model (see next item), about 2/3
 307             of the error bars should overlap with the model, if the model is correct.
 308           </li>
 309 <li class="listitem">
 310             Least-squares fitting: Often you have a model of the expected number
 311             of counts <span class="emphasis"><em>lambda</em></span> per bin, which is a function of
 312             parameters with unknown values. A simple method to find good (sometimes
 313             the best) estimates for those parameter values is to vary them until
 314             the sum of squared residuals <span class="emphasis"><em>(k - lambda)^2/v</em></span> is
 315             minimized. This is the <a href="https://en.wikipedia.org/wiki/Least_squares" target="_top">method
 316             of least squares</a>, in which both the bin values <span class="emphasis"><em>k</em></span>
 317             and variance estimates <span class="emphasis"><em>v</em></span> enter.
 318           </li>
 319 <li class="listitem">
 320             Pull distributions: If you have two histograms filled with the same number
 321             of samples and you want to know whether they are in agreement, you can
 322             compare the so-called pull distribution. It is formed by subtracting
 323             the counts and dividing by the square root of their variances <span class="emphasis"><em>(k1
 324             - k2)/sqrt(v1 + v2)</em></span>. If the histograms are identical, the
 325             pull distribution randomly scatters around zero, and about 2/3 of the
 326             values are in the interval <span class="emphasis"><em>[ -1, 1]</em></span>.
 327           </li>
 328 </ul></div>
 329 <p>
 330         Why return the variance <span class="emphasis"><em>v</em></span> and not the standard deviation
 331         <span class="emphasis"><em>s = sqrt(v)</em></span>? The reason is that variances can be trivially
 332         added and it is computationally more efficient to return the variance. <a href="https://en.wikipedia.org/wiki/Variance#Properties" target="_top">Variances of independent
 333         samples can be added</a> like normal numbers <span class="emphasis"><em>v3 = v1 + v2</em></span>.
 334         This is not true for standard deviations, where the addition law is more
 335         complex <span class="emphasis"><em>s3 = sqrt(s1^2 + s2^2)</em></span>. In that sense, the variance
 336         is more straight-forward to use during data processing. The user can take
 337         the square-root at the end of the processing obtain the standard deviation
 338         as needed.
 339       </p>
 340 <p>
 341         How is the variance estimate <span class="emphasis"><em>v</em></span> computed for a normal
 342         counting histogram? If we know the expected number of counts <span class="emphasis"><em>lambda</em></span>
 343         per bin, we could compute the variance as <span class="emphasis"><em>v = lambda</em></span>,
 344         because counts in a histogram follow the <a href="https://en.wikipedia.org/wiki/Poisson_distribution" target="_top">Poisson
 345         distribution</a> <a href="#ftn.histogram.rationale.variance.f1" class="footnote" name="histogram.rationale.variance.f1"><sup class="footnote">[4]</sup></a>. After filling a histogram, we do not know the expected number
 346         of counts <span class="emphasis"><em>lambda</em></span> for any particular bin, but we know
 347         the observed count <span class="emphasis"><em>k</em></span>, which is not too far from <span class="emphasis"><em>lambda</em></span>.
 348         We therefore might be tempted to just replace <span class="emphasis"><em>lambda</em></span>
 349         with <span class="emphasis"><em>k</em></span> in the formula <span class="emphasis"><em>v = lambda = k</em></span>.
 350         This is in fact the so-called non-parametric estimate for the variance based
 351         on the <a href="https://en.wikipedia.org/wiki/Plug-in_principle" target="_top">plug-in
 352         principle</a>. It is the best (and only) estimate for the variance, if
 353         we know nothing more about the underlying stochastic process which generated
 354         the inputs (or want to feign ignorance about it).
 355       </p>
 356 </div>
 357 <div class="section">
 358 <div class="titlepage"><div><div><h3 class="title">
 359 <a name="histogram.rationale.weights"></a><a class="link" href="rationale.html#histogram.rationale.weights" title="Support of weighted fills">Support of weighted fills</a>
 360 </h3></div></div></div>
 361 <p>
 362         A histogram sorts input values into bins and increments a bin counter if
 363         an input value falls into the range covered by that bin. The <code class="computeroutput"><a class="link" href="../boost/histogram/unlimited_storage.html" title="Class template unlimited_storage">standard
 364         storage</a></code> uses integer types to store these counts, see the <a class="link" href="overview.html#histogram.overview.structure.storage" title="Storage types">storage section</a> how
 365         integer overflow is avoided. However, sometimes histograms need to be filled
 366         with values that have a weight <span class="emphasis"><em>w</em></span> attached to them. In
 367         this case, the corresponding bin counter is not increased by one, but by
 368         the weight value <span class="emphasis"><em>w</em></span>.
 369       </p>
 370 <div class="note"><table border="0" summary="Note">
 371 <tr>
 372 <td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../../../../doc/src/images/note.png"></td>
 373 <th align="left">Note</th>
 374 </tr>
 375 <tr><td align="left" valign="top"><p>
 376           There are several use-cases for weighted increments. The main use in particle
 377           physics is to adapt simulated data of an experiment to real data. Simulations
 378           are needed to determine various corrections and efficiencies, but a simulated
 379           experiment is almost never a perfect replica of the real experiment. In
 380           addition, simulations are expensive to do. So, when deviations in a simulated
 381           distribution of a variable are found, one typically does not rerun the
 382           simulations, but assigns weights to match the simulated distribution to
 383           the real one.
 384         </p></td></tr>
 385 </table></div>
 386 <p>
 387         When the <code class="computeroutput"><a class="link" href="reference.html#boost.histogram.weight_storage">weight_storage</a></code>
 388         is used, histograms may be filled with weighted value tuples. Two real numbers
 389         per bin are stored in this case. The first keeps track of the sum of weights.
 390         The second keeps track of the sum of weights squared, which is the variance
 391         estimate in this case. The former is accessed with the <code class="computeroutput"><span class="identifier">value</span><span class="special">()</span></code> method of the bin counter, and the latter
 392         with the <code class="computeroutput"><span class="identifier">variance</span><span class="special">()</span></code>
 393         method.
 394       </p>
 395 <p>
 396         Why the sum of weights squared is the variance estimate can be derived from
 397         the <a href="https://en.wikipedia.org/wiki/Variance#Properties" target="_top">mathematical
 398         properties of the variance</a>. Let us say a bin is filled <span class="emphasis"><em>k1</em></span>
 399         times with a fixed weight <span class="emphasis"><em>w1</em></span>. The sum of weights is
 400         then <span class="emphasis"><em>w1 k1</em></span>. It then follows from the variance properties
 401         that <span class="emphasis"><em>Var(w1 k1) = w1^2 Var(k1)</em></span>. Using the reasoning
 402         from before, the estimated variance of <span class="emphasis"><em>k1</em></span> is <span class="emphasis"><em>k1</em></span>,
 403         so that <span class="emphasis"><em>Var(w1 k1) = w1^2 Var(k1) = w1^2 k1</em></span>. Variances
 404         of independent samples are additive. If the bin is further filled <span class="emphasis"><em>k2</em></span>
 405         times with weight <span class="emphasis"><em>w2</em></span>, the sum of weights is <span class="emphasis"><em>w1
 406         k1 + w2 k2</em></span>, with variance <span class="emphasis"><em>w1^2 k1 + w2^2 k2</em></span>.
 407         This also holds for <span class="emphasis"><em>k1 = k2 = 1</em></span>. Therefore, the sum
 408         of weights <span class="emphasis"><em>w[i]</em></span> has variance sum of <span class="emphasis"><em>w[i]^2</em></span>.
 409         In other words, to incrementally keep track of the variance of the sum of
 410         weights, we need to keep track of the sum of weights squared.
 411       </p>
 412 </div>
 413 <div class="section">
 414 <div class="titlepage"><div><div><h3 class="title">
 415 <a name="histogram.rationale.python_support"></a><a class="link" href="rationale.html#histogram.rationale.python_support" title="Python support">Python support</a>
 416 </h3></div></div></div>
 417 <p>
 418         Python is a popular scripting language in the data science community. Thus,
 419         the library must be designed to support Python bindings, which are developed
 420         separately. The histogram should usable as an interface between a complex
 421         simulation or data-storage system written in C++ and data-analysis/plotting
 422         in Python. Users are able to define a histogram in Python, let it be filled
 423         on the C++ side, and then get it back for further data analysis or plotting.
 424       </p>
 425 <p>
 426         This is a major reason why a purely static design was rejected, where the
 427         histogram must be fully configured at compile-time. While this generates
 428         more efficient code, it does not work with Python, which requires one to
 429         configure histograms at run-time without recompiling the code.
 430       </p>
 431 </div>
 432 <div class="section">
 433 <div class="titlepage"><div><div><h3 class="title">
 434 <a name="histogram.rationale.support_of_boost_accumulators"></a><a class="link" href="rationale.html#histogram.rationale.support_of_boost_accumulators" title="Support of Boost.Accumulators">Support
 435       of Boost.Accumulators</a>
 436 </h3></div></div></div>
 437 <p>
 438         Boost.Histogram can be configured to use arbitrary accumulators as cells,
 439         in particular the accumulators from <a href="../../../../../libs/accumulators/index.html" target="_top">Boost.Accumulators</a>.
 440         Sample values can be passed to the cell accumulator, which it may use to
 441         compute the mean, median, variance or other statistics of the samples sorted
 442         into each cell.
 443       </p>
 444 </div>
 445 <div class="section">
 446 <div class="titlepage"><div><div><h3 class="title">
 447 <a name="histogram.rationale.support_of_boost_range"></a><a class="link" href="rationale.html#histogram.rationale.support_of_boost_range" title="Support of Boost.Range">Support of
 448       Boost.Range</a>
 449 </h3></div></div></div>
 450 <p>
 451         The histogram class is a valid range and can be used with the <a href="../../../../../libs/range/index.html" target="_top">Boost.Range</a>
 452         library. This library provides a custom adaptor generator, <code class="computeroutput"><span class="identifier">indexed</span></code>, analog to the corresponding adaptor
 453         generator in Boost.Range, but with a potentially multi-dimensional index.
 454       </p>
 455 </div>
 456 <div class="section">
 457 <div class="titlepage"><div><div><h3 class="title">
 458 <a name="histogram.rationale.support_of_serialization"></a><a class="link" href="rationale.html#histogram.rationale.support_of_serialization" title="Support of serialization">Support
 459       of serialization</a>
 460 </h3></div></div></div>
 461 <p>
 462         Serialization is implemented using <a href="../../../../../libs/serialization/index.html" target="_top">Boost.Serialization</a>.
 463         It would be great to have a portable binary archive with support for floating
 464         point data to store and retrieve histograms efficiently, which is currently
 465         not available. The library has to be open for other serialization libraries.
 466       </p>
 467 </div>
 468 <div class="section">
 469 <div class="titlepage"><div><div><h3 class="title">
 470 <a name="histogram.rationale.comparison_to_boost_accumulators"></a><a class="link" href="rationale.html#histogram.rationale.comparison_to_boost_accumulators" title="Comparison to Boost.Accumulators">Comparison
 471       to Boost.Accumulators</a>
 472 </h3></div></div></div>
 473 <p>
 474         Boost.Histogram has a minor overlap with <a href="../../../../../libs/accumulators/index.html" target="_top">Boost.Accumulators</a>,
 475         but the scopes are rather different. The statistical accumulators <code class="computeroutput"><span class="identifier">density</span></code> and <code class="computeroutput"><span class="identifier">weighted_density</span></code>
 476         in Boost.Accumulators generate one-dimensional histograms. The axis range
 477         and the bin widths are determined automatically from a cached sample of initial
 478         values. They cannot be used for multi-dimensional data. Boost.Histogram focuses
 479         on multi-dimensional data and gives the user full control of how the binning
 480         should be done for each dimension.
 481       </p>
 482 <p>
 483         Automatic binning is not an option for Boost.Histogram, because it does not
 484         scale well to many dimensions. Because of the Curse of Dimensionality, a
 485         prohibitive number of samples would need to be collected.
 486       </p>
 487 <div class="note"><table border="0" summary="Note">
 488 <tr>
 489 <td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../../../../doc/src/images/note.png"></td>
 490 <th align="left">Note</th>
 491 </tr>
 492 <tr><td align="left" valign="top"><p>
 493           There is no scientific consensus on how do automatic binning in an optimal
 494           way, mostly because there is no consensus over the cost function (there
 495           are many articles with different solutions in the literature). The problem
 496           is not solved for one-dimensional data, and even less so for multi-dimensional
 497           data.
 498         </p></td></tr>
 499 </table></div>
 500 <p>
 501         Recommendation:
 502       </p>
 503 <div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
 504 <li class="listitem">
 505             Boost.Accumulators
 506             <div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem">
 507                   You have one-dimensional data of which you know nothing about,
 508                   and you want a histogram quickly without worrying about binning
 509                   details.
 510                 </li></ul></div>
 511           </li>
 512 <li class="listitem">
 513             Boost.Histogram
 514             <div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; ">
 515 <li class="listitem">
 516                   You have multi-dimensional data or you suspect you will switch
 517                   to multi-dimensional data later.
 518                 </li>
 519 <li class="listitem">
 520                   You want to customize the binning by hand, for example, to make
 521                   bin edges coincide with special values or to handle special properties
 522                   of your values, like angles defined on a circle.
 523                 </li>
 524 </ul></div>
 525           </li>
 526 </ul></div>
 527 </div>
 528 <div class="section">
 529 <div class="titlepage"><div><div><h3 class="title">
 530 <a name="histogram.rationale.why_is_boost_histogram_not_built"></a><a class="link" href="rationale.html#histogram.rationale.why_is_boost_histogram_not_built" title="Why is Boost.Histogram not built on top of Boost.MultiArray?">Why
 531       is Boost.Histogram not built on top of Boost.MultiArray?</a>
 532 </h3></div></div></div>
 533 <p>
 534         Boost.MultiArray implements a multi-dimensional array, it also converts an
 535         index tuple into a global index that is used to access an element in the
 536         array. Boost.Histogram and Boost.MultiArray share this functionality, but
 537         Boost.Histogram cannot use Boost.MultiArray as a back-end. Boost.MultiArray
 538         makes the rank of the array a compile-time property, while this library needs
 539         the rank to be dynamic.
 540       </p>
 541 <p>
 542         Boost.MultiArray also does not allow to change the element type dynamically.
 543         This is needed to implement the adaptive storage mentioned further up. Using
 544         a variant type as the element type of a Boost.MultiArray would not work,
 545         because it creates this wasteful layout:
 546       </p>
 547 <p>
 548         <code class="computeroutput"><span class="special">[</span><span class="identifier">type</span><span class="special">-</span><span class="identifier">index</span> <span class="number">1</span><span class="special">][</span><span class="identifier">value</span>
 549         <span class="number">1</span><span class="special">][</span><span class="identifier">type</span><span class="special">-</span><span class="identifier">index</span>
 550         <span class="number">2</span><span class="special">][</span><span class="identifier">value</span> <span class="number">2</span><span class="special">]...</span></code>
 551       </p>
 552 <p>
 553         A type index is stored for each cell. Moreover, the variant is always as
 554         large as the largest type in the union, so there is no way to safe memory
 555         by using a smaller type when the bin count is low, as it is done by the adaptive
 556         storage. The adaptive storage uses only one type-index for the whole array
 557         and allocates a homogeneous array of values of the same type that exactly
 558         matches their sizes, creating the following layout:
 559       </p>
 560 <p>
 561         <code class="computeroutput"><span class="special">[</span><span class="identifier">type</span><span class="special">-</span><span class="identifier">index</span><span class="special">][</span><span class="identifier">value</span> <span class="number">1</span><span class="special">][</span><span class="identifier">value</span>
 562         <span class="number">2</span><span class="special">][</span><span class="identifier">value</span> <span class="number">3</span><span class="special">]...</span></code>
 563       </p>
 564 <p>
 565         There is only one type index and the number of allocated bytes for the array
 566         can adapted dynamically to the size of the value type.
 567       </p>
 568 </div>
 569 <div class="footnotes">
 570 <br><hr style="width:100; text-align:left;margin-left: 0">
 571 <div id="ftn.histogram.rationale.variance.f0" class="footnote"><p><a href="#histogram.rationale.variance.f0" class="para"><sup class="para">[3] </sup></a>
 572           The choices of the person are most likely not random, but if we pick a
 573           random person from a group, we randomly sample from a pool of opinions
 574         </p></div>
 575 <div id="ftn.histogram.rationale.variance.f1" class="footnote"><p><a href="#histogram.rationale.variance.f1" class="para"><sup class="para">[4] </sup></a>
 576           The Poisson distribution is correct as far as the counts <span class="emphasis"><em>k</em></span>
 577           themselves are of interest. If the fractions per bin <span class="emphasis"><em>p = k /
 578           N</em></span> are of interest, where <span class="emphasis"><em>N</em></span> is the total
 579           number of counts, then the correct distribution to describe the fractions
 580           is the <a href="https://en.wikipedia.org/wiki/Multinomial_distribution" target="_top">multinomial
 581           distribution</a>.
 582         </p></div>
 583 </div>
 584 </div>
 585 <table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr>
 586 <td align="left"></td>
 587 <td align="right"><div class="copyright-footer">Copyright &#169; 2016-2019 Hans
 588       Dembinski<p>
 589         Distributed under the Boost Software License, Version 1.0. (See accompanying
 590         file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
 591       </p>
 592 </div></td>
 593 </tr></table>
 594 <hr>
 595 <div class="spirit-nav">
 596 <a accesskey="p" href="../boost/histogram/weight.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="history.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a>
 597 </div>
 598 </body>
 599 </html>