libs/histogram/doc/html/histogram/overview.html

   1 <html>
   2 <head>
   3 <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
   4 <title>Overview</title>
   5 <link rel="stylesheet" href="../../../../../doc/src/boostbook.css" type="text/css">
   6 <meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
   7 <link rel="home" href="../index.html" title="Chapter&#160;1.&#160;Boost.Histogram">
   8 <link rel="up" href="../index.html" title="Chapter&#160;1.&#160;Boost.Histogram">
   9 <link rel="prev" href="../index.html" title="Chapter&#160;1.&#160;Boost.Histogram">
  10 <link rel="next" href="getting_started.html" title="Getting started">
  11 </head>
  12 <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
  13 <table cellpadding="2" width="100%"><tr>
  14 <td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../boost.png"></td>
  15 <td align="center"><a href="../../../../../index.html">Home</a></td>
  16 <td align="center"><a href="../../../../libraries.htm">Libraries</a></td>
  17 <td align="center"><a href="http://www.boost.org/users/people.html">People</a></td>
  18 <td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td>
  19 <td align="center"><a href="../../../../../more/index.htm">More</a></td>
  20 </tr></table>
  21 <hr>
  22 <div class="spirit-nav">
  23 <a accesskey="p" href="../index.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="getting_started.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a>
  24 </div>
  25 <div class="section">
  26 <div class="titlepage"><div><div><h2 class="title" style="clear: both">
  27 <a name="histogram.overview"></a><a class="link" href="overview.html" title="Overview">Overview</a>
  28 </h2></div></div></div>
  29 <div class="toc"><dl class="toc">
  30 <dt><span class="section"><a href="overview.html#histogram.overview.introduction">Introduction</a></span></dt>
  31 <dt><span class="section"><a href="overview.html#histogram.overview.structure">Structure</a></span></dt>
  32 <dd><dl>
  33 <dt><span class="section"><a href="overview.html#histogram.overview.structure.host">Histogram host class</a></span></dt>
  34 <dt><span class="section"><a href="overview.html#histogram.overview.structure.axis">Axis types</a></span></dt>
  35 <dt><span class="section"><a href="overview.html#histogram.overview.structure.storage">Storage types</a></span></dt>
  36 </dl></dd>
  37 </dl></div>
  38 <div class="section">
  39 <div class="titlepage"><div><div><h3 class="title">
  40 <a name="histogram.overview.introduction"></a><a class="link" href="overview.html#histogram.overview.introduction" title="Introduction">Introduction</a>
  41 </h3></div></div></div>
  42 <p>
  43         <a href="https://en.wikipedia.org/wiki/Histogram" target="_top">Histograms</a> are
  44         a basic tool in statistical analysis. A histogram consists of a number of
  45         non-overlapping cells in data space. When an input value is passed to the
  46         histogram, the corresponding cell that envelopes the value is found and an
  47         associated counter is incremented.
  48       </p>
  49 <p>
  50         When analyzing a large low-dimensional data set, it is more convenient to
  51         work with a histogram of the input values than the original values. Keeping
  52         the cell counts in memory for analysis and/or processing the counts requires
  53         far fewer resources than keeping the original values in memory and processing
  54         them. Information present in the original can also be extracted from the
  55         histogram<a href="#ftn.histogram.overview.introduction.f0" class="footnote" name="histogram.overview.introduction.f0"><sup class="footnote">[1]</sup></a>. Some information is lost in this way, but if the cells are small
  56         enough<a href="#ftn.histogram.overview.introduction.f1" class="footnote" name="histogram.overview.introduction.f1"><sup class="footnote">[2]</sup></a>, the loss is negligible. A histogram is a kind of lossy data-compression.
  57         It is also often used as a simple estimator for the <a href="https://en.wikipedia.org/wiki/Probability_density_function" target="_top">probability
  58         density function</a> of the input data. More complex density estimators
  59         exist, but histograms remain attractive because they are easy to reason about.
  60       </p>
  61 <p>
  62         This library provides a histogram for multi-dimensional data. In the multi-dimensional
  63         case, the input consist of tuples of values which belong together and describing
  64         different aspects of the same entity. For example, when you make a digital
  65         image with a camera, photons hit a pixel detector. The photon is the entity
  66         and it has two coordinates values where it hit the detector. The camera only
  67         counts how often a photon hit each cell, so it is a real-life example of
  68         making a two-dimensional histogram. A two-dimensional histogram collects
  69         more information than two separate one-dimensional histograms, one for each
  70         coordinate. For example, if the two-dimensional image looks like a checker
  71         board, with high and low densities are alternating along each coordinate,
  72         then the one-dimensional histograms along each coordinate would look flat.
  73         There would be no hint that there is a complex structure in two dimensions.
  74       </p>
  75 <p>
  76         The term <span class="emphasis"><em>histogram</em></span> is usually strictly used for something
  77         with cells over discrete or continuous data. This histogram class can also
  78         process categorical variables and it even allows for non-consecutive cells
  79         if that is desired. There is no restriction to numbers as input either. Any
  80         C++ type can be fed into the histogram, if the user provides a specialized
  81         axis class that maps values of this type to a cell index. The only remaining
  82         restriction is that cells are non-overlapping, since there must be a unique
  83         mapping from input value to cell. The library is not able to automatically
  84         ensure this for user-provided axis classes, so the responsibly is on the
  85         user.
  86       </p>
  87 <p>
  88         Furthermore, the histogram can handle weighted input. Normally, the cell
  89         counter which is connected to an input tuple is incremented by one, but sometimes
  90         it is useful to increment by a weight, an integral or floating point number.
  91         Finally, the histogram can be configured to store any kind of accumulator
  92         in each cell. Arbitrary samples can be passed to this accumulator, which
  93         may compute the mean or other interesting quantities from the samples that
  94         are sorted into the cell. When the accumulator computes a mean, the result
  95         is called a <span class="emphasis"><em>profile</em></span>. The feature set is informed by
  96         popular libraries for scientific computing, notably <a href="https://root.cern.ch" target="_top">CERN's
  97         ROOT framework</a> and the <a href="https://www.gnu.org/software/gsl" target="_top">GNU
  98         Scientific Library</a>.
  99       </p>
 100 </div>
 101 <div class="section">
 102 <div class="titlepage"><div><div><h3 class="title">
 103 <a name="histogram.overview.structure"></a><a class="link" href="overview.html#histogram.overview.structure" title="Structure">Structure</a>
 104 </h3></div></div></div>
 105 <div class="toc"><dl class="toc">
 106 <dt><span class="section"><a href="overview.html#histogram.overview.structure.host">Histogram host class</a></span></dt>
 107 <dt><span class="section"><a href="overview.html#histogram.overview.structure.axis">Axis types</a></span></dt>
 108 <dt><span class="section"><a href="overview.html#histogram.overview.structure.storage">Storage types</a></span></dt>
 109 </dl></div>
 110 <p>
 111         The library consists of three orthogonal components:
 112       </p>
 113 <div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
 114 <li class="listitem">
 115             <a class="link" href="overview.html#histogram.overview.structure.host" title="Histogram host class">histogram host class</a>:
 116             The histogram host class defines the public user interface and holds
 117             axis objects (one for each dimension) and a storage object. The user
 118             can chose whether axis objects are stored in a static tuple, a vector,
 119             or a vector of variants.
 120           </li>
 121 <li class="listitem">
 122             <a class="link" href="overview.html#histogram.overview.structure.axis" title="Axis types">axis types</a>:
 123             Defines how input values are mapped to bins. Several axis types are provided
 124             which implement different specializations. Users can make their own axis
 125             types following the axis concept and use them with the library. A variant
 126             type is provided, which can hold one of several concrete axis types.
 127           </li>
 128 <li class="listitem">
 129             <a class="link" href="overview.html#histogram.overview.structure.storage" title="Storage types">storage types</a>:
 130             Manages a collection of bin counters. The requirements for a storage
 131             differ from those of an STL container, it needs to follow the storage
 132             concept. Two implementations are provided.
 133           </li>
 134 </ul></div>
 135 <div class="section">
 136 <div class="titlepage"><div><div><h4 class="title">
 137 <a name="histogram.overview.structure.host"></a><a class="link" href="overview.html#histogram.overview.structure.host" title="Histogram host class">Histogram host class</a>
 138 </h4></div></div></div>
 139 <p>
 140           Histograms store axis objects and a storage object. A one-dimensional histogram
 141           has one axis, a multi-dimensional histogram has several. When you pass
 142           an input tuple, say (v1, v2, v3), then the first axis will map v1 onto
 143           index i1, the second axis v2 onto i2, and so on, to generate the index
 144           tuple (i1, i2, i3). The histogram host class then converts these indices
 145           into a linear global index that is used to address bin counter in the storage
 146           object.
 147         </p>
 148 <div class="note"><table border="0" summary="Note">
 149 <tr>
 150 <td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../../../../doc/src/images/note.png"></td>
 151 <th align="left">Note</th>
 152 </tr>
 153 <tr><td align="left" valign="top"><p>
 154             To understand the need for multi-dimensional histograms, think of point
 155             coordinates. If all points that you consider lie on a line, you need
 156             only one value to describe the point. If all points lie in a plane, you
 157             need two values to describe the position. Three values are needed for
 158             a point in space. A histogram puts a discrete grid over the line, the
 159             plane or the space, and counts how many points lie in each cell of the
 160             grid. To approximate a point distribution on a line, a 1d-histogram is
 161             sufficient. To do the same in 3d-space, one needs a 3d-histogram.
 162           </p></td></tr>
 163 </table></div>
 164 <p>
 165           This library supports different axis types, so that the user can customize
 166           how the mapping is done exactly, see <a class="link" href="overview.html#histogram.overview.structure.axis" title="Axis types">axis
 167           types</a>. Users can furthermore chose between several ways of storing
 168           axis types in the histogram.
 169         </p>
 170 <p>
 171           When the number and types of the axes are known at compile-time, the histogram
 172           host class stores axis types in a <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">tuple</span></code>.
 173           We call this a <span class="emphasis"><em>static histogram</em></span>. To access a particular
 174           axis, one should use a compile-time number as index (a run-time index also
 175           works with some limitations). A static histogram is extremely fast (see
 176           <a class="link" href="benchmarks.html" title="Benchmarks">benchmark</a>), because there is
 177           no overhead and the compiler can inline code, unroll loops, and more. Also,
 178           many user errors are can be caught at compile-time rather than run-time.
 179         </p>
 180 <p>
 181           Static histograms are the best kind, but cannot be used when histograms
 182           are to be created with an axis configuration that is only known at run-time.
 183           This is the case, for example, when histograms are created from Python
 184           or from a graphical user interface. Therefore also more dynamic kinds of
 185           histograms are supported.
 186         </p>
 187 <p>
 188           There are two levels of dynamism. The histogram can hold instances of a
 189           single axis type in a <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">vector</span></code>.
 190           Now the number of axis instances per histogram can vary at run-time, but
 191           the axis type must be the same for all instances. We call this a <span class="emphasis"><em>semi-dynamic
 192           histogram</em></span>.
 193         </p>
 194 <p>
 195           If also the axis types need to vary at run-time, one can place <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">histogram</span><span class="special">::</span><span class="identifier">axis</span><span class="special">::</span><span class="identifier">variant</span></code> type in a <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">vector</span></code>,
 196           which can hold one of a set of different concrete axis types. We call this
 197           a <span class="emphasis"><em>dynamic histogram</em></span>. The dynamic histogram is a single
 198           type that can store arbitrary sequences of different axes types, which
 199           may be generated at run-time. The polymorphic behavior of the generic
 200           <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">histogram</span><span class="special">::</span><span class="identifier">axis</span><span class="special">::</span><span class="identifier">variant</span></code> type has a run-time cost, however.
 201           Typically, the performance is reduced by a factor of two compared to a
 202           static histogram.
 203         </p>
 204 <div class="note"><table border="0" summary="Note">
 205 <tr>
 206 <td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../../../../doc/src/images/note.png"></td>
 207 <th align="left">Note</th>
 208 </tr>
 209 <tr><td align="left" valign="top"><p>
 210             The design decision to store axis types in the variant-like type <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">histogram</span><span class="special">::</span><span class="identifier">axis</span><span class="special">::</span><span class="identifier">variant</span></code> has several advantages over
 211             forms of run-time polymorphism. Firstly, it guarantees that axis objects
 212             which belong to the same histogram are stored locally together in memory,
 213             which reduces cache misses when the histogram iterates over axis objects
 214             in a tight loop, which it often does. Secondly, each axis can accept
 215             a different value type in this scheme. Classic polymorphism with vtables
 216             requires that all classes share the same method call signatures, but
 217             we want different axis to allowed to accepted different types of arguments.
 218             Variants work well for this case.
 219           </p></td></tr>
 220 </table></div>
 221 </div>
 222 <div class="section">
 223 <div class="titlepage"><div><div><h4 class="title">
 224 <a name="histogram.overview.structure.axis"></a><a class="link" href="overview.html#histogram.overview.structure.axis" title="Axis types">Axis types</a>
 225 </h4></div></div></div>
 226 <p>
 227           An axis defines an injective mapping of (a range of) input values to a
 228           bin. The logic is encapsulated in an axis type. Users can create their
 229           own axis classes and use them with the library, by implementing the <a class="link" href="concepts.html#histogram.concepts.Axis" title="Axis"><span class="bold"><strong>Axis</strong></span>
 230           concept</a>. The library comes with four builtin types, which implement
 231           different specializations.
 232         </p>
 233 <div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
 234 <li class="listitem">
 235               <code class="computeroutput"><a class="link" href="../boost/histogram/axis/regular.html" title="Class template regular">boost::histogram::axis::regular</a></code>
 236               sorts real numbers into bins with equal width. The regular axis also
 237               supports monotonic transforms, which are applied when the input values
 238               are passed to the axis. This can be used to make a fast logarithmic
 239               axis, where the bins have equal width in the logarithm of the variable.
 240             </li>
 241 <li class="listitem">
 242               <code class="computeroutput"><a class="link" href="../boost/histogram/axis/variable.html" title="Class template variable">boost::histogram::axis::variable</a></code>
 243               sorts real numbers into bins with varying width.
 244             </li>
 245 <li class="listitem">
 246               <code class="computeroutput"><a class="link" href="../boost/histogram/axis/integer.html" title="Class template integer">boost::histogram::axis::integer</a></code>
 247               is a specialization of a regular axis for a range of integers with
 248               unit bin width. It is much faster than a regular axis.
 249             </li>
 250 <li class="listitem">
 251               <code class="computeroutput"><a class="link" href="../boost/histogram/axis/category.html" title="Class template category">boost::histogram::axis::category</a></code>
 252               is a bijective mapping of unique values onto bin indices and vice versa.
 253               This can be used with discrete categorical data, like "red",
 254               "green", "blue", for example.
 255             </li>
 256 </ul></div>
 257 <p>
 258           Each builtin axis type has a few compile-time options, which change its
 259           behavior.
 260         </p>
 261 <div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
 262 <li class="listitem">
 263               All axis types can have an optional overflow bin. When the overflow
 264               bin is enabled and an input value is above the range covered by the
 265               axis, it is not discarded but counted in the overflow bin.
 266             </li>
 267 <li class="listitem">
 268               All axis types except the category axis can have an optional underflow
 269               bin. When the underflow bin is enabled and an input value is below
 270               the range covered by the axis, it is not discarded but counted in the
 271               underflow bin.
 272             </li>
 273 <li class="listitem">
 274               All axis types except the category axis can be circular, meaning that
 275               the axis range is periodic. This is useful for periodic data like polar
 276               angles.
 277             </li>
 278 <li class="listitem">
 279               All axis types can optionally grow. When an input value is outside
 280               of the range of an axis which is configured to grow, the range of the
 281               axis is extended until the value is in range. This option is incompatible
 282               with the circular option, only either can be active.
 283             </li>
 284 </ul></div>
 285 </div>
 286 <div class="section">
 287 <div class="titlepage"><div><div><h4 class="title">
 288 <a name="histogram.overview.structure.storage"></a><a class="link" href="overview.html#histogram.overview.structure.storage" title="Storage types">Storage types</a>
 289 </h4></div></div></div>
 290 <p>
 291           A storage type holds the actual cell values. It uses a one-dimensional
 292           index for cell lookup, computed by the histogram host from the indices
 293           generated by its axes. The storage needs to know nothing about axes. Users
 294           can integrate their own storage classes with the library, by implementing
 295           the <a class="link" href="concepts.html#histogram.concepts.Storage" title="Storage">storage concept</a>.
 296           Standard containers can be used as storage backends, the library adapts
 297           them with the <code class="computeroutput"><a class="link" href="../boost/histogram/storage_adaptor.html" title="Class template storage_adaptor">boost::histogram::storage_adaptor</a></code>.
 298         </p>
 299 <p>
 300           Cell lookup is often happening in a tight loop and is random-access. A
 301           normal <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">vector</span></code> works well as a storage backend.
 302           Sometimes this is the best solution, but there are some caveats to this
 303           approach. The user has to decide which type should represents the cell
 304           counts and it is not an obvious choice. An integer type needs to be large
 305           enough to avoid counter overflow, but only a fraction of the bits are used
 306           if its capacity is too large. This is a waste of memory then. When memory
 307           is wasted, more cache misses occur and performance is degraded (see the
 308           benchmarks). The performance of modern CPUs depends a lot on effective
 309           utilization of the CPU cache, which is still small. Using floating point
 310           numbers instead of integers is also dangerous. They don't overflow, but
 311           cap the bin count when the bits in the mantissa are used up.
 312         </p>
 313 <p>
 314           The default storage used in the library is <code class="computeroutput"><a class="link" href="../boost/histogram/unlimited_storage.html" title="Class template unlimited_storage">boost::histogram::unlimited_storage</a></code>.
 315           It solves these issues with a dynamic counter type management, based on
 316           the following insight. The <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality" target="_top">curse
 317           of dimensionality</a> makes the total number of bins grow very fast
 318           as the dimension of the histogram grows. However, having many bins also
 319           reduces the typical number of counts per bin, since the input values are
 320           spread over many more bins now. This means a small counter is often sufficient
 321           for high-dimensional histograms.
 322         </p>
 323 <p>
 324           The default storage therefore starts with a minimum amount of memory per
 325           cell, it uses an 1 byte. If the count in any cell is about to overflow,
 326           all cells switch to the next larger integer type simultaneously. This goes
 327           on, the capacity per cell is always doubled when it is used up, until 8
 328           byte per bin are reached. The following images illustrate this progression
 329           for a storage of 3 bin counters. A new memory block is always allocated
 330           for all counters, when the first one of them hits its capacity limit.
 331         </p>
 332 <p>
 333           <span class="inlinemediaobject"><object type="image/svg+xml" data="../../storage_3_uint8.svg" width="65" height="23"></object></span>
 334         </p>
 335 <p>
 336           <span class="inlinemediaobject"><object type="image/svg+xml" data="../../storage_3_uint16.svg" width="129" height="23"></object></span>
 337         </p>
 338 <p>
 339           <span class="inlinemediaobject"><object type="image/svg+xml" data="../../storage_3_uint32.svg" width="256" height="23"></object></span>
 340         </p>
 341 <p>
 342           <span class="inlinemediaobject"><object type="image/svg+xml" data="../../storage_3_uint64.svg" width="511" height="23"></object></span>
 343         </p>
 344 <p>
 345           When even that is not enough, the default storage switches to a multiprecision
 346           type similar to those in <a href="../../../../../libs/multiprecision/index.html" target="_top">Boost.Multiprecision</a>,
 347           whose capacity is limited only by available memory. The following image
 348           is not to scale:
 349         </p>
 350 <p>
 351           <span class="inlinemediaobject"><object type="image/svg+xml" data="../../storage_3_cpp_int.svg" width="511" height="23"></object></span>
 352         </p>
 353 <p>
 354           This approach is not only memory conserving, but also provides the strong
 355           guarantee that bin counters cannot overflow.
 356         </p>
 357 <div class="note"><table border="0" summary="Note">
 358 <tr>
 359 <td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../../../../doc/src/images/note.png"></td>
 360 <th align="left">Note</th>
 361 </tr>
 362 <tr><td align="left" valign="top"><p>
 363             The no-overflow-guarantee only applies when the histogram is not using
 364             weighted fills or if all weights are integral numbers. When floating
 365             point weights are used, the default storage switches to a double counter
 366             per cell to store the sum of such weights. A double cannot provide the
 367             no-overflow-guarantee.
 368           </p></td></tr>
 369 </table></div>
 370 <p>
 371           The best part: this approach is even faster for a histogram with sufficient
 372           size despite the run-time overheads of handling the counter type dynamically.
 373           The benchmarks show that the gains from better cache usage outweigh the
 374           run-time overheads of dynamic dispatching to the right bin counter type
 375           and the occasional allocation costs. Doubling the size of the bin counters
 376           each time helps, because the allocations happen only O(logN) times for
 377           N increments.
 378         </p>
 379 </div>
 380 </div>
 381 <div class="footnotes">
 382 <br><hr style="width:100; text-align:left;margin-left: 0">
 383 <div id="ftn.histogram.overview.introduction.f0" class="footnote"><p><a href="#histogram.overview.introduction.f0" class="para"><sup class="para">[1] </sup></a>
 384           Parameters of interest, like the center of a distribution, can be extracted
 385           from the histogram instead of the original data set; likewise, statistical
 386           models can be fitted to histograms.
 387         </p></div>
 388 <div id="ftn.histogram.overview.introduction.f1" class="footnote"><p><a href="#histogram.overview.introduction.f1" class="para"><sup class="para">[2] </sup></a>
 389           What small enough means has to be decided case by case.
 390         </p></div>
 391 </div>
 392 </div>
 393 <table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr>
 394 <td align="left"></td>
 395 <td align="right"><div class="copyright-footer">Copyright &#169; 2016-2019 Hans
 396       Dembinski<p>
 397         Distributed under the Boost Software License, Version 1.0. (See accompanying
 398         file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
 399       </p>
 400 </div></td>
 401 </tr></table>
 402 <hr>
 403 <div class="spirit-nav">
 404 <a accesskey="p" href="../index.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="getting_started.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a>
 405 </div>
 406 </body>
 407 </html>