Imported Upstream version 1.64.0
[platform/upstream/boost.git] / libs / fiber / doc / html / fiber / performance.html
index 2eefb33..66a1c32 100644 (file)
@@ -3,7 +3,7 @@
 <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
 <title>Performance</title>
 <link rel="stylesheet" href="../../../../../doc/src/boostbook.css" type="text/css">
-<meta name="generator" content="DocBook XSL Stylesheets V1.75.2">
+<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
 <link rel="home" href="../index.html" title="Chapter&#160;1.&#160;Fiber">
 <link rel="up" href="../index.html" title="Chapter&#160;1.&#160;Fiber">
 <link rel="prev" href="integration/deeper_dive_into___boost_asio__.html" title="Deeper Dive into Boost.Asio">
 <div class="titlepage"><div><div><h2 class="title" style="clear: both">
 <a name="fiber.performance"></a><a class="link" href="performance.html" title="Performance">Performance</a>
 </h2></div></div></div>
-<div class="toc"><dl><dt><span class="section"><a href="performance/tweaking.html">Tweaking</a></span></dt></dl></div>
+<div class="toc"><dl class="toc"><dt><span class="section"><a href="performance/tweaking.html">Tweaking</a></span></dt></dl></div>
 <p>
       Performance measurements were taken using <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">chrono</span><span class="special">::</span><span class="identifier">highresolution_clock</span></code>,
-      with overhead corrections. The code was compiled with gcc-6.20, using build
-      options: variant = release, optimization = speed. Tests were executed on Intel
-      Core i7-4770S 3.10GHz, 4 cores with 8 hyperthreads (4C/8T), running Linux (4.7.0/x86_64).
+      with overhead corrections. The code was compiled with gcc-6.3.1, using build
+      options: variant = release, optimization = speed. Tests were executed on dual
+      socket Intel XEON E5 2620 2.2GHz, 16C/32T, running Linux (x86_64).
     </p>
 <p>
       Measurements headed 1C/1T were run in a single-threaded process.
       The <a href="https://github.com/atemerev/skynet" target="_top">microbenchmark <span class="emphasis"><em>syknet</em></span></a>
       from Alexander Temerev was ported and used for performance measurements. At
       the root the test spawns 10 threads-of-execution (ToE), e.g. actor/goroutine/fiber
-      etc.. Each spawned ToE spawns additional 10 ToEs ... until 100000 ToEs are
-      created. ToEs return back ther ordinal numbers (0 ... 99999), which are summed
-      on the previous level and sent back upstream, until reaching the root. The
-      test was run 10-20 times, producing a range of values for each measurement.
+      etc.. Each spawned ToE spawns additional 10 ToEs ... until <span class="bold"><strong>1,000,000</strong></span>
+      ToEs are created. ToEs return back their ordinal numbers (0 ... 999,999), which
+      are summed on the previous level and sent back upstream, until reaching the
+      root. The test was run 10-20 times, producing a range of values for each measurement.
     </p>
 <div class="table">
-<a name="fiber.performance.observed_time_to_run_100_000_actors_erlang_process__goroutines__other_languages_"></a><p class="title"><b>Table&#160;1.1.&#160;observed time to run 100,000 actors/erlang process'/goroutines (other
-      languages)</b></p>
-<div class="table-contents"><table class="table" summary="observed time to run 100,000 actors/erlang process'/goroutines (other
-      languages)">
+<a name="fiber.performance.time_per_actor_erlang_process_goroutine__other_languages___average_over_1_000_000_"></a><p class="title"><b>Table&#160;1.1.&#160;time per actor/erlang process/goroutine (other languages) (average over
+      1,000,000)</b></p>
+<div class="table-contents"><table class="table" summary="time per actor/erlang process/goroutine (other languages) (average over
+      1,000,000)">
 <colgroup>
 <col>
 <col>
 <col>
-<col>
 </colgroup>
 <thead><tr>
 <th>
               <p>
-                Haskell | stack-1.0.4
-              </p>
-            </th>
-<th>
-              <p>
-                Erlang | erts-7.0
+                Haskell | stack-1.4.0
               </p>
             </th>
 <th>
               <p>
-                Go | go1.6.1 (GOMAXPROCS == default)
+                Go | go1.8
               </p>
             </th>
 <th>
               <p>
-                Go | go1.6.1 (GOMAXPROCS == 8)
+                Erlang | erts-8.3
               </p>
             </th>
 </tr></thead>
 <tbody><tr>
 <td>
               <p>
-                0.32 &#181;s
+                0.05 &#181;s - 0.06 &#181;s
               </p>
             </td>
 <td>
               <p>
-                0.64 &#181;s - 1.21 &#181;s
+                0.45 &#181;s - 0.52 &#181;s
               </p>
             </td>
 <td>
               <p>
-                1.52 &#181;s - 1.64 &#181;s
+                0.63 &#181;s - 0.73 &#181;s
+              </p>
+            </td>
+</tr></tbody>
+</table></div>
+</div>
+<br class="table-break"><p>
+      Pthreads are created with a stack size of 8kB while <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">thread</span></code>'s
+      use the system default (1MB - 2MB). The microbenchmark could not be run with
+      1,000,000 threads because of resource exhaustion (pthread and std::thread).
+      Instead the test runs only at <span class="bold"><strong>10,000</strong></span> threads.
+    </p>
+<div class="table">
+<a name="fiber.performance.time_per_thread__average_over__10_000____unable_to_spawn_1_000_000_threads_"></a><p class="title"><b>Table&#160;1.2.&#160;time per thread (average over *10,000* - unable to spawn 1,000,000 threads)</b></p>
+<div class="table-contents"><table class="table" summary="time per thread (average over *10,000* - unable to spawn 1,000,000 threads)">
+<colgroup>
+<col>
+<col>
+</colgroup>
+<thead><tr>
+<th>
+              <p>
+                pthread
+              </p>
+            </th>
+<th>
+              <p>
+                <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">thread</span></code>
+              </p>
+            </th>
+</tr></thead>
+<tbody><tr>
+<td>
+              <p>
+                54 &#181;s - 73 &#181;s
               </p>
             </td>
 <td>
               <p>
-                0.70 &#181;s - 0.98 &#181;s
+                52 &#181;s - 73 &#181;s
               </p>
             </td>
 </tr></tbody>
 </table></div>
 </div>
 <br class="table-break"><p>
-      The test utilizes 4 cores with Symmetric MultiThreading enabled (8 logical
+      The test utilizes 16 cores with Symmetric MultiThreading enabled (32 logical
       CPUs). The fiber stacks are allocated by <a class="link" href="stack.html#class_fixedsize_stack"><code class="computeroutput">fixedsize_stack</code></a>.
     </p>
 <p>
       As the benchmark shows, the memory allocation algorithm is significant for
       performance in a multithreaded environment. The tests use glibc&#8217;s memory allocation
       algorithm (based on ptmalloc2) as well as Google&#8217;s <a href="http://goog-perftools.sourceforge.net/doc/tcmalloc.html" target="_top">TCmalloc</a>
-      (via linkflags="-ltcmalloc").<sup>[<a name="fiber.performance.f0" href="#ftn.fiber.performance.f0" class="footnote">9</a>]</sup>
+      (via linkflags="-ltcmalloc").<a href="#ftn.fiber.performance.f0" class="footnote" name="fiber.performance.f0"><sup class="footnote">[7]</sup></a>
     </p>
 <p>
-      The <a class="link" href="scheduling.html#class_shared_work"><code class="computeroutput">shared_work</code></a> scheduling algorithm uses one global queue,
-      containing fibers ready to run, shared between all threads. The work is distributed
-      equally over all threads. In the <a class="link" href="scheduling.html#class_work_stealing"><code class="computeroutput">work_stealing</code></a> scheduling
-      algorithm, each thread has its own local queue. Fibers that are ready to run
-      are pushed to and popped from the local queue. If the queue runs out of ready
-      fibers, fibers are stolen from the local queues of other participating threads.
+      In the <a class="link" href="scheduling.html#class_work_stealing"><code class="computeroutput">work_stealing</code></a> scheduling algorithm, each thread has
+      its own local queue. Fibers that are ready to run are pushed to and popped
+      from the local queue. If the queue runs out of ready fibers, fibers are stolen
+      from the local queues of other participating threads.
     </p>
 <div class="table">
-<a name="fiber.performance.observed_time_to_run_100_000_fibers"></a><p class="title"><b>Table&#160;1.2.&#160;observed time to run 100,000 fibers</b></p>
-<div class="table-contents"><table class="table" summary="observed time to run 100,000 fibers">
+<a name="fiber.performance.time_per_fiber__average_over_1_000_000_"></a><p class="title"><b>Table&#160;1.3.&#160;time per fiber (average over 1.000.000)</b></p>
+<div class="table-contents"><table class="table" summary="time per fiber (average over 1.000.000)">
 <colgroup>
 <col>
 <col>
-<col>
-<col>
-<col>
-<col>
 </colgroup>
 <thead><tr>
 <th>
               <p>
-                fiber (1C/1T, round robin)
+                fiber (16C/32T, work stealing, tcmalloc)
               </p>
             </th>
 <th>
                 fiber (1C/1T, round robin, tcmalloc)
               </p>
             </th>
-<th>
-              <p>
-                fiber (4C/8T, work sharing)
-              </p>
-            </th>
-<th>
-              <p>
-                fiber (4C/8T, work sharing, tcmalloc)
-              </p>
-            </th>
-<th>
-              <p>
-                fiber (4C/8T, work stealing)
-              </p>
-            </th>
-<th>
-              <p>
-                fiber (4C/8T, work stealing, tcmalloc)
-              </p>
-            </th>
 </tr></thead>
 <tbody><tr>
 <td>
               <p>
-                0.91 &#181;s - 1.28 &#181;s
-              </p>
-            </td>
-<td>
-              <p>
-                0.90 &#181;s - 1.03 &#181;s
-              </p>
-            </td>
-<td>
-              <p>
-                0.90 &#181;s - 1.11 &#181;s
-              </p>
-            </td>
-<td>
-              <p>
-                0.62 &#181;s - 0.80 &#181;s
-              </p>
-            </td>
-<td>
-              <p>
-                0.35 &#181;s - 0.66 &#181;s
+                0.05 &#181;s - 0.11 &#181;s
               </p>
             </td>
 <td>
               <p>
-                0.13 &#181;s - 0.26 &#181;s
+                1.69 &#181;s - 1.79 &#181;s
               </p>
             </td>
 </tr></tbody>
 </table></div>
 </div>
 <br class="table-break"><div class="footnotes">
-<br><hr width="100" align="left">
-<div class="footnote"><p><sup>[<a name="ftn.fiber.performance.f0" href="#fiber.performance.f0" class="para">9</a>] </sup>
+<br><hr style="width:100; text-align:left;margin-left: 0">
+<div id="ftn.fiber.performance.f0" class="footnote"><p><a href="#fiber.performance.f0" class="para"><sup class="para">[7] </sup></a>
         Tais B. Ferreira, Rivalino Matias, Autran Macedo, Lucio B. Araujo <span class="quote">&#8220;<span class="quote">An
         Experimental Study on Memory Allocators in Multicore and Multithreaded Applications</span>&#8221;</span>,
         PDCAT &#8217;11 Proceedings of the 2011 12th International Conference on Parallel