<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
<title>Performance</title>
<link rel="stylesheet" href="../../../../../doc/src/boostbook.css" type="text/css">
-<meta name="generator" content="DocBook XSL Stylesheets V1.75.2">
+<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
<link rel="home" href="../index.html" title="Chapter 1. Fiber">
<link rel="up" href="../index.html" title="Chapter 1. Fiber">
<link rel="prev" href="integration/deeper_dive_into___boost_asio__.html" title="Deeper Dive into Boost.Asio">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="fiber.performance"></a><a class="link" href="performance.html" title="Performance">Performance</a>
</h2></div></div></div>
-<div class="toc"><dl><dt><span class="section"><a href="performance/tweaking.html">Tweaking</a></span></dt></dl></div>
+<div class="toc"><dl class="toc"><dt><span class="section"><a href="performance/tweaking.html">Tweaking</a></span></dt></dl></div>
<p>
Performance measurements were taken using <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">chrono</span><span class="special">::</span><span class="identifier">highresolution_clock</span></code>,
- with overhead corrections. The code was compiled with gcc-6.20, using build
- options: variant = release, optimization = speed. Tests were executed on Intel
- Core i7-4770S 3.10GHz, 4 cores with 8 hyperthreads (4C/8T), running Linux (4.7.0/x86_64).
+ with overhead corrections. The code was compiled with gcc-6.3.1, using build
+ options: variant = release, optimization = speed. Tests were executed on dual
+ socket Intel XEON E5 2620 2.2GHz, 16C/32T, running Linux (x86_64).
</p>
<p>
Measurements headed 1C/1T were run in a single-threaded process.
The <a href="https://github.com/atemerev/skynet" target="_top">microbenchmark <span class="emphasis"><em>syknet</em></span></a>
from Alexander Temerev was ported and used for performance measurements. At
the root the test spawns 10 threads-of-execution (ToE), e.g. actor/goroutine/fiber
- etc.. Each spawned ToE spawns additional 10 ToEs ... until 100000 ToEs are
- created. ToEs return back ther ordinal numbers (0 ... 99999), which are summed
- on the previous level and sent back upstream, until reaching the root. The
- test was run 10-20 times, producing a range of values for each measurement.
+ etc.. Each spawned ToE spawns additional 10 ToEs ... until <span class="bold"><strong>1,000,000</strong></span>
+ ToEs are created. ToEs return back their ordinal numbers (0 ... 999,999), which
+ are summed on the previous level and sent back upstream, until reaching the
+ root. The test was run 10-20 times, producing a range of values for each measurement.
</p>
<div class="table">
-<a name="fiber.performance.observed_time_to_run_100_000_actors_erlang_process__goroutines__other_languages_"></a><p class="title"><b>Table 1.1. observed time to run 100,000 actors/erlang process'/goroutines (other
- languages)</b></p>
-<div class="table-contents"><table class="table" summary="observed time to run 100,000 actors/erlang process'/goroutines (other
- languages)">
+<a name="fiber.performance.time_per_actor_erlang_process_goroutine__other_languages___average_over_1_000_000_"></a><p class="title"><b>Table 1.1. time per actor/erlang process/goroutine (other languages) (average over
+ 1,000,000)</b></p>
+<div class="table-contents"><table class="table" summary="time per actor/erlang process/goroutine (other languages) (average over
+ 1,000,000)">
<colgroup>
<col>
<col>
<col>
-<col>
</colgroup>
<thead><tr>
<th>
<p>
- Haskell | stack-1.0.4
- </p>
- </th>
-<th>
- <p>
- Erlang | erts-7.0
+ Haskell | stack-1.4.0
</p>
</th>
<th>
<p>
- Go | go1.6.1 (GOMAXPROCS == default)
+ Go | go1.8
</p>
</th>
<th>
<p>
- Go | go1.6.1 (GOMAXPROCS == 8)
+ Erlang | erts-8.3
</p>
</th>
</tr></thead>
<tbody><tr>
<td>
<p>
- 0.32 µs
+ 0.05 µs - 0.06 µs
</p>
</td>
<td>
<p>
- 0.64 µs - 1.21 µs
+ 0.45 µs - 0.52 µs
</p>
</td>
<td>
<p>
- 1.52 µs - 1.64 µs
+ 0.63 µs - 0.73 µs
+ </p>
+ </td>
+</tr></tbody>
+</table></div>
+</div>
+<br class="table-break"><p>
+ Pthreads are created with a stack size of 8kB while <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">thread</span></code>'s
+ use the system default (1MB - 2MB). The microbenchmark could not be run with
+ 1,000,000 threads because of resource exhaustion (pthread and std::thread).
+ Instead the test runs only at <span class="bold"><strong>10,000</strong></span> threads.
+ </p>
+<div class="table">
+<a name="fiber.performance.time_per_thread__average_over__10_000____unable_to_spawn_1_000_000_threads_"></a><p class="title"><b>Table 1.2. time per thread (average over *10,000* - unable to spawn 1,000,000 threads)</b></p>
+<div class="table-contents"><table class="table" summary="time per thread (average over *10,000* - unable to spawn 1,000,000 threads)">
+<colgroup>
+<col>
+<col>
+</colgroup>
+<thead><tr>
+<th>
+ <p>
+ pthread
+ </p>
+ </th>
+<th>
+ <p>
+ <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">thread</span></code>
+ </p>
+ </th>
+</tr></thead>
+<tbody><tr>
+<td>
+ <p>
+ 54 µs - 73 µs
</p>
</td>
<td>
<p>
- 0.70 µs - 0.98 µs
+ 52 µs - 73 µs
</p>
</td>
</tr></tbody>
</table></div>
</div>
<br class="table-break"><p>
- The test utilizes 4 cores with Symmetric MultiThreading enabled (8 logical
+ The test utilizes 16 cores with Symmetric MultiThreading enabled (32 logical
CPUs). The fiber stacks are allocated by <a class="link" href="stack.html#class_fixedsize_stack"><code class="computeroutput">fixedsize_stack</code></a>.
</p>
<p>
As the benchmark shows, the memory allocation algorithm is significant for
performance in a multithreaded environment. The tests use glibc’s memory allocation
algorithm (based on ptmalloc2) as well as Google’s <a href="http://goog-perftools.sourceforge.net/doc/tcmalloc.html" target="_top">TCmalloc</a>
- (via linkflags="-ltcmalloc").<sup>[<a name="fiber.performance.f0" href="#ftn.fiber.performance.f0" class="footnote">9</a>]</sup>
+ (via linkflags="-ltcmalloc").<a href="#ftn.fiber.performance.f0" class="footnote" name="fiber.performance.f0"><sup class="footnote">[7]</sup></a>
</p>
<p>
- The <a class="link" href="scheduling.html#class_shared_work"><code class="computeroutput">shared_work</code></a> scheduling algorithm uses one global queue,
- containing fibers ready to run, shared between all threads. The work is distributed
- equally over all threads. In the <a class="link" href="scheduling.html#class_work_stealing"><code class="computeroutput">work_stealing</code></a> scheduling
- algorithm, each thread has its own local queue. Fibers that are ready to run
- are pushed to and popped from the local queue. If the queue runs out of ready
- fibers, fibers are stolen from the local queues of other participating threads.
+ In the <a class="link" href="scheduling.html#class_work_stealing"><code class="computeroutput">work_stealing</code></a> scheduling algorithm, each thread has
+ its own local queue. Fibers that are ready to run are pushed to and popped
+ from the local queue. If the queue runs out of ready fibers, fibers are stolen
+ from the local queues of other participating threads.
</p>
<div class="table">
-<a name="fiber.performance.observed_time_to_run_100_000_fibers"></a><p class="title"><b>Table 1.2. observed time to run 100,000 fibers</b></p>
-<div class="table-contents"><table class="table" summary="observed time to run 100,000 fibers">
+<a name="fiber.performance.time_per_fiber__average_over_1_000_000_"></a><p class="title"><b>Table 1.3. time per fiber (average over 1.000.000)</b></p>
+<div class="table-contents"><table class="table" summary="time per fiber (average over 1.000.000)">
<colgroup>
<col>
<col>
-<col>
-<col>
-<col>
-<col>
</colgroup>
<thead><tr>
<th>
<p>
- fiber (1C/1T, round robin)
+ fiber (16C/32T, work stealing, tcmalloc)
</p>
</th>
<th>
fiber (1C/1T, round robin, tcmalloc)
</p>
</th>
-<th>
- <p>
- fiber (4C/8T, work sharing)
- </p>
- </th>
-<th>
- <p>
- fiber (4C/8T, work sharing, tcmalloc)
- </p>
- </th>
-<th>
- <p>
- fiber (4C/8T, work stealing)
- </p>
- </th>
-<th>
- <p>
- fiber (4C/8T, work stealing, tcmalloc)
- </p>
- </th>
</tr></thead>
<tbody><tr>
<td>
<p>
- 0.91 µs - 1.28 µs
- </p>
- </td>
-<td>
- <p>
- 0.90 µs - 1.03 µs
- </p>
- </td>
-<td>
- <p>
- 0.90 µs - 1.11 µs
- </p>
- </td>
-<td>
- <p>
- 0.62 µs - 0.80 µs
- </p>
- </td>
-<td>
- <p>
- 0.35 µs - 0.66 µs
+ 0.05 µs - 0.11 µs
</p>
</td>
<td>
<p>
- 0.13 µs - 0.26 µs
+ 1.69 µs - 1.79 µs
</p>
</td>
</tr></tbody>
</table></div>
</div>
<br class="table-break"><div class="footnotes">
-<br><hr width="100" align="left">
-<div class="footnote"><p><sup>[<a name="ftn.fiber.performance.f0" href="#fiber.performance.f0" class="para">9</a>] </sup>
+<br><hr style="width:100; text-align:left;margin-left: 0">
+<div id="ftn.fiber.performance.f0" class="footnote"><p><a href="#fiber.performance.f0" class="para"><sup class="para">[7] </sup></a>
Tais B. Ferreira, Rivalino Matias, Autran Macedo, Lucio B. Araujo <span class="quote">“<span class="quote">An
Experimental Study on Memory Allocators in Multicore and Multithreaded Applications</span>”</span>,
PDCAT ’11 Proceedings of the 2011 12th International Conference on Parallel