:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
to standard error, and the tool returns 1.
-HOW MCA WORKS
--------------
+HOW LLVM-MCA WORKS
+------------------
-MCA takes assembly code as input. The assembly code is parsed into a sequence
-of MCInst with the help of the existing LLVM target assembly parsers. The
-parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate
-a performance report.
+:program:`llvm-mca` takes assembly code as input. The assembly code is parsed
+into a sequence of MCInst with the help of the existing LLVM target assembly
+parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
+to generate a performance report.
The Pipeline module simulates the execution of the machine code sequence in a
loop of iterations (default is 100). During this process, the pipeline collects
a number of execution related statistics. At the end of this process, the
pipeline generates and prints a report from the collected statistics.
-Here is an example of a performance report generated by MCA for a dot-product
-of two packed float vectors of four elements. The analysis is conducted for
-target x86, cpu btver2. The following result can be produced via the following
-command using the example located at
+Here is an example of a performance report generated by the tool for a
+dot-product of two packed float vectors of four elements. The analysis is
+conducted for target x86, cpu btver2. The following result can be produced via
+the following command using the example located at
``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
.. code-block:: bash
Timeline View
^^^^^^^^^^^^^
-MCA's timeline view produces a detailed report of each instruction's state
+The timeline view produces a detailed report of each instruction's state
transitions through an instruction pipeline. This view is enabled by the
command line option ``-timeline``. As instructions transition through the
various stages of the pipeline, their states are depicted in the view report.
Below is the timeline view for a subset of the dot-product example located in
``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
-MCA using the following command:
+:program:`llvm-mca` using the following command:
.. code-block:: bash
2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
The timeline view is interesting because it shows instruction state changes
-during execution. It also gives an idea of how MCA processes instructions
+during execution. It also gives an idea of how the tool processes instructions
executed on the target, and how their timing information might be calculated.
The timeline view is structured in two tables. The first table shows
Table *Average Wait times* helps diagnose performance issues that are caused by
the presence of long latency instructions and potentially long data dependencies
-which may limit the ILP. Note that MCA, by default, assumes at least 1cy
-between the dispatch event and the issue event.
+which may limit the ILP. Note that :program:`llvm-mca`, by default, assumes at
+least 1cy between the dispatch event and the issue event.
When the performance is limited by data dependencies and/or long latency
instructions, the number of cycles spent while in the *ready* state is expected
the target scheduling model.
Instructions that are dispatched to the schedulers consume scheduler buffer
-entries. MCA queries the scheduling model to determine the set of
-buffered resources consumed by an instruction. Buffered resources are treated
-like scheduler resources.
+entries. :program:`llvm-mca` queries the scheduling model to determine the set
+of buffered resources consumed by an instruction. Buffered resources are
+treated like scheduler resources.
Instruction Issue
"""""""""""""""""
has to wait in the scheduler's buffer until input register operands become
available. Only at that point, does the instruction becomes eligible for
execution and may be issued (potentially out-of-order) for execution.
-Instruction latencies are computed by MCA with the help of the scheduling
-model.
-
-MCA's scheduler is designed to simulate multiple processor schedulers. The
-scheduler is responsible for tracking data dependencies, and dynamically
-selecting which processor resources are consumed by instructions.
-
-The scheduler delegates the management of processor resource units and resource
-groups to a resource manager. The resource manager is responsible for
-selecting resource units that are consumed by instructions. For example, if an
-instruction consumes 1cy of a resource group, the resource manager selects one
-of the available units from the group; by default, the resource manager uses a
+Instruction latencies are computed by :program:`llvm-mca` with the help of the
+scheduling model.
+
+:program:`llvm-mca`'s scheduler is designed to simulate multiple processor
+schedulers. The scheduler is responsible for tracking data dependencies, and
+dynamically selecting which processor resources are consumed by instructions.
+It delegates the management of processor resource units and resource groups to a
+resource manager. The resource manager is responsible for selecting resource
+units that are consumed by instructions. For example, if an instruction
+consumes 1cy of a resource group, the resource manager selects one of the
+available units from the group; by default, the resource manager uses a
round-robin selector to guarantee that resource usage is uniformly distributed
between all units of a group.
-MCA's scheduler implements three instruction queues:
+:program:`llvm-mca`'s scheduler implements three instruction queues:
* WaitQueue: a queue of instructions whose operands are not ready.
* ReadyQueue: a queue of instructions ready to execute.
Every cycle, the scheduler checks if instructions can be moved from the
WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
-issued. The algorithm prioritizes older instructions over younger
-instructions.
+issued to the underlying pipelines. The algorithm prioritizes older instructions
+over younger instructions.
Write-Back and Retire Stage
"""""""""""""""""""""""""""
Load/Store Unit and Memory Consistency Model
""""""""""""""""""""""""""""""""""""""""""""
-To simulate an out-of-order execution of memory operations, MCA utilizes a
-simulated load/store unit (LSUnit) to simulate the speculative execution of
-loads and stores.
+To simulate an out-of-order execution of memory operations, :program:`llvm-mca`
+utilizes a simulated load/store unit (LSUnit) to simulate the speculative
+execution of loads and stores.
-Each load (or store) consumes an entry in the load (or store) queue. The
-number of slots in the load/store queues is unknown by MCA, since there is no
-mention of it in the scheduling model. In practice, users can specify flags
-``-lqueue`` and ``-squeue`` to limit the number of entries in the load and
-store queues respectively. The queues are unbounded by default.
+Each load (or store) consumes an entry in the load (or store) queue. Users can
+specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
+load and store queues respectively. The queues are unbounded by default.
The LSUnit implements a relaxed consistency model for memory loads and stores.
The rules are:
loads, the scheduling model provides an "optimistic" load-to-use latency (which
usually matches the load-to-use latency for when there is a hit in the L1D).
-MCA does not know about serializing operations or memory-barrier like
-instructions. The LSUnit conservatively assumes that an instruction which has
-both "MayLoad" and unmodeled side effects behaves like a "soft" load-barrier.
-That means, it serializes loads without forcing a flush of the load queue.
-Similarly, instructions that "MayStore" and have unmodeled side effects are
-treated like store barriers. A full memory barrier is a "MayLoad" and
-"MayStore" instruction with unmodeled side effects. This is inaccurate, but it
-is the best that we can do at the moment with the current information available
-in LLVM.
+:program:`llvm-mca` does not know about serializing operations or memory-barrier
+like instructions. The LSUnit conservatively assumes that an instruction which
+has both "MayLoad" and unmodeled side effects behaves like a "soft"
+load-barrier. That means, it serializes loads without forcing a flush of the
+load queue. Similarly, instructions that "MayStore" and have unmodeled side
+effects are treated like store barriers. A full memory barrier is a "MayLoad"
+and "MayStore" instruction with unmodeled side effects. This is inaccurate, but
+it is the best that we can do at the moment with the current information
+available in LLVM.
A load/store barrier consumes one entry of the load/store queue. A load/store
barrier enforces ordering of loads/stores. A younger load cannot pass a load