Add design doc for the Perf Score feature (dotnet/coreclr#26519)

author Brian Sullivan <briansul@microsoft.com>

Wed, 18 Sep 2019 18:02:55 +0000 (11:02 -0700)

committer GitHub <noreply@github.com>

Wed, 18 Sep 2019 18:02:55 +0000 (11:02 -0700)
author Brian Sullivan <briansul@microsoft.com>
Wed, 18 Sep 2019 18:02:55 +0000 (11:02 -0700)
committer GitHub <noreply@github.com>
Wed, 18 Sep 2019 18:02:55 +0000 (11:02 -0700)
diff --git a/docs/coreclr/design-docs/Perf-Score.md b/docs/coreclr/design-docs/Perf-Score.md

new file mode 100644 (file)

index 0000000..4e50ce1
--- /dev/null
+++ b/docs/coreclr/design-docs/Perf-Score.md
@@ -0,0 +1,66 @@
+# Perf Score 
+
+## What is it:
+A new JIT feature that estimates the dynamic execution cost of a method.
+
+## Who is the customer:
+Developers who work on the JIT code base.
+
+## What problem does it solve:
+Currently JIT developers will generate "Asm Diffs" when making a change to the JIT to see if their change is improving the JIT-ted code.
+We have a tool called"jit-diffs" which allows a developer to quickly find methods that generate different assembly code for two different JITs.
+
+The jit-analyze" tool will sort this information and provide a list of methods that have the largest decreases and increases in codesize.
+For many methods smaller codesize usually meansfaster execution and larger codesize means slower execution.  
+But for complex methods that have loops it is not a very good metric for measuring code quality.
+For such methods we want the JIT to improve the code generated for the innermost loop and are willing to pay extra costs outside the loop.
+The total codesize for the method does not properly classify regressions or improvements for changes to code inside of loops.
+
+Instead of using total codesize as the metric it is better to have the JIT produce an estimate of the the dynamic execution cost of a method.
+This new dynamic execution cost of the method is called the "Perf Score" of the method.
+The "jit-analyze" tool would also be changed so that instead of using codeszie it would use the Perf Score to determine the list of methods to display.
+
+## Implementation details:
+It is a very hard problem to get a highly accurate value for what the actual dynamic execution cost of a method is on a particular CPU.
+The difficulty occurs because all modern hardware executes instructions using a superscalar architecture.
+This approach partitions each instruction into a series of steps called a pipeline and the CPU manages multiple instruction pipelines each executing concurrently.
+With modern hardware there can easily be twenty or more instructions executing at one time.
+In order to accurately compute the actual dynamic execution cost a software model of this process would need to be implemented and it would be large and complex.
+
+This current Perf Score feature does **NOT** propose to try to accurately model the actual dynamic execution cost.
+Instead we propose to return an estimate of the dynamic execution cost.  
+We are willing to make several simplfying assumptions that allow us to more quicly and easily return a reasonable estinmate to use as the Perf Score.
+
+### Modern CPU Hardware documentation:
+Hardware vendors, such as Intel and ARM,  provide information for assembly lanaguge programmers and compiler writers.
+They typically provide the following details for instructions:
+   - Latency -- The number of clock cyccles that are required for the execution to complete and produce a result after all inputs are available.
+   - Throughput -- The number of clock cycles required to wait before the issue ports are free to accept another instruction. This value is often less than one cycle.
+   - Pipeline Ports -- Each instruction uses certain pipeline functions, such as LOad Memory, Store Mmeory, Integer ALU, Floating Point Multiply, etc...
+
+For the Perf Score implementation we will call the method getInsExecutionCharacteristics and it will return the insExecutionCharacteristics for each instruction.
+We will use the information returned along with the BasicBlock weight to calculate the weighted execution cost for each instruction.
+For this implementation we won't try to model the actual def/use latencies and instead will use a simplified model to estimate the instruction latency.
+If in the future we decide to add an instruction scheduler phase to the JIT we can revisit this area.  
+I believe that it will be straightforward to add it at that time.
+
+### Simplfying Assumptions
+1. We will assume that we won't have to pay for the latency for memory store operations.  This means that we don't expect a memory read to occur that reads back the value being stored before it has finished it latency cost.
+2. We will assume that the hardware speculation will be able to hide one cycle of latency for each instruction.
+3. We won't model exact pipeline port usage and instead will model the worst case, such issuing back to back divide intructions.
+
+### Additional code size costs
+We also want to take into account the anount of code generated by the JIT for the method.
+We do this as follows:
+
+1. We add 0.1 for each byte of hot code generated
+2. We add 0.01 for each byte of cold code generarted.
+
+### Follow on work
+I will modify the "jit-analyze" tool to use the Perf Score to identify the methods that have the largest regressions and improvements.
+
+
+
+
+
+
author	Brian Sullivan <briansul@microsoft.com>
	Wed, 18 Sep 2019 18:02:55 +0000 (11:02 -0700)
committer	GitHub <noreply@github.com>
	Wed, 18 Sep 2019 18:02:55 +0000 (11:02 -0700)