From f32bafd801f3ec615c109437ae7417b902d465e1 Mon Sep 17 00:00:00 2001 From: Brian Sullivan Date: Mon, 23 Sep 2019 16:45:54 -0700 Subject: [PATCH] [skip -ci] Feedback on PerfScore design doc (dotnet/coreclr#26839) * Feedback on PerfScore design doc * Additional feedback and grammer changes * Added note on Benchmarks Commit migrated from https://github.com/dotnet/coreclr/commit/50d0ca5917603213948f1fc6a138f0d397e74898 --- docs/coreclr/design-docs/Perf-Score.md | 38 +++++++++++++++------------------- 1 file changed, 17 insertions(+), 21 deletions(-) diff --git a/docs/coreclr/design-docs/Perf-Score.md b/docs/coreclr/design-docs/Perf-Score.md index ea9cf99..793146f 100644 --- a/docs/coreclr/design-docs/Perf-Score.md +++ b/docs/coreclr/design-docs/Perf-Score.md @@ -8,17 +8,19 @@ Developers who work on the JIT code base. ## What problem does it solve: Currently JIT developers will generate "Asm Diffs" when making a change to the JIT to see if their change is improving the JIT-ted code. -We have a tool called"jit-diffs" which allows a developer to quickly find methods that generate different assembly code for two different JITs. +We have a tool called "jit-diffs" which allows a developer to quickly find methods that generate different assembly code for two different JITs. -The jit-analyze" tool will sort this information and provide a list of methods that have the largest decreases and increases in codesize. -For many methods smaller codesize usually meansfaster execution and larger codesize means slower execution. +The jit-analyze" tool will sort this information and provide a list of methods that have the largest decreases and increases in code size. +For many methods smaller code size usually means faster execution and larger code size means slower execution. But for complex methods that have loops it is not a very good metric for measuring code quality. For such methods we want the JIT to improve the code generated for the innermost loop and are willing to pay extra costs outside the loop. -The total codesize for the method does not properly classify regressions or improvements for changes to code inside of loops. +The total code size for the method does not properly classify regressions or improvements for changes to code inside of loops. -Instead of using total codesize as the metric it is better to have the JIT produce an estimate of the dynamic execution cost of a method. +Instead of using total code size as the metric it is better to have the JIT produce an estimate of the dynamic execution cost of a method. This new dynamic execution cost of the method is called the "Perf Score" of the method. -The "jit-analyze" tool would also be changed so that instead of using codeszie it would use the Perf Score to determine the list of methods to display. +The "jit-analyze" tool would also be changed so that instead of using code size it would use the Perf Score to determine the list of methods to display. + +When making a Perf change that impacts a Benchmark, you should verify that both the computed Perf Score and the benchmarks execution time improve in a similar way. ## Implementation details: It is a very hard problem to get a highly accurate value for what the actual dynamic execution cost of a method is on a particular CPU. @@ -29,28 +31,28 @@ In order to accurately compute the actual dynamic execution cost a software mode This current Perf Score feature does **NOT** propose to try to accurately model the actual dynamic execution cost. Instead we propose to return an estimate of the dynamic execution cost. -We are willing to make several simplfying assumptions that allow us to more quicly and easily return a reasonable estinmate to use as the Perf Score. +We are willing to make several simplifying assumptions that allow us to more quickly and easily return a reasonable estimate to use as the Perf Score. ### Modern CPU Hardware documentation: -Hardware vendors, such as Intel and ARM, provide information for assembly lanaguge programmers and compiler writers. +Hardware vendors, such as Intel and ARM, provide information for assembly language programmers and compiler writers. They typically provide the following details for instructions: - Latency -- The number of clock cycles that are required for the execution to complete and produce a result after all inputs are available. - Throughput -- The number of clock cycles required to wait before the issue ports are free to accept another instruction. This value is often less than one cycle. - - Pipeline Ports -- Each instruction uses certain pipeline functions, such as LOad Memory, Store Mmeory, Integer ALU, Floating Point Multiply, etc... + - Pipeline Ports -- Each instruction uses certain pipeline functions, such as Load Memory, Store Memory, Integer ALU, Floating Point Multiply, etc... -For the Perf Score implementation we will call the method getInsExecutionCharacteristics and it will return the insExecutionCharacteristics for each instruction. +For the Perf Score implementation, we will call the method getInsExecutionCharacteristics and it will return the insExecutionCharacteristics for each instruction. We will use the information returned along with the BasicBlock weight to calculate the weighted execution cost for each instruction. For this implementation we won't try to model the actual def/use latencies and instead will use a simplified model to estimate the instruction latency. -If in the future we decide to add an instruction scheduler phase to the JIT we can revisit this area. +If in the future, we decide to add an instruction scheduler phase to the JIT we can revisit this area. I believe that it will be straightforward to add it at that time. -### Simplfying Assumptions -1. We will assume that we won't have to pay for the latency for memory store operations. This means that we don't expect a memory read to occur that reads back the value being stored before it has finished it latency cost. +### Simplifying Assumptions +1. We will assume that we won't have to pay for the latency for memory store operations. This means that we don't expect a memory read to occur that reads back the value being stored before it has finished its latency cost. 2. We will assume that the hardware speculation will be able to hide one cycle of latency for each instruction. -3. We won't model exact pipeline port usage and instead will model the worst case, such issuing back to back divide intructions. +3. We won't model exact pipeline port usage and instead will model the worst case, such issuing back to back divide instructions. ### Additional code size costs -We also want to take into account the anount of code generated by the JIT for the method. +We also want to consider the amount of code generated by the JIT for the method. We do this as follows: 1. We add 0.1 for each byte of hot code generated @@ -58,9 +60,3 @@ We do this as follows: ### Follow on work I will modify the "jit-analyze" tool to use the Perf Score to identify the methods that have the largest regressions and improvements. - - - - - - -- 2.7.4