-# App is experiencing intermittent memory spikes
+# App is hanging
**IMPORTANT: This tutorial uses API/methods available in dotnet core preview 5. These API/methods are _subject to change._**
-http://localhost:5000/api/diagscenario/memspike/{seconds}
+http://localhost:5000/api/diagscenario/deadlock
-In this scenario, the endpoint will experience intermittent memory spikes over the specified number of seconds. Memory will go from base line to spike and back to baseline several times. What makes this scenario different from the memory leak scenario is that we will have to figure out a way to automatically trigger the collection of a dump when the memory spikes.
+In this scenario, the endpoint will experience a hang and substantial thread accumulation. We'll show how you can use both the existing tools to analyze the problem as well as extending the existing tools to create a more automated debug session.
### Memory counters
-Before we dig into collecting diagnostics data to help us root cause this scenario, we need to convince ourselves that what we are seeing is an intermittent memory spike. To help with this we can use the dotnet-counters tool which allows us to watch the memory usage for a selected dotnet process (please see [Installing the diagnostics tools](installing_the_diagnostics_tools.md)).
+As of preview 5, the lock contention and thread counters are not yet available. They are on the backlog to be included in future milestones. The general symptom of this issue is that one or more requests will take a long time to process and ultimately timeout.
+
+### Core dump generation
+In order to investigate hung applications a memory dump is often beneficial allowing us to inspect the state of all the threads as well as any possible locks that may have contention issues.
+
+
+
+Run the webapi (dotnet run) and navigate to the following URL:
+
+
+
+http://localhost:5000/api/diagscenario/deadlock
+
-Let's run the webapi (dotnet run) and before navigating to the above URL (specifying 300 seconds) check our managed memory counters:
+
+Let the request run for about 10-15 seconds. The first thing we need in order to generate a core dump is to find the process identifier of our webapi process:
> ```bash
-> dotnet-counters monitor --refresh-interval 1 -p 4807
+> $ dotnet-trace list-processes
+> ...
+> 80926 webapi /home/marioh/webapi/bin/Debug/netcoreapp3.0/webapi
+> sudo dotnet triggerdump.dll 80926 500
> ```
-4807 is the process identifier which can be found using dotnet-trace list-processes. The refresh-interval is the number of seconds before refreshes.
-The output should be similar to the below:
-
+Once we have the process identifier, we can use the dotnet-dump collect tool (specifying the process identifier) to generate the core dump:
-Here we can see that right after startup, the managed heap memory is 4MB.
-Now, let's navigate to the URL (http://localhost:5000/api/diagscenario/memspike/300) which will run for 5 minutes giving us ample time to experiment.
-Re-run the dotnet-counters command. We should see an alternating heap size with a baseline of roughly 250MB and the highest spike around 630MB. The memory usage will alternate between baseline and spike every 5 seconds or so.
+> ```bash
+> $ sudo ./dotnet-dump collect -p 80926
+> ```
-Baseline:
-
-High:
-
-At this point, we can safely say that memory is spiking to a high that is not normal and the next step is to run a collection tool that can help us collect the right data at the right time.
+At this point, we have the core dump and the next step is analyzing it to find the root cause of our hanging application.
-### Core dump generation
-Let's step back a bit and revisit the high memory scenario earlier in the tutorial. In that scenario, memory grew high and stayed high allowing us the opportunity to run the dotnet-dump command without restriction. However, in our current scenario we have a short memory spike that only lasts about 5 seconds per spike. This makes it difficult to get setup to run the dotnet-dump tool manually. What we would preferably like is a tool that could monitor the dotnet core counters and automatically create a core dump once a threshold has been breached. This is a perfect opportunity to start exploring how we can write our own diagnostics tools to cater to our diagnostics needs.
-What we would like this tool to do is allow the user to specify the pid of the target process as well as the memory consumption threshold (in MBs). It would then continuously monitor the process and create a dump if the threshold is breached:
+### Analyzing the core dump
+To start our investigation, let's open the core dump using dotnet-dump analyze:
+
+
> ```bash
-> sudo ./triggerdump <pid> <memory threshold in MBs>
+> $ ./dotnet-dump analyze ~/.dotnet/tools/core_20190513_143916
> ```
-#### Some background before we start writing the tool...
-The dotnet core runtime contains a mechanism known as the EventPipe that carries events to interested consumers. There are several different events that flow through the EventPipe including diagnostics information such as counters. The EventPipe is exposed as a Unix domain socket on Linux machines and named pipes on Windows. EventPipe is set to duplex mode which means that clients can both read and write to the pipe. A diagnostics application can register to consume these events from the EventPipe and create new diagnostics experiences. Rather than communicating directly with EventPipe there is a client library that can be used and implemented in Microsoft.Diagnostics.Tools.RuntimeClient.dll.
-Events that are written to the EventPipe can come from multiple sources (or providers) and as such, clients that receive events over EventPipe can filter those events based on specific providers.
-#### Writing the tool...
-We have two requirements in order to implement a tool that will create a dump file based on memory consumption:
+Since we are looking at a potential hang it's useful to first get an overall feel for the thread activity in the process. We can use the threads command as shown below:
+
+
+
+> threads
+> *0 0x1DBFF (121855)
+> 1 0x1DC01 (121857)
+> 2 0x1DC02 (121858)
+> 3 0x1DC03 (121859)
+> 4 0x1DC04 (121860)
+> 5 0x1DC05 (121861)
+> 6 0x1DC06 (121862)
+> 7 0x1DC07 (121863)
+> 8 0x1DC08 (121864)
+> 9 0x1DC09 (121865)
+> 10 0x1DC0A (121866)
+> 11 0x1DC0D (121869)
+> 12 0x1DC0E (121870)
+> 13 0x1DC10 (121872)
+> 14 0x1DC11 (121873)
+> 15 0x1DC12 (121874)
+> 16 0x1DC13 (121875)
+> 17 0x1DC14 (121876)
+> 18 0x1DC15 (121877)
+> 19 0x1DC1C (121884)
+> 20 0x1DC1D (121885)
+> 21 0x1DC1E (121886)
+> 22 0x1DC21 (121889)
+> 23 0x1DC22 (121890)
+> 24 0x1DC23 (121891)
+> 25 0x1DC24 (121892)
+> 26 0x1DC25 (121893)
+>
+> ...
+>
+> ...
+>
+> 317 0x1DD48 (122184)
+> 318 0x1DD49 (122185)
+> 319 0x1DD4A (122186)
+> 320 0x1DD4B (122187)
+> 321 0x1DD4C (122188)
+
+
+
+The output shows all the threads currently running in the process with their associated debugger thread ID and operating system thread id. Based on the output above it looks like we have a little over 300 threads.
+
+The next step is to get a better understanding of what the threads are currently doing by getting the callstack. There is a command called clrstack that can be used to output both the callstack of the currently selected thread as well as a shortcut to show the callstacks for all the threads. To perform the latter, run the following command:
+
+```bash
+> clrstack -all
+
+ ...
+ ...
+ ...
+ Child SP IP Call Site
+00007F2AE37B5680 00007f305abc6360 [GCFrame: 00007f2ae37b5680]
+00007F2AE37B5770 00007f305abc6360 [GCFrame: 00007f2ae37b5770]
+00007F2AE37B57D0 00007f305abc6360 [HelperMethodFrame_1OBJ: 00007f2ae37b57d0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
+00007F2AE37B5920 00007F2FE392B31F testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1() [/home/marioh/webapi/Controllers/diagscenario.cs @ 36]
+00007F2AE37B5950 00007F2FE392B46D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
+00007F2AE37B5CA0 00007f30593044af [GCFrame: 00007f2ae37b5ca0]
+00007F2AE37B5D70 00007f30593044af [DebuggerU2MCatchHandlerFrame: 00007f2ae37b5d70]
+OS Thread Id: 0x1dc82
+ Child SP IP Call Site
+00007F2AE2FB4680 00007f305abc6360 [GCFrame: 00007f2ae2fb4680]
+00007F2AE2FB4770 00007f305abc6360 [GCFrame: 00007f2ae2fb4770]
+00007F2AE2FB47D0 00007f305abc6360 [HelperMethodFrame_1OBJ: 00007f2ae2fb47d0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
+00007F2AE2FB4920 00007F2FE392B31F testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1() [/home/marioh/webapi/Controllers/diagscenario.cs @ 36]
+00007F2AE2FB4950 00007F2FE392B46D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
+00007F2AE2FB4CA0 00007f30593044af [GCFrame: 00007f2ae2fb4ca0]
+00007F2AE2FB4D70 00007f30593044af [DebuggerU2MCatchHandlerFrame: 00007f2ae2fb4d70]
+OS Thread Id: 0x1dc83
+ Child SP IP Call Site
+00007F2AE27B3680 00007f305abc6360 [GCFrame: 00007f2ae27b3680]
+00007F2AE27B3770 00007f305abc6360 [GCFrame: 00007f2ae27b3770]
+00007F2AE27B37D0 00007f305abc6360 [HelperMethodFrame_1OBJ: 00007f2ae27b37d0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
+00007F2AE27B3920 00007F2FE392B31F testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1() [/home/marioh/webapi/Controllers/diagscenario.cs @ 36]
+00007F2AE27B3950 00007F2FE392B46D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
+00007F2AE27B3CA0 00007f30593044af [GCFrame: 00007f2ae27b3ca0]
+00007F2AE27B3D70 00007f30593044af [DebuggerU2MCatchHandlerFrame: 00007f2ae27b3d70]
+OS Thread Id: 0x1dc84
+ Child SP IP Call Site
+00007F2AE1FB2680 00007f305abc6360 [GCFrame: 00007f2ae1fb2680]
+00007F2AE1FB2770 00007f305abc6360 [GCFrame: 00007f2ae1fb2770]
+00007F2AE1FB27D0 00007f305abc6360 [HelperMethodFrame_1OBJ: 00007f2ae1fb27d0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
+00007F2AE1FB2920 00007F2FE392B31F testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1() [/home/marioh/webapi/Controllers/diagscenario.cs @ 36]
+00007F2AE1FB2950 00007F2FE392B46D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
+00007F2AE1FB2CA0 00007f30593044af [GCFrame: 00007f2ae1fb2ca0]
+00007F2AE1FB2D70 00007f30593044af [DebuggerU2MCatchHandlerFrame: 00007f2ae1fb2d70]
+OS Thread Id: 0x1dc85
+ Child SP IP Call Site
+00007F2AE17B1680 00007f305abc6360 [GCFrame: 00007f2ae17b1680]
+00007F2AE17B1770 00007f305abc6360 [GCFrame: 00007f2ae17b1770]
+00007F2AE17B17D0 00007f305abc6360 [HelperMethodFrame_1OBJ: 00007f2ae17b17d0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
+00007F2AE17B1920 00007F2FE392B31F testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1() [/home/marioh/webapi/Controllers/diagscenario.cs @ 36]
+00007F2AE17B1950 00007F2FE392B46D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
+00007F2AE17B1CA0 00007f30593044af [GCFrame: 00007f2ae17b1ca0]
+00007F2AE17B1D70 00007f30593044af [DebuggerU2MCatchHandlerFrame: 00007f2ae17b1d70]
+OS Thread Id: 0x1dc86
+ Child SP IP Call Site
+00007F2AE0FB0680 00007f305abc6360 [GCFrame: 00007f2ae0fb0680]
+00007F2AE0FB0770 00007f305abc6360 [GCFrame: 00007f2ae0fb0770]
+00007F2AE0FB07D0 00007f305abc6360 [HelperMethodFrame_1OBJ: 00007f2ae0fb07d0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
+00007F2AE0FB0920 00007F2FE392B31F testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1() [/home/marioh/webapi/Controllers/diagscenario.cs @ 36]
+00007F2AE0FB0950 00007F2FE392B46D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
+00007F2AE0FB0CA0 00007f30593044af [GCFrame: 00007f2ae0fb0ca0]
+00007F2AE0FB0D70 00007f30593044af [DebuggerU2MCatchHandlerFrame: 00007f2ae0fb0d70]
+OS Thread Id: 0x1dc87
+ Child SP IP Call Site
+00007F2AE07AF680 00007f305abc6360 [GCFrame: 00007f2ae07af680]
+00007F2AE07AF770 00007f305abc6360 [GCFrame: 00007f2ae07af770]
+00007F2AE07AF7D0 00007f305abc6360 [HelperMethodFrame_1OBJ: 00007f2ae07af7d0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
+00007F2AE07AF920 00007F2FE392B31F testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1() [/home/marioh/webapi/Controllers/diagscenario.cs @ 36]
+00007F2AE07AF950 00007F2FE392B46D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
+00007F2AE07AFCA0 00007f30593044af [GCFrame: 00007f2ae07afca0]
+00007F2AE07AFD70 00007f30593044af [DebuggerU2MCatchHandlerFrame: 00007f2ae07afd70]
+OS Thread Id: 0x1dc88
+ Child SP IP Call Site
+00007F2ADFFAE680 00007f305abc6360 [GCFrame: 00007f2adffae680]
+00007F2ADFFAE770 00007f305abc6360 [GCFrame: 00007f2adffae770]
+00007F2ADFFAE7D0 00007f305abc6360 [HelperMethodFrame_1OBJ: 00007f2adffae7d0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
+00007F2ADFFAE920 00007F2FE392B31F testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1() [/home/marioh/webapi/Controllers/diagscenario.cs @ 36]
+00007F2ADFFAE950 00007F2FE392B46D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
+00007F2ADFFAECA0 00007f30593044af [GCFrame: 00007f2adffaeca0]
+00007F2ADFFAED70 00007f30593044af [DebuggerU2MCatchHandlerFrame: 00007f2adffaed70]
+...
+...
+...
+```
-* Read dotnet memory counter to know if the current memory consumptions goes above the specified threshold
-* Generate the actual core dump
+(output abbreviated)
-Let's start with the first requirement, reading dotnet counters. As explained earlier, we can use the EventPipe mechanism to read counters from the runtime. In this case, the provider that writes counter events is System.Runtime.
+Eye balling the callstacks for all 300+ threads shows a pattern where a majority of the threads share a common callstack:
-[Full code](src/triggerdump/Program.cs)
-Below is the releveant code snippet that is required to consume the counters:
-```csharp
-Task monitorTask = new Task(() =>
-{
- var prov = new List<Provider>();
- prov.Add(new Provider("System.Runtime", filterData:"EventCounterIntervalSec=1"));
-
- var configuration = new SessionConfiguration(
- circularBufferSizeMB: 1000,
- outputPath: "",
- providers: prov);
-
- var binaryReader = EventPipeClient.CollectTracing(Int32.Parse(args[0]), configuration, out _sessionId);
- EventPipeEventSource source = new EventPipeEventSource(binaryReader);
- source.Dynamic.All += Dynamic_All;
- source.Process();
-});
+```bash
+OS Thread Id: 0x1dc88
+ Child SP IP Call Site
+00007F2ADFFAE680 00007f305abc6360 [GCFrame: 00007f2adffae680]
+00007F2ADFFAE770 00007f305abc6360 [GCFrame: 00007f2adffae770]
+00007F2ADFFAE7D0 00007f305abc6360 [HelperMethodFrame_1OBJ: 00007f2adffae7d0] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
+00007F2ADFFAE920 00007F2FE392B31F testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1() [/home/marioh/webapi/Controllers/diagscenario.cs @ 36]
+00007F2ADFFAE950 00007F2FE392B46D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
+00007F2ADFFAECA0 00007f30593044af [GCFrame: 00007f2adffaeca0]
+00007F2ADFFAED70 00007f30593044af [DebuggerU2MCatchHandlerFrame: 00007f2adffaed70]
```
-The above code first creates the configuration and specifying the buffer size, output path and finally the System.Runtime provider that we are interested in. Next, it calls the CollectTracing method specifying the process identifier we are interested in tracing, the configuration and an out session ID. Once that is completed, we create an EventPipeSource from the reader created in the previous step and attach a callback that will be invoked as the events are delivered over EventPipe. Last, we call the Process method to start processing the events. At this point, the Dynamic_All method will be invoked anytime an event comes through from the System.Runtime provider.
+The callstack seems to show that the request arrived in our deadlock method which in turn makes a call to Monitor.ReliableEnter. This is a good indication that the majority of threads are trying to enter a monitor lock and are waiting on the availability of the lock (i.e., it is currently locked by a different thread). The next step then is to find out which thread is actually holding the monitor lock. Since monitors (by and large) store lock information in the sync block table, we can use the syncblk command to get more information:
-Now that we have the events flowing through out callback, let's turn our attention to the callback itself and how we can get the counter information:
-```csharp
-private static void Dynamic_All(TraceEvent obj)
-{
- if (obj.EventName.Equals("EventCounters"))
- {
- IDictionary<string, object> payloadVal = (IDictionary<string, object>)(obj.PayloadValue(0));
- IDictionary<string, object> payloadFields = (IDictionary<string, object>)(payloadVal["Payload"]);
- ICounterPayload payload = payloadFields.Count == 6 ? (ICounterPayload)new IncrementingCounterPayload(payloadFields) : (ICounterPayload)new CounterPayload(payloadFields);
- string displayName = payload.GetDisplay();
- if (string.IsNullOrEmpty(displayName))
- {
- displayName = payload.GetName();
- }
+```bash
+> syncblk
+Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner
+ 41 000000000143D038 603 1 00007F2B542D28C0 1dc1d 20 00007f2e90080fb8 System.Object
+ 42 000000000143D080 3 1 00007F2B400016A0 1dc1e 21 00007f2e90080fd0 System.Object
+-----------------------------
+Total 264
+Free 0
- if(string.Compare(displayName, "GC Heap Size") == 0 && Convert.ToInt32(payload.GetValue())>threshold)
- {
- // Generate dump and exit
- }
- }
-}
```
-Every time the callback is invoked, a TraceEvent is recieved. The TraceEvent contains information about the event that was delivered. In our case, the first thing we do is to make sure the event corresponds to EventCounters. If so, we get the GC Heap Size counter from the event payload and compare it to the threshold that the user set as part of the command line invocation. If the threshold was breached, we are ready to generate a dump.
-The last step of the puzzle is to generate the dump. For brevity, we will focus only on core dump generation on Linux. In preview 5, the way to generate a core dump is to invoke the createdump tool that ships with the runtime. Add the following code to the Dynamic_All method (replacing the Generate dump and exit comment):
+The two interesting columns are the MonitorHeld column and the Owning Thread Info column. The former shows whether a monitor lock is acquired by a thread (and also number of waiters) and the latter shows which thread currently owns the monitor lock. Please note that the Owning Thread Info shows three different columns. The column of interest is the operating system thread id (second column).
+
+At this point, we know that two different threads (0x1dc1d and 0x1dc1e) each hold a monitor lock. The next step is to take a look at what those threads are doing and if there is any possibility that they are stuck indefinitely holding the lock. Let's use the setthread and clrstack commands to switch to each of the threads and display the callstacks:
+
+```bash
+> setthread 0x1dc1d
+> clrstack
+OS Thread Id: 0x1dc1d (20)
+ Child SP IP Call Site
+00007F2B862B9610 00007f305abc6360 [GCFrame: 00007f2b862b9610]
+00007F2B862B9700 00007f305abc6360 [GCFrame: 00007f2b862b9700]
+00007F2B862B9760 00007f305abc6360 [HelperMethodFrame_1OBJ: 00007f2b862b9760] System.Threading.Monitor.Enter(System.Object)
+00007F2B862B98B0 00007F2FE392A9E5 testwebapi.Controllers.DiagScenarioController.DeadlockFunc() [/home/marioh/webapi/Controllers/diagscenario.cs @ 57]
+00007F2B862B98F0 00007F2FE392A8FC testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_0() [/home/marioh/webapi/Controllers/diagscenario.cs @ 27]
+00007F2B862B9910 00007F2FE02B7BA2 System.Threading.ThreadHelper.ThreadStart_Context(System.Object) [/__w/3/s/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 51]
+00007F2B862B9930 00007F2FE02C1021 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 172]
+00007F2B862B9980 00007F2FE02B7CBE System.Threading.ThreadHelper.ThreadStart() [/__w/3/s/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 101]
+00007F2B862B9CA0 00007f30593044af [GCFrame: 00007f2b862b9ca0]
+00007F2B862B9D70 00007f30593044af [DebuggerU2MCatchHandlerFrame: 00007f2b862b9d70]
+> setthread 0x1dc1e
+> clrstack
+OS Thread Id: 0x1dc1e (21)
+ Child SP IP Call Site
+00007F2B85AB8640 00007f305abc6360 [GCFrame: 00007f2b85ab8640]
+00007F2B85AB8730 00007f305abc6360 [GCFrame: 00007f2b85ab8730]
+00007F2B85AB8790 00007f305abc6360 [HelperMethodFrame_1OBJ: 00007f2b85ab8790] System.Threading.Monitor.Enter(System.Object)
+00007F2B85AB88E0 00007F2FE392AAAE testwebapi.Controllers.DiagScenarioController.<DeadlockFunc>b__4_0() [/home/marioh/webapi/Controllers/diagscenario.cs @ 53]
+00007F2B85AB8910 00007F2FE02B7BA2 System.Threading.ThreadHelper.ThreadStart_Context(System.Object) [/__w/3/s/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 51]
+00007F2B85AB8930 00007F2FE02C1021 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/__w/3/s/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 172]
+00007F2B85AB8980 00007F2FE02B7CBE System.Threading.ThreadHelper.ThreadStart() [/__w/3/s/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 101]
+00007F2B85AB8CA0 00007f30593044af [GCFrame: 00007f2b85ab8ca0]
+00007F2B85AB8D70 00007f30593044af [DebuggerU2MCatchHandlerFrame: 00007f2b85ab8d70]
+```
-```csharp
-Console.WriteLine("Memory threshold has been breached....");
-System.Diagnostics.Process process = System.Diagnostics.Process.GetProcessById(pid);
+Starting with the first thread, our deadlock method (which already owns a monitor lock) seems to be trying to acquire yet another lock but is waiting for the lock to become available. Similarly, the second thread (which also owns another monitor lock), is also trying to acquire a lock and waiting for the lock to become available.
-System.Diagnostics.ProcessModule coreclr = process.Modules.Cast<System.Diagnostics.ProcessModule>().FirstOrDefault(m => string.Equals(m.ModuleName, "libcoreclr.so"));
-if (coreclr == null)
-{
- Console.WriteLine("Unable to locate .NET runtime associated with this process!");
- Environment.Exit(1);
-}
-else
+To summarize our findings, we have two threads, each of which already own a lock and are stuck waiting for another lock to become available (deadlock). The remaining 300+ threads that are all waiting are most likely also waiting on one of the locks that caused the deadlock.
+
+
+
+### Simplifying the analysis process
+
+One of the pain points in our analysis process was around having to eye ball hundreds of callstacks to see if we can spot a common pattern. It would be much easier if we had a command that could output just the unique callstacks and all the thread ID's associated with those callstacks. In this section, we'll implement that command as a custom command for dotnet-dump analyze.
+
+Before we get started, you need to do the following:
+
+1. Clone the dotnet/diagnostics repo (https://github.com/dotnet/diagnostics)
+2. Make sure all the pre-requisites are in place to build the repo (https://github.com/dotnet/diagnostics/blob/master/README.md)
+3. Add the uniquestacks.cs file to ~/diagnostics/src/Tools/dotnet-dump/Commands folder.
+4. Build the repo using ~/diagnostics/build.sh
+
+All the interesting command code is located in the InvokeAsync method (uniquestacks.cs) and uses CLRMD to enumerate all the threads like so:
+
+```c#
+foreach (ClrThread thread in AnalyzeContext.Runtime.Threads)
{
- string runtimeDirectory = Path.GetDirectoryName(coreclr.FileName);
- string createDumpPath = Path.Combine(runtimeDirectory, "createdump");
- if (!File.Exists(createDumpPath))
- {
- Console.WriteLine("Unable to locate 'createdump' tool in '{runtimeDirectory}'");
- Environment.Exit(1);
- }
+ if (!thread.IsAlive) continue;
- var createdump = new System.Diagnostics.Process()
- {
- StartInfo = new System.Diagnostics.ProcessStartInfo()
+ StringBuilder completeStack = new StringBuilder();
+ foreach (ClrStackFrame frame in thread.StackTrace)
+ {
+ switch (frame.Kind)
{
- FileName = createDumpPath,
- Arguments = $"--name coredump --withheap {pid}",
- },
- EnableRaisingEvents = true,
- };
+ case ClrStackFrameType.ManagedMethod:
+ completeStack.Append(frame.DisplayString);
+ break;
+
+ case ClrStackFrameType.Runtime:
+ completeStack.Append("["+ frame.DisplayString +"]");
+
+ if (frame.Method != null)
+ {
+ completeStack.Append("Special");
+ completeStack.Append(frame.Method);
+ }
+
+ break;
+ }
+
+ completeStack.Append("\n");
- createdump.Start();
- createdump.WaitForExit();
+ string cStack = completeStack.ToString();
- Environment.Exit(0);
+ if(threads.ContainsKey(cStack)==true)
+ {
+ threads[cStack].Add(thread.OSThreadId);
+ }
+ else
+ {
+ List<uint> l = new List<uint>();
+ l.Add(thread.OSThreadId);
+ threads.Add(cStack, l);
+ }
+
+ }
}
```
-We can now build the tool (dotnet build) and run it against our application. Once built, restart the test webapi, get the process id from dotnet-trace and run triggerdump:
-> ```bash
-> dotnet-trace list-processes
-> ...
-> 80926 webapi /home/marioh/webapi/bin/Debug/netcoreapp3.0/webapi
-> sudo dotnet triggerdump.dll 80926 500
-> ```
-Please note that triggerdump has to be run with sudo. The invocation tells triggerdump to monitor process with a process id of 80926 and to create dump when memory grows above 500MB.
+The code simply iterates over all the threads, gets the callstacks and finally adding them to a dictionary. Once the loop has finished the dictionary will contain all the unique callstacks and associated thread ids.
-At this point, you should see triggerdump waiting/monitoring the memory counter. Let's trigger the intermittent memory spike by navigating to:
+Lastly, the code prints out the dictionary allowing us to see the duplicate callstacks with ease.
-http://localhost:5000/api/diagscenario/memspike/300
+Build the dotnet-dump tool (dotnet build) and run it from the following location on the same dump that we generated earlier:
-After 10-15 seconds, you will notice that triggerdump outputs the following followed by exiting.
+```bash
+~/diagnostics/artifacts/bin/dotnet-dump/Debug/netcoreapp2.1/publish/dotnet dotnet-dump.dll analyze ~/.dotnet/tools/core_20190513_143916
+```
-> ```bash
-> ...
-> Writing minidump with heap to file coredump
-> Written 820518912 bytes (200322 pages) to core file
-> ```
+You can now use the 'uniquestacks' command and see the following output (partial):
-That's it for creating your own diagnostics tool to solve the intermittent memory spike scenario!
+```bash
+...
+...
+...
-### Analyzing the core dump
-Since the core dump that our tool generates is a standard core dump, we can use the same techniques illustrated in [Scenario - App is leaking memory](app_is_leaking_memory_eventual_crash.md) to analyze the dump and find out why we're seeing intermittent high memory conditions.
+[GCFrame]
+[GCFrame]
+[HelperMethodFrame_1OBJ]
+testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1()
+System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
+[GCFrame]
+
+121917,121918,121919,121920,121921,121922,121923,121924,121925,121926,121927,121928,121929,121930,121931,121932,121933,121934,121935,121936,121937,121938,121939,121940,121941,121942,121943,121944,121945,121946,121947,121948,121949,121950,121951,121952,121953,121954,121955,121956,121957,121958,121959,121960,121961,121962,121963,121964,121965,121966,121967,121968,121969,121970,121971,121972,121973,121974,121975,121976,121977,121978,121979,121980,121981,121982,121983,121984,121985,121986,121987,121988,121989,121990,121991,121992,121993,121994,121995,121996,121997,121998,121999,122000,122001,122002,122003,122004,122005,122006,122007,122008,122009,122010,122011,122012,122013,122014,122015,122016,122017,122018,122019,122020,122021,122022,122023,122024,122025,122026,122027,122028,122029,122030,122031,122032,122033,122034,122035,122036,122037,122038,122039,122040,122041,122042,122043,122044,122045,122046,122047,122048,122049,122050,122051,122052,122053,122054,122055,122056,122057,122058,122059,122060,122061,122062,122063,122064,122065,122066,122067,122068,122069,122070,122071,122072,122073,122074,122075,122076,122077,122078,122079,122080,122081,122082,122083,122084,122085,122086,122087,122088,122089,122090,122091,122092,122093,122094,122095,122096,122097,122098,122099,122100,122101,122102,122103,122104,122105,122106,122107,122108,122109,122110,122111,122112,122113,122114,122115,122116,122117,122118,122119,122120,122121,122122,122123,122124,122125,122126,122127,122128,122129,122130,122131,122132,122133,122134,122135,122136,122137,122138,122139,122140,122141,122142,122143,122144,122145,122146,122147,122148,122149,122150,122151,122152,122153,122154,122155,122156,122157,122158,122159,122160,122161,122162,122163,122164,122165,122166,122167,122168,122169,122170,122171,122172,122173,122174,122175,122176,122177,122178,122179,122180,122181,122182,122183,122184,122185,122186,122187,122188
+```
+The output shows a large number of threads (with associated thread ID's) that is commonly a good indicator that a threading issue is taking place.
+