15

A CPU sampling profiler in less than 200 lines

 3 years ago
source link: https://lowleveldesign.org/2020/10/13/a-cpu-sampling-profiler-in-less-than-200-lines/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

lowleveldesign.org

Software tracing, debugging, and security

While working on a new version of wtrace, I am analyzing the PerfView source code to learn how its various features work internally. One of such features is the call stack resolution for ETW events. This post will show you how to use the TraceEvent library to decode call stacks, and, as an exercise, we will write a sampling process profiler. Before we start, remember to set DisablePagingExecutive to 1. That is a requirement to make call stacks work for ETW sessions.

❗❗❗ Visit wtrace.net to receive updates on wtrace and my other troubleshooting tools. ❗ ❗ ❗

Collecting profiling data

First, we need to collect trace data that we will later analyze. Our profiler will rely on SampledProfile events. Enabling them makes the Kernel emit a profiling event every 1ms for every CPU in the system. The SampledProfile event has only two fields: InstructionPointer and ThreadId. InstructionPointer is an address in the memory, so to find a method name that matches it, we need information about the images loaded by a process (the ImageLoad event). Our code to start the kernel session could look as follows:

use session = new TraceEventSession(sessionName, etlFilePath)
let traceFlags = NtKeywords.Profile ||| NtKeywords.ImageLoad
let stackFlags = NtKeywords.Profile
session.EnableKernelProvider(traceFlags, stackFlags) |> ignore

That is enough to resolve call stacks from the native images (NGEN/R2R), where CPU executes the image’s compiled code. However, resolving managed call stacks requires some more data. For managed code, the instruction pointer points to a block of JIT-compiled code that resides in a process private memory and is valid only for the process lifetime. The Kernel does not know about the CLR internals, so we need to query the CLR ETW provider. Fortunately, the ClrTraceEventParser already has a flag which contains all the necessary keywords:

public sealed class ClrTraceEventParser : TraceEventParser
{
...
/// <summary>
///  Keywords are passed to TraceEventSession.EnableProvider to enable particular sets of
/// </summary>
[Flags]
public enum Keywords : long
{
...
/// <summary>
/// What is needed to get symbols for JIT compiled code. 
/// </summary>
JITSymbols = Jit | StopEnumeration | JittedMethodILToNativeMap | SupressNGen | Loader,
...
};
}

We are ready to enable the CLR provider in our ETW session:

let keywords = ClrKeywords.JITSymbols
session.EnableProvider(ClrTraceEventParser.ProviderGuid, TraceEventLevel.Always, uint64(keywords)) |> ignore

Our session is now open and saves ETW events to the .etl file we specified. It’s time to perform the action we want to profile, for example, start a new process. When done, we should call Stop or Dispose to stop the ETW session and close the .etl file. For .NET processes that didn’t start or finish when the session was running, we also need to collect CLR rundown events. Without this data, we will lack JIT mappings, and we won’t be able to resolve some of the managed stacks. Here is the code for creating a rundown session:

let collectClrRundownEvents () =
let rec waitForRundownEvents filePath =
// Poll until 2 seconds goes by without growth.
let len = FileInfo(filePath).Length
Thread.Sleep(2000)
if FileInfo(filePath).Length = len then
waitForRundownEvents filePath
else ()
printfn "Forcing rundown of JIT methods."
let rundownFileName = Path.ChangeExtension(etlFilePath, ".clrRundown.etl")
use rundownSession = new TraceEventSession(sessionName + "-rundown", rundownFileName)
rundownSession.EnableProvider(ClrRundownTraceEventParser.ProviderGuid,
TraceEventLevel.Verbose, uint64(ClrRundownTraceEventParser.Keywords.Default)) |> ignore
waitForRundownEvents rundownFileName
printfn "Done with rundown."

The final steps are to merge the .etl files, add necessary module information, and prepare symbol files for the pre-compiled assemblies. Fortunately, the TraceEvent library has one method that performs all the operations mentioned above: ZippedETLWriter.WriteArchive. Internally, this method uses KernelTraceControl.dll!CreateMergedTraceFile (from WPT) to merge the .etl files and generate ‘synthetic’ events for all the modules which were loaded by processes during the trace session. If you’d like to see the PInvoke signature of this method, decrypt the OSExtensions.cs file. For all the pre-compiled assemblies (ETWTraceEventSource.GetModulesNeedingSymbols) TraceEvent generates the .pdb files (SymbolReader.GenerateNGenSymbolsForModule) and packs them into the final .zip file.

Our final recording code looks as follows:

do
use session = new TraceEventSession(sessionName, etlFilePath)
let traceFlags = NtKeywords.Profile ||| NtKeywords.ImageLoad // NtKeywords.Process and NtKeywords.Thread are added automatically
let stackFlags = NtKeywords.Profile
session.EnableKernelProvider(traceFlags, stackFlags) |> ignore
// we need CLR to resolve managed stacks
let keywords = ClrKeywords.JITSymbols
session.EnableProvider(ClrTraceEventParser.ProviderGuid, TraceEventLevel.Always, uint64(keywords)) |> ignore
printfn "Press enter to stop the tracing session"
Console.ReadLine() |> ignore
session.Stop() |> ignore
collectClrRundownEvents ()
printfn "Collecting the data required for stack resolution..."
let writer = ZippedETLWriter(etlFilePath, log)
if not (writer.WriteArchive(Compression.CompressionLevel.NoCompression)) then
printfn "Error occured while merging the data"
else
printfn "Trace session completed"

It is very similar to the 40_SimpleTraceLog sample from the Perfview repository. So if you are looking for C# code, check it out.

Analyzing call stacks with TraceLog

Once we finished and post-processed the trace session, it’s time to analyze the collected events. Usually, when we want to process ETW events, we create the TraceEventSession instance and assign callbacks for interesting events. Using callbacks to the TraceEventSession source is the most efficient way of processing ETW events. However, this approach has some drawbacks. TraceEventParser instances are usually stateless, so if you need to work on aggregated data (processes, threads, modules), you need to implement the aggregation logic. For example, to list process lifetimes saved in the trace, we need to process the ProcessStartGroup and ProcessEndGroup events. Resolving call stacks is even more demanding and involves examining different groups of events. Fortunately, the TraceEvent library again gives us a hand and provides the TraceLog class. As stated in the documentation, TraceLog represents a higher-level event processing. It keeps a backlog of various session objects, including processes, threads, and call stacks. Our task is to instantiate the SymbolReader class, download or unpack the required symbol files, and process the event call stacks. The code below does all that:

type CallStackNode (callsCount : int32) =
inherit Dictionary<string, CallStackNode>(StringComparer.OrdinalIgnoreCase)
member val CallsCount = callsCount with get, set
let loadSymbols (traceLog : TraceLog) (proc : TraceProcess) =
use symbolReader = new SymbolReader(log)
proc.LoadedModules |> Seq.where (fun m -> not (isNull m.ModuleFile))
|> Seq.iter (fun m -> traceLog.CodeAddresses.LookupSymbolsForModule(symbolReader, m.ModuleFile))
let buildCallStacksTree (traceLog : TraceLog) pid tid =
let perfInfoTaskGuid = Guid(int32 0xce1dbfb4, int16 0x137e, int16 0x4da6, byte 0x87, byte 0xb0, byte 0x3f, byte 0x59, byte 0xaa, byte 0x10, byte 0x2c, byte 0xbc)
let perfInfoOpcode = 46
let callStacks = CallStackNode(0)
let processCallStack (callStack : TraceCallStack) =
let addOrUpdateChildNode (node : CallStackNode) (callStack : TraceCallStack) =
let decodedAddress = sprintf "%s!%s" callStack.CodeAddress.ModuleName callStack.CodeAddress.FullMethodName
match node.TryGetValue(decodedAddress) with
| (true, childNode) ->
childNode.CallsCount <- childNode.CallsCount + 1
childNode
| (false, _) ->
let childNode = CallStackNode(1)
node.Add(decodedAddress, childNode)
childNode
let rec processStackFrame (callStackNode : CallStackNode) (callStack : TraceCallStack) =
let caller = callStack.Caller
if isNull caller then // root node
callStackNode.CallsCount <- callStackNode.CallsCount + 1
let childNode =
if isNull caller then addOrUpdateChildNode callStackNode callStack
else processStackFrame callStackNode caller
addOrUpdateChildNode childNode callStack
processStackFrame callStacks callStack |> ignore
traceLog.Events
|> Seq.filter (fun ev -> ev.ProcessID = pid && ev.ThreadID = tid && ev.TaskGuid = perfInfoTaskGuid && (int32 ev.Opcode = perfInfoOpcode))
|> Seq.iter (fun ev -> processCallStack (ev.CallStack()))
callStacks
let tryFindProcess (traceLog : TraceLog) processNameOrPid =
let (|Pid|ProcessName|) (s : string) =
match Int32.TryParse(s) with
| (true, pid) -> Pid pid
| (false, _) -> ProcessName s
let filter =
match processNameOrPid with
| Pid pid -> fun (p : TraceProcess) -> p.ProcessID = pid
| ProcessName name -> fun (p : TraceProcess) -> String.Equals(p.Name, name, StringComparison.OrdinalIgnoreCase)
let processes = traceLog.Processes |> Seq.where filter |> Seq.toArray
if processes.Length = 0 then
None
elif processes.Length = 1 then
Some processes.[0]
else
processes |> Seq.iteri (fun i p -> printfn "[%d] %s (%d): '%s'" i p.Name p.ProcessID p.CommandLine)
printf "Which process to analyze []: "
match Int32.TryParse(Console.ReadLine()) with
| (false, _) -> None
| (true, n) -> processes |> Seq.tryItem n
let analyze processNameOrPid maxDepth =
let reader = ZippedETLReader(etlFilePath + ".zip", log)
reader.UnpackArchive()
let options = TraceLogOptions(ConversionLog = log)
let traceLog = TraceLog.OpenOrConvert(etlFilePath, options)
match tryFindProcess traceLog processNameOrPid with
| None -> printfn "No matching process found in the trace"
| Some proc ->
printfn "%s [%s] (%d)" proc.Name proc.CommandLine proc.ProcessID
let sw = Stopwatch.StartNew()
printfn "[%6d ms] Loading symbols for modules in process %s (%d)" sw.ElapsedMilliseconds proc.Name proc.ProcessID
loadSymbols traceLog proc
// usually, system process has PID 4, but TraceEvent attaches the drivers to the Idle process (0)
let systemProc = traceLog.Processes |> Seq.where (fun p -> p.ProcessID = 0) |> Seq.exactlyOne
printfn "[%6d ms] Loading symbols for system modules" sw.ElapsedMilliseconds
loadSymbols traceLog systemProc
printfn "[%6d ms] Starting call stack analysis" sw.ElapsedMilliseconds
let printThreadCallStacks (thread : TraceThread) =
let callStacks = buildCallStacksTree traceLog proc.ProcessID thread.ThreadID
let rec getCallStack depth name (callStack : CallStackNode) =
let folder frames (kv : KeyValuePair<string, CallStackNode>) =
if kv.Value.Count = 0 || (maxDepth > 0 && depth >= maxDepth) then frames
else getCallStack (depth + 1) kv.Key kv.Value |> List.append frames
callStack
|> Seq.fold folder [ (sprintf "%s├─ %s [%d]" ("│ " |> String.replicate depth) name callStack.CallsCount) ]
let stacks = getCallStack 0 (sprintf "Thread (%d) '%s'" thread.ThreadID thread.ThreadInfo) callStacks
stacks |> List.iter (fun s -> printfn "%s" s)
proc.Threads |> Seq.iter printThreadCallStacks
printfn "[%6d ms] Completed call stack analysis" sw.ElapsedMilliseconds

An example output might look as follows (I removed some lines for brevity):

PS netcoreapp3.1> .\etwprofiler.exe analyze testproc 10
testproc ["C:\Users\me\testproc.exe" ] (15416)
[     2 ms] Loading symbols for modules in process testproc (15416)
[   428 ms] Loading symbols for system modules
[ 21400 ms] Starting call stack analysis
├─ Thread (2412) 'Startup Thread' [121]
│ ├─ ntdll!LdrInitializeThunk [11]
│ │ ├─ ntdll!LdrInitializeThunk [11]
│ │ │ ├─ ntdll!LdrpInitialize [11]
│ │ │ │ ├─ ntdll!??_LdrpInitialize [11]
│ │ │ │ │ ├─ ntdll!LdrpInitializeProcess [11]
│ │ │ │ │ │ ├─ ntdll!LdrpInitializeTls [1]
│ │ │ │ │ │ ├─ ntdll!LdrLoadDll [2]
│ │ │ │ │ │ │ ├─ ntdll!LdrpLoadDll [2]
...
├─ Thread (24168) '' [1]
│ ├─ ntdll!RtlUserThreadStart [1]
│ │ ├─ ntdll!RtlUserThreadStart [1]
│ │ │ ├─ kernel32!BaseThreadInitThunk [1]
│ │ │ │ ├─ ntdll!TppWorkerThread [1]
│ │ │ │ │ ├─ ntdll!TppAlpcpExecuteCallback [1]
│ │ │ │ │ │ ├─ rpcrt4!LrpcIoComplete [1]
│ │ │ │ │ │ │ ├─ rpcrt4!LRPC_ADDRESS::ProcessIO [1]
├─ Thread (9808) '' [0]
├─ Thread (22856) '.NET ThreadPool Worker' [1]
│ ├─ ntdll!RtlUserThreadStart [1]
│ │ ├─ ntdll!RtlUserThreadStart [1]
│ │ │ ├─ kernel32!BaseThreadInitThunk [1]
│ │ │ │ ├─ coreclr!Thread::intermediateThreadProc [1]
│ │ │ │ │ ├─ coreclr!ThreadpoolMgr::WorkerThreadStart [1]
│ │ │ │ │ │ ├─ coreclr!UnManagedPerAppDomainTPCount::DispatchWorkItem [1]
│ │ │ │ │ │ │ ├─ coreclr!ThreadpoolMgr::AsyncTimerCallbackCompletion [1]
│ │ │ │ │ │ │ │ ├─ coreclr!AppDomainTimerCallback [1]
│ │ │ │ │ │ │ │ │ ├─ coreclr!ManagedThreadBase_DispatchOuter [1]
│ │ │ │ │ │ │ │ │ │ ├─ coreclr!ManagedThreadBase_DispatchMiddle [1]
├─ Thread (12352) '.NET ThreadPool Gate' [0]
[ 21835 ms] Completed call stack analysis

The full profiler code is available in my blog samples repository. For my Dotnetos session about tracing, I also prepared the C# version so you may check it too. And, of course, I invite you to watch the whole session! 🙂


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK