lowleveldesign.org

Software tracing, debugging, and security

While working on a new version of wtrace, I am analyzing the PerfView source code to learn how its various features work internally. One of such features is the call stack resolution for ETW events. This post will show you how to use the TraceEvent library to decode call stacks, and, as an exercise, we will write a sampling process profiler. Before we start, remember to set DisablePagingExecutive to 1. That is a requirement to make call stacks work for ETW sessions.

Visit wtrace.net to receive updates on wtrace and my other troubleshooting tools.

Collecting profiling data

First, we need to collect trace data that we will later analyze. Our profiler will rely on SampledProfile events. Enabling them makes the Kernel emit a profiling event every 1ms for every CPU in the system. The SampledProfile event has only two fields: InstructionPointer and ThreadId. InstructionPointer is an address in the memory, so to find a method name that matches it, we need information about the images loaded by a process (the ImageLoad event). Our code to start the kernel session could look as follows:

use session = new TraceEventSession(sessionName, etlFilePath)

let traceFlags = NtKeywords.Profile ||| NtKeywords.ImageLoad

let stackFlags = NtKeywords.Profile

session.EnableKernelProvider(traceFlags, stackFlags) |> ignore

That is enough to resolve call stacks from the native images (NGEN/R2R), where CPU executes the image’s compiled code. However, resolving managed call stacks requires some more data. For managed code, the instruction pointer points to a block of JIT-compiled code that resides in a process private memory and is valid only for the process lifetime. The Kernel does not know about the CLR internals, so we need to query the CLR ETW provider. Fortunately, the ClrTraceEventParser already has a flag which contains all the necessary keywords:

public sealed class ClrTraceEventParser : TraceEventParser

{

...

/// <summary>

/// Keywords are passed to TraceEventSession.EnableProvider to enable particular sets of

/// </summary>

[Flags]

public enum Keywords : long

{

...

/// <summary>

/// What is needed to get symbols for JIT compiled code.

/// </summary>

JITSymbols = Jit | StopEnumeration | JittedMethodILToNativeMap | SupressNGen | Loader,

...

};

}

We are ready to enable the CLR provider in our ETW session:

let keywords = ClrKeywords.JITSymbols

session.EnableProvider(ClrTraceEventParser.ProviderGuid, TraceEventLevel.Always, uint64(keywords)) |> ignore

Our session is now open and saves ETW events to the .etl file we specified. It’s time to perform the action we want to profile, for example, start a new process. When done, we should call Stop or Dispose to stop the ETW session and close the .etl file. For .NET processes that didn’t start or finish when the session was running, we also need to collect CLR rundown events. Without this data, we will lack JIT mappings, and we won’t be able to resolve some of the managed stacks. Here is the code for creating a rundown session:

let collectClrRundownEvents () =

let rec waitForRundownEvents filePath =

// Poll until 2 seconds goes by without growth.

let len = FileInfo(filePath).Length

Thread.Sleep(2000)

if FileInfo(filePath).Length = len then

waitForRundownEvents filePath

else ()

printfn "Forcing rundown of JIT methods."

let rundownFileName = Path.ChangeExtension(etlFilePath, ".clrRundown.etl")

use rundownSession = new TraceEventSession(sessionName + "-rundown", rundownFileName)

rundownSession.EnableProvider(ClrRundownTraceEventParser.ProviderGuid,

TraceEventLevel.Verbose, uint64(ClrRundownTraceEventParser.Keywords.Default)) |> ignore

waitForRundownEvents rundownFileName

printfn "Done with rundown."

The final steps are to merge the .etl files, add necessary module information, and prepare symbol files for the pre-compiled assemblies. Fortunately, the TraceEvent library has one method that performs all the operations mentioned above: ZippedETLWriter.WriteArchive. Internally, this method uses KernelTraceControl.dll!CreateMergedTraceFile (from WPT) to merge the .etl files and generate ‘synthetic’ events for all the modules which were loaded by processes during the trace session. If you’d like to see the PInvoke signature of this method, decrypt the OSExtensions.cs file. For all the pre-compiled assemblies (ETWTraceEventSource.GetModulesNeedingSymbols) TraceEvent generates the .pdb files (SymbolReader.GenerateNGenSymbolsForModule) and packs them into the final .zip file.

Our final recording code looks as follows:

do

use session = new TraceEventSession(sessionName, etlFilePath)

let traceFlags = NtKeywords.Profile ||| NtKeywords.ImageLoad // NtKeywords.Process and NtKeywords.Thread are added automatically

let stackFlags = NtKeywords.Profile

session.EnableKernelProvider(traceFlags, stackFlags) |> ignore

// we need CLR to resolve managed stacks

let keywords = ClrKeywords.JITSymbols

session.EnableProvider(ClrTraceEventParser.ProviderGuid, TraceEventLevel.Always, uint64(keywords)) |> ignore

printfn "Press enter to stop the tracing session"

Console.ReadLine() |> ignore

session.Stop() |> ignore

collectClrRundownEvents ()

printfn "Collecting the data required for stack resolution..."

let writer = ZippedETLWriter(etlFilePath, log)

if not (writer.WriteArchive(Compression.CompressionLevel.NoCompression)) then

printfn "Error occured while merging the data"

else

printfn "Trace session completed"

It is very similar to the 40_SimpleTraceLog sample from the Perfview repository. So if you are looking for C# code, check it out.

Analyzing call stacks with TraceLog

Once we finished and post-processed the trace session, it’s time to analyze the collected events. Usually, when we want to process ETW events, we create the TraceEventSession instance and assign callbacks for interesting events. Using callbacks to the TraceEventSession source is the most efficient way of processing ETW events. However, this approach has some drawbacks. TraceEventParser instances are usually stateless, so if you need to work on aggregated data (processes, threads, modules), you need to implement the aggregation logic. For example, to list process lifetimes saved in the trace, we need to process the ProcessStartGroup and ProcessEndGroup events. Resolving call stacks is even more demanding and involves examining different groups of events. Fortunately, the TraceEvent library again gives us a hand and provides the TraceLog class. As stated in the documentation, TraceLog represents a higher-level event processing. It keeps a backlog of various session objects, including processes, threads, and call stacks. Our task is to instantiate the SymbolReader class, download or unpack the required symbol files, and process the event call stacks. The code below does all that:

type CallStackNode (callsCount : int32) =

inherit Dictionary<string, CallStackNode>(StringComparer.OrdinalIgnoreCase)

member val CallsCount = callsCount with get, set

let loadSymbols (traceLog : TraceLog) (proc : TraceProcess) =

use symbolReader = new SymbolReader(log)

proc.LoadedModules |> Seq.where (fun m -> not (isNull m.ModuleFile))

|> Seq.iter (fun m -> traceLog.CodeAddresses.LookupSymbolsForModule(symbolReader, m.ModuleFile))

let buildCallStacksTree (traceLog : TraceLog) pid tid =

let

perfInfoTaskGuid = Guid(int32 0xce1dbfb4, int16 0x137e, int16 0x4da6, byte 0x87, byte 0xb0, byte 0x3f, byte 0x59, byte 0xaa, byte 0x10, byte 0x2c, byte 0xbc)

let perfInfoOpcode = 46

let callStacks = CallStackNode(0)

let processCallStack (callStack : TraceCallStack) =

let addOrUpdateChildNode (node : CallStackNode) (callStack : TraceCallStack) =

let decodedAddress = sprintf "%s!%s" callStack.CodeAddress.ModuleName callStack.CodeAddress.FullMethodName

match node.TryGetValue(decodedAddress) with

| (true, childNode) ->

childNode.CallsCount <- childNode.CallsCount + 1

childNode

| (false, _) ->

let childNode = CallStackNode(1)

node.Add(decodedAddress, childNode)

childNode

let rec processStackFrame (callStackNode : CallStackNode) (callStack : TraceCallStack) =

let caller = callStack.Caller

if isNull caller then // root node

callStackNode.CallsCount <- callStackNode.CallsCount + 1

let childNode =

if isNull caller then addOrUpdateChildNode callStackNode callStack

else processStackFrame callStackNode caller

addOrUpdateChildNode childNode callStack

processStackFrame callStacks callStack |> ignore

traceLog.Events

|> Seq.filter (fun ev -> ev.ProcessID = pid && ev.ThreadID = tid && ev.TaskGuid = perfInfoTaskGuid && (int32 ev.Opcode = perfInfoOpcode))

|> Seq.iter (fun ev -> processCallStack (ev.CallStack()))

callStacks

let tryFindProcess (traceLog : TraceLog) processNameOrPid =

let (|Pid|ProcessName|) (s : string) =

match Int32.TryParse(s) with

| (true, pid) -> Pid pid

| (false, _) -> ProcessName s

let filter =

match processNameOrPid with

| Pid pid -> fun (p : TraceProcess) -> p.ProcessID = pid

| ProcessName name -> fun (p : TraceProcess) -> String.Equals(p.Name, name, StringComparison.OrdinalIgnoreCase)

let processes = traceLog.Processes |> Seq.where filter |> Seq.toArray

if processes.Length = 0 then

None

elif processes.Length = 1 then

Some processes.[0]

else

processes |> Seq.iteri (fun i p -> printfn "[%d] %s (%d): '%s'" i p.Name p.ProcessID p.CommandLine)

printf "Which process to analyze []: "

match Int32.TryParse(Console.ReadLine()) with

| (false, _) -> None

| (true, n) -> processes |> Seq.tryItem n

let analyze processNameOrPid maxDepth =

let reader = ZippedETLReader(etlFilePath + ".zip", log)

reader.UnpackArchive()

let options = TraceLogOptions(ConversionLog = log)

let traceLog = TraceLog.OpenOrConvert(etlFilePath, options)

match tryFindProcess traceLog processNameOrPid with

| None -> printfn "No matching process found in the trace"

| Some proc ->

printfn "%s [%s] (%d)" proc.Name proc.CommandLine proc.ProcessID

let sw = Stopwatch.StartNew()

printfn "[%6d ms] Loading symbols for modules in process %s (%d)" sw.ElapsedMilliseconds proc.Name proc.ProcessID

loadSymbols traceLog proc

// usually, system process has PID 4, but TraceEvent attaches the drivers to the Idle process (0)

let systemProc = traceLog.Processes |> Seq.where (fun p -> p.ProcessID = 0) |> Seq.exactlyOne

printfn "[%6d ms] Loading symbols for system modules" sw.ElapsedMilliseconds

loadSymbols traceLog systemProc

printfn "[%6d ms] Starting call stack analysis" sw.ElapsedMilliseconds

let printThreadCallStacks (thread : TraceThread) =

let callStacks = buildCallStacksTree traceLog proc.ProcessID thread.ThreadID

let rec getCallStack depth name (callStack : CallStackNode) =

let folder frames (kv : KeyValuePair<string, CallStackNode>) =

if kv.Value.Count = 0 || (maxDepth > 0 && depth >= maxDepth) then frames

else getCallStack (depth + 1) kv.Key kv.Value |> List.append frames

callStack

|> Seq.fold folder [ (sprintf "%s├─ %s [%d]" ("│ " |> String.replicate depth) name callStack.CallsCount) ]

let stacks = getCallStack 0 (sprintf "Thread (%d) '%s'" thread.ThreadID thread.ThreadInfo) callStacks

stacks |> List.iter (fun s -> printfn "%s" s)

proc.Threads |> Seq.iter printThreadCallStacks

printfn "[%6d ms] Completed call stack analysis" sw.ElapsedMilliseconds

An example output might look as follows (I removed some lines for brevity):

PS netcoreapp3.1> .\etwprofiler.exe analyze testproc 10

testproc ["C:\Users\me\testproc.exe" ] (15416)

[ 2 ms] Loading symbols for modules in process testproc (15416)

[ 428 ms] Loading symbols for system modules

[ 21400 ms] Starting call stack analysis

├─ Thread (2412) 'Startup Thread' [121]

│ ├─ ntdll!LdrInitializeThunk [11]

│ │ ├─ ntdll!LdrInitializeThunk [11]

│ │ │ ├─ ntdll!LdrpInitialize [11]

│ │ │ │ ├─ ntdll!??_LdrpInitialize [11]

│ │ │ │ │ ├─ ntdll!LdrpInitializeProcess [11]

│ │ │ │ │ │ ├─ ntdll!LdrpInitializeTls [1]

│ │ │ │ │ │ ├─ ntdll!LdrLoadDll [2]

│ │ │ │ │ │ │ ├─ ntdll!LdrpLoadDll [2]

...

├─ Thread (24168) '' [1]

│ ├─ ntdll!RtlUserThreadStart [1]

│ │ ├─ ntdll!RtlUserThreadStart [1]

│ │ │ ├─ kernel32!BaseThreadInitThunk [1]

│ │ │ │ ├─ ntdll!TppWorkerThread [1]

│ │ │ │ │ ├─ ntdll!TppAlpcpExecuteCallback [1]

│ │ │ │ │ │ ├─ rpcrt4!LrpcIoComplete [1]

│ │ │ │ │ │ │ ├─ rpcrt4!LRPC_ADDRESS::ProcessIO [1]

├─ Thread (9808) '' [0]

├─ Thread (22856) '.NET ThreadPool Worker' [1]

│ ├─ ntdll!RtlUserThreadStart [1]

│ │ ├─ ntdll!RtlUserThreadStart [1]

│ │ │ ├─ kernel32!BaseThreadInitThunk [1]

│ │ │ │ ├─ coreclr!Thread::intermediateThreadProc [1]

│ │ │ │ │ ├─ coreclr!ThreadpoolMgr::WorkerThreadStart [1]

│ │ │ │ │ │ ├─ coreclr!UnManagedPerAppDomainTPCount::DispatchWorkItem [1]

│ │ │ │ │ │ │ ├─ coreclr!ThreadpoolMgr::AsyncTimerCallbackCompletion [1]

│ │ │ │ │ │ │ │ ├─ coreclr!AppDomainTimerCallback [1]

│ │ │ │ │ │ │ │ │ ├─ coreclr!ManagedThreadBase_DispatchOuter [1]

│ │ │ │ │ │ │ │ │ │ ├─ coreclr!ManagedThreadBase_DispatchMiddle [1]

├─ Thread (12352) '.NET ThreadPool Gate' [0]

[ 21835 ms] Completed call stack analysis

The full profiler code is available in my blog samples repository. For my Dotnetos session about tracing, I also prepared the C# version so you may check it too. And, of course, I invite you to watch the whole session!

A CPU sampling profiler in less than 200 lines

lowleveldesign.org

Software tracing, debugging, and security

Collecting profiling data

Analyzing call stacks with TraceLog

Recommend

GitHub - rbspy/rbspy: Sampling profiler for Ruby

GitHub - nokia/nperf: A sampling CPU profiler for Linux

GitHub - adsr/phpspy: Sampling profiler for PHP 7

Py-Spy: A sampling profiler for Python programs

Creating a blockchain with less than 100 code lines

How I Created a Landing page in Less than 100 lines of Code

GitHub - sj-i/php-profiler: a sampling profiler for PHP written in PHP

GitHub - javierhonduco/rbperf: Sampling profiler and tracer for Ruby (CRuby) whi...

GitHub - jvm-profiling-tools/async-profiler: Sampling CPU and HEAP profiler for...

Sampling Profiler & Rogue case of App Profiling

About Joyk