

A CPU sampling profiler in less than 200 lines
source link: https://lowleveldesign.org/2020/10/13/a-cpu-sampling-profiler-in-less-than-200-lines/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

lowleveldesign.org
Software tracing, debugging, and security
While working on a new version of wtrace, I am analyzing the PerfView source code to learn how its various features work internally. One of such features is the call stack resolution for ETW events. This post will show you how to use the TraceEvent library to decode call stacks, and, as an exercise, we will write a sampling process profiler. Before we start, remember to set DisablePagingExecutive to 1. That is a requirement to make call stacks work for ETW sessions.
Visit wtrace.net to receive updates on wtrace and my other troubleshooting tools.
Collecting profiling data
First, we need to collect trace data that we will later analyze. Our profiler will rely on SampledProfile events. Enabling them makes the Kernel emit a profiling event every 1ms for every CPU in the system. The SampledProfile event has only two fields: InstructionPointer and ThreadId. InstructionPointer is an address in the memory, so to find a method name that matches it, we need information about the images loaded by a process (the ImageLoad event). Our code to start the kernel session could look as follows:
use
session =
new
TraceEventSession(sessionName, etlFilePath)
let
traceFlags = NtKeywords.Profile ||| NtKeywords.ImageLoad
let
stackFlags = NtKeywords.Profile
session.EnableKernelProvider(traceFlags, stackFlags) |> ignore
That is enough to resolve call stacks from the native images (NGEN/R2R), where CPU executes the image’s compiled code. However, resolving managed call stacks requires some more data. For managed code, the instruction pointer points to a block of JIT-compiled code that resides in a process private memory and is valid only for the process lifetime. The Kernel does not know about the CLR internals, so we need to query the CLR ETW provider. Fortunately, the ClrTraceEventParser already has a flag which contains all the necessary keywords:
public
sealed
class
ClrTraceEventParser : TraceEventParser
{
...
/// <summary>
/// Keywords are passed to TraceEventSession.EnableProvider to enable particular sets of
/// </summary>
[Flags]
public
enum
Keywords :
long
{
...
/// <summary>
/// What is needed to get symbols for JIT compiled code.
/// </summary>
JITSymbols = Jit | StopEnumeration | JittedMethodILToNativeMap | SupressNGen | Loader,
...
};
}
We are ready to enable the CLR provider in our ETW session:
let
keywords = ClrKeywords.JITSymbols
session.EnableProvider(ClrTraceEventParser.ProviderGuid, TraceEventLevel.Always, uint64(keywords)) |> ignore
Our session is now open and saves ETW events to the .etl file we specified. It’s time to perform the action we want to profile, for example, start a new process. When done, we should call Stop or Dispose to stop the ETW session and close the .etl file. For .NET processes that didn’t start or finish when the session was running, we also need to collect CLR rundown events. Without this data, we will lack JIT mappings, and we won’t be able to resolve some of the managed stacks. Here is the code for creating a rundown session:
let
collectClrRundownEvents () =
let
rec
waitForRundownEvents filePath =
// Poll until 2 seconds goes by without growth.
let
len = FileInfo(filePath).Length
Thread.Sleep(2000)
if
FileInfo(filePath).Length = len
then
waitForRundownEvents filePath
else
()
printfn
"Forcing rundown of JIT methods."
let
rundownFileName = Path.ChangeExtension(etlFilePath,
".clrRundown.etl"
)
use
rundownSession =
new
TraceEventSession(sessionName +
"-rundown"
, rundownFileName)
rundownSession.EnableProvider(ClrRundownTraceEventParser.ProviderGuid,
TraceEventLevel.Verbose, uint64(ClrRundownTraceEventParser.Keywords.Default)) |> ignore
waitForRundownEvents rundownFileName
printfn
"Done with rundown."
The final steps are to merge the .etl files, add necessary module information, and prepare symbol files for the pre-compiled assemblies. Fortunately, the TraceEvent library has one method that performs all the operations mentioned above: ZippedETLWriter.WriteArchive. Internally, this method uses KernelTraceControl.dll!CreateMergedTraceFile (from WPT) to merge the .etl files and generate ‘synthetic’ events for all the modules which were loaded by processes during the trace session. If you’d like to see the PInvoke signature of this method, decrypt the OSExtensions.cs file. For all the pre-compiled assemblies (ETWTraceEventSource.GetModulesNeedingSymbols) TraceEvent generates the .pdb files (SymbolReader.GenerateNGenSymbolsForModule) and packs them into the final .zip file.
Our final recording code looks as follows:
do
use
session =
new
TraceEventSession(sessionName, etlFilePath)
let
traceFlags = NtKeywords.Profile ||| NtKeywords.ImageLoad
// NtKeywords.Process and NtKeywords.Thread are added automatically
let
stackFlags = NtKeywords.Profile
session.EnableKernelProvider(traceFlags, stackFlags) |> ignore
// we need CLR to resolve managed stacks
let
keywords = ClrKeywords.JITSymbols
session.EnableProvider(ClrTraceEventParser.ProviderGuid, TraceEventLevel.Always, uint64(keywords)) |> ignore
printfn
"Press enter to stop the tracing session"
Console.ReadLine() |> ignore
session.Stop() |> ignore
collectClrRundownEvents ()
printfn
"Collecting the data required for stack resolution..."
let
writer = ZippedETLWriter(etlFilePath, log)
if
not (writer.WriteArchive(Compression.CompressionLevel.NoCompression))
then
printfn
"Error occured while merging the data"
else
printfn
"Trace session completed"
It is very similar to the 40_SimpleTraceLog sample from the Perfview repository. So if you are looking for C# code, check it out.
Analyzing call stacks with TraceLog
Once we finished and post-processed the trace session, it’s time to analyze the collected events. Usually, when we want to process ETW events, we create the TraceEventSession instance and assign callbacks for interesting events. Using callbacks to the TraceEventSession source is the most efficient way of processing ETW events. However, this approach has some drawbacks. TraceEventParser instances are usually stateless, so if you need to work on aggregated data (processes, threads, modules), you need to implement the aggregation logic. For example, to list process lifetimes saved in the trace, we need to process the ProcessStartGroup and ProcessEndGroup events. Resolving call stacks is even more demanding and involves examining different groups of events. Fortunately, the TraceEvent library again gives us a hand and provides the TraceLog class. As stated in the documentation, TraceLog represents a higher-level event processing. It keeps a backlog of various session objects, including processes, threads, and call stacks. Our task is to instantiate the SymbolReader class, download or unpack the required symbol files, and process the event call stacks. The code below does all that:
type
CallStackNode (callsCount : int32) =
inherit
Dictionary<string, CallStackNode>(StringComparer.OrdinalIgnoreCase)
member
val
CallsCount = callsCount
with
get, set
let
loadSymbols (traceLog : TraceLog) (proc : TraceProcess) =
use
symbolReader =
new
SymbolReader(log)
proc.LoadedModules |> Seq.where (
fun
m -> not (isNull m.ModuleFile))
|> Seq.iter (
fun
m -> traceLog.CodeAddresses.LookupSymbolsForModule(symbolReader, m.ModuleFile))
let
buildCallStacksTree (traceLog : TraceLog) pid tid =
let
perfInfoTaskGuid = Guid(int32 0xce1dbfb4, int16 0x137e, int16 0x4da6, byte 0x87, byte 0xb0, byte 0x3f, byte 0x59, byte 0xaa, byte 0x10, byte 0x2c, byte 0xbc)
let
perfInfoOpcode = 46
let
callStacks = CallStackNode(0)
let
processCallStack (callStack : TraceCallStack) =
let
addOrUpdateChildNode (node : CallStackNode) (callStack : TraceCallStack) =
let
decodedAddress = sprintf
"%s!%s"
callStack.CodeAddress.ModuleName callStack.CodeAddress.FullMethodName
match
node.TryGetValue(decodedAddress)
with
| (
true
, childNode) ->
childNode.CallsCount <- childNode.CallsCount + 1
childNode
| (
false
, _) ->
let
childNode = CallStackNode(1)
node.Add(decodedAddress, childNode)
childNode
let
rec
processStackFrame (callStackNode : CallStackNode) (callStack : TraceCallStack) =
let
caller = callStack.Caller
if
isNull caller
then
// root node
callStackNode.CallsCount <- callStackNode.CallsCount + 1
let
childNode =
if
isNull caller
then
addOrUpdateChildNode callStackNode callStack
else
processStackFrame callStackNode caller
addOrUpdateChildNode childNode callStack
processStackFrame callStacks callStack |> ignore
traceLog.Events
|> Seq.filter (
fun
ev -> ev.ProcessID = pid && ev.ThreadID = tid && ev.TaskGuid = perfInfoTaskGuid && (int32 ev.Opcode = perfInfoOpcode))
|> Seq.iter (
fun
ev -> processCallStack (ev.CallStack()))
callStacks
let
tryFindProcess (traceLog : TraceLog) processNameOrPid =
let
(|Pid|ProcessName|) (s : string) =
match
Int32.TryParse(s)
with
| (
true
, pid) -> Pid pid
| (
false
, _) -> ProcessName s
let
filter =
match
processNameOrPid
with
| Pid pid ->
fun
(p : TraceProcess) -> p.ProcessID = pid
| ProcessName name ->
fun
(p : TraceProcess) -> String.Equals(p.Name, name, StringComparison.OrdinalIgnoreCase)
let
processes = traceLog.Processes |> Seq.where filter |> Seq.toArray
if
processes.Length = 0
then
None
elif
processes.Length = 1
then
Some processes.[0]
else
processes |> Seq.iteri (
fun
i p -> printfn
"[%d] %s (%d): '%s'"
i p.Name p.ProcessID p.CommandLine)
printf
"Which process to analyze []: "
match
Int32.TryParse(Console.ReadLine())
with
| (
false
, _) -> None
| (
true
, n) -> processes |> Seq.tryItem n
let
analyze processNameOrPid maxDepth =
let
reader = ZippedETLReader(etlFilePath +
".zip"
, log)
reader.UnpackArchive()
let
options = TraceLogOptions(ConversionLog = log)
let
traceLog = TraceLog.OpenOrConvert(etlFilePath, options)
match
tryFindProcess traceLog processNameOrPid
with
| None -> printfn
"No matching process found in the trace"
| Some proc ->
printfn
"%s [%s] (%d)"
proc.Name proc.CommandLine proc.ProcessID
let
sw = Stopwatch.StartNew()
printfn
"[%6d ms] Loading symbols for modules in process %s (%d)"
sw.ElapsedMilliseconds proc.Name proc.ProcessID
loadSymbols traceLog proc
// usually, system process has PID 4, but TraceEvent attaches the drivers to the Idle process (0)
let
systemProc = traceLog.Processes |> Seq.where (
fun
p -> p.ProcessID = 0) |> Seq.exactlyOne
printfn
"[%6d ms] Loading symbols for system modules"
sw.ElapsedMilliseconds
loadSymbols traceLog systemProc
printfn
"[%6d ms] Starting call stack analysis"
sw.ElapsedMilliseconds
let
printThreadCallStacks (thread : TraceThread) =
let
callStacks = buildCallStacksTree traceLog proc.ProcessID thread.ThreadID
let
rec
getCallStack depth name (callStack : CallStackNode) =
let
folder frames (kv : KeyValuePair<string, CallStackNode>) =
if
kv.Value.Count = 0 || (maxDepth > 0 && depth >= maxDepth)
then
frames
else
getCallStack (depth + 1) kv.Key kv.Value |> List.append frames
callStack
|> Seq.fold folder [ (sprintf
"%s├─ %s [%d]"
(
"│ "
|> String.replicate depth) name callStack.CallsCount) ]
let
stacks = getCallStack 0 (sprintf
"Thread (%d) '%s'"
thread.ThreadID thread.ThreadInfo) callStacks
stacks |> List.iter (
fun
s -> printfn
"%s"
s)
proc.Threads |> Seq.iter printThreadCallStacks
printfn
"[%6d ms] Completed call stack analysis"
sw.ElapsedMilliseconds
An example output might look as follows (I removed some lines for brevity):
PS netcoreapp3.1> .\etwprofiler.exe analyze testproc 10
testproc ["C:\Users\me\testproc.exe" ] (15416)
[ 2 ms] Loading symbols for modules in process testproc (15416)
[ 428 ms] Loading symbols for system modules
[ 21400 ms] Starting call stack analysis
├─ Thread (2412) 'Startup Thread' [121]
│ ├─ ntdll!LdrInitializeThunk [11]
│ │ ├─ ntdll!LdrInitializeThunk [11]
│ │ │ ├─ ntdll!LdrpInitialize [11]
│ │ │ │ ├─ ntdll!??_LdrpInitialize [11]
│ │ │ │ │ ├─ ntdll!LdrpInitializeProcess [11]
│ │ │ │ │ │ ├─ ntdll!LdrpInitializeTls [1]
│ │ │ │ │ │ ├─ ntdll!LdrLoadDll [2]
│ │ │ │ │ │ │ ├─ ntdll!LdrpLoadDll [2]
...
├─ Thread (24168) '' [1]
│ ├─ ntdll!RtlUserThreadStart [1]
│ │ ├─ ntdll!RtlUserThreadStart [1]
│ │ │ ├─ kernel32!BaseThreadInitThunk [1]
│ │ │ │ ├─ ntdll!TppWorkerThread [1]
│ │ │ │ │ ├─ ntdll!TppAlpcpExecuteCallback [1]
│ │ │ │ │ │ ├─ rpcrt4!LrpcIoComplete [1]
│ │ │ │ │ │ │ ├─ rpcrt4!LRPC_ADDRESS::ProcessIO [1]
├─ Thread (9808) '' [0]
├─ Thread (22856) '.NET ThreadPool Worker' [1]
│ ├─ ntdll!RtlUserThreadStart [1]
│ │ ├─ ntdll!RtlUserThreadStart [1]
│ │ │ ├─ kernel32!BaseThreadInitThunk [1]
│ │ │ │ ├─ coreclr!Thread::intermediateThreadProc [1]
│ │ │ │ │ ├─ coreclr!ThreadpoolMgr::WorkerThreadStart [1]
│ │ │ │ │ │ ├─ coreclr!UnManagedPerAppDomainTPCount::DispatchWorkItem [1]
│ │ │ │ │ │ │ ├─ coreclr!ThreadpoolMgr::AsyncTimerCallbackCompletion [1]
│ │ │ │ │ │ │ │ ├─ coreclr!AppDomainTimerCallback [1]
│ │ │ │ │ │ │ │ │ ├─ coreclr!ManagedThreadBase_DispatchOuter [1]
│ │ │ │ │ │ │ │ │ │ ├─ coreclr!ManagedThreadBase_DispatchMiddle [1]
├─ Thread (12352) '.NET ThreadPool Gate' [0]
[ 21835 ms] Completed call stack analysis
The full profiler code is available in my blog samples repository. For my Dotnetos session about tracing, I also prepared the C# version so you may check it too. And, of course, I invite you to watch the whole session!
Recommend
-
118
rbspy Have a running Ruby program that you want to profile without restarting it? Want to profile a Ruby command line program really easily? You want rbspy! rbspy can profile any Ruby program just by running 1 simple command....
-
97
README.md
-
63
Sampling profiler for PHP 7
-
69
Py-Spy: A sampling profiler for Python programs. Py-Spy is a sampling profiler for Python programs. It lets you visualize what your Python program is spending time on without restarting the program or modify...
-
23
Creating a blockchain with less than 100 code lines
-
7
How I Created a Landing page in Less than 100 lines of CodeJanuary 21st 2021 new story
-
10
php-profiler php-profiler is a sampling profiler (or a VM state inspector) written in PHP. It can read information about running PHP script from outside of the process. It's a stand alone CLI tool, so target programs don't need any m...
-
6
rbperf rbperf is a low-overhead sampling profiler and tracer for Ruby (CRuby) which runs in BPF Features The main goals for rbperf are On-CPU profiling support Low overhead Profi...
-
6
async-profiler This project is a low overhead sampling profiler for Java that does not suffer from Safepoint bias problem. It f...
-
4
Sampling Profiler & Rogue case of App Profiling
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK