USE Method: Mac OS X Performance Checklist

This is my example USE Method-based performance checklist for the Apple Mac OS X operating system, for identifying common bottlenecks and errors. This draws upon both command line and graphical tools for coverage, focusing where possible on those that are provided with the OS by default, or by Apple (eg, Instruments). Further notes about tools are provided after this table.

Some of the metrics are easy to find in various GUIs or from the command line (eg, using Terminal; if you've never used Terminal before, follow my instructions at the top of this post). Many metrics require some math, inference, or quite a bit of digging. This will hopefully get easier in the future, as tools include a USE method wizard or the metrics required to follow this easily.

Physical Resources, Standard

componenttypemetricCPUutilizationsystem-wide: iostat 1, "us" + "sy"; per-cpu: DTrace [1]; Activity Monitor → CPU Usage or Floating CPU Window; per-process: top -o cpu, "%CPU"; Activity Monitor → Activity Monitor, "%CPU"; per-kernel-thread: DTrace profile stack()CPUsaturationsystem-wide: uptime, "load averages" > CPU count; latency, "SCHEDULER" and "INTERRUPTS"; per-cpu: dispqlen.d (DTT), non-zero "value"; runocc.d (DTT), non-zero "%runocc"; per-process: Instruments → Thread States, "On run queue"; DTrace [2]CPUerrorsdmesg; /var/log/system.log; Instruments → Counters, for PMC and whatever error counters are supported (eg, thermal throttling)Memory capacityutilizationsystem-wide: vm_stat 1, main memory free = "free" + "inactive", in units of pages; Activity Monitor → Activity Monitor → System Memory, "Free" for main memory; per-process: top -o rsize, "RSIZE" is resident main memory size, "VSIZE" is virtual memory size; ps -alx, "RSS" is resident set size, "SZ" is virtual memory size; ps aux similar (legacy format)Memory capacitysaturationsystem-wide: vm_stat 1, "pageout"; per-process: anonpgpid.d (DTT), DTrace vminfo:::anonpgin [3] (frequent anonpgin == pain); Instruments → Memory Monitor, high rate of "Page Ins" and "Page Outs"; sysctl vm.memory_pressure [4]Memory capacityerrorsSystem Information → Hardware → Memory, "Status" for physical failures; DTrace failed malloc()sNetwork Interfacesutilizationsystem-wide: netstat -i 1, assume one very busy interface and use input/output "bytes" / known max (note: includes localhost traffic); per-interface: netstat -I interface 1, input/output "bytes" / known max; Activity Monitor → Activity Monitor → Network, "Data received/sec" "Data sent/sec" / known max (note: includes localhost traffic); atMonitor, interface percentNetwork Interfacessaturationsystem-wide: netstat -s, for saturation related metrics, eg netstat -s | egrep 'retrans|overflow|full|out of space|no bufs'; per-interface: DTraceNetwork Interfaceserrorssystem-wide: netstat -s | grep bad, for various metrics; per-interface: netstat -i, "Ierrs", "Oerrs" (eg, late collisions), "Colls" [5]Storage device I/Outilizationsystem-wide: iostat 1, "KB/t" and "tps" are rough usage stats [6]; DTrace could be used to calculate a percent busy, using io provider probes; atMonitor, "disk0" is percent busy; per-process: iosnoop (DTT), shows usage; iotop (DTT), has -P for percent I/OStorage device I/Osaturationsystem-wide: iopending (DTT)Storage device I/OerrorsDTrace io:::done probe when /args[0]->b_error == 0/Storage capacityutilizationfile systems: df -h; swap: sysctl vm.swapusage, for swap file usage; Activity Monitor → Activity Monitor → System Memory, "Swap used"Storage capacitysaturationnot sure this one makes sense - once its full, ENOSPCStorage capacityerrorsDTrace; /var/log/system.log file system full messages

[1] eg: dtrace -x aggsortkey -n 'profile-100 /!(curthread->state & 0x80)/ { @ = lquantize(cpu, 0, 1000, 1); } tick-1s { printa(@); clear(@); }'. Josh Clulow also wrote a simple C program to dig out per-CPU utilization: cpu_usage.c.
[2] Until there are sched:::enqueue/dequeue probes, I suspect this could be done using fbt tracing of thread_*(). I haven't tried yet. It might be worth seeing what Instruments uses for its "On run queue" thread state trace, and DTracing that.
[3] eg: dtrace -n 'vminfo:::anonpgin { printf("%Y %s", walltimestamp, execname); }'.
[4] the kernel source under bsd/vm/vm_unix.c describes this as "Memory pressure indicator", although I've yet to see this as non-zero.
[5] the netstat(1) man page reads: "BUGS: The notion of errors is ill-defined."
[6] it would be great if Mac OS X iostat added a -x option to include utilization, saturation, and error columns, like Solaris "iostat -xnze 1".
atMonitor is a 3rd party tool that provides various statistics; I'm running version 2.7b, although it crashes if you leave the "Top Window" open for more than 2 seconds.
Activity Monitor is a default Apple performance monitoring tool with a graphical interface.
Instruments is an Apple performance analysis product with a graphical interface. It is comprehensive, consuming performance data from multiple frameworks, including DTrace. Instruments also includes functionality that was provided by separate previous performance analysis products, like CHUD and Shark, making it a one stop shop. It'd be wonderful if it included latency heat maps as well :-).
Temperature Monitor: 3rd party software that can read various temperature probes.
PMC == Performance Monitor Counters, aka CPU Performance Counters (CPC), Performance Instrumentation Counters (PICs), and more. These are processor hardware counters that are read via programmable registers on each CPU.
DTT == DTraceToolkit scripts, many of which were ported by the Apple engineers and shipped by default with Mac OS X. ie, you should be able to run these immediately, eg, sudo runocc.d.

Physical Resources, Advanced

componenttypemetricGPUutilizationdirectly: DTrace [7]; atMonitor, "gpu"; indirect: Temperature Monitor; atMonitor, "gput"GPUsaturationDTrace [7]; Instruments → OpenGL Driver, "Client GLWait Time" (maybe)GPUerrorsDTrace [7] Storage controllerutilizationiostat 1, compare to known IOPS/tput limits per-cardStorage controllersaturationDTrace and look for kernel queueingStorage controllererrorsDTrace the driverNetwork controllerutilizationsystem-wide: netstat -i 1, assume one busy controller and examine input/output "bytes" / known max (note: includes localhost traffic)Network controllersaturationsee network interface saturationNetwork controllererrorssee network interface errorsCPU interconnectutilizationfor multi-processor systems, try Instruments → Counters, and relevent PMCs for CPU interconnect port I/O, and measure throughput / maxCPU interconnectsaturationInstruments → Counters, and relevent PMCs for stall cyclesCPU interconnecterrorsInstruments → Counters, and relevent PMCs for whatever is availableMemory interconnectutilizationInstruments → Counters, and relevent PMCs for memory bus throughput / max, or, measure CPI and treat, say, 5+ as high utilization; Shark had "Processor bandwidth analysis" as a feature, which either was or included memory bus throughput, but I never used itMemory interconnectsaturationInstruments → Counters, and relevent PMCs for stall cyclesMemory interconnecterrorsInstruments → Counters, and relevent PMCs for whatever is availableI/O interconnectutilizationInstruments → Counters, and relevent PMCs for tput / max if available; inference via known tput from iostat/...I/O interconnectsaturationInstruments → Counters, and relevent PMCs for stall cyclesI/O interconnecterrorsInstruments → Counters, and relevent PMCs for whatever is available

[7] I haven't found a shipped tool to provide GPU statistics easily. I'd like a gpustat that behaved like mpstat, with at least the columns: utilization, saturation, errors. Until there is such a tool, you could trace GPU activity (at least the scheduling of activity) using DTrace on the graphics drivers. It won't be easy. I imagine Instruments will at some point add a GPU instrument set (other than the OpenGL instruments), otherwise, 3rd party tools can be used, like atMonitor.
CPI == Cycles Per Instruction (others use IPC == Instructions Per Cycle).
I/O interconnect: this includes the CPU to I/O controller busses, the I/O controller(s), and device busses (eg, PCIe).
Using PMCs is typically a lot of work. This involves researching the processor manuals to see what counters are available and what they mean, and then collecting and interpreting them. I've used them on other OSes, but haven't used them all under Instruments → Counters, so I don't know if there's a hitch with anything there. Good luck.

Software Resources

componenttypemetricKernel mutexutilizationDTrace and lockstat provider for held timesKernel mutexsaturationDTrace and lockstat provider for contention times [8]Kernel mutexerrorsDTrace and fbt provider for return probes and error statusUser mutexutilizationplockstat -H (held time); DTrace plockstat providerUser mutexsaturationplockstat -C (contention); DTrace plockstat providerUser mutexerrorsDTrace plockstat and pid providers, for EDEADLK, EINVAL, ... see pthread_mutex_lock(3C)Process capacityutilizationcurrent/max using: ps -e | wc -l / sysctl kern.maxproc; top, "Processes:" also shows currentProcess capacitysaturationnot sure this makes senseProcess capacityerrors"can't fork()" messagesFile descriptorsutilizationsystem-wide: sysctl kern.num_files / sysctl kern.maxfiles; per-process: can figure out using lsof and ulimit -nFile descriptorssaturationI don't think this one makes sense, as if it can't allocate or expand the array, it errors; see fdalloc()File descriptorserrorsdtruss or custom DTrace to look for errno == EMFILE on syscalls returning fds (eg, open(), accept(), ...)

[8] eg, showing adaptive lock block time totals (in nanoseconds) by calling function name: dtrace -n 'lockstat:::adaptive-block { @[caller] = sum(arg1); } END { printa("%40a%@16d ns\n", @); }'

Other Tools

I didn't include fs_usage, sc_usage, sample, spindump, heap, vmmap, malloc_history, leaks, and other useful Mac OS X performance tools, as here I'm beginning with questions (the methodology) and only including tools that answer them. This is instead of the other way around: listing all the tools and trying to find a use for them. Those other tools are useful for other methodologies, which can be used after this one.

What's Next

See the USE Method for the follow-up methodologies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other methodologies: drill-down analysis and latency analysis.

For more performance analysis, also see my earlier post on Top 10 DTrace Scripts for Mac OS X.

Acknowledgements

Resources used:

Filling this this checklist has required a lot of research, testing and experimentation. Please reference back to this post if it helps you develop related material.

It's quite possible I've missed something or included the wrong metric somewhere (sorry); I'll update the post to fix these up as they are understood, and note at the top the update date.

Also see my USE method performance checklists for Solaris, SmartOS, Linux, and FreeBSD.

USE Method: Mac OS X Performance Checklist

USE Method: Mac OS X Performance Checklist

Physical Resources, Standard

Physical Resources, Advanced

Software Resources

Other Tools

What's Next

Acknowledgements

Recommend

Kernel Recipes 2017 - Perf in Netflix - Brendan Gregg

DTraceToolkit ver 0.99

USENIX ATC '17: Performance Superpowers with Enhanced BPF

Thinking Methodically about Performance

Hybrid Storage Pool: Top Speeds

Bitcoin The Stateless Emergence - Bitcoin Magazine: Bitcoin News, Articles, Char...

OSSNA 2017 Performance Analysis Superpowers with Linux BPF

Keynote 3: System Performance Analysis Methodologies, by Brendan Gregg (EuroBSDc...

Netflix talks about Extended BPF - A new software type

CIFS at 1 Gbyte/sec

About Joyk