24

GDI leaks and the importance of luck

 4 years ago
source link: https://randomascii.wordpress.com/2020/05/10/gdi-leaks-and-the-importance-of-luck/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

rUZRBzy.jpg!web In May 2019 I was asked to look at a potentially serious Chrome bug. I initially misdiagnosed it as unimportant, thus wasting two valuable weeks, and when I rejoined the investigation it was the number one browser-process crash in Chrome’s beta channel. Oops.

On June 6th, the same day I realized I had misinterpreted the crash data, the bug was marked as ReleaseBlock-Stable meaning that we couldn’t ship our new Chrome version to most of our users until we figured out what was going on.

The crash was happening because we were running out of GDI (Graphics Device Interface) objects , but we didn’t know what type of GDI objects, our crash data gave us no clues as to where the problem was happening, and we couldn’t reproduce the problem.

Several of us worked hard on the bug on June 6th and 7th, testing out theories but not making any clear progress. Then, on June 8th I went to check my email and Chrome immediately crashed. It was the crash .

The irony is rich. While I was searching through changes and crash reports and brainstorming about what could possibly be causing Chrome’s browser process to leak GDI objects the GDI object count in my browser was creeping inexorably upward, and on the morning of the 8th it tipped over the magic 10,000 number . At that point one of our GDI object allocations failed and we intentionally crashed the browser. This was an improbably lucky break.

Bugs that are reproducible are, inevitably, fixable. All I had to do was figure out how I had triggered the bug and we would be able to fix it.

First, some background

UvuAB3Z.png!web In most places in Chromium’s code when we try to allocate a GDI object we check to see if the allocation succeeded. If the allocation failed then we record some information on the stack and then intentionally crash, as you can see in this source code . We intentionally crash because if we can’t allocate GDI objects then we can’t render to the screen – it’s better to report the issue (if crash reporting is enabled) and restart the process rather than display a blank UI. The default limit is 10,000 GDI objects per process and we usually use just a few hundred so if we hit that limit then something has gone terribly wrong.

When we get one of the crash reports showing a GDI object allocation failure we have a call stack and all sorts of other useful information. Great! The problem is that these crash dumps aren’t necessarily related to the bug. That’s because the code that is leaking GDI objects and the code that reports the failure may not be the same code.

That is, roughly speaking we had these two types of code:

void GoodCode() {    auto x = AllocateGDIObject();    if (!x)      CollectGDIUsageAndDie ();    UseGDIObject(x);    FreeGDIObject(x);  }
void BadCode() {    auto x = AllocateGDIObject();    UseGDIObject(x);  } 

The good code notices when an allocation has failed and reports it, whereas the bad code both ignores failures and leaks the objects, thus setting up the good code to take the fall.

Chromium is a few million lines of code. We didn’t know which function was buggy, and we didn’t even know what type of GDI object was being leaked. One of my coworkers added code to troll through the Process Environment Block before crashing, to get counts by GDI object type, but for all of the enumerated types (DCs, regions, bitmaps, palettes, fonts, brushes, pens, and unknown) the counts totaled less than one hundred. Weird.

The misinterpretation

Our CollectGDIUsageAndDie function has a catchy name I think you’ll agree. Very evocative.

The problem is that it did too much. CollectGDIUsageAndDie checked for about a dozen different types of GDI object allocation failures and, due to inlining, they all ended up getting the same crash signature – they all crashed in the main function and got bucketed together. So one of my coworkers wisely landed a change to break out the different checks to different (non-inlined) functions. This meant that we could now tell at a glance which check was failing.

Unfortunately this meant that when we started getting crash reports from CrashIfExcessiveHandles I confidently said “this isn’t a crash spike – it’s just a signature change.”

That was a mistake. It was a crash spike and a signature change. It was a dessert topping and a floor wax . Oops. Sloppy analysis there Dawson. No cookies for you.

Back to our story

At this point I knew that something I had done on June 7th had used up almost 10,000 GDI objects in a day. If I could figure out what it was then I could crack the case.

MRjmmmu.png!web Windows’ Task Manager has an optional GDI objects column which can be used to look for leaks. On June 7th I was working from home, connecting to my machine at work, and I had that column enabled on my work machine as I ran tests and tried to come up with a repro scenario. But meanwhile it was the web browser on my home machine which was leaking GDI objects.

The main thing I had been using the browser on my home machine for was to connect to my work machine, using the Chrome Remote Desktop (CRD) app . So, I enabled the GDI objects column on my home machine and started experimenting. It didn’t take long to get results.

In fact, the bug timeline shows that it took just 35 minutes to go from “I’ve hit the crash” (2:00 pm) to “it’s something to do with CRD” to “it’s cursors.” Did I mention how much easier bug investigations are when you have a local repro?

It turned out that every time the CRD app (or any Chrome app?) changed cursors it would leak six GDI objects. Wiggling the mouse over the right part of the screen when using Chrome Remote Desktop could leak hundreds of GDI objects per minute, and thousands per hour.

After a month of no progress this problem had gone from intractable to a simple fix. I created a hack fix and then one of my coworkers (apparently I didn’t do any actual work on this bug) created the real fix . It was uploaded at 11:16 am on June 10th, and landed at 1:00 pm. A few merges later and the bug was vanquished.

That’s it?

Fixing bugs is nice, but what we really want is a way to make sure the bugs never come back. Using C++ objects to manage resources ( RAII FTW! ) is clearly the right thing to do, but in this case our WebCursor class had a bug.

When it comes to memory leaks there is a robust set of systems available. Microsoft hasheap snapshots, Chromium has heap profiling in the wild and leak sanitizer on test machines. But GDI object leak detection seems to have been neglected. The Process Information Block has incomplete information, some GDI objects can only be enumerated in kernel mode, and there isn’t a single chokepoint for allocating or freeing objects to allow easy tracing. This wasn’t the first GDI object leak I’ve dealt with and it won’t be the last, because there isn’t a robust way of tracking them. Some suggestions for Windows-Next:

  • Make it trivial to get counts for all GDI object types, without requiring arcane PEB reading (and without skipping cursors)
  • Have a supported way to intercept and trace all GDI object creations and destructions, for robust tracking
  • Make sure this is documented

That’s it. This sort of tracking wouldn’t even be that hard, since GDI objects are necessarily constrained in a way that memory isn’t. It would be great to make these quirky but unavoidable GDI objects safer to use. Please?


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK