A Methodological Approach to Fixing Flaky Tests

This is the second part of a two-part series of blog posts on flaky tests. You’ll find the first part here: Flaky Tests are Not Random Failures. It is recommended you start over there if you have the time for a longer read.

In the previous post, we focused on what could cause flaky tests and insisted that they’re not the product of true randomness.

Having a good idea of how some flaky tests can happen helps to avoid them, but won’t completely prevent them from appearing.

Below are some suggestions on what to do when you do come across a flaky test in your suite:

Keep track of it

The first step is to make sure that the flaky test is logged somewhere!

After having searched GitHub issues tagged with “Flaky spec” to see if an issue already exists, the developer should write an issue containing at least:

file path and line number of the failing test in the title
a link to the failed build on CI
a copy-paste of the failure output
the “Flaky spec” tag

If an existing issue already exists, adding a comment with a link to the failing CI logs will help collect evidence and potentially better understand the problem.

Sometimes, circumstances won’t allow a developer to dig into a flaky spec as soon as they find it, and they’ll find themselves clicking the “rebuild” button and crossing their fingers.

That can be understandable, but keeping track of a flaky spec should be the bare minimum!

Run the failed test locally in isolation

The next step in assessing a flaky test is to run it on the development environment, to see how it behaves.

As the test is supposedly flaky, running it once might not be enough to reproduce the failure. Often, we’ll run the same spec 100 times by wrapping it in a loop:

The test run should result in one of the three outcomes below:

The test fails constantly

We need to confirm that the test’s setup and assertions are correct.

If they are, then the exercised code is probably broken and may need a fix.

If the test was wrong, then it needs to either be fixed or deleted.

Either way, some “context” was allowing this normally failing spec to pass on CI, this might be a sign of leaking state and need to be fixed.

The test never fails

If the test is correct, then some state leaking from other tests might sometimes prevent it to succeed. This needs to be tracked and fixed.

Note that for example running a test 100 times without failures after an attempted fix is by no means a proof of its success! A “non-deterministic” behaviour might exist and still not show within 100 runs.

The test sometimes fails

This is a purely non-deterministic test, that might succeed or fail without influence from other tests. Maybe part of the involved code is, for example, relying on non-deterministic systems. The fix will focus on improving either the test’s code or the exercised code to make them deterministic.

Reproduce failure context

When a flaky test cannot be identified in isolation, the next step is to get closer to the conditions in which the test ran when it failed. At the very least:

the codebase should be the same (same Git SHA)
the same tests cases should be run
the tests should run in the same order (same random seed)

(I wrote a small gem to accelerate the process: RSpec::OrderedCommandFormatter.)

If the test consistently fails in a reproducible context, then this might be caused by leaking state. It will be interesting next to try and use a bisect tool (like rspec --bisect) to isolate the minimal set of examples that reproduce the same failure.

Play with timing

Suspecting a test might be flaky because of timing issues, one option is playing with Rails’ TimeHelpers, or even the sleep method.

We saw examples using TimeHelpers in the first article.

If we suspect a given piece of code is at times too slow to the point of making a test fail, it can be made artificially slow using the sleep method to confirm the hunch:

Using timing helpers can either be all that’s needed to fix a flaky test or sometimes a sign that the exercised code suffers from a race condition.

Ruby debugger

In can be useful to enter a debugger such as pry-byebug, right before the failure is expected to happen.

We can then examine the environment to for example look at whether any inserted data has the values we expected, whether we are on the right page in a browser, etc etc.

Combining the debugger with the 100.times technique and a conditional can help to get it where we want it to be:

Visual check

When a feature test that renders a page seems to be flaky, it can be useful to take a look at the actual page.

The capybara-screenshot gem allows taking screenshots in tests that render a page (whether it is using the Rack driver or even Chrome driver through Selenium). It can even automatically take a screenshot when a test fails! This becomes very useful to take a look at a failure that happened on CI but that is difficult to reproduce locally. On CircleCI for example, it is possible to store any kind of artifacts for future reference.

capybara-screenshot automatically took a screenshot when the test failed

The screenshot can be retrieved from CircleCI artifacts

Using screenshots, or a headed (not headless) browser, and well-placed sleep or binding.pry statements, it should be possible to see what the page looks like at almost any point of the test.

Screenshots help us identify mainly two types of failures:

race conditions where JavaScript (or CSS animations) didn’t have time to complete before an expectation runs
layout issues which behaviour varies depending on the environment (Chrome version, OS, etc.)

Run manually inside CI environment

When a test’s failure cannot be reproduced locally, it might be helpful to follow the steps above one more time, but this time, running on the CI platform.

CircleCI, for example, offers a “Rerun job with SSH” option, which will run the same job again, and give access to the SSH containers when the job completes. Once in the container, it is possible to follow the various techniques above in an environment as close as it can be to the original failure.

Note that, in the case of CircleCI, the SSH’able container is only available for two hours, so being prepared and taking notes can be helpful.

Reproducing a flaky spec directly on CI allowed us in the past, for example, to identify problems triggered by a new version of Chrome before updating locally.

Understand before fixing

Tackling flaky tests can seem daunting at first sight, but they’re nothing that methodical investigation cannot fix!

Even though they appear to happen randomly, they’re usually triggered by a very reproducible set of conditions.

Trying to come up with a fix before understanding the root cause will often result in further disappointment, as it turns out the test was just temporarily passing after an attempted “fix”.

Using the approaches above should help in getting to that understanding, and increase confidence that we have actually gotten to the root of the problem next time we encounter a flaky test.

A Methodological Approach to Fixing Flaky Tests

A Methodological Approach to Fixing Flaky Tests

Keep track of it

Run the failed test locally in isolation

The test fails constantly

The test never fails

The test sometimes fails

Reproduce failure context

Play with timing

Ruby debugger

Visual check

Run manually inside CI environment

Understand before fixing

Recommend

Ruby 2.7 NEWS: Commentary by Cookpad’s Full Time Ruby Committers

Rails i18n — Handling formatting within translated content

Package namespacing for Python library collection

Testing external dependencies using dependency injection

Better fuzzy-finding in Vim

Retrieving Recipes from Images: Baselines Strike Back

Navigating towards a new navigation

Easy debugging with the Android Navigation component

Writing GitHub Actions in Go

Using google-java-format with VS Code

About Joyk