The bane of flaky tests: Part 1

Add new comment

Today our software is covered by a vast interwoven set of automated tests. Through the wizardry of our DevOps staff, we keep hundreds of VMs busy night and day continually running all our tests – unit tests, functional tests, end-to-end tests, integration tests, performance tests, stability tests. Thanks to our amazing test automation systems, we’ve even been able to achieve high-quality monthly releases!

That said, every project I’ve worked on with automated testing suffers from the same irritant – flaky tests. Flaky tests are tests that fail sometimes… but not always. Unlike flakiness in a pastry, a flaky quality is detrimental when it comes to software testing! We’ve had tests that fail as little as one in every 300,000 runs. A review of the literature shows evidence that all major software companies struggle with flaky tests.

Flaky tests drive me nuts! They’re a constant distraction; a worry that they might be hiding a big problem; and they’re hard to solve. Even worse, they tend to demoralize the team over time.

In the years I’ve been at Kinaxis, we’ve waged a battle against flaky tests that I think we will eventually win. Here are some of the things we’ve done:

Step 1: Identified and addressed the flakiest tests

Our first strategy to help us get a grip on the problem was to focus on the tests that failed the most. We augmented the test framework to run each flaky test three times. If a test passed two out of three times, we treated it as a pass.

Next we addressed the flakiest tests. We considered the coverage they provided. In some cases, a flaky test covered the same functionality as another (non-flaky) test. We could simply eliminate the flaky test. In other situations, we could make a small addition to an existing non-flaky test to cover the functionality of the flaky test. We could then remove the flaky test.

Some flaky tests were old and needed to be re-written. Some flaky tests covered very strange edge cases. We took some of these off-line for the time being to remove noise.

Step 2: Generated test code

Investigation shows that for a subset of our flaky tests, the cause of flakiness was threading mistakes in the test code. Multi-threaded code is complicated to write and easy to get wrong!

To address this, we wrote a test code generator that generates parts of a test. The tester just needed to fill in the other parts. This had two important benefits – first, it made it faster for our testers to create new tests; and second, the tests that used generated code were much less likely to be flaky.

Step 3: Pinpointed test interference causes

Tests sometimes adjust an aspect of the environment, like when they set a registry setting or create files on disk. Sometimes the test author forgets to restore the environment to its original state (such as clear the registry setting, delete files, etc.). As a result, a subsequent test fails because of these changes in the environment. This is known as test interference, or, more casually as tests that “don’t clean up after themselves.”

To detect these, we wrote tooling that checks every aspect of the environment after each test. We found this approach very effective, but it ultimately did not scale as the number of tests grew.

Step 4: Eliminated timeouts

Some of our tests had “timeouts” or “waits” that guessed how long another operation would take. When these tests failed intermittently, a common quick fix was to increase the timeouts. However, this made our test suites take longer.

We eliminated this problem by using call-backs to reliably indicate when an operation was complete. This also made our tests run more quickly.
Thanks to these techniques, we found ourselves well on our way along our path to eliminating flaky tests.

Read our next post (part 2) for details on our evolution to even more advanced techniques.

Discussions

Jason Rudolph

- August 23, 2021 at 8:10pm

It's great to see more teams talking about their experiences with flaky tests these days. Much of this post resonates with the experience we had developing the Atom text editor (atom.io). The combination of slow builds and flaky tests caused major slowdowns in our ability to ship quickly. 😬

We took a similar approach to what you've described: Identify the flakiest tests and address them first. By fixing the flakiest tests first, we were able to get some pretty significant improvements in reliability of the test suite with a fairly small amount of effort.

Honestly, much of the early effort was just trying to get an accurate inventory of flaky tests. It was a manual process of 1) noticing a failure and then 2) gathering enough info to file an issue for it (examples: https://github.com/search?q=org%3Aatom+involves%3Ajasonrudolph+flaky+test). But because it was a manual process, and because it was a fairly mundane task, it was always tempting to just click "rebuild" instead of taking the time to file an issue.

Ultimately, we found ourselves wishing for automated tooling to handle this for us. So, I set out to offer a way for teams to automatically detect, track, and rank flaky tests: https://buildpulse.io

Sharing here in case any other readers are in the same boat that we were in. 😅

Bob the builder

- August 26, 2021 at 8:14am

Tests that are applied after code is already written is doomed to fail eventually if the development team is not following a true TDD approach. I have seen entire suites of tests fail due to a single design change decision in which the decision makes never even talked to those writing tests.

Tests are great for giving you confidence to depoy but they are VERY expensive to maitain. In my experience trying to get 100% coverage is a fool's errand. Just automate tests where it saves QA a lot of time but don't try to automate tests in places where logic gets really fuzzy.