The bane of flaky tests: Part 1

Today our software is covered by a vast interwoven set of automated tests. Through the wizardry of our DevOps staff, we keep hundreds of VMs busy night and day continually running all our tests – unit tests, functional tests, end-to-end tests, integration tests, performance tests, stability tests. Thanks to our amazing test automation systems, we’ve even been able to achieve high-quality monthly releases!

That said, every project I’ve worked on with automated testing suffers from the same irritant – flaky tests. Flaky tests are tests that fail sometimes… but not always. Unlike flakiness in a pastry, a flaky quality is detrimental when it comes to software testing! We’ve had tests that fail as little as one in every 300,000 runs. A review of the literature shows evidence that all major software companies struggle with flaky tests.

Flaky tests drive me nuts! They’re a constant distraction; a worry that they might be hiding a big problem; and they’re hard to solve. Even worse, they tend to demoralize the team over time.

In the years I’ve been at Kinaxis, we’ve waged a battle against flaky tests that I think we will eventually win. Here are some of the things we’ve done:

Step 1: Identified and addressed the flakiest tests

Our first strategy to help us get a grip on the problem was to focus on the tests that failed the most. We augmented the test framework to run each flaky test three times. If a test passed two out of three times, we treated it as a pass.

Next we addressed the flakiest tests. We considered the coverage they provided. In some cases, a flaky test covered the same functionality as another (non-flaky) test. We could simply eliminate the flaky test. In other situations, we could make a small addition to an existing non-flaky test to cover the functionality of the flaky test. We could then remove the flaky test.

Some flaky tests were old and needed to be re-written. Some flaky tests covered very strange edge cases. We took some of these off-line for the time being to remove noise.

Step 2: Generated test code

Investigation shows that for a subset of our flaky tests, the cause of flakiness was threading mistakes in the test code. Multi-threaded code is complicated to write and easy to get wrong!

To address this, we wrote a test code generator that generates parts of a test. The tester just needed to fill in the other parts. This had two important benefits – first, it made it faster for our testers to create new tests; and second, the tests that used generated code were much less likely to be flaky.

Step 3: Pinpointed test interference causes

Tests sometimes adjust an aspect of the environment, like when they set a registry setting or create files on disk. Sometimes the test author forgets to restore the environment to its original state (such as clear the registry setting, delete files, etc.). As a result, a subsequent test fails because of these changes in the environment. This is known as test interference, or, more casually as tests that “don’t clean up after themselves.”

To detect these, we wrote tooling that checks every aspect of the environment after each test. We found this approach very effective, but it ultimately did not scale as the number of tests grew.

Step 4: Eliminated timeouts

Some of our tests had “timeouts” or “waits” that guessed how long another operation would take. When these tests failed intermittently, a common quick fix was to increase the timeouts. However, this made our test suites take longer.

We eliminated this problem by using call-backs to reliably indicate when an operation was complete. This also made our tests run more quickly.
Thanks to these techniques, we found ourselves well on our way along our path to eliminating flaky tests.

Read our next post (part 2) for details on our evolution to even more advanced techniques.

Trusted by the worlds most admired brands

Master uncertainty.
Tame complexity.

From a winning beginning to long-term success

The latest resources on supply chains

The bane of flaky tests: Part 1

Step 1: Identified and addressed the flakiest tests

Step 2: Generated test code

Step 3: Pinpointed test interference causes

Step 4: Eliminated timeouts

More from Christine Fulford

Dev and QA working together

Trusted by the worlds most admired brands

Master uncertainty. Tame complexity.

From a winning beginning to long-term success

The latest resources on supply chains

Step 1: Identified and addressed the flakiest tests

Step 2: Generated test code

Step 3: Pinpointed test interference causes

Step 4: Eliminated timeouts

More from Christine Fulford

Dev and QA working together

Master uncertainty.
Tame complexity.