The bane of flaky tests: Part 2

Woman in white shirt holds magnifying glass up to one eye as she stares at a laptop computer.

In our previous post, The bane of flaky tests: Part 1, we described some high-level techniques that we used to try and bring stability to our tests. Today in Part 2 we’ll dive into some implementation details about how we handle these problems at a lower level.

A server capable of handling terabytes of data has a multitude of complex tests. At Kinaxis we have over 110,000 tests that get run repeatedly under different conditions, and over 67,000 are on the pipeline that I manage. Developers and management all rely on the results of these tests to validate rapid changes being done to a large codebase from a growing organization. We also employ a continuous delivery (CD) model, constantly providing updates to customers. This means there is no acceptable time at which our product can be unhealthy. As Kinaxis grows this has become a tough challenge, but it is one we have met head-on with innovation.

The number of back-end tests we have has nearly doubled in the past three years, and as we move forward with deploying new technology, that test count is continuing to grow. Part of my responsibility here, in addition to running the tests in the most efficient manner possible, is to investigate test failures. Any tests that produce inconsistent results, even when there are no code or infrastructure changes, are referred to as “flaky tests.”

A flaky test is sometimes the result of an environmental hiccup, but it typically represents a problem in the code. After ruling out environmental factors, we need to determine whether the problem is in the test or in the server itself. As you can see, investigating inconsistent failures is a very time-consuming activity, so we rely heavily on automation.

Automated investigation diagnoses tests faster

Our first installment on flaky tests discussed several methods used for addressing these tests, however most of them rely on manual intervention from developers to identify the cause of the inconsistencies. Manual intervention for an ever-increasing code base is not a sustainable solution. Instead, I identify developer time-sinks in our testing workflow, and design and implement automation solutions.

By making use of a large resource pool, I’ve implemented various stages of automation tools to diagnose these issues without any human involvement. The first stage of automatic diagnosis is a second attempt at any test failure. When tests run in a particular order, a test’s outcome can be affected by tests that ran prior, and changed the environment in some way.

To isolate these false positives caused by test interference, our test order is not static. To isolate false negatives, any test which failed is run again at the end of a test run, after resetting the environment to its default state. If a test fails its second attempt, it is treated as an ordinary failure because the failure was reproduced. However, if a test passes its second attempt, we call it a “suspicious test” and that triggers the next stage of automated investigation.

The second stage of automation is called the suspicious-test-chaser. It does a progressive search for a combination of tests that resulted in the suspicious test (one that failed when run in order but passed on its own). If the process determines that the failure was caused by interference from another test, this process will isolate that combination and report it. No user action was required in this process except implementing the final fix.

From suspicious-test-chaser to flaky-finder

If the suspicious-test-chaser cannot determine a combination of tests to reproduce the failure, then it is likely the “suspicious test” is actually a “flaky test”, something that isn’t unstable because of other tests, but is unstable on its own. The third stage of automation is the “flaky-finder.” This automation runs one test repeatedly using over 100 virtual machines, running the test more times than it would normally run in an entire year. In the end it produces a report on the stability on the test. Any test which isn’t 100% stable is automatically marked as flaky, removing the reproduction step for developers. A developer can pick up work knowing it isn’t environmental, and have confidence in a reproducible case of the failure.

These enhancements went a long way to helping developer efficiency and saving time, ironically showing that to test software that handles big data, you need good automation tools. There is still a lot more to discuss on the topic of our tests. In a coming post, I’ll share how we were able to scale up to handle our test load while maintaining full resource utilization.

The bane of flaky tests: Part 2

Automated investigation diagnoses tests faster

From suspicious-test-chaser to flaky-finder

Leave a Reply