Determinism is the cornerstone of automated tests, meaning that every test should run and deliver a predictable result. When this does not happen, we label it as a flaky test. A flaky test is a test that both passes and fails an automated test periodically for the same code configuration. In other words, it is an analysis of code which oscillates between identifying as a passing or failing code with each run, even when the code remains unchanged.
According to Google engineer John Micco, flakes account for 84 percent of tests which oscillate between passing and failing. Not only do flaky tests take time, money, and resources to fix, they also stall project progress and hide real defects. While some wish to avoid the processes, it is important to run flaky tests because they can assist in finding new bugs. Additionally, Micco concludes, “more than one in seven of the tests written by our world-class engineers occasionally fail in a way not caused by changes to the code or tests.”
Why They Are a Problem
Even if you employ the best engineers, flaky tests may still be inevitable. Even Google has flaky tests. At Google, they add flaky tests roughly as fast as they fix them, which means teams need to consider that some level of flakiness is inevitable, and efforts must be made to alleviate the problems caused by them. Some engineers need to have 10,000 tests pass as part of their code commit. Unfortunately, 1,000 of those tests can be flaky. In order to have the commit accepted, engineers have to run the test suite until all the tests pass. Even with each of those tests failing once in a thousand times, you still have only a 35 percent chance of all the tests passing.
Moreover, they are difficult to catch. There are two reasons for their evasive nature. First, flaky tests can slip past undetected by either failing too frequently or not frequently enough. Second, the test failure may not be a result of the tests being flaky, but rather a result of the environment in which the tests are being run (i.e. on a mobile device). While fixing these flakes takes substantial time and they barely affect users anyway, they are a necessity because some time-absorbing and inconvenient fixes can save a company millions per day.
How to Fix Them
Often, teams rely on two methods of mitigation: either fix the test or delete the test. However, deleting the test is not recommended. Not only is it demoralizing to just delete the tests after you’ve spent time creating them, but the tests themselves still have value. They can still find new defects and even the flakiness of the test can be indicative of the overall health of the system. Flakiness may not even be caused by the test itself, but rather the underlying infrastructure, such as your docker container or even the mobile device used while running your tests. Essentially, you could be needlessly deleting valuable information. Why delete the entire test if the only problem was the device on which it was run?
Three Recommended Methods
There are three primary methods currently employed by developers to detect flaky tests:
1. Manually move them into separate runs.
This is done to prevent the tests from breaking the build. While this method allows for more developer control, moving tests in and out of different runs manually is basically a full-time job.
2. Automatically move them into separate runs.
Big companies like Facebook and Google automatically separate flaky test failures. When a bot classifies a test as flaky, it quarantines it from the rest of the run. This means that your Continuous Integration (CI) process can run without failing due to flaky tests. Developers need to be observant while using this method. Depending on the way in which the system is set up to classify flaky tests, you may overlook real defects affecting users.
3. Re-run the flaky tests.
Re-running tests that have failed is a strategy used by several companies. Some Google engineers have written scripts just to re-run large test suites on repeat until the tests all pass. While this can be effective, running tests takes time and resources. There are smart ways to re-run tests, and this is one of the problems we try to solve at Appsurify.
A More Efficient Method
Appsurify TestBrain mitigates effects of flaky tests by automating the defect detection process and quarantining flaky tests. We accomplish this in several ways:
- Automatically creating defects for each cause of test failures
- Allowing a defect to be manually or automatically classified as a “flaky defect”
- Prioritizing tests based on their likelihood to find defects and their flakiness. We only re- run tests that have failed due to a flaky defect
- Running tests that are flaky but preventing flaky failures from causing your build to fail.
By using this process, our users are provided value from the flaky tests. Furthermore, we can monitor those tests for increases in flakiness. By doing this, we save users valuable time and resources. Contact our team to learn more or get started by trying TestBrain for free.