Testing is a crucial part of maintaining a code base, but not all tests validate what they’re testing for. Flaky tests—tests that fail sometimes but not always—are a universal problem, particularly in UI testing. In this blog post, we will discuss a new and simple approach we have taken to solve this problem. In particular, we found that a large fraction of most test code is setting up the conditions to test the actual business-logic we are interested in, and consequently a lot of the flakiness is due to errors in this setup phase. However, these errors don’t tell us anything about whether the primary test condition succeeded or failed, and so rather than marking the test as failing, we should mark it as “unable to reach test condition.”
We’ve operationalized this insight using 2 simple techniques:
- We developed a simple way to designate the relevant parts of a test as the actual business logic being tested.
- We modified our test framework behavior to treat failures outside these critical sections differently—as “unable to test,” rather than “failure”.
This has led to a significant reduction in flakiness, in turn reducing maintenance time and increasing code test coverage and developers’ trust in the testing process.
End-to-end tests are a powerful tool for verifying product correctness, by testing end-user functionality directly. Unlike unit tests (their better-known counterparts), which test individual components in isolation, E2E tests provide the closest approximation to the production environment, and their success is the best guarantee for sane functionality in the real world. E2E tests are relatively easy to write, as they require only knowledge of the product’s expected behavior and don’t require (almost) any knowledge of the implementation. Well-written E2E tests reflect and describe the application behavior better than any existing spec, and they allow us to undertake significant code refactoring by providing a safety net that the product continues to behave as expected.
At Dropbox, we use Selenium WebDriver for web E2E testing. Selenium sends commands to a web browser which simulate user interactions on a website. This allows the developer to verify a specific E2E functionality on the website. In Dropbox’s case, this includes actions such as adding, removing, or sharing a file; authentication flows; and user management. A typical test will describe a sequence of actions taken by a user in order to perform a certain operation, followed by a verification to ensure success. These actions can include navigating to pages, mouse clicks, or sending key strokes. Verifications are usually done by assertions of specific attributes of a web page that prove the success of the operation, such as success notifications or updates to the UI.
For example, let’s say we want to test whether a user can successfully share a file. An E2E test for that might specify a sequence of actions such as:
- Pick a file
- Click the “Share button” next to the file
- Specify the email address of a person to share with
- Click “Share”
The test would then check to see if the notification of a successful share was displayed.
(While the solution for flaky tests we will describe in this post can be applied to any testing framework, we will focus on Selenium tests, as that is where we have found the most use for it.)
Analyzing an E2E test
In order to better understand the problem of flaky tests and our solution for them, let’s look at a real E2E test: verifying that a Dropbox Team Admin can delete a Dropbox “group” (Dropbox teams can assign their users to groups). Here is an animated gif of screenshots depicting the process of creating a group, adding a users to it, and then deleting the group:
The code to test this flow might look something like this:
def test_delete_group(): # setup group and add member to it GROUP_NAME = 'test group' creator = setup_team() # api operation; no screen interaction create_group_through_admin_console(creator, GROUP_NAME) team_member_email = get_other_team_member_email([creator.email]) navigate_to_single_group_page(GROUP_NAME) single_group_page.add_member_by_email(team_member_email) # Delete the group single_group_page.delete_group() # Check that it no longer shows up group_row = get_group_row_info_by_name(GROUP_NAME) assert group_row is None
There’s a lot happening here, but most of it is setup—we’re creating a test team, creating a group within that team, adding members to it, and only then do we actually do the main purpose of the test: deleting the group and checking to make sure that it got deleted. Errors anywhere prior to that last step don’t tell us anything about whether the
delete_group() functionality is correct or not.
Furthermore, for this particular function (and many others that we would want to test), the amount of code executed during setup is much larger than the pieces being tested, and so if bugs were distributed evenly, we would expect failures of this test case to more likely be caused by irrelevant things rather than the delete functionality itself.
How do we deal with this issue, and in a general way?
The anatomy of a test
Let’s take a step back and think about tests in general. A test is an experiment that aims to validate proper functionality of the system by demonstrating that an expected outcome occurs when a particular factor is manipulated. You can think of it like this:
if A exists: Perform action X on A Verify that the output is O
In our case of
- A =
is a group
- X =
- O =
Conventionally, the outputs of a test are either
failure, based on whether any part of the test causes an error. But what happens if there’s an error even before we get to the test condition of
if A exists ?
Let’s look at an analogous situation: imagine an experiment to test if lightning transmits electricity. To do this, we’ll measure the current through a metal rod placed on top of a tall building during a storm. However, if lightning never strikes the rod, we cannot conclude either that lightning does or does not transmit electricity, since the conditions for the experiment weren’t satisfied.
We call this result
fail to verify, leaving us with the following possibilities for the outcome of a test:
fail to verify
Adding semantics to tests
To implement this logic into our tests, we have to do two things:
- Designate “the relevant part” of the tests
- Modify the testing framework to use this designation to return our 3 different outcomes.
For the first step, we introduce a simple semantic addition to designate parts of a test as “under test.” In python (our primary language), this can be implemented as a context manager which we call
under_test. This is used for the critical sections of code and wraps raised exceptions as
UnderTestFailure. Here’s how
test_delete_group looks with this new code construct (new code is bolded):
def test_delete_group(self): # setup group and add member to it GROUP_NAME = 'test group' creator = setup_team() # api operation; no screen interaction create_group_through_admin_console(creator, GROUP_NAME) team_member_email = get_other_team_member_email([creator.email]) navigate_to_single_group_page(GROUP_NAME) single_group_page.add_member_by_email(team_member_email) with under_test(): # Delete the group single_group_page.delete_group() # Check that it no longer shows up group_row = get_group_row_info_by_name(GROUP_NAME) assert group_row is None
For the second step, let’s look at before and after versions of how our tests are evaluated.
try: run_test(test_function, *args, **kwargs) except Exception as e: return Result.FAILURE else: return Result.SUCCESS
try: run_test(test_function, *args, **kwargs) except Exception as e: if isinstance(e, UnderTestFailure): return Result.FAILURE else: return Result.FAIL_TO_VERIFY else: return Result.SUCCESS
Note that the changes in the test code are extremely minimal: the critical section has just been placed inside a
with under_test() block, and the rest of the code remains the same. However, this has a big impact on failures. The original code had 7 significant lines of code in the test, of which the new version moved 2 into the critical section. If we assume failures are evenly distributed across lines of code, then 5/7 of the errors in the original code would have actually been irrelevant to the functionality we are testing. And in practice, some of the setup code (such as
setup_team()) is way more complex, thus often resulting in an order of magnitude reduction in the number of failures that fall inside the critical section!
Success and failure scenarios
How does this simple change affect various scenarios? Let’s take a look at some common patterns:
# This is a test that passes. # Test output: success def test_pass(self): do_something() with under_test() assert True # This test fails before validating business logic. # Test output: fail_to_verify def test_skip(self): assert False with under_test() do_something() # This test fails while validating business logic. # Test output: failure def test_fail(self): with under_test() do_something() assert False # This test fails while validating business logic (multiple under_test blocks) # Test output: failure def test_fail_2(self): do_some_setupstuff() with under_test() do_something() some_other_preliminary_stuff() with under_test() do_something() assert False
Real failures in skipped tests
Our methodology is very effective with flakiness but we’ve introduced the possibility of missing some real bugs. In particular, consider this example:
def test_skip(self): assert False # real consistent bug with under_test() do_something()
The bug in the non
under_test() section will not be discovered by this test since the test gets marked as
fail to verify. But this is true only locally—when we consider the entire test suite, we would hope that another test would include the bug from this test, inside an
under_test() section, so that the bug is actually caught and eventually fixed. Thus, we must follow a new rule: every piece of code that is no longer inside the
under_test() block must be covered with its own dedicated test where it is
test_delete_group example from above, the non-critical setup pieces such as team and group creation are, in fact, tested in other tests dedicated to those operations, such as
In order to best utilize tests, we run them as part of an automated deployment environment. Understanding the deployment process is crucial for understanding the environmental effects of flaky tests, and how the changes described above help remediate these effects.
Dropbox uses Continuous Integration (CI) across the organization. For each codebase, engineers commit code to the same mainline branch. Each commit (also called “build” at Dropbox) kicks off a suite of tests. Generally speaking, as a software organization grows, CI and well-written test suites are the first line of defense for automatically maintaining product quality. They document and enforce what the expected behavior of code is, which prevents one engineer (who may not know or regularly interact with everyone who commits code) from unknowingly mucking up another’s work—or from regressing their own features, for that matter.
We went into some detail on our CI system in previous blog posts: Accelerating Iteration Velocity on Dropbox’s Desktop Client, Part 1 and Part 2. Here we will briefly review a few pieces that are relevant to us right now.
Our test and build coordinator is a Dropbox open source project called Changes, which has an interface for each test suite that looks this:
Each bar represents a commit, in reverse chronological order. The result could be totally-passing (green) or have at least one test failure (red), with occasional system errors (black). The time it took to run the job is represented by the height of the bar. The test suite being run is quite extensive, and includes both unit tests as well as E2E tests. Thus, it runs on a separate cluster of machines, and currently takes tens of minutes to run.
At first, the workflow at Dropbox to add a new commit to the mainline branch was as follows:
- Ensure that the commit passes unit tests, which run on the developer’s machine.
- These run quite fast (under a few minutes, and often in a few seconds)
- Add the build to the main branch
- Changes would then run the full test suite on that build, eventually marking it as green or red.
However, we started to get cascading failures increasingly often with this system: notice the sequence of red builds on the left and right sides of the above screenshot. This happened because if one build had an error and thus failed the test suite, the next several would most likely fail as well, since in the time it took to run the full test suite, several other commits would have been added to the mainline branch, all of which include the failing code from the first build.
So we added an intermediate stage in the commit process: a “Commit Queue” (CQ). After passing unit tests, new commits now first have to go through the CQ, which runs the same suite of tests as on the main branch. Only builds that pass the CQ are submitted to the main branch, where they are again tested. This prevents cascading failures on the main branch, since each build has already been tested before being added. In the example above, the first bad build would have never been added to the mainline branch, since it would have been caught by the CQ. All subsequent builds would have gone through just fine (assuming they didn’t contain bugs of their own).
Flaky tests and the Commit Queue
Flakiness is the most common problem with E2E tests. What is so bad about flaky tests? Flaky tests are useless because they don’t provide a reliable signal on the particular code being tested. However, things get worse in the context of our Commit Queue (CQ), since a red build blocks engineers from landing code to the mainline, even if their code was fine, but a flaky test falsely marked the build as bad. Excessive flakiness can cause engineers to start losing faith in the entire CI process and starting pushing red builds anyways, in the hope that the build was red just due to flaky tests.
In part, flakiness is a logical price for trying to simulate a production environment that has a lot of indeterministic variants, in contrast to unit tests that run in a sterile mocked environment. For example, in the production environment, a delay of a few seconds with an occurrence rate of once per million per operation might be tolerable. However, in our CI system, if we have 10,000 tests, each composed of 10 operations, this might result in a red build 10% of the time. This is not just specific to us; Google reports that 1.5% of all test runs report a “flaky” result.
In our CQ, we try to discover flaky tests by rerunning failing tests and seeing if they succeed on rerun after failure. In the old system, if a test failed on retry, we would mark the test (and build) as truly failed; whereas if it succeeded on retry, we mark the test as flaky and wouldn’t include its results in evaluating the success of the build as a whole. We then moved the test into a quarantine, meaning it would not be evaluated on any future CQ builds, until the test was fixed by its author.
In practice, fixing flaky tests would often take quite a while, since by definition they contain issues that only surface occasionally. And for the entire duration of repair, our test coverage was reduced. Furthermore, we found a fairly high rate of these tests remaining flaky after a “fix”—25% by some internal estimates. Over time, the quarantine would grow quite large, as engineers struggled to fix flaky tests fast enough to keep up with the discovery of new flaky tests.
With the new
under_test framework, only failures that are raised
under_test() result in
Failure and block the commit queue. Failures outside the critical sections now return
fail_to_verify and are skipped, meaning that they do not block the commit queue. There is no longer a quarantine; all future builds run all tests, including those previously returned
fail_to_verify. Of course, tests which frequently return this error are marked and investigated to try to fix permanently, but now there is no urgency to do so right away.
Quantifying errors in the Commit Queue
What happens when we run an entire test suite? Let’s say we have 10,000 tests, 0.1% of which fail due to real bugs in the code. In addition, let’s assume a 1.5% rate of flakiness in the other tests. Due to the time it takes to fix flaky tests and their cascading failures in the old approach, we might have as many as 10% of tests in quarantine at one time. Finally, let’s assume that only 10% of test code is inside an
under_test() critical section.
Let’s see what happens when we run this test suite through both old and new approaches:
|old ||new |
|Tests in quarantine||1,000||0|
|Tests resulting in ||9,000 x 0.001 = 9||10,000 x 0.001 = 10|
|Tests resulting in ||9,000 x 0.015 = 135||10,000 x (0.015 x 0.1) = 15|
|Tests resulting in ||0||10,000 x (0.015 x 0.9) = 135|
|Tests resulting in ||9,000 – 9 – 135 = 8,856||10,000 – 10 – 15 – 135 = 9,840|
Notice that with the new approach, we not only reduced the number of failures due to flakiness (from 135 to 15), we also increased both our test coverage in successful cases (from 8,856 to 9,840) and the number of real bugs caught!
By introducing a framework of less than 20 lines of code, we expanded our testing outcomes to include a new
fail_to_verify result. We could then remove the “quarantine” for flaky tests from our continuous integration system, resulting in an improvement in all test metrics. In particular, we reduced flakiness by more than 90% from our test suite, transforming it from a lethal disease into a chronic—but treatable—condition. We hope this approach will prove useful to others.