Mutation Testing: A Tale of Two Suites


In January of 2020 Etsy engineering embarked upon a highly anticipated initiative. For years our frontend engineers had been using a JavaScript test framework that was developed in house. It utilized Jasmine for assertions and syntax, PhantomJS for the test environment, and a custom, built-from-scratch test runner written in PHP. This setup no longer served our needs for a multitude of reasons:

It was time to reach for an industry standard tool with a strong community and a feature list JavaScript developers have come to expect. We settled on Jest because it met all of those criteria and naturally complemented the areas of our site that are built with React. Within a few months we had all of the necessary groundwork in place to begin a large-scale effort of migrating our legacy tests. This raised an important question — would our test suite be as effective at catching regressions if it was run by Jest as opposed to our legacy test runner?

At first this seemed like a simple question. Our legacy test said:

expect(a).toEqual(b);

And the migrated Jest test said:

expect(a).toEqual(b);

So weren’t they doing the same thing? Maybe. What if one checked shallow equality and the other checked deep equality? And what about assertions that had no corollaries in Jest? Our legacy suite relied on jasmine.ajax and jasmine-jquery, and we would need to propose alternatives for both modules when migrating our tests. All of this opened the door for subtle variations to creep in and make the difference between catching and missing a bug. We could have spent our time poring through the source code of Jasmine and Jest to figure out if these differences really existed, but instead we decided to use Mutation Testing to find them for us.

What is Mutation Testing?

Mutation Testing allows developers to score their test suite based on how many potential bugs it can catch. Since we were testing our JavaScript suite we reached for Stryker, which works roughly the same as any other Mutation Testing framework. Stryker analyzes our source code, makes any number of copies of it, then mutates those copies by programmatically inserting bugs into them. It then runs our unit tests against each “mutant” and sees if the suite fails or passes. If all tests pass, then the mutant has survived. If one or more tests fail, then the mutant has been killed. The more mutants that are killed, the more confidence we have that the suite will catch regressions in our code. After testing all of these potential mutations, Stryker generates a score by dividing the number of mutants that were killed by the total number generated:

Output from running Stryker on a single file

Stryker’s default reporter even displays how it generated the mutants that survived so it’s easy to identify the gaps in the suite. In this case, two Conditional Expression mutants and a Logical Operator mutant survived. All together, Stryker supports roughly thirty possible mutation types, but that list can be whittled down for faster test runs.

The Experiment

Since our hypothesis was that the implementation differences between Jasmine and Jest could affect the Mutation Score of our legacy and new test suites, we began by cataloging every bit of Jasmine-specific syntax in our legacy suite. We then compiled a list of roughly forty test files that we would target for Mutation Testing in order to cover the full syntax catalog. For each file we generated a Mutation Score for its legacy state, converted it to run in our new Jest setup, and generated a Mutation Score again. Our hope was that the new Jest framework would have a Mutation Score as good as or better than our legacy framework.

By limiting the scope of our test to just a few dozen files, we were able to run all mutations Stryker had to offer within a reasonable timeframe. However, the sheer size of our codebase and the sprawling dependency trees in any given feature presented other challenges to this work. As I mentioned before, Stryker copies the source code to be mutated into separate sandbox directories. By default, it copies the entire project into each sandbox, but that was too much for Node.js to handle in our repository:

Error when opening too many files

Stryker allows users to configure an array of files to copy over instead of the entire codebase, but doing so would require us to know the full dependency tree of each file that we hoped to test ahead of time. Instead of figuring that out by hand, we wrote a custom Jest file resolver specifically for our Stryker testing environment. It would attempt to resolve source files from the local directory structure, but it wouldn’t fail immediately if they weren’t found. Instead, our new resolver would reach outside of the Stryker sandbox to find the file in the original directory structure, copy it into the sandbox, and re-initiate the resolution process. This method saved us time for files that had very expansive dependency trees. With that in hand, we pressed forth with our experiment.

The Result

Ultimately we found that our new Jest framework had a worse Mutation Score than our legacy framework.

…Wait, What?

It’s true. On average, tests run by our legacy framework received a 55.28% Mutation Score whereas tests run by our new Jest framework received a 54.35%. In one of our worst cases, the legacy test earned a 35% while the Jest test picked up a measly 16%.

Analyzing The Result

Once we began seeing lower Mutation Scores on a multitude of files, we put a hold on the migration to investigate what sort of mutants were slipping past our new suite. It turned out that most of what our new Jest suite failed to catch were String Literal mutations in our Asynchronous Module Definitions:

Mutant generated by replacing a dependency definition with an empty string

We dug into these failures further and discovered that the real culprit was how the different test runners compiled our code. Our legacy test runner was custom built to handle Etsy’s unique codebase and was tightly coupled to the rest of our infrastructure. When we kicked off tests it would locate all relevant source and test files, run them through our actual webpack build process, then load the resulting code into PhantomJS to execute. When webpack encountered empty strings in the dependency definition it would throw an error and halt the test, effectively catching the bug even if there were no tests that actually relied on that dependency.

Jest, on the other hand, was able to bypass our build system using its file resolver and a handful of custom mappings and transformers. This was one of the big draws of the migration in the first place; decoupling the tests from the build process meant they could execute in a fraction of the time. However, the module we used in Jest to manage dependencies was much more lenient than our actual build system, and empty strings were simply ignored. This meant that unless a test actually relied on the dependency, our Jest setup had no way to alert the tester if it was accidentally left out. Ultimately we decided that this sort of bug was acceptable to let slide. While it would no longer be caught during the testing phase, the code would still be rejected by the build phase of our CI pipeline, thereby preventing the bug from reaching Production.

As we proceeded with the migration we encountered a handful of other cases where the Mutation Scores were markedly different, one of which is particularly notable. We happened upon an asynchronous test that used a done() callback to signify when the test should exit. The test was malformed in that there were two done() callbacks with assertions between them. In Jest this was no big deal; it happily executed the additional assertions before ending the test. Jasmine was much more strict though. It stopped the test immediately when it encountered the first callback. As a result, we saw a significant jump in Mutation Score because mutants were suddenly being caught by the dangling assertions. This validated our suspicion that implementation differences between Jasmine and Jest could affect which bugs were caught and which slipped through.

The Future of Mutation Testing at Etsy

Over the course of this experiment we learned a ton about our testing frameworks and Mutation Testing in general. Stryker generated more than 3,800 mutations for the forty or so files that were tested, which equates to roughly ninety-five test runs per file. In all transparency, that number is likely to be artificially low as we ruled out some of the files we had initially identified for testing when we realized they generated many hundreds of mutations. If we assume our calculated average is indicative of all files and account for how long it takes to run our entire Jest suite, then we can estimate that a single-threaded, full Mutation Test of our entire JavaScript codebase would take about five and a half years to complete. Granted, Stryker parallelizes test runs out of the box, and we could potentially see even more performance gains using Jest’s findRelatedTests feature to narrow down which tests are run based on which file was mutated. Even so, it’s difficult to imagine running a full Mutation Test on any regular cadence.

While it may not be feasible for Etsy to test every possible mutant in the codebase, we can still gain insights about our testing practices by applying Mutation Testing at a more granular level. A manageable approach would be to generate a Mutation Score automatically any time a pull request is opened, and focus the testing on only the files that changed. Posting that information on the pull request could help us understand what conditions will cause our unit tests to fail. It’s easy to write an overly-lenient test that will pass no matter what, and in some ways that’s more dangerous than having no test at all. If we only look at Code Coverage, such a test boosts our numbers giving us a false sense of security that bugs will be caught. Mutation Score forces us to confront the limitations of our suite and encourages us to test as effectively as possible.



Source link