At first this seemed like a simple question. Our legacy test said:
And the migrated Jest test said:
So weren’t they doing the same thing? Maybe. What if one checked shallow equality and the other checked deep equality? And what about assertions that had no corollaries in Jest? Our legacy suite relied on jasmine.ajax and jasmine-jquery, and we would need to propose alternatives for both modules when migrating our tests. All of this opened the door for subtle variations to creep in and make the difference between catching and missing a bug. We could have spent our time poring through the source code of Jasmine and Jest to figure out if these differences really existed, but instead we decided to use Mutation Testing to find them for us.
What is Mutation Testing?
Stryker’s default reporter even displays how it generated the mutants that survived so it’s easy to identify the gaps in the suite. In this case, two Conditional Expression mutants and a Logical Operator mutant survived. All together, Stryker supports roughly thirty possible mutation types, but that list can be whittled down for faster test runs.
Since our hypothesis was that the implementation differences between Jasmine and Jest could affect the Mutation Score of our legacy and new test suites, we began by cataloging every bit of Jasmine-specific syntax in our legacy suite. We then compiled a list of roughly forty test files that we would target for Mutation Testing in order to cover the full syntax catalog. For each file we generated a Mutation Score for its legacy state, converted it to run in our new Jest setup, and generated a Mutation Score again. Our hope was that the new Jest framework would have a Mutation Score as good as or better than our legacy framework.
By limiting the scope of our test to just a few dozen files, we were able to run all mutations Stryker had to offer within a reasonable timeframe. However, the sheer size of our codebase and the sprawling dependency trees in any given feature presented other challenges to this work. As I mentioned before, Stryker copies the source code to be mutated into separate sandbox directories. By default, it copies the entire project into each sandbox, but that was too much for Node.js to handle in our repository:
Stryker allows users to configure an array of files to copy over instead of the entire codebase, but doing so would require us to know the full dependency tree of each file that we hoped to test ahead of time. Instead of figuring that out by hand, we wrote a custom Jest file resolver specifically for our Stryker testing environment. It would attempt to resolve source files from the local directory structure, but it wouldn’t fail immediately if they weren’t found. Instead, our new resolver would reach outside of the Stryker sandbox to find the file in the original directory structure, copy it into the sandbox, and re-initiate the resolution process. This method saved us time for files that had very expansive dependency trees. With that in hand, we pressed forth with our experiment.
Ultimately we found that our new Jest framework had a worse Mutation Score than our legacy framework.
It’s true. On average, tests run by our legacy framework received a 55.28% Mutation Score whereas tests run by our new Jest framework received a 54.35%. In one of our worst cases, the legacy test earned a 35% while the Jest test picked up a measly 16%.
Analyzing The Result
Once we began seeing lower Mutation Scores on a multitude of files, we put a hold on the migration to investigate what sort of mutants were slipping past our new suite. It turned out that most of what our new Jest suite failed to catch were String Literal mutations in our Asynchronous Module Definitions:
We dug into these failures further and discovered that the real culprit was how the different test runners compiled our code. Our legacy test runner was custom built to handle Etsy’s unique codebase and was tightly coupled to the rest of our infrastructure. When we kicked off tests it would locate all relevant source and test files, run them through our actual webpack build process, then load the resulting code into PhantomJS to execute. When webpack encountered empty strings in the dependency definition it would throw an error and halt the test, effectively catching the bug even if there were no tests that actually relied on that dependency.
Jest, on the other hand, was able to bypass our build system using its file resolver and a handful of custom mappings and transformers. This was one of the big draws of the migration in the first place; decoupling the tests from the build process meant they could execute in a fraction of the time. However, the module we used in Jest to manage dependencies was much more lenient than our actual build system, and empty strings were simply ignored. This meant that unless a test actually relied on the dependency, our Jest setup had no way to alert the tester if it was accidentally left out. Ultimately we decided that this sort of bug was acceptable to let slide. While it would no longer be caught during the testing phase, the code would still be rejected by the build phase of our CI pipeline, thereby preventing the bug from reaching Production.
As we proceeded with the migration we encountered a handful of other cases where the Mutation Scores were markedly different, one of which is particularly notable. We happened upon an asynchronous test that used a
done() callback to signify when the test should exit. The test was malformed in that there were two
done() callbacks with assertions between them. In Jest this was no big deal; it happily executed the additional assertions before ending the test. Jasmine was much more strict though. It stopped the test immediately when it encountered the first callback. As a result, we saw a significant jump in Mutation Score because mutants were suddenly being caught by the dangling assertions. This validated our suspicion that implementation differences between Jasmine and Jest could affect which bugs were caught and which slipped through.
The Future of Mutation Testing at Etsy
While it may not be feasible for Etsy to test every possible mutant in the codebase, we can still gain insights about our testing practices by applying Mutation Testing at a more granular level. A manageable approach would be to generate a Mutation Score automatically any time a pull request is opened, and focus the testing on only the files that changed. Posting that information on the pull request could help us understand what conditions will cause our unit tests to fail. It’s easy to write an overly-lenient test that will pass no matter what, and in some ways that’s more dangerous than having no test at all. If we only look at Code Coverage, such a test boosts our numbers giving us a false sense of security that bugs will be caught. Mutation Score forces us to confront the limitations of our suite and encourages us to test as effectively as possible.