LinkedOut: A Request-Level Failure Injection Framework

More recently, though, another mechanism was added to inject failures into requests, via the invocation context (IC). The IC is a LinkedIn-specific, internal component of the framework that allows keys and values to be passed into requests and propagated to all of the services involved in handling them. We built a new schema for disruption data that can be passed down through the IC, and then failures would instantly happen for that request.

The LiX and IC methods are handy ways to trigger failures, but how do we get actual people inside the company to use them? The answer is building easy-to-use user interfaces that make it simple for anyone at LinkedIn to test the resiliency of their services.

Web application

Developed using our internal Flyer (Flask + Ember) framework and designed using the Art Deco patterns, our LinkedOut web application makes it easy to perform failure tests on a larger level. It provides two main modes of operation: automated testing and feature targeting-based ramping.

Automated testing
As mentioned before, single pages on LinkedIn can depend on several downstreams in order to return the proper data to the member. Due to the velocity of code changes at LinkedIn, which can lead to changes in services’ dependency graphs, as well as their abilities to handle downstream failures, we knew we needed a way to allow for automated failure testing. However, there were several questions we had to ask ourselves in designing this feature:

  • Which user will be making the requests? This is especially important when considering access to paid features on LinkedIn (such as Sales Navigator), if we want to be able to failure test everything.

  • How do we run these failure tests at scale? Due to the number of downstream services involved in a given LinkedIn page, testing one at a time would take hours.

  • How do we determine success in an automated failure test? We have several frontends at LinkedIn where a 200 response code doesn’t necessarily denote total success, so we needed a different way to determine if we’re gracefully degrading.

  • What is the most effective way to convey automated test results to the user? Users probably would be overwhelmed by raw failure data for every endpoint involved in a request, so we needed a better way to present it.

These questions, and the corresponding answers, led us to our current implementation of automated failure testing. We created a service account (not associated with a real member) and gave it access to all of our products. This way, we could be confident that engineers could run tests on almost every part of the LinkedIn experience.

As for running at scale, we devised a two-fold solution. We first needed to scale the automated testing across our LinkedOut web application hosts, for which we leveraged the Celery distributed task queue framework for Python. Using a Redis broker, we’re able to create tasks for testing each downstream (based on call tree data) and then distribute them evenly across the workers on our hosts.

For the actual testing of the pages, we leverage an internal framework at LinkedIn that allows for Selenium testing at scale. You can use a traditional Selenium WebDriver and point it at this framework’s control host, and it’ll run your commands on a remote host running your desired browser. We send commands to inject the disruption info into the invocation context via a cookie (which only functions on our internal network), authenticate the user, and then load the URL defined in the test.

We considered a few ways to determine success after injecting failures (user-contributed DOM elements to look for, etc.), but, for our first iteration, we decided to simply provide default matchers for “oops” pages and blank pages. If the page loaded by Selenium matched one of these default patterns, we would consider the page to not have gracefully degraded. We definitely want to make this more extensible in the future, so that users can define how their pages should look when they load successfully.

Finally, we needed an effective way to present these test results to our users. We figured that some users would like to see the firehose of data (every failure for every endpoint), but others would want a simpler view of regressions and new failures for defined tests. So we made both:

Source link