Resilience Engineering at LinkedIn with Project Waterbear


Our latest home page depends on more than 550 different endpoints in its dependency tree. It is very difficult for developers to ensure expected “graceful” degradation on the home page for every failure scenario involving this many endpoints. With LinkedOut, we implemented a “disruptor” filter in the Rest.li client. In this filter, we inspect the LiX context for member ID and treatment. The member ID will decide if this request should be disrupted, and the LiX context will determine which failure mechanism we will introduce. To lower the learning curve, we also created a Chrome plugin, which will provide all dependency downstream information for any page in the LinkedIn.com domain and will allow users to click and select specific downstreams to simulate failure by passing in the LiX context to the request.

A few interesting things about LinkedOut include the use of LiX as the targeting system for treatments and the decision to adopt client-side failure logic in Rest.li.

  • LiX treatment targeting: Using a well-known and internally popular system, instead of using other methods like Zookeeper or invocation context (IC) headers, lowers the cost of onboarding and avoids system duplication and adding additional complexity to the LinkedIn stack.
  • Client-side failure logic: This approach increases the adaptability and scalability of the implementation. Integrating the testing framework is as simple as bumping up your Rest.li client version and enabling the testing mode via a config change.

Infrastructure failure induction (FireDrill)

FireDrill provides an automated, systematic way to trigger/simulate infrastructure failure in production, with the goal of helping build applications that are resistant to these failures.

For example, let’s look at a short list of how we classify various failure states:

  • Host half failure: DNS pollution, time out of sync, disk write failure, high network latency
  • Host offline: host power failure, network disconnected
  • Rack fail: whole rack offline
  • Data center failure: data center lost data link

We started this project by creating host-level failures. The type of failures we chose to simulate initially are:

  • Network failure
  • Disk failure
  • CPU/Memory failure

We use modules written in SaltStack to simulate these failures. Our goal with FireDrill is to move to an automated setup where we can simulate these failures constantly in production. Once we get comfortable with simulating host-level failures, in the next phase we intend on using FireDrill to create power, switch, and rack failures in our data centers.

Cultural changes

Graceful degradation

The term “graceful degradation” is almost self-explanatory. One part of this concept is that when a non-core dependency experiences failure (for example, an ad failing to load on a page), the pages will still load with reduced information, but won’t impact user experience as much as sending them to an error/”wiper” page would.

The process for thinking through a graceful degradation scenario follows a specific set of steps that includes SREs working directly with service owners and other teams:

  • Identify “core flows” that should NOT failIdentify and reduce “core dependencies”
  • Creatively come up with ways to “gracefully degrade”
  • Determine best practices and continuous checking in production

In the event of a core dependency failure, there are several scenarios that could be activated to ensure a better user experience:

  • Trade with time: Assuming a core dependency in another datacenter or PoP didn’t fail, make a cross-colo call to get the dependency info.
  • Trade with complexity: Look into alternative sources for similar data or use cached data to mitigate the impact.
  • Trade with feature/relevance/content: For example, if complex search isn’t available, degrade to simple search. If personalization info isn’t available, degrade to a default experience within that feature for everyone.

Implementation notes

Cultural changes are stereotypically one of the hardest to implement in large engineering organizations. To roll out these changes, we took a measured approach that showcased successes with pilot services to secure buy-in from the other teams.

We also implemented a full-blown communications strategy to advocate for more teams to get involved in the Waterbear project. This included:

  • Roadshows to other eng teams and orgsInternal tech talks and training videos
  • Adding resilience-focused sessions in new hire onboarding training (coming soon)
  • Designing competitive games/challenges to encourage people to discover/fix the resilience issues in their product (coming soon)
  • A blog post about Waterbear (the one you are reading!)

Finally, we are also considering whether to implement a “Waterbear certified” status for services, which would be a more formal version of the recognition we give out for projects rated as “A+” by service scorecard (SSC), an internal benchmarking tool we use to gamify operational excellence.

Rest.li client changes

Degrader tuning

Almost all LinkedIn inter-service communication is conducted over Rest.li protocol. If we can make the protocol more failure-tolerant, it will help us cover a lot of common resilience negligences.

The Rest.li framework is super powerful, but has makes users turn a lot of knobs to make it work properly. As a result, most service owners just use the default setting. However, this can be suboptimal, because different services have dramatically different profiles (QPS, latency, etc.).

We noticed the problem from the user-facing latency/error spikes was caused by single-node failures in a relatively large cluster.in theory, in which case the Rest.li degrader should act as an intelligent load balancer and quickly remove the problematic node out of rotation. In reality, because the degrader algorithm takes the default setting as input, it tries to wait for 10 seconds to mark a request as “degraded,” and is too lenient/slow on punishing the badly-behaving node for services that take thousands of QPS and expect less than 10ms latency; for these, it almost never acts properly. As a result, the single-node failure will bubble up to user-facing errors.

D2Tuner is a project we’ve worked on that will give recommendations of D2 settings based on each service’s historical data. At the end of the day, we want to use D2Tuner to allow every service owner to tune their service degrader properly, which will allow the Rest.li framework to take care of most of the single-node turbulences in the production environment and prevent them from bubbling up to users.

Lessons learned

Reflecting on this project, we found that there were two lessons learned that we thought other organizations could benefit from.

Values and culture matter

There are many factors that go into problem-solving, but one that does not get discussed enough is the role a company’s values play in the process. LinkedIn Engineering is a values-driven organization, and we feel that these values can contribute positively to the way a company approaches a problem like resilience engineering. When we discussed our “hierarchy of needs” earlier in this post, we were actually talking about “craftsmanship,” a LinkedIn Engineering core company value.

At a company level, one of our core values is the concept of “members first.” Serving 500 million members in more than 200 countries leads to a huge potential impact for any site outage. This influenced our design thinking, because we needed to begin with the idea of minimizing the impact that any failures had on a member’s experience.

Every team can contribute to resilience engineering efforts

Finally, another common refrain in the LinkedIn SRE organization is, “attack the problem, not the person” (a reflection of our company values, “act like an owner” and “relationships matter”). Knowing that an individual service owner won’t be shamed or verbally attacked when a problem occurs creates a much more positive environment for problem-solving. This also makes it easier for the SRE organization to influence the decision-making process throughout the engineering teams in order to make changes to shared infrastructure (Rest.li, etc.) and get the rest of the company to think about resilience as part of the product design process. In many ways, Waterbear may not have been possible with a different company culture.

One additional piece that has been a pleasant surprise for us is the general reaction from the leadership and development partners. We were expecting hard pushback because introducing failures in production and requesting resilient features will conflict with product delivery timelines. However, the development team and leadership team have been very supportive. As long as we have scientific approaches to validate our hypothesis against failure scenarios, the ability to limit the blast radius of failure, the capability to derive clear action items to improve system resilience, and can build proper tooling/systems to make running such tests extremely easy, every team at LinkedIn can contribute to resilience engineering efforts.

Acknowledgments

Our team would like to thank several individuals who provided invaluable help during the design, testing, and implementation of Waterbear. Specifically, we’d like to thank Anurag Bhatt, David Hoa, Anil Mallapur, Nina Mushiana, Ben Nied, Jaroslaw Odzga, Maksym Revutskyi, Logan Rosen, Sean Sheng, Ted Strzalkowski, Nishan Weragama, Brian Wilcox, Benson Wu, and Xu Zhang.

 



Source link