LinkedIn started as a professional networking service in 2003, serving user requests out of single data center. For any internet services company, availability is a key factor in its success. In any internet architecture, a lot of things can go wrong at any given time; network links can die, power fluctuations can knock out entire racks of servers, or you could release bad code. In the face of adversity, keeping LinkedIn up and always running is a challenging task.
In the initial days of LinkedIn, when our whole infrastructure was hosted in a single data center, if any disaster struck, LinkedIn would be offline for the duration of the disaster. These days, a lot has changed: it’s been 13 years, we now have more than 500 million members, and LinkedIn has successfully transitioned to multi-colo architecture, distributing infrastructure across multiple data centers. LinkedIn currently serves user traffic out of four geographically-distributed data centers, thereby providing a stable infrastructure to serve members without downtime. Moving to multi-colo was a significant milestone in LinkedIn’s disaster recovery strategy. It enabled us to serve members without interruption, even in the rare event that we lose a data center, by moving live production traffic between our data centers in minutes. In this post, we will discuss our disaster recovery strategy, including TrafficShift, the architecture that enables us to move live production traffic across data centers, and the processes we use everyday to keep improving our availability.
LinkedIn’s traffic architecture
In order to understand how we move live traffic across data centers, we need to understand how a user’s request ends up in our data center in the first place. As shown in the above figure, when a user visits “https://www.linkedin.com” in their browser, requests are sent to one of 13 PoPs distributed globally using GeoDNS. Each PoP runs a service called ATS (Apache Traffic Server). To determine which colo to route a user’s traffic to, ATS includes a custom plugin that verifies the signed cookie present in request headers to identify the data center that the user’s request should be routed to and establish a sticky session. If this cookie is not available, the primary data center is not online, or if the cookie has expired, ATS communicates with the stickyrouting service to determine the primary colo the user’s request should be routed to and a cookie is set with data center information for that particular user.
The stickyrouting service is a key service that assigns a primary colo and secondary colo to all members at LinkedIn. The assignment is done via a Hadoop job, which runs regularly and assigns each member a primary colo and secondary colo based on the geographical distance between the member and the various colos, while taking into account the capacity constraints of each colo. A member’s secondary colo is assigned such that the overall traffic is equally divided among the remaining two U.S. data centers.
Preventing disasters with TrafficShift
Graph showing traffic being failed out of data centers
When a service degrades or any disaster strikes one of our data centers, we redirect live traffic to another data center in order to mitigate member impact. We achieve this by marking portions of a colo that is facing issues as “offline,” using the sticky routing service. All requests destined for that colo are then routed to the appropriate secondary colo without any downtime. We developed an internal tool called TrafficShift to automate this process. A data center can be marked offline with a click of button (called the “big red button”) in minutes. Whenever there is an issue or major maintenance scheduled, we move traffic out of that colo and mitigate the risks of service interruption for our members.
As shown in the above diagram, TrafficShift Engine is a collection of two services: trafficshift-fe and trafficshift-be. Trafficshift-fe is a web application that provides a UI for creating plans to perform different operations, like failing traffic out of a colo, failing into a colo, scheduling automated load tests, or failing out of a PoP.
TrafficShift UI showing traffic distribution
Trafficshift-be is a backend service written in Python that executes the above-mentioned jobs by leveraging Salt modules and the Salt-API to communicate with the stickyrouting master service.
These services are deployed in multiple colos with multiple instances to provide resiliency and operational stability. Trafficshift-be runs as a single active master instance with standby instances ready to assume the master role if something happens to the current master. Handling this distributed master ownership is done via Apache Zookeeper.
Load testing: disaster recovery process
As part of our disaster recovery policy, we regularly stress test our data centers to help identify capacity bottlenecks, newly-introduced bugs that might surface during high loads, varying traffic patterns and their impact on key services, and, last but not least, to provide confidence in each data center’s capacity to handle extra load during outages. Load testing is a planned stress test that we execute three days per week during our peak traffic times. During load testing, we target a particular colo and stress it by routing extra traffic from other colos to this colo, up to a predetermined target QPS. The whole process of load testing is run via a continuously-monitored, automated TrafficShift service.
The purple space between 07:40 and 09:25 shows when we did a load test to one of our colos
The TrafficShift tool provides fully-automated scheduled load testing execution, where an engineer creates a plan with a customized stress percentage and has options to select various partitions and schedule the test to run at a particular time. Once we achieve the desired QPS, we maintain that traffic level for 45 minutes before rebalancing members back to their primary colos. The TrafficShift tool will abort and rebalance traffic if any alert triggers, thereby preventing any risk of disrupting members’ experience. If a load test fails, a blocker ticket is raised to the failing service owners and they have a strict SLA to fix the issue.
Instead of being reactive in our disaster recovery approach or only testing these processes once per year, we have incorporated load testing as a part of our daily operations work. Although load testing is done in a live environment, the process is managed carefully to minimize any impact on members. We believe testing in a live environment is necessary in order to be fully prepared to best serve our members in the event of a real outage situation. By building the necessary infrastructure and tooling to enable easy rerouting of traffic between colos, LinkedIn has shielded our members from potential disasters by making our service more resilient. Along with disaster recovery preparation, our regular load test also provides an opportunity for individual service owners to test their services during high loads using production live traffic.