How we reduced latency and cost-to-serve by merging two systems


Architecture of identity services

Motivation

As our Identity applications footprint expanded, and LinkedIn applications and their features grew, the team started to shift focus to performance, cost-to-serve, and operational overheads. One example can be found in this article. In addition, we started to re-evaluate some of our assumptions and practices in developing the identity services. 

We observed a couple of downsides of keeping the identity data service separate from the identity midtier:

  1. The design of having the data service separate from the midtier turned out to be less valuable than we initially thought. We discovered that most scalability challenges could be addressed at the storage layer, i.e., the Espresso data store. Furthermore, reads and writes to Espresso were in effect passthrough from the identity data service to Espresso.
  2. Maintaining the data service as a standalone service incurred operational overheads and increased code complexity. We provisioned over 1,000 application instances in multiple data centers for it. Furthermore, we had to maintain the API in the data service to provide access to only the midtier. This involved data modeling, evolving the API, and security, to mention a few.
  3. The business logic in the data service is minimal, and the majority involves data validations. 
  4. Keeping the midtier and backend services separate also incurred additional network hops for client applications.

With these considerations, we embarked upon the effort to combine the midtier and the data service into a single service, while maintaining the APIs unchanged. This, at first glance, is counterintuitive, considering that we generally follow service-oriented architectures to tackle complexities by breaking big systems into smaller ones. However, we believe there is a right balance to strike in deconstructing systems. In the case of identity services, the benefits of potential gains from performance, cost-to-serve, and operational overheads triumphed over the additional complexity bundled in a single service.

Implementation

Thanks to the microservice architecture we employ at LinkedIn, we were able to merge the two services with significant footprints into a single one without disrupting our clients. We would merge the code from the data service to the midtier and enable the midtier to interact directly with the data store while keeping the midtier’s interface unchanged. One important goal we had was to maintain the feature and performance parity between the new and old architectures. We were also focused on managing the risks that came with merging two critical applications, and keeping the development cost of the merger to a minimum.

Our implementation was completed in four steps.

Step 1. To seamlessly merge the two code bases and run them in a single service, there were two approaches we could take. An intuitive approach would be copying select code from the data service to the midtier service so that it could perform logic such as data validation and interact with the data store. While that was the cleanest approach, it required a significant amount of upfront development cost before we could validate the idea. Consequently, we opted for a creative “hack” by using the data service’s REST API as a local library in the midtier. We then would have the option to clean up the tech debt once the idea was validated.

Step 2. We gradually ramped the change described in Step 1. At LinkedIn, we have a state-of-the-art A/B testing framework called T-REX. With T-REX, we can create a ramp schedule based on the level of risk and impacts of a change, and generate statistical reports to measure top-tier metrics. This allows us to gradually ramp the change while observing the impacts, and gives us a fast rollback capability (within a few minutes) if needed. Since our change to the two critical services was a high-risk and high-impact kind of change, we took extra caution with our ramp schedule. We ramped one data center after another, and within each data center, we ramped from small percentiles to larger percentiles, with enough time in between to generate reports.

Step 3. We decommissioned the data service hosts. 

Step 4. Since we took a creative shortcut in Step 1 by embedding the data service code developed for a REST service as a local library, we needed to clean this up because craftsmanship is an important tenet of our culture. We simplified the layers of classes by removing those classes and interfaces that were used to expose Rest.li services, and kept only the essential classes that interact with the data store. 

The diagram below shows the difference in the architecture before and after the change.



Source link