We had two services and three different databases supporting contacts and calendar use cases. Address Book Service, written in Java, acted as a backed service to ingest contacts from email import flow and mobile uploads. Connected Service, which originally came as a part of LinkedIn’s acquisition of ConnectedHQ, is written in Python and periodically syncs contacts from third-party sources. Oracle was used by the Address Book Service to store contact data and Connected Service used MySQL and Espresso to store both contact and calendar data. These deployments of Oracle and MySQL had significant footprints, with close to 50 and 100 shards, respectively. Espresso was used to process mobile calendars to create notifications for members to learn more about people they were meeting. Since the data was distributed across these three DBs, our Extract, Transform, and Load (ETL) process had to take periodic snapshots of all three databases and make them available in Hadoop Distributed File System (HDFS) for all offline analysis like PYMK. These offline jobs would process the data and compute the suggested connections for the member, making those suggestions available to the frontends.
There were challenges with using a fragmented architecture, including: maintaining and syncing multiple services and databases, dealing with different data models across these systems while onboarding new use cases, manual DBA intervention to run ALTER TABLE scripts on all the shards to do any schema evolution, and huge monetary costs in licensing and provisioning hardware to support multiple databases. On top of that, without LinkedIn’s standard Rest framework for our Python stack, it became quite difficult and hacky for other online services to interact with this system. For these reasons, we decided to build a single source of truth for all contact and calendar data.
Re-architecting the system
As we started this project of re-architecting the system, we had many challenges ahead of us.
We wanted to establish a single source of truth, which required us to migrate hundreds of terabytes of data and 40+ client services. On top of that, we were constrained by the need to migrate with zero downtime, while keeping both systems in sync.
Deciding on the right technology
Before we built out the new system, we needed to decide what was the right database platform that would fit our needs. After careful evaluation, we felt that Espresso met all the requirements for our use cases. Espresso is a fault-tolerant and distributed NoSQL database, and it provided a more scalable approach to our ever-increasing demand. It also provided a quick lookup for member data without the need for global indexes. Espresso’s auto-purging ability reinforces our commitment to data privacy by purging all members’ data if they delete their LinkedIn account. Also, being an in-house technology, Espresso is well-integrated within the LinkedIn ecosystem and has good support from engineers and SREs.
Building the right data model
Since we were going to perform migrations from two relational databases (Oracle and MySQL) to a document-based key-value store (Espresso), re-designing the schemas was a challenge in and of itself. We carefully examined all the tables and fields in our legacy systems and their usage patterns. We looked at all the existing relations and how they actually impacted the system. We also looked at all the usage patterns of our data and optimized more for performance than storage. We had to de-normalize some of the data in favor of better performance. We created secondary tables as per our query patterns and avoided a global index on the primary table.
At LinkedIn, all the database and API schemas need to go through the Data Model Review Committee (DMRC) to make sure they meet common standards and to discover common pitfalls. DMRC consists of many senior engineers at LinkedIn and is often quite helpful in making sure you have designed your data model in the right way so that it can be easily evolved in the future.
Use of personal data routing
Typically at LinkedIn, we replicate member data in all of our data centers to provide quick access to member data anywhere around the globe. This also provides an easy way of switching traffic between data centers in case of failover.
The contacts dataset is one of the biggest datasets at LinkedIn and is growing very rapidly. Following a similar architecture of replication was going to cost a lot in terms of hardware for our Espresso clusters. To combat this problem, we leveraged personal data routing (PDR) for the contacts dataset, originally built by the Messaging team. Instead of writing/replicating members’ contacts data in every single data center, we decided to write data only in each member’s primary and secondary data centers. It seemed two data centers were sufficient for disaster recovery. This decision significantly reduced the hardware cost to N/2, where N is the total number of data centers operated by LinkedIn.
One major downside of utilizing PDR is that we must make a cross-colo call whenever the traffic layer shifts from a member’s primary data center. However, after careful evaluation, we realized that this only impacts less than 10% of members, and we were able to optimize so that a cross-colo call only adds a few milliseconds to page load time. This seemed a manageable risk to our approach of going with PDR.