Expediting Data Fixes and Data Migrations

With over 630 million members, the LinkedIn platform delivers thousands of features that individually serve and store large amounts of diverse data. Protecting, maintaining, and serving data has always been of paramount importance for enriching the member experience and ensuring service reliability. In this blog post, we’ll address a critical part of data management involving ad hoc operations through data migrations or data fixes. More specifically, we’ll detail how we developed a centralized scaleable self-service platform for implementing and executing both data fixes and migrations.

Let’s review the two primary types of data operations. Data migrations generally involve a process of collecting, transforming, and transferring data from one field, table, or database to another. Data fixes on the other hand generally involve selecting and transforming data in place. To elaborate on these types, let’s go through an example:

  • Suppose member first names are not limited in length. A number of members have entered first names that are millions of characters long. A product decision is made to enforce a limit of 1,000 characters. A data fix might involve truncating the first names of all members with first names of length greater than 1,000 characters.

  • Suppose member profile data exists in a relational database that we would like to transfer to a NoSQL database. In the past, we migrated profile data from multiple Oracle databases to a NoSQL Espresso database in order to leverage existing technology developed internally by LinkedIn. This would be an example of a data migration.

From cleaning invalid or corrupted data, to migrating member data from legacy systems, these operation types are all use cases that we frequently encounter at LinkedIn.

As LinkedIn and its datasets continue to grow rapidly, executing these data operations becomes increasingly difficult at scale. Specifically, data fixes can carry a weight of urgency as immediate solutions to production issues. For both data operations, reducing the engineering effort required provides immense member value by preserving data quality and expediting feature development—especially when data migrations can require several months to complete and verify. Historically, across the company, data operations have been conducted through a decentralized variety of mechanisms including ad hoc scripts, use case-specific CLIs, deploying new migration services, and many more. 

However, in order to develop a generic platform, we had to keep in mind a few requirements and considerations:

This new system should maintain a level of simplicity such that any data owners can write a datafix or migration without significant effort. Achieving this requires abstracting concepts away from users writing jobs through simple interfaces, as well as flexible support for multiple languages, frameworks, and data engines (i.e. Java, Python, Pig, Hive, Spark, etc.). To reduce the engineering effort required for data operations, we need to abstract away multiple aspects such as resource/topology management, request throttling/scaling, code deployment, iterating data operations on records, and many more. The goal of these abstractions and interfaces is to improve both the developer experience and speed of implementing data operations as much as possible.

With any data operation, we must ensure and maintain data correctness. This entails keeping data in a valid state for use by features/products and preventing data from entering a state such that member features break. Any data platform should focus on preventing these implications if at all possible by allowing for strong validation. For any software or engineering company at the scale of LinkedIn, designing systems that have perfect validation and will preempt all future data quality issues is close to impossible. With a multitude of data stores, schemas, and services, not all corner cases and holes in validation can be fully prevented or protected against. Any client, system, or script can compromise data quality. Similarly, any data operation job could do the same, potentially exacerbating pre-existing issues. Therefore, any existing validation must be honored regardless if data changes occur organically via live traffic or inorganically through a data operation platform.

Validation comes in all shapes and sizes. Simple validation can range from type checks, formatting checks, range checks, and many more. Alternatively, complex rules can include checks across multiple tables within a database or cross-service calls to multiple sources of truth.

Often times, these simple rules can be fully enforced at the database layer, but not all validation rules can or should be implemented this way. Enforcement is not always feasible at the database layer and may require an intermediate service, especially in the case of complex rules. To satisfy both use cases of simple and complex rules, it is imperative that a data platform be flexible enough to access any service or database that could contain necessary validation rules, ensuring any fixes or migrations maintain high data quality.

When modifying large amounts of records, any data migration must be able to iterate on records quickly. For example, we recently needed to purge deprecated profile fields, such as associations and specialties, from millions of members as part of a decision to remove the elements of the profile that no longer seemed relevant to the product. Generally, data migrations may have to apply across millions or hundreds of millions of profile records, which requires an immense scaling of this data operation process. If we were to modify all 630 million members’ records using a data migration job that can only modify 10 records per second, this migration would take almost two years to complete! Our data platform must be able to modify large amounts of records quickly across large datasets.

On the other hand, complex systems will inevitably have bottlenecks and capacity limits (database, service, and event consumption limitations). High query frequency and volume can easily take down under-provisioned services or exposed bottlenecks, and require careful control of data operation rates. This system will require consistent throttling mechanisms to ensure services do not take on an unmanageable load.

In order to solve these problems, we implemented a flexible self-service platform for users to rapidly implement, test, and execute data operation jobs. At a high level, the platform filters down large datasets to records that need to be operated upon and then iterates through each of these records to conduct corresponding updates to data until the job is complete.

Source link