Migrating to Espresso | LinkedIn Engineering


Babylonia makes direct writes to Espresso.

 

Ensuring consistency
We’ve had three different processes writing data to our Espresso database: the bulk loader, the Databus listener, and Babylonia itself. One issue we needed to tackle was how we would allow these three writers to operate without conflicting.

Consider the system at this stage, where Babylonia was performing dual writes. After writing directly to Espresso, Babylonia would write to Oracle, which generated a Databus event. When this event reached the Databus listener, it would attempt to again write the same record to Espresso. If we allowed the Databus listener to overwrite the data from Babylonia, it could conceal any issues with the direct writes.

Complicating this further (not shown in the diagrams) is that in each colocation data center (colo) we have multiple instances of Babylonia running and multiple instances of the Databus listener running. Oracle and Espresso each have their own mechanisms for cross-colo replication. Once data is committed in one colo, those changes start propagating around the world. There’s a chance that, somewhere, the replicated Oracle data may reach a local Databus listener before the Espresso replication has updated the same data.

We have a similar problem with our LinkedIn Experimentation (LiX) Platform for controlling the ramp. When we change the state of a LiX, there is no way to ensure that all instances of Babylonia and the Databus listener see the new state simultaneously.

Essentially, the problem is that any scenario relying on timing or LiX states to ensure that only one process updates the record in Espresso will have some chance of dropped or duplicate writes, which could lead to inconsistencies between Oracle and Espresso or between Espresso databases in different colos.

MigrationControl
Our solution to this problem was to add an additional optional field to the Espresso schema, which we called MigrationControl. When a process writes to Espresso, it sets the MigrationControl to indicate which type of process it is: bulk loader, Databus listener, or Babylonia.

In the write methods, we added logic that checks for an existing record in Espresso. If there is one, it examines the MigrationControl field. If the Databus listener finds that the record has been recently written by Babylonia, it aborts writing the record. That way the last write always comes from Babylonia.

If we find ourselves in a situation where we need to patch up corrupted data, we can redefine this logic to allow the Databus listener or bulk uploader to overwrite Babylonia.

Current status
We are currently at this step in the migration process. Direct writes from Babylonia to Espresso are partially ramped, and we expect to complete that soon and begin the next step, which is establishing the Espresso database as the new SoT.

Declaring Espresso the new SoT
Once we have Babylonia writing to Espresso directly, and we have validated what we are writing to Espresso through shadow read validation, we will be ready to declare that Espresso is our new SoT. Babylonia will continue to write to both Oracle and Espresso, but then it will service read requests by reading only from Espresso.

Even though Babylonia will no longer be dependent on Oracle at this stage, we can’t shut off the writes to Oracle until all the other systems at LinkedIn that use the Oracle Databus and ETL snapshots have migrated to the Espresso-equivalent Brooklin and ETL data sources.



Source link