Tumblr Engineering — Juggling Databases Between Datacenters

Tumblr is a big user of MySQL, and MySQL automation at Tumblr is centered around a tool we built called Jetpants. Jetpants does an incredible job making risky operations safe and reliable, even fairly complex tasks like replacing failed master servers, or splitting a shard.

While Jetpants is an incredibly effective and valuable tool for Tumblr’s day-to-day operation, it has remained very difficult to implement a meaningful testing framework. Integration testing at this level is very challenging. In this article I’ll go through these challenges and how we’ve tackled them at Tumblr.


Jetpants operates under the assumption you’re managing MySQL daemons on a fully functional host, and that it can:

  • ssh to the target system
  • manage processes via service or systemctl commands
  • copy data around between systems
  • allocate spare servers from the asset management system, Collins

Right away this means we have some challenges with respect to infrastructure testing:

  • We need a Collins deployment
  • We need an environment with spare servers running MySQL
  • We need these spare servers to actually be servers, not light-weight Docker containers


For most of the life of Jetpants, these requirements were fulfilled using actual hardware in a testing pool in our datacenter. This wasn’t ideal, however. Running a test which allocated more replicas, or tested shard splitting means using an extensive amount of real hardware that takes hours to reprovision. Testing changes to the Collins code meant talking to a real Collins deployment. What if we messed up?

This test strategy has all the hallmarks of manual testing. It doesn’t prevent regressions. Test coverage of our featureset is spotty based on what was interesting at the time. Public contributors can’t run the tests.

For a new user to pick up Jetpants and Collins, it can be very difficult to get started. Jetpants requires Collins to be configured it certain ways that aren’t publicly documented. When I first built the testing environment, I had to regularly compare what I had to our actual deployment to figure out why Jetpants wasn’t working correctly.


During a Tumblr hackathon earlier this year, I devoted my time to developing an isolated, automatic testing system. We have since integrated this system directly into Jetpants and are using it in our day-to-day development and testing.

Our test framework is based on the NixOS test framework, the same framework NixOS uses to verify it is safe to release a new version. These tests use QEMU to start an isolated environment of at least one VM, and NixOS configuration to build the VMs.

Our testing framework adds lots of tooling on top to let us create robust tests. By default, a test has a running Collins instance, a master database server, and one replica. Simple options allow provisioning additional spares or additional replicas on that initial master.

Below is an example test we’ve written for performing a dead master promotion. This is where the current master database is dead, and we replace it with one of the existing replicas.

Here you can see what a test looks like, and how easily we can express the components and phases of our tests:

import ../make-test.nix  ({ helpers, ... }:
  name = "shard-dead-master-promotion";
  starting-slave-dbs = 2;

  test-phases = with helpers; [
    (jetpants-phase "shutdown-master" ''
    (phase "jetpants-promotion" ''
        echo "YES" # Approve for promotion
        echo "YES" # Approve after summary output. Confirmation.
      ) | jetpants promotion --demote= --promote=
    (assert-shard-master "POSTS-1-INFINITY" "")
    (assert-shard-slave "POSTS-1-INFINITY" "")

Running this test first provisions the base environment, by

  1. starting Collins
  2. starting 3 Linux systems running MySQL
  3. creating a master-replica relationship between one MySQL server as a master, and two MySQL servers as replicas, then loading in a default schema, and naming it the POSTS-1-INFINITY shard

Once all this preparation is done, our test phases begin.

First we shut down the current master, to simulate a dead master situation. We then run the jetpants promotion command which will replace the old master ( with a new master we have selected, jetpants promotion will prompt for confirmations, so we echo approvals to its stdin.

We continue by validating that the jetpants command did what we expected, and verifying the master and slaves.

Initial Results

Through this testing, we have already identified and fixed several race conditions and very old interface bugs. Nix’s functional nature allows us to create and tear down test VMs in minutes, as it isn’t a convergence-based configuration management tool. The stability of the test framework, and consistency of its results have allowed us to more aggressively change the underlying code in Jetpants while remaining confident our tools will work correctly during our day-to-day production maintenance.

Jetpants has been under continuous and vigorous development at Tumblr for many years now, and I’m excited about where the future will be taking MySQL automation at Tumblr.


Source link