Building Migrations for Slack Enterprise Grid


by Eric Vierhaus and Todd Wirth

The Raygun Gothic Rocketship at Pier 14 on the Embarcadero” by Tom Hilton (cc-by)

Slack Enterprise Grid lifted off in January 2017, allowing Slack to power the work behind even the largest and most complex companies in the world. To achieve this, our new product allows administrators to link multiple Slack teams together under one organization. When we set out to build the Enterprise product back in 2015, it was clear we’d need an entirely new data model while continuing to support all of our pre-existing teams.

Engine Room: Modeling Enterprise Data

Slack’s data model is based around collocating all the data for a Slack team on the same MySQL database shard. This works well for spreading load throughout the fleet, but presents challenges for data that transcends the boundary of a single Slack team. During initial team creation we generate a unique 64-bit unsigned integer that’s assigned to the team and provides a means to hash all team-centric data to the same location.

With Slack Enterprise Grid you can connect many — even thousands — of Slack teams together and communicate across those teams with shared channels. How do we fit the team-based database architecture into this design? It turns out, not very easily!

There were a number of problems we needed to address:

  • Where do “cross-team” channels like DMs live in a team-centric database model?
  • How do we migrate legacy data in the new Enterprise model both safely and quickly?
  • How do we notify other engineering teams when issues arise?
  • How do we monitor parallel asynchronous jobs?
  • How do we achieve all of this while also minimizing downtime for our customers?

We got to work.

Beam You Up: Transforming Your Slack Data

How do “cross-team” channels live in a team-centric database model? Making this as seamless as possible meant we needed a new data model for Slack Enterprise Grid. We chose to assign the Organization itself — the “umbrella” around all the various teams — to its own database shard. Since all teams in an organization store a pointer to this “parent” organization we could easily redirect to the team’s local database shard or the organization shard depending on the type of data we wanted to access. Data stored on the organization’s shard is hashed with the organization’s 64-bit ID, just like data stored on the local team shards.

Once we decided that data related to the organization would reside in a different physical place, we had two tasks: sending new messages and related data to the right DB host, and migrating all the old data to its new home. Any team IDs stored in database fields or within flat data blobs would need to be translated to the organization ID.

Additionally, in Enterprise Grid all user records are assigned new organization wide 64-bit IDs as well so that each user has a single ID value regardless which team they access. Any references to old user IDs within the migrated data must be translated to the new ID as well. Since Direct Messages and Group Direct Messages (DMs and Groups DMs, respectively) needed to be available regardless of which team you log into within the organization, this channel data needed to be moved from the team database shard to the organization database shard.

Looking ahead, we knew all current and future features in Slack might need to go through this transformation process, so we wanted to build a robust and flexible framework. Instead of a set of one-off scripts we designed a generic set of data handlers that each exposed three separate interfaces: validation, transformation and insertion.

We set out to write data handlers for each piece of Slack data that needed migration. Through a careful audit of our backend code, we discovered quite a lot of Slack primitives, including: files, channels, users, custom emoji and other channel-related data. We created a generic interface to each migration type containing sub-data handlers. Then when a new data type is added the developer need only “fill the blanks” for the handler. For example, when migrating custom emoji a developer adds a data handler to validate, transform, and insert the new emoji data on the new shard.

The data handlers, including the master handler, use our asynchronous job queue system (we’ll get more into this in “The Bridge” section). Each and every handler is idempotent, ensuring that if an error occurs during migration we can safely restart an individual handler without fear of data loss or corruption.

Warp Drive: Moving Fast, but not Too Fast

After tackling the initial problem of migrating the data, we had to make it scale. How do we migrate legacy data in the new Enterprise model both safely and quickly? Some of our customers had millions of files and messages that needed to be migrated. As we expected, our application code ran much faster than our database hosts could reasonably sustain.

There are many performance characteristics of databases, and we focused overall on CPU utilization, replication lag, thread count and iowait. Although CPU utilization can be misleading, it provided a decent picture of how much load we were placing on the database.

How do we keep CPU utilization and replication lag down? Enter rate limiting! Slack uses a custom rate limiting system (that’s worthy of its own future blog post) to keep high volume external requests from overwhelming Slack’s servers. During Grid migration testing we observed that our own internal requests had the potential to overwhelm our systems as well. So we repurposed our rate limiting system to work with Grid migrations, allowing us to control our migration process so that they move at a steady, safe rate. This also kept CPU utilization and replication lag within acceptable limits.

The flexibility of our rate limit system also allowed us to build an internal GUI for our amazing Customer Experience team to manage migration speeds without needing an engineer available.

The Bridge: Commanding the Starship

For deferring work outside web requests, we have an asynchronous task queue which we call our “job queue” system. This is a set of dedicated hosts and application interfaces that allow engineers to enqueue tasks that can run in parallel and outside the bounds of a single HTTP request. Our job queue allows us to offload lengthy operations to specialized, dedicated hosts. When we began working on our Grid migration process, the job queue system was the obvious choice to perform the lengthy data migrations we needed. We worked with our operations team to create a new queue for Grid migrations utilizing a dedicated set of hosts. An isolated queue and worker pool was an ideal choice, so that we could monitor and scale the systems as needed, without affecting the latency or throughput of other jobs.

As part of Grid migrations, we added new capabilities to our job queue system to allow finer grained monitoring. We created a job queue inspector and used it to monitor and observe jobs as they progress through the job queue system. The inspector allows us to “tag” our migration tasks with a unique value, or values, which we can later use to observe jobs of a given tag or tags.

Tagging job tasks with a unique name allows us to observe the state of all migration tasks for a specific tag name or multiple named tags easily. We use this process to monitor the total work to be completed for each migration and to raise an alert if any of these tasks fail to complete successfully. The job queue inspector then powers our real-time progress bot and allows us to monitor the migration, moving automatically from one phase to the next. Building this system proved to be a crucial component of Grid migrations and other engineering teams have started employing the same application logic to monitor and observe their tasks in similar ways.

The job queue inspector has allowed us to run Grid migrations on full auto-pilot without the need for human operators. If an engineer wants to observe the process in real time, we built a special internal user interface to monitor:

The job queue inspector gave us the visibility into our Grid migrations process so that we could build automated tools to alert and inform our own teams of their progress.

Hailing Frequencies Open: Keeping Others at Slack Informed

Not long after our initial beta customers joined their first Grid organizations, we realized that we wanted — and very much needed — a rich set of tools to inform our own teams about the migration process from start to finish. With that in mind, our first thought was to create our own custom integrations that would share the scheduling, approval, migration and validation processes of our Grid migrations within public channels, for Slack employees to monitor and (of course) search. In fact, one of the aspects that makes working at Slack so great is that there are literally millions of messages and thousands of channels, cataloging the entirety of our company’s history, at each of our fingertips at any moment within Slack. We have a culture of transparency and openness, so we posted these messages in public channels, keeping the appropriate people and departments informed along the way.

We used Slack’s APIs and rich message attachment features to build and send real-time messages to a series of channels. This is an example of a message we sent when a new Grid migration was scheduled by one of our Customer Success Managers using our chat.postMessage API method:

One of the most beloved integrations has become our real-time migration progress bot which monitors and reports on the data that we have migrated during the Grid migration process. After the first message is sent, we use our chat.update API to update the existing message with updated progress every minute:

These automated alerts have allowed our Sales and Customer Success teams to inform customers of any changes made during scheduling and provide minute-by-minute progress reports during the migration process itself.

At Slack we use collectd and StatsD compatible services throughout our ecosystem to record numerous types of host and application metrics quickly and easily. During development of our project, we added small bits of application logic to record and observe counting and timing data around important and interesting parts of our project code. Our application framework provides a few simple interfaces for collecting these types of metrics.

The collected host and application metric data is pushed through our data pipelines, aggregated and made available to all engineers and staff via a Graphite cluster. For our Grid migrations project, and most projects at Slack, we built a custom Grafana dashboard with both the metrics we recorded for our feature and other useful metrics pulled from the hosts which would execute our application code.

Visualizing the system in real time is often the best way to spot and diagnose application issues and performance bottlenecks before they become a customer issue. We use these tools on a daily basis to monitor the health and status of these systems and to optimize our code accordingly. Here’s an example of one such dashboard we created for our Grid migrations:

Building the tools and systems to track our Grid migrations from start to finish has allowed us to build confidence, identify issues and continue to improve the process of migrating our customers to Enterprise Grid.

Captain’s (Epi)log(ue)

After iterating on the above systems on many test migrations, we took the plunge and migrated Slack’s own internal teams to Grid. We perfected the systems and added automatic data validation and consistency checks post-migration. Since late 2016, we’ve successfully migrated many of Slack’s largest customers to Grid, all with minimal downtime and no data loss. Making Grid migrations happen continues to be a team effort — many thanks go out to the hard working folks on the Slack Enterprise team for pulling this off. You can find more information about Slack Enterprise Grid and the customers using it at https://slack.com/enterprise.



Source link