Scaling is hard. Design decisions that initially seemed reasonable break down with little warning, and suddenly even the simplest parts of your data model need to go through a complex re-architecture.
We’re tackling this problem at Slack. A lot of our early design decisions made sense for small workspaces, but can be inefficient for large Enterprise Grid workspaces, where we can have thousands of users in one organization. For example, we have a MySQL table for workspaces, which has information like workspace names and URLs. When Slack was growing, it seemed perfectly reasonable to store workspace preferences (like default language and channels, and permissions on who can create or archive channels) as a JSON blob in this
But Slack has grown, and workspace preferences have grown with it. We now offer 165 preferences that you can use to customize your workspace. These are utilized by more than 150 Enterprise Grid organizations, and 70,000+ paid workspaces. And with all these new users and large workspaces, the workspace pref JSON blob can become larger than 55 kB 😱
Why prefs are a problem
The biggest problem with having such a large JSON blob is that it’s stored in the
workspaces table, which we access a lot. Often times, our code fetches one preference at a time — for example, we’ll check if a workspace’s analytics page is only accessible to admins before loading the page for a non-admin user. But that means querying the entire
workspaces table, which we’re already accessing for other critical information about workspaces. Sending too many queries to this table could overwhelm it, reducing reliability for all of those other places in our code where we need to use it.
Caching helps reduce queries to the database, but the sizes of our workspace pref blobs hurt the current caching scheme. Our code caches all of a workspace’s information in the
workspaces table in one object. So accessing one workspace pref from the cache means getting tons of extra information, which is a large unnecessary load on our Memcached hosts. And changing exactly one workspace preference invalidates that entire cached object!
I’m a rising junior at Harvard studying Computer Science, and a software engineering intern on Slack’s Enterprise Architecture team this summer. My team is responsible for restructuring parts of our data model to make Slack more scalable. For my summer internship, I contributed to my team’s efforts by re-architecting workspace prefs.
So how did I re-architect workspace prefs? The model I created (which might help for your own re-architecture efforts), can be split into three buckets:
- Creating a new data model and database table
- Migrating existing data to the new table with double-writes and backfill scripts
- Accessing the data in the new table in existing code locations and in new, generated functions, as well as changing your caching scheme to reflect the new data model
Creating a new data model
There are many possible data models for storing workspace prefs. We could store a workspace ID with a JSON blob of all of those workspace’s preferences (but as previously discussed, lumping all preferences together made us retrieve a lot of unnecessary information when accessing a single preference). Another strategy would be to create a database table with a column for each preference, so each workspace would have a single row in the table. But as we continue to add more customization for workspaces, this would mean adding new columns to the table — an expensive procedure.
To split up workspace preferences and give us flexibility to add new preferences in the future, I chose to use an Entity/Attribute/Value (EAV) table. The table has three columns: the workspace ID, the name of the preference, and a JSON blob of the value for that preference. Each pref is stored as a new row in the table, meaning we don’t get any extra information when fetching only one pref. It also lets us add new preferences without needing to change the structure of the table.
| workspace_id | pref_name | pref_value |
| 1 | stats_only_admin | true |
Once the table’s schema was chosen, it had to be created in our Vitess cluster. Vitess is a tool that helps reduces the load on MySQL databases by creating multiple databases and routing each request to a specific database, like a hash table. Slack uses Vitess to ensure we can provide reliable access to our database as we continue scaling (check out a talk about using Vitess at Slack from one of our senior staff engineers, Mike Demmer, who also helped me by creating the new table in Vitess!).
Migrating existing data
Once the new table was created in Vitess, it was time to fill it with data. This process was complicated because of the new format of the data; while Slack’s infrastructure team has great support for migrating data from one table to another, I couldn’t use these tools since I was splitting up one row in the
workspaces table into many rows in the new workspace pref table. So, I had to create functions that could split up the preferences into a list of key/value pairs, and turn each of those pairs into a row that could be inserted into the new table.
With the ability to transform the data into something writable, it was time to start writing the data. I started with doing “double writes,” meaning any time a workspace updated their preferences and updated data in the existing
workspaces table, that data was written to the new table. I also created a backfill script that went through every single workspace in our database and inserted their preferences into the new table. These two processes meant that in a matter of days, the new table contained a complete and up-to-date view of a workspace’s preferences.
After the table was filled with correct data, I started a gradual transition to read workspace prefs only from the new table. I started with “dark mode” reads, where any time we want to access workspace prefs, we fetch them from both the existing and the new table. The prefs are compared to catch any inconsistencies, and then the existing data is returned. This helps ensure the data is accurate before actually switching to the new table. I also added some logging to track how long it took to fetch prefs from the new table, which ensured that this change would not impose any performance penalties.
This dark mode helped uncover bugs that weren’t even caused by the re-architecture work. For example, when creating new workspaces, we called a function to set some initial preferences for that workspace. But this function took the workspace by value, and not by reference. So we would write these initial prefs to the workspaces table, but once that function returned, we didn’t update the workspace object in our code to have those new prefs. My new table was inconsistent because it actually persisted those initial prefs, whereas the existing code would overwrite the prefs. This bug also showed the importance of the re-architecture work; if you store workspace preferences separate from the rest of the workspace object, you don’t need to worry about having to pass a workspace by reference or updating the object after a function returns.
Once I fixed that bug, and created and ran a backfill script to make the new table consistent with the existing one, it was time for “light mode.” In light mode, we get the data from the new table and return it instead of ignoring it. I rolled this out slowly, using feature flags to turn on light mode for a few test workspaces, then to our internal Slack, and then finally for all of our users.
At this point, our code was reading from and writing to the new table, but the work wasn’t done. There was still the problem of fetching unnecessary information; to read a single pref, our code often calls a function that fetches an array of all of a workspace’s prefs, and then access the one that’s needed. To fix this problem, I wrote a script that generated getter functions for each individual pref. Not only would this allow developers to get only the information they need, it would also enforce data types for each of the preferences, since each of the functions would have a return type (something we’re trying to do to all of our code as part of our conversion to the Hack language).
To create these getter functions, the script iterates through a list of all the existing workspace prefs. This list specifies default values, data types, and permissions about who can change these prefs. The script grabbed the name of each pref, as well as the data type to create the return value. But some of the prefs didn’t have types specified! I specified the types of 12 prefs, and even found 14 prefs that were deprecated; they didn’t affect any behavior for the users, but were still floating around in the codebase and needed to be removed. This work helped create a cleaner codebase and will make it easier for other Slack engineers to understand what these prefs should store and how they should be accessed. We value craftsmanship at Slack, and believe that these small efforts can have a big impact in improving code quality and helping other engineers.
Finally, I needed to change our caching structure to reflect the EAV format. I switched from caching all of the prefs together to caching each one individually. Then, when prefs are updated, only the new prefs are invalidated in the cache, instead of the entire workspace object.
With the new table, we’ve substantially reduced workspace cache invalidation, and enabled us to continue adding prefs. The getter functions have also allowed us to better validate workspace prefs and ensure type-safety in our code.
As a company grows, re-architecture work becomes inevitable. At Slack, we’ve seen this as we’ve grown to support more than 8 million daily active users and 70,000 paid teams. Seemingly sensible design decisions break down at scale; when we offer 160+ workspace preferences, the JSON blobs will grow and impact other work we do with workspaces. Slack’s workspace pref re-architecture can provide a model for how to change data models without affecting how people use Slack.
At a dinner with other Bay Area interns this summer, someone said “internships are great but you won’t learn that much; it’s not like you get to come in and re-architect the codebase.” Luckily, Slack let me do exactly that. I learned about sharded database management, data migration, and code generation — all while making our codebase more scalable and easier to navigate for future engineers.
Thank you to my manager Eric Vierhaus and my mentor Johan Oskarsson for their constant support and guidance during my internship!