Over 10 million users across the globe rely on Slack everyday to collaborate with their colleagues. As our user base has grown, so has our focus on enhancing the performance of our features and ensuring their ability to perform under load. This was especially true for Enterprise Key Management (EKM), which launched earlier this year for Enterprise Grid customers.
EKM allows our most security-conscious customers to use their own keys to encrypt and decrypt their data in Slack, including fundamental aspects of Slack such as sending and receiving messages. Our goal was to ensure that it would work seamlessly at scale when we launched.
We’d like to share our strategy for load testing EKM, how our tooling reinforces service ownership, and how this tool (and others like it) can act as force-multipliers for other engineers.
6 Steps to load testing EKM
Context is key
Designing load tests for this feature required a meticulous understanding of EKM and a systems-level view of how it fits within the greater Slack architecture.
Slack has a (sometimes surprisingly) complex architecture with many services that work together in order to respond to requests as fast as possible. At a high level, there is a backend server which we will refer to as
webapp, which is responsible for handling web/API requests and federating database access.
Webapp talks to a set of real-time stateful services, which handle the fanning out of events to connected clients. This allows us to send messages in real time and know the moment our colleagues begin typing in a given channel.
EKM is a new service within Slack’s architecture which
webapp hooks into. As a result, our native clients are unaware of the EKM service and encryption and decryption is handled by
webapp. We designed our tests with a focus on the server-side APIs, where we could easily measure EKM’s impact on real-time services.
1. Identifying critical paths
Our tests were built to place sustained load on our web APIs and observe the impact on database health and the real time services.
The first step was to identify endpoints that serve message and file data, since those requests result in a downstream connection from
webapp to EKM for encryption/decryption. To do this, we analyzed API usage patterns by our native clients (iOS, Android, desktop, web) and selected endpoints that were (1) most frequently called and (2) interacted with the EKM service.
2. Gathering a baseline
It was crucial for us to gather baseline values to guide our analyses and better understand our test results. We created a target benchmark based on the average rate of requests for our largest enterprise customers during peak hours. We then compared the success rate and response timings of the load testing against the benchmark.
Our goal was to simulate double the load created by our largest customers during our tests, so that we could be confident that the service will perform for customers today and into the future as we continue to grow.
3. Creating consistent test environments
A critical component of load testing is creating environments that will mimic production users and workspaces. To make this process easier, we created a tool for quickly spinning up workspaces and users, then populating them with ample message and file data. Since we wanted to compare our experimental data against a baseline, we relied on this tool to create and populate two orgs, one control and one with the EKM feature enabled. This tool allowed us to create near-identical enterprises with identical data, thus reducing the noise in our analysis.
4. One tool to rule them all
Before we took on this project, there were many load testing tools available, scattered across different repos. Our first plan of action was to consolidate the different tools and make them more accessible, which led to the birth of a new
loadtest service. This new service allowed us to generate API calls and execute them all within a single place.
At a high level, this service generates calls to a given endpoint and hits them at the specified rate. This allows us to fine tune and simulate load on our specified APIs. This service was written entirely in golang, taking advantage of features like goroutines and channels for concurrency. We created a single CLI tool which encapsulates fine-grained control over rate and concurrency of API calls as well as safety valves. This improves the accessibility of these tests, allowing any engineer at Slack to leverage this in their workflow.
5. Provisioning a machine
It is very easy for an engineer at Slack to spin up a development host within our infrastructure. We leveraged that provisioning service, along with some additional tooling to automate this process, to increase diagnostics and error handling. In order to minimize time spent manually debugging, the script performs basic remediation steps on behalf of the engineer to work around common errors. An engineer who wants to run load tests can run
slack loadtest --bootstrap which executes the following steps:
- Creates a development server
- Installs the latest version of the
- Clones and installs the
loadtestrepository and its dependencies
- Performs a simple test to confirm the tooling is installed correctly
We empower engineers to leverage this form of testing as we decrease the work required to get started. If there’s a problem the tool doesn’t solve, engineers can copy-paste the command output and get better assistance thanks to the clear output. In summary, with a single CLI command, we can provision a bootstrapped machine for you with all required dependencies in under 5 minutes.
6. Tracking metrics and analyzing data
With the introduction of a new service, there are always uncertainties about how it will perform and integrate with the rest of the stack. Even while running tests, there are numerous things that could go wrong when certain parts of the system are put under too much load. While running our tests, data was piped into dashboards which we monitored during test runs. The main metrics we monitored were API method error rates, EKM service latency, and database cluster health — CPU idle, replication lag and the number of threads connected provided high signal into problem areas.
One of the main benefits from our testing was uncovering bugs and areas for improvement within Slack’s codebase and infrastructure. We uncovered several codepaths that generated excessive load on other systems or performed unnecessary expensive operations. We were able to ship fixes for these types of regressions and validate our fixes in a controlled environment through our tests before our customers ever experienced them.
Load testing allowed us to:
- Identify performance bottlenecks
- Gain confidence that the system would work in production
- Get a clear view into how EKM plugs into all the interconnected pieces of Slack’s backend under load
A look at how we tested posting messages
We tested the performance of sending messages using an organization with EKM enabled. The API that we use for sending messages for both internal and external clients is
loadtest tool, we auto-generated a
.txt file containing URLs pointing to our target endpoint. These URLs are generated programmatically with unique session tokens to simulate load from different users into randomly selected channels.
These calls were piped into the
loadtest executor, which allowed us to specify the length of test run, concurrency, and rate of requests.
go run api_blast.go -concurrency=50 -rate=100
With a simple invocation like the one shown above, we could simulate sending messages at a rate of 100 requests per second over a period of 5 minutes. We repeated this process for the identified API endpoints and incrementally ramped up the rate to sustain load for longer periods of time.
One of the most exciting things about working at a growing company is being able to create tools that act as force multipliers of developer productivity. Using this standardized workflow, others can now take the tooling we created and apply the same process to their own features, enabling them to ship with the same confidence that we had.
However, even the best designed tools are useless if people don’t know they exist nor how to successfully leverage them into their work. It is equally as important to onboard engineers interested in using the tools as it is to build them. To that end, we revamped documentation, presented at department-wide meetings, and held office hours where engineers could drop in to discuss how they could incorporate the load testing service into their workflows. As a result, we are already seeing people take ownership in the new service as folks contribute back to this tooling to improve it.
We knew that running load tests would be valuable for the scope of EKM. What we didn’t know was the impact this tool would have on workflows across the engineering organization.
A great bonus from this project was enabling our colleagues to easily grasp how their new feature will work end to end within the greater Slack ecosystem under load. Since releasing this new tooling, more and more teams are adopting load tests as a part of their feature development process to better understand how their feature performs at scale.
As a whole, this framework enabled us to see the comprehensive impact of a feature on an entire system under load. We were able to mimic harsh conditions to see how features behave in the extremes, something that is critical as some bugs only surface at scale. Most importantly, we were able to confidently ship EKM and know that it will perform well from day one.
Slack has 10M+ daily active users that rely on our service to do the best work of their lives. Our job as engineers is to make sure that the work we produce can support this scale of our largest customers so that the experience continues to be delightful as we grow. Load testing has become increasingly important in our ability to deliver this with confidence.
We believe that running load tests shouldn’t be hard — the hardest part of your work should be designing and implementing the feature. This project was built by engineers with minimal load testing experience, but we believe this allowed us to create a product that will enable anyone to participate in load testing whenever they need.