Engineering dive into Slack Enterprise Key Management

Over the years, customers have asked us for stricter control over and visibility into their data in Slack, all while maintaining the product’s most essential features, such as link unfurling, mentions, notifications, and search. We talked with many of these customers to get a better understanding of their threat models, how they’d like to assert that control, and what kind of visibility they were lacking.

That brings us to this most auspicious day: The release of Slack Enterprise Key Management, or Slack EKM for short. Slack EKM is a tailor-made solution for security-conscious customers. It allows organizations to have greater visibility into their data in Slack and more control over the keys used to encrypt and decrypt their data.

I’d like to talk a bit about how we designed and made Slack EKM an engineering reality. You can read more about EKM at

But first, a little background

Our initial research focused on a couple of potential solutions:

  1. End-to-End Encryption: This solution ensures that only the receiver of the message would be able to view it (as in iMessage).
  2. Policy-Based Controls: This solution means customers’ encryption keys would be stored by Slack. Customers would then create policies in the user interface to control Slack’s access to those keys.

These two solutions were on the extreme ends of possibility. End-to-end encryption would have some performance impact and make certain features unavailable; policy controls would allow the full use of Slack but not give customers full control over the encryption keys.

These options didn’t meet our requirements: to maintain usage of Slack’s features and preserve performance and support the customers’ need for control over their encryption keys and visibility into their data.

Instead, we designed a solution for customers to bring their own encryption keys into the Slack service — and that’s what we now know as Slack EKM.

On our initial release, we support Amazon Web Services Key Management Service (AWS KMS) for our first third-party integration to store keys. In the future, we may decide to expand Slack EKM to integrate with other key management services or hardware security modules (HSMs) based on customer demand.

OK, let’s get into it

Now that we have some background, it’s time to get into the details. What follows is a high-level diagram of Slack’s systems with EKM:

High-level design of Slack with EKM

Phew, there’s a lot there! Let’s take it piece by piece.

Before we had Slack EKM, this was the state of the world:

High-level design of Slack without EKM

At a high level, when the Webapp received a new message, we would immediately store the message in our databases, end of story. This is pretty straightforward, but even though our databases are encrypted at rest, customers do not have control over those keys.

Everyone loves a new service

When it came time to design how EKM would work in Slack, we decided very early on that any EKM-related encryption and decryption would run as a separate service. The following diagram depicts the interaction between our Webapp and our new EKM service:

Slack and the EKM service

There were a number of reasons to do this. One big reason was that we could add other security measures to this service, such as no one being allowed to log into the boxes running the service. This isolation helps ensure that the encryption keys sourced from AWS KMS are never leaked.

Also, building a new service meant we had the chance to pick a language. At Slack, we use a number of languages, including Hack, Java, Elixir and Go. We decided to use Go because it is already a part of our ecosystem, is well suited for doing CPU-intensive cryptographic operations, and has a great AWS software development kit. (We also had Go experts working on the team.)

Even though our focus was on building Slack EKM to integrate with AWS KMS, we also wanted to ensure that this foundation would scale as we continue to evolve EKM in the future. After introducing an abstraction layer, our interface was not only cleaner but more future-proof too.

Granular control?

I previously mentioned that Slack EKM gives customers control of and visibility into their sensitive data, but I haven’t described how. To recap, here’s our revised diagram of Slack’s architecture with EKM:

High-level design of Slack with EKM

The customer’s key that is stored in AWS KMS is referred to as the Customer Master Key, or CMK. Other key management services may refer to this key as the Master Key or Key Encryption Key. Instead of encrypting every piece of data with that single key, the EKM service requests a data key to be generated based on a supplied scope, which KMS calls the encryption context. That data key is then used to encrypt a file or a small slice of messages. That’s a mouthful. What does it actually mean?

Scopes + policies = granular control

We have five scopes that we can use when encrypting message or file data: Organization, Workspace, Team, Hour and File. If we were encrypting a message, we would use the Organization, Workspace, Channel and Hour scopes, which reflects all the context of a message. The Hour scope is a bit special — it changes every time the hour changes. This means that we have built-in key rotation for our messages, since a new data key will be used every hour.

Another great property of our granular scopes is that they enable customers to create policies in AWS to control or limit access to their keys, and thus their data, in a more precise and targeted way. For example, customers can create sophisticated policies that revoke access for a channel for a particular period of time.

Slack’s access revoked to a channel for February 13, 2019

Files use just the Organization and File scopes. This means that an admin can set a policy in AWS to revoke access to a particular file.

Once we built support for granular scopes, we needed to make sure all Slack engineers used the right scopes when building new features or storing data. As part of our internal implementation, we created EKM encrypt and decrypt objects that help guide Slack engineers through the encryption and decryption process and creating the right scope. There’s a hefty amount of documentation, too.

After creating the objects and documentation, which helped make building EKM a lot easier, we found some fun problems, such as What should the scope be for a message if a customer moved channels between workspaces?

The answer to that one is simple. All we had to do was:

  1. Fetch the message
  2. Update the channel ID for the message
  3. Update the Channel in the scope for the message
  4. Re-encrypt the message

Yup. Easy peasy… 😉

Can’t use the same master key forever

So far, I’ve talked about encrypting new data and rotating keys, but I’d be remiss if I didn’t mention that Slack EKM also supports the rekeying. This means that when a customer decides to rotate their Master Key, Slack will re-encrypt all of the existing messages and files with data keys encrypted by the new Master Key.

How frequently do you request data keys?

You may ask yourself, Does Slack’s EKM service make a request to AWS KMS for every message and file? Don’t worry; we don’t 😄. We’ve done extensive performance and load testing on the EKM service and Slack (be on the lookout for a future blog post on that). During our testing, it became very clear that we would need to cache the scoped data keys, so we implemented an in-memory cache with a TTL (time-to-live) of five minutes.

In order to address our customers’ threat model, we keep the data key caches in memory so they will not be stored persistently. Based on our testing, a five-minute cache has enough of a hit rate to minimize any performance impact during normal use, such as sending messages. Caching also helps us make sure we don’t hit the rate limits for AWS KMS.

EKM service key cache as we start to introduce more requests

Logs for days

In addition to empowering customers to control their data, Slack EKM also gives them visibility into when their KMS keys are requested. When customers set up EKM, they have the ability to set up AWS CloudTrail. CloudTrail provides customers with logs about key requests that are created directly from AWS KMS, thanks to the seamless integration between CloudTrail and KMS. Each log contains detailed information, including the action that was taken, the key scope and the timestamp. Customers can take these logs and ingest them into their security information and event management (SIEM) providers, where they can analyze the logs and use that information to gain insights.

Since CloudTrail logs only actions taken through AWS KMS, we provide additional logs through AWS CloudWatch that will log for any activity, including the in-memory data key cache hits. The CloudWatch logs also provide additional information, such as the reason we made the key requests.

What about search?

Searching over encrypted data is a hard problem, fraught with trade-offs around security and performance. We decided to extend the same model for search as we do for messages: Customers control the key used to encrypt their data.

Without getting into all the details of how search at Slack works, an EKM customer’s search index lives on an encrypted filesystem, where the filesystem is encrypted with the customer’s Organization scope. This means each EKM customer’s search index is also built separately from everyone else’s.

When a search occurs, we check to see if we have access to the key needed to decrypt the index. If we have access to the key, we mount the index so users can perform searches. We keep the index mounted for five minutes (the same period we cache the data keys) and verify access with KMS at the end of that time period.

What about everyone else?

What about the organizations and workspaces not using Slack EKM? While we are excited for all the customers who have Slack EKM, we needed a way to differentiate them from all of our non-EKM workspaces so we could tell if we needed to perform an EKM operation.

As you can imagine, with EKM encrypting all messages and files, where we store the EKM status would be a very popular database table. Enter the EKM status cache: Not just a cache that keeps track of who has EKM enabled, it also negatively caches the workspaces that don’t have EKM. This way, we can quickly tell if we need to perform an EKM operation without stressing a database table that doesn’t change often.

Only the beginning

Even with all the changes made to Slack to support EKM, end users won’t experience any disruptions or notice anything different. And administrators have peace of mind knowing that with Slack EKM, they have full oversight over their data and can mitigate potential risks at the most granular level.

And we’re only getting started! Now that we have a framework for EKM within Slack, we have opportunities in the future to encrypt other pieces of data.

With that, you now have a solid introduction to Slack with EKM. We’ve covered a lot of information here, and this is just the tip of the EKM and Slack iceberg. If you’re curious about other problems we’ll solve with EKM, come join us and see what else we’re up to 😄!

Source link