In April of 2016 Etsy launched Pattern, a new product that gives Etsy sellers the ability to create their own hosted e-commerce website. With an easy-setup experience, modern and stylish themes, and guest checkout, sellers can merchandise and manage their brand identity outside of the Etsy.com retail marketplace while leveraging all of Etsy’s e-commerce tools.
The ability to point a custom domain to a Pattern site is an especially popular feature; many Pattern sites use their own domain name, either registered directly on the Pattern dashboard, or linked to Pattern from a third-party registrar.
At launch, Pattern shops with custom domains were served over HTTP, while checkouts and other secure actions happened over secure connections with Etsy.com. This model isn’t ideal though; Google ranks pages with SSL slightly higher, and plans to increase the bump it gives to sites with SSL. That’s a big plus for our sellers. Plus, it’s the year 2017 – we don’t want to be serving pages over boring old HTTP if we can help it … and we really, really like little green lock icons.
In this post we’ll be looking at some interesting challenges you run into when building a system designed to serve HTTPS traffic for hundreds of thousands of domains.
How to HTTPS
First, let’s talk about how HTTPS works, and how we might set it up for a single domain.
Let’s not talk that much about how HTTPS works though, because that would take a long time. Very briefly: by HTTPS we mean HTTP over SSL, and by SSL we mean TLS. All of this is a way for your website to communicate securely with clients. When a client first begins communicating with your site, you provide them with a certificate, and the client verifies that your certificate comes from a trusted certificate authority. (This is a gross over-simplification, here’s a very interesting post with more detail.)
Suffice to say, if you want HTTPS, you need a certificate, and if you want a certificate you need to get one from a certificate authority. Until fairly recently, this is the point where you had to open up your wallet. There were a handful of certificate authorities, each with an array of confusing options and pricey up-sells.
You pick whichever feels right to you, you get out your credit card, and you wind up with a certificate that’s valid for one to three years. Some CA’s offer API’s, but three years is a relatively long time, so chances are you do this manually. Plus, who even knows which cell in which SSL pricing matrix gets you API access?
From here, if you’re managing a limited number of domains, typically what you do is upload your certificate and private key to your load balancer (or CDN) and it handles TLS termination for you. Clients communicate with your load balancer over HTTPS using your domain’s public certificate, and your load balancer makes internal requests to your application.
And that’s it! Your clients have little green padlocks in their address bars, and your application is serving payloads just like it used to.
More certificates, more problems
Manually issuing certificates works well enough if your application is served off a handful of top-level domains. If you need certificates for lots and lots of TLDs, it’s not going to work. Have we mentioned that Pattern is available for the very reasonable price of $15 per month? A real baseline requirement for us is we can’t pay more for a certificate than someone pays for a year of Pattern. And we certainly don’t have time to issue all those certificates by hand.
Until fairly recently, our only options at this point would have been 1) do a deal with one of the big certificate authorities that offers API access, or 2) become our own certificate authority. These are both expensive options (we don’t like expensive options).
Let’s Encrypt to the rescue
Luckily for us, in April of 2016, the Internet Security Research Group (ISRG), which includes founders from the Mozilla Foundation and the Electronic Frontier Foundation, launched a new certificate authority named Let’s Encrypt.
The ISRG has also designed a communication protocol, Automated Certificate Management Environment (ACME), that covers the process of issuing and renewing certificates for TLS termination. Let’s Encrypt provides its service for free, and exclusively via an API that implements the ACME protocol. (You can read about it in more detail here.)
This is great for our project: we found a certificate authority we feel really good about, and since it implements a freely published protocol, there are great open source libraries out there for interacting with it.
Building out certificate issuance
Let’s re-examine our problem now that we have a good way to get certificates:
- We have a large quantity of custom domains attached to existing Pattern sites; we need to generate certificates for all of those.
- We have a constant stream of new domain registrations for new Pattern sites; we’ll need to get new certificates on a rolling basis.
- Let’s Encrypt certificates last 90 days; we need to renew our SSL certificates fairly frequently.
All of these problems are solvable now! We chose an open source ACME client library for PHP called AcmePHP. Another exciting thing about Let’s Encrypt using an open protocol, there are open source server implementations as well as client implementations. So we were able to spin up an internal equivalent of Let’s Encrypt called boulder and do our development and testing inside our private network, without worrying about rate limits.
The service we built out to handle this ACME logic, store the certificates we receive, and keep track of which domains we have certificates for we named CertService. It’s a service that deals with certificates, you see. CertService communicates with Let’s Encrypt, and also exposes an API for other internal Etsy services. We’ll go into more detail on that later in this post.
TLS termination for lots and lots of domains
Now that we’ve build out a service that can have certificates issued, we need a place to put them, and we need to use them to do TLS termination for HTTPS requests to Pattern custom domains.
We handle this for *.etsy.com by putting our certificates directly onto our load balancers and giving them to our CDN’s. In planning this project, we thought about hundreds of thousands of certificates, each with a 90 day lifetimes. If we did a good job spacing out issuing and renewing our certificates, that would add up to thousands of daily write operations to our load balancers and CDN’s. That’s not a rate we were comfortable with; load balancer and CDN configuration changes are relatively high risk operations.
What we did instead, is create a pool of proxy servers, and use our load balancer to distribute HTTPS traffic to them. The proxy hosts handle the client SSL termination, and proxy internal requests to our web servers, much like the load balancer does for www.etsy.com.
Our proxy hosts run Apache and we leverage mod_ssl and mod_proxy to do TLS termination and proxy requests. To preserve client IP addresses, we make use of the PROXY protocol on our load balancer and mod_proxy_protocol on the proxy hosts. We’re also using mod_macro to avoid ever having to write out hundreds of thousands of virtual hosts declarations for hundreds of thousands of domains. All put together, it looks something like this:
ProxyPass / https://internal-web-vip/ ProxyPassReverse / https://internal-web-vip/ <Macro VHost $domain> <VirtualHost *:443> ProxyProtocol On ServerName $domain SSLEngine on SSLCertificateFile $domain.crt SSLCertificateKeyFile $domain.key SSLCertificateChainFile lets-encrypt-cross-signed.pem </VirtualHost> </Macro> Use VHost custom-domain-1.com ... Use VHost custom-domain-n.com
To connect all this together, our proxy hosts periodically query CertService for a list of recently modified custom domains. Each host then 1) fetches the new certificates from CertService, 2) writes them to disk, 3) regenerates a config like the one above, and 4) does a graceful restart of Apache. These restarts are staggered across our proxy pool so all but one of the hosts is available and receiving requests from the load balancer at any give time (fingers crossed).
How do we securely store lots of certificates?
Now that we’ve figured out how to programmatically request and renew SSL certificates via LetsEncrypt, we need to store these certificates in a secure way. To do this, there are some guarantees we need to make:
- Private keys are stored in a database segmented from other types of data
- Private keys encrypted at rest and never leave CertService in plaintext
- SSL key pair generation and LetsEncrypt communications take place only on trusted hosts
- Private keys can be retrieved only by the SSL terminating hosts
Guarantee #1 is a no-brainer. If an attacker were to compromise a datastore containing thousands of SSL private keys in plaintext, they would be be able to intercept critical data being sent to thousands of custom domains. Since security is about raising costs for an attacker – that is, making it harder for an attacker to succeed – we employ a number of techniques to secure our keys. Our first layer of defense is at an infrastructure level: private keys are stored in a MySQL database away from the rest of the network. We use iptables to limit who can connect to the MySQL server, and given that CertService is the only client that needs access, the scope is really narrow. This vastly reduces attack surface, especially in cases where an attacker is looking to pivot from another compromised server on the network. Iptables is also then used to lock down who can communicate to the CertService API; adding constraints to connectivity on top of a secure authentication scheme makes retrieving certificates that much more difficult. That addresses Guarantee #4.
Now that we’ve locked down access to the database, we need to make sure they’re stored encrypted. For this, we make use of a concept known as a hybrid cryptosystem. Hybrid cryptosystems, in a nutshell, combine asymmetric (public-key crypto) and symmetric cryptosystems. If you’re familiar with SSL, much of how we handle crypto here is analogous to Session Keys.
At the start of this process, we have two pieces of data: the SSL private key and its corresponding public key – the certificate. We don’t particularly care about the certificate since that is public by definition. We start by generating a domain specific AES-256 key and encrypt the SSL private key. This only technically addresses the issue of not having plaintext on disk; the encrypted SSL private key is stored right next to the AES key, which can be used to both encrypt and decrypt. An attacker who could steal the encrypted keys could also steal the AES key. To address this, we encrypt the AES key with a CertService public key. Now we have an encrypted SSL private key (encrypted with the AES-256 key) and an encrypted AES key (encrypted with CertService’s RSA-2048 public key). Now not only are keys truly stored encrypted on disk, they also cannot be decrypted at all on CertService. This means if an attacker were to break CertService’s authentication scheme, the most they would receive is an encrypted SSL private key; they would still need the CertService private key – available only on the SSL terminating hosts – to decrypt it. Now we’ve fully taken care of Guarantee #2.
The only Guarantee that remains is #3. If key generation were compromised, an attacker would be able to grab private keys before they were encrypted and stored. If LetsEncrypt communication were compromised, an attacker could use our keys to generate certificates for domains we’ve already authorized (they could technically authorize new ones, but that would be significantly more difficult) or even revoke certificates. Both of these cases would render the entire system untrustworthy. Instead, we limit this functionality to CertService and expose it as an API; that way, if the web server handling Pattern requests were broken into, the attacker would not be able to affect critical LetsEncrypt flows.
One of our stretch goals is to look into deploying HSMs. If there are bugs in the underlying software, the integrity of the entire system could be compromised thus voiding any guarantees we try to keep. While bugs are inevitable, moving critical cryptographic functions into secure hardware will mitigate their impact.
No cryptosystem is perfect, but we’ve reached our goal of significantly increasing attacker cost. On top of that, we’ve supplemented the cryptosystem with our usual host based alerting and monitoring. So not only will an attacker have to jump through several hoops to get those SSL key pairs, they will also have to do it without being detected.
After the build
With all of that wrapped up, we had a system to issue large numbers of certificates, securely store this data, and terminate TLS requests with it. At this point Etsy brought in a third-party security firm to do a round of penetration testing. This process finished up without finding any substantive security weaknesses, which gives us an added level of confidence in our system.
Once we’ve gained enough confidence, we will enable HSTS. This should be the final goal of any SSL rollout as it forces browsers to use encryption for all future communication. Without it, downgrade attacks could be used to intercept traffic and hijack session cookies.
Every Pattern site and linked domain now has a valid certificate stored in our system and ready to facilitate secure requests. This feature is rolled out across Pattern and all Pattern traffic is now pushed over HTTPS!
(This project was a collaboration between Ram Nadella, Andy Yaco-Mink and Nick Steele on the Pattern team; Omar Ahmed and Ken Lee from Security; and Will Gallego from Operations. Big thanks to Dennis Olvany, Keyur Govande and Mike Adler for all their help.)