Application DDoS In Microservice Architectures
We’d like to introduce you to one of the most devastating ways to cause service instability in modern micro-service architectures: application DDoS. A specially crafted application DDoS attack can cause cascading system failures often for a fraction of the resources needed to conduct a more traditional DDoS attack. This is due to the complex, interconnected relationships between applications. Traditional DDoS attacks focus on exhausting system resources at the network level. In contrast, application layer attacks focus on expensive API calls, using their complex interconnected relationships to cause the system to attack itself — sometimes with a massive effect. In a modern microservice architecture this can be particularly harmful. A sophisticated attacker could craft malicious requests that model legitimate traffic and pass through edge protections such as a web application firewall (WAF).
In this blog post, we will discuss an effort at Netflix to identify, test, and remediate application layer DDoS attacks. We will begin with some background on the problem space. Next we will discuss the tools and methods we used to test our systems. Finally, we will discuss steps for making systems more resilient against application layer DDoS attacks. We are also presenting at DEF CON 25 today on the same topics, so if you are attending the conference please stop by.
According to Akamai’s Q1 2017 State of the Internet Security report, “less than 1% of all DDoS attacks are application layer”¹. However, this metric underrepresents the impact of these attacks. When an attacker takes the time to craft this style of attack, they can be highly effective. Keeping this in mind, defending against these types of attacks can help ensure that your organization does not have cascading failures if an application layer DDoS attack occurs.
Traditional application layer DDoS attacks were focused on the attacker’s work to generate an input compared with the responding system’s work to generate the resulting output. Attacks focused on expensive calls such as database queries or heavy disk I/O with the goal of over utilizing the application until it could no longer service legitimate users. As application architectures have evolved into more complex and distributed systems, we now have additional vectors to focus on like service health checks, queuing/batching, and complex microservice dependencies that may result in failures if one key service becomes unstable.
Microservices and DDoS
In a modern microservice architecture, application DDoS can be a particularly effective opportunity to cause service instability. To understand why, let’s consider a sample microservice architecture that uses a gateway to interact with a variety of middle tier and backend microservices, as depicted in the figure below.
This diagram shows how a single request at the edge can fan out into thousands of requests for the middle tier and backend microservices. If an attacker can identify API calls that have this effect, then it may be possible to use this fan out architecture against the overall service. If the resulting computations are expensive enough, then certain middle tier services could stop working. Depending on the criticality of these services, this could result in an overall service outage.
All of this is made possible because the microservice architecture helps the attacker by massively amplifying the attack against internal systems. In summary, a single request in a microservices architecture may generate tens of thousands of complex middle tier and backend service calls.
This presents unique challenges for defenders. If your environment takes advantage of a common web application firewall deployment, where the firewall is positioned on the internet facing systems only (such as the API gateway), it may miss an opportunity to block requests that are specifically causing distress to those middle tier and backend services. In addition, this firewall may not know how much work one request to the API gateway will generate for middle tier services, and may not trigger a blacklist until the damage is done. As defenders it is important that we understand how to identify these vulnerable application calls by walking through a framework for discovery and validation.
A Framework to Identify and Validate Application DDoS
Intuitively, our goal as defenders is to identify which API calls may be vulnerable to a DDoS. These are the calls that require significant resources from the middle tier and backend services. Timing how long API calls take to complete is one way to identify such calls. The most basic and error prone way to do this is to fingerprint API calls from a web browser. This can be done by opening Chrome Developer console, selecting the Network tab, clicking the Preserve log button, and then browsing to the site. After some period of time sort by Time and look at the most latent calls. You will get a screen similar to the image below.
This technique may have false positives, including calls that cannot be modified to increase latency. Also, you may end up missing calls that could be manipulated to increase latency. A better technique to identify latent calls may be to monitor request times for middle tier services. Once you have found a latent middle tier service, you should work on reconstructing a request that could be made through the API gateway that would invoke the latent middle tier service.
Once you have found some interesting API calls, the next step is to inspect their content. Your goal at this stage should be to find ways to make the calls more expensive. One technique for this is to increase the range of objects requested. For example, in the image below the from and to parameters could potentially be modified to increase the workload on middle tier services.
Digging a little deeper, you can often modify many different elements of a request to make it more expensive. The image below shows one example where you can potentially modify the object fields requested, the range, and even the image size.
You also want to build out a list of indicators on the health of your test. This will inform you if your test is working and where you may need to scale the test up or down. Typically this will be different HTTP Status codes or latencies observed during testing but it may also be specific headers, response text, stack traces, etc. The image below shows an example list of indicators.
Another useful test success indicator is increased latency (such as an HTTP 200 and a 10 second response). You may observe the latency yourself during the test or other users browsing the application. Once you have a good understanding of the types of requests you can send to generate latency and how to measure the indicators of success, you’ll need to tune your test to operate under a web application firewall if that exists in your environment.
The ideal traffic flow will be somewhere below when the web application firewall starts blocking, but high enough that the number of requests and work per request causes service instability.
To help facilitate this testing on a smaller scale, you can use the Repulsive Grizzly framework. Netflix is releasing this framework through our skunkworks open source program, which means that we are sharing it as a proof of concept but do not anticipate maintaining this code base long term. This framework is written in Python and leverages eventlet for high concurrency. It also supports the ability to round-robin authentication objects, which can be an effective technique to bypass certain web application firewalls.
Repulsive Grizzly does not help with identification of application DDoS vulnerabilities. As with all security testing tools, it is important to utilize this only on systems where you are authorized to perform this testing. On these systems, you will first need to identify potential issues as outlined above. Once you have some potential issues to test, Repulsive Grizzly will simplify the testing process.
For details on how to use the Repulsive Grizzly framework, see the project’s Github page for documentation.
After testing your hypothesis on a smaller scale, you can leverage Cloudy Kraken to scale your test. Cloudy Kraken is an AWS orchestration framework, specifically centered around helping you test your applications at a global scale. Similar to the Repulsive Grizzly framework, Netflix is releasing Cloudy Kraken as a skunkworks open source project.
Cloudy Kraken helps maintain a global fleet of test instances and the Repulsive Grizzly tests that run on those instances. It also builds and distributes the test configuration and leverages AWS EC2’s enhanced network drivers. Cloudy Kraken can also scale the test over multiple regions and provides time-synchronization so your test agents run in parallel. The diagram below provides a high level overview of Cloudy Kraken.
Cloudy Kraken orchestrates your tests in a developer friendly fashion. This starts with some configuration scripts that define the test. Cloudy Kraken will then create the AWS environment for the test and launch the instances. While the test is underway, Cloudy Kraken will collect data using AWS SNS. Finally, the AWS resources are destroyed at the conclusion of the test. These steps are shown in the diagram below.
A Netflix Case Study
At Netflix, we wanted to test our findings against a particular API call we identified as being latent. During a Chaos Kong exercise (where Netflix evacuates an entire AWS region while gracefully redirecting customer traffic to other regions), we tested against the production-scale environment in the evacuated region. This provided the rare opportunity to test an application layer DDoS scenario against a production scale environment without any customer impact. Our unique culture encourages us to do what is best for Netflix and it’s customers, and we embraced that freedom to run the test in production to truly understand it’s impact.
The test we ran conducted two different attacks over two 5 minute periods. The results of the test confirmed our theory and resulted in an 80% API gateway error rate for the specific region we targeted. Test users who were making requests against that API gateway observed site errors or other exceptions which prevented usage of the site. The graph below shows two spikes in HTTP 503 status codes (depicted in purple), which are correlated to the API gateway health.
Defending against Application DDoS Attacks
The best defense for application layer DDoS attacks comes from a collection of security controls and best practices.
First and foremost it is critical to know your system. You should understand which microservices impact each aspect of the customer experience. Look for ways to reduce inter-dependencies on those services. If one service becomes unstable the rest of your microservices should continue to operate, perhaps in a degraded state.
It is important to have a good understanding of how your services queue and service requests. It may be possible for your middle tier and backend services to limit the batch or object size requested. This can also be done in the client code, and potentially even enforced in the API gateway. Putting a limit on the allowable work per request can significantly reduce the likelihood of exploitation.
We also recommend enabling a feedback loop to provide alerts from the middle tier and backend service to your WAF. This will help the WAF know when to block these attacks. In many deployments, the WAF is only monitoring the edge and may not realize the impact of one single request to the API gateway. A WAF should also monitor the volume of cache misses. If an API gateway is constantly performing middle tier service calls due to cache misses, that suggests that the cache is not configured correctly or potential malicious behavior.
API gateways and other microservices should prioritize authenticated traffic over unauthenticated traffic. It costs more for the attacker to use authenticated sessions. This can also help to mitigate the impact of an application layer DDoS event on your customers.
Finally, ensure you have reasonable client library timeouts and circuit breakers. With reasonable timeouts — and plenty of testing — you can protect your middle tier services from application layer DDoS attacks.