Details of the Cloudflare outage on July 2, 2019


Almost nine years ago, Cloudflare was a tiny company and I was a customer not an employee. Cloudflare had launched a month earlier and one day alerting told me that my little site, jgc.org, didn’t seem to have working DNS any more. Cloudflare had pushed out a change to its use of Protocol Buffers and it had broken DNS.

I wrote to Matthew Prince directly with an email titled “Where’s my dns?” and he replied with a long, detailed, technical response (you can read the full email exchange here) to which I replied:

From: John Graham-Cumming
Date: Thu, Oct 7, 2010 at 9:14 AM
Subject: Re: Where's my dns?
To: Matthew Prince

Awesome report, thanks. I'll make sure to call you if there's a
problem.  At some point it would probably be good to write this up as
a blog post when you have all the technical details because I think
people really appreciate openness and honesty about these things.
Especially if you couple it with charts showing your post launch
traffic increase.

I have pretty robust monitoring of my sites so I get an SMS when
anything fails.  Monitoring shows I was down from 13:03:07 to
14:04:12.  Tests are made every five minutes.

It was a blip that I'm sure you'll get past.  But are you sure you
don't need someone in Europe? :-)

To which he replied:

From: Matthew Prince
Date: Thu, Oct 7, 2010 at 9:57 AM
Subject: Re: Where's my dns?
To: John Graham-Cumming

Thanks. We've written back to everyone who wrote in. I'm headed in to
the office now and we'll put something on the blog or pin an official
post to the top of our bulletin board system. I agree 100%    
transparency is best.

And so, today, as an employee of a much, much larger Cloudflare I get to be the one who writes, transparently about a mistake we made, its impact and what we are doing about it.

The events of July 2

On July 2, we deployed a new rule in our WAF Managed Rules that caused CPUs to become exhausted on every CPU core that handles HTTP/HTTPS traffic on the Cloudflare network worldwide. We are constantly improving WAF Managed Rules to respond to new vulnerabilities and threats. In May, for example, we used the speed with which we can update the WAF to push a rule to protect against a serious SharePoint vulnerability. Being able to deploy rules quickly and globally is a critical feature of our WAF.

Unfortunately, last Tuesday’s update contained a regular expression that backtracked enormously and exhausted CPU used for HTTP/HTTPS serving. This brought down Cloudflare’s core proxying, CDN and WAF functionality. The following graph shows CPUs dedicated to serving HTTP/HTTPS traffic spiking to nearly 100% usage across the servers in our network.

CPU utilization in one of our PoPs during the incident

This resulted in our customers (and their customers) seeing a 502 error page when visiting any Cloudflare domain. The 502 errors were generated by the front line Cloudflare web servers that still had CPU cores available but were unable to reach the processes that serve HTTP/HTTPS traffic.

We know how much this hurt our customers. We’re ashamed it happened. It also had a negative impact on our own operations while we were dealing with the incident.

It must have been incredibly stressful, frustrating and frightening if you were one of our customers. It was even more upsetting because we haven’t had a global outage for six years.

The CPU exhaustion was caused by a single WAF rule that contained a poorly written regular expression that ended up creating excessive backtracking. The regular expression that was at the heart of the outage is (?:(?:"|'|]|}|\|d|(?:nan|infinity|true|false|null|undefined|symbol|math)|`|-|+)+[)]*;?((?:s|-|~|!|{}||||+)*.*(?:.*=.*)))

Although the regular expression itself is of interest to many people (and is discussed more below), the real story of how the Cloudflare service went down for 27 minutes is much more complex than “a regular expression went bad”. We’ve taken the time to write out the series of events that lead to the outage and kept us from responding quickly. And, if you want to know more about regular expression backtracking and what to do about it, then you’ll find it in an appendix at the end of this post.

What happened

Let’s begin with the sequence of events. All times in this blog are UTC.

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic process. This generated a Change Request ticket. We use Jira to manage these tickets and a screenshot is below.

Three minutes later the first PagerDuty page went out indicating a fault with the WAF. This was a synthetic test that checks the functionality of the WAF (we have hundreds of such tests) from outside Cloudflare to ensure that it is working correctly. This was rapidly followed by pages indicating many other end-to-end tests of Cloudflare services failing, a global traffic drop alert, widespread 502 errors and then many reports from our points-of-presence (PoPs) in cities worldwide indicating there was CPU exhaustion.

Some of these alerts hit my watch and I jumped out of the meeting I was in and was on my way back to my desk when a leader in our Solutions Engineering group told me we had lost 80% of our traffic. I ran over to SRE where the team was debugging the situation. In the initial moments of the outage there was speculation it was an attack of some type we’d never seen before.

Cloudflare’s SRE team is distributed around the world, with continuous, around-the-clock coverage. Alerts like these, the vast majority of which are noting very specific issues of limited scopes in localized areas, are monitored in internal dashboards and addressed many times every day. This pattern of pages and alerts, however, indicated that something gravely serious had happened, and SRE immediately declared a P0 incident and escalated to engineering leadership and systems engineering.

The London engineering team was at that moment in our main event space listening to an internal tech talk. The talk was interrupted and everyone assembled in a large conference room and others dialed-in. This wasn’t a normal problem that SRE could handle alone, it needed every relevant team online at once.

At 14:00 the WAF was identified as the component causing the problem and an attack dismissed as a possibility. The Performance Team pulled live CPU data from a machine that clearly showed the WAF was responsible. Another team member used strace to confirm. Another team saw error logs indicating the WAF was in trouble. At 14:02 the entire team looked at me when it was proposed that we use a ‘global kill’, a mechanism built into Cloudflare to disable a single component worldwide.

But getting to the global WAF kill was another story. Things stood in our way. We use our own products and with our Access service down we couldn’t authenticate to our internal control panel (and once we were back we’d discover that some members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently).

And we couldn’t get to other internal services like Jira or the build system. To get to them we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). Eventually, a team member executed the global WAF kill at 14:07 and by 14:09 traffic levels and CPU were back to expected levels worldwide. The rest of Cloudflare’s protection mechanisms continued to operate.

Then we moved on to restoring the WAF functionality. Because of the sensitivity of the situation we performed both negative tests (asking ourselves “was it really that particular change that caused the problem?”) and positive tests (verifying the rollback worked) in a single city using a subset of traffic after removing our paying customers’ traffic from that location.

At 14:52 we were 100% satisfied that we understood the cause and had a fix in place and the WAF was re-enabled globally.

How Cloudflare operates

Cloudflare has a team of engineers who work on our WAF Managed Rules product; they are constantly working to improve detection rates, lower false positives, and respond rapidly to new threats as they emerge. In the last 60 days, 476 change requests have been handled for the WAF Managed Rules (averaging one every 3 hours).

This particular change was to be deployed in “simulate” mode where real customer traffic passes through the rule but nothing is blocked. We use that mode to test the effectiveness of a rule and measure its false positive and false negative rate. But even in the simulate mode the rules actually need to execute and in this case the rule contained a regular expression that consumed excessive CPU.

As can be seen from the Change Request above there’s a deployment plan, a rollback plan and a link to the internal Standard Operating Procedure (SOP) for this type of deployment. The SOP for a rule change specifically allows it to be pushed globally. This is very different from all the software we release at Cloudflare where the SOP first pushes software to an internal dogfooding network point of presence (PoP) (which our employees pass through), then to a small number of customers in an isolated location, followed by a push to a large number of customers and finally to the world.

The process for a software release looks like this: We use git internally via BitBucket. Engineers working on changes push code which is built by TeamCity and when the build passes, reviewers are assigned. Once a pull request is approved the code is built and the test suite runs (again).

If the build and tests pass then a Change Request Jira is generated and the change has to be approved by the relevant manager or technical lead. Once approved deployment to what we call the “animal PoPs” occurs: DOG, PIG, and the Canaries.

The DOG PoP is a Cloudflare PoP (just like any of our cities worldwide) but it is used only by Cloudflare employees. This dogfooding PoP enables us to catch problems early before any customer traffic has touched the code. And it frequently does.

If the DOG test passes successfully code goes to PIG (as in “Guinea Pig”). This is a Cloudflare PoP where a small subset of customer traffic from non-paying customers passes through the new code.

If that is successful the code moves to the Canaries. We have three Canary PoPs spread across the world and run paying and non-paying customer traffic running through them on the new code as a final check for errors.

Cloudflare software release process

Once successful in Canary the code is allowed to go live. The entire DOG, PIG, Canary, Global process can take hours or days to complete, depending on the type of code change. The diversity of Cloudflare’s network and customers allows us to test code thoroughly before a release is pushed to all our customers globally. But, by design, the WAF doesn’t use this process because of the need to respond rapidly to threats.

WAF Threats

In the last few years we have seen a dramatic increase in vulnerabilities in common applications. This has happened due to the increased availability of software testing tools, like fuzzing for example (we just posted a new blog on fuzzing here).

Source: https://cvedetails.com/

What is commonly seen is a Proof of Concept (PoC) is created and often published on Github quickly, so that teams running and maintaining applications can test to make sure they have adequate protections. Because of this, it’s imperative that Cloudflare are able to react as quickly as possible to new attacks to give our customers a chance to patch their software.

A great example of how Cloudflare proactively provided this protection was through the deployment of our protections against the SharePoint vulnerability in May (blog here). Within a short space of time from publicised announcements, we saw a huge spike in attempts to exploit our customer’s Sharepoint installations. Our team continuously monitors for new threats and writes rules to mitigate them on behalf of our customers.

The specific rule that caused last Tuesday’s outage was targeting Cross-site scripting (XSS) attacks. These too have increased dramatically in recent years.

Source: https://cvedetails.com/

The standard procedure for a WAF Managed Rules change indicates that Continuous Integration (CI) tests must pass prior to a global deploy. That happened normally last Tuesday and the rules were deployed. At 13:31 an engineer on the team had merged a Pull Request containing the change after it was approved.

At 13:37 TeamCity built the rules and ran the tests, giving it the green light. The WAF test suite tests that the core functionality of the WAF works and consists of a large collection of unit tests for individual matching functions. After the unit tests run the individual WAF rules are tested by executing a huge collection of HTTP requests against the WAF. These HTTP requests are designed to test requests that should be blocked by the WAF (to make sure it catches attacks) and those that should be let through (to make sure it isn’t over-blocking and creating false positives). What it didn’t do was test for runaway CPU utilization by the WAF and examining the log files from previous WAF builds shows that no increase in test suite run time was observed with the rule that would ultimately cause CPU exhaustion on our edge.

With the tests passing, TeamCity automatically began deploying the change at 13:42.

Quicksilver

Because WAF rules are required to address emergent threats they are deployed using our Quicksilver distributed key-value (KV) store that can push changes globally in seconds. This technology is used by all our customers when making configuration changes in our dashboard or via the API and is the backbone of our service’s ability to respond to changes very, very rapidly.

We haven’t really talked about Quicksilver much. We previously used Kyoto Tycoon as a globally distributed key-value store, but we ran into operational issues with it and wrote our own KV store that is replicated across our more than 180 cities. Quicksilver is how we push changes to customer configuration, update WAF rules, and distribute JavaScript code written by customers using Cloudflare Workers.

From clicking a button in the dashboard or making an API call to change configuration to that change coming into effect takes seconds, globally. Customers have come to love this high speed configurability. And with Workers they expect near instant, global software deployment. On average Quicksilver distributes about 350 changes per second.

And Quicksilver is very fast.  On average we hit a p99 of 2.29s for a change to be distributed to every machine worldwide. Usually, this speed is a great thing. It means that when you enable a feature or purge your cache you know that it’ll be live globally nearly instantly. When you push code with Cloudflare Workers it’s pushed out at the same speed. This is part of the promise of Cloudflare fast updates when you need them.

However, in this case, that speed meant that a change to the rules went global in seconds. You may notice that the WAF code uses Lua. Cloudflare makes use of Lua extensively in production and details of the Lua in the WAF have been discussed before. The Lua WAF uses PCRE internally and it uses backtracking for matching and has no mechanism to protect against a runaway expression. More on that and what we’re doing about it below.

Everything that occurred up to the point the rules were deployed was done “correctly”: a pull request was raised, it was approved, CI/CD built the code and tested it, a change request was submitted with an SOP detailing rollout and rollback, and the rollout was executed.

Cloudflare WAF deployment process

What went wrong

As noted, we deploy dozens of new rules to the WAF every week, and we have numerous systems in place to prevent any negative impact of that deployment. So when things do go wrong, it’s generally the unlikely convergence of multiple causes. Getting to a single root cause, while satisfying, may obscure the reality. Here are the multiple vulnerabilities that converged to get to the point where Cloudflare’s service for HTTP/HTTPS went offline.

  1. An engineer wrote a regular expression that could easily backtrack enormously.
  2. A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.
  3. The regular expression engine being used didn’t have complexity guarantees.
  4. The test suite didn’t have a way of identifying excessive CPU consumption.
  5. The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.
  6. The rollback plan required running the complete WAF build twice taking too long.
  7. The first alert for the global traffic drop took too long to fire.
  8. We didn’t update our status page quickly enough.
  9. We had difficulty accessing our own systems because of the outage and the bypass procedure wasn’t well trained on.
  10. SREs had lost access to some systems because their credentials had been timed out for security reasons.
  11. Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.

What’s happened since last Tuesday

Firstly, we stopped all release work on the WAF completely and are doing the following:

  1. Re-introduce the excessive CPU usage protection that got removed. (Done)
  2. Manually inspecting all 3,868 rules in the WAF Managed Rules to find and correct any other instances of possible excessive backtracking. (Inspection complete)
  3. Introduce performance profiling for all rules to the test suite. (ETA:  July 19)
  4. Switching to either the re2 or Rust regex engine which both have run-time guarantees. (ETA: July 31)
  5. Changing the SOP to do staged rollouts of rules in the same manner used for other software at Cloudflare while retaining the ability to do emergency global deployment for active attacks.
  6. Putting in place an emergency ability to take the Cloudflare Dashboard and API off Cloudflare’s edge.
  7. Automating update of the Cloudflare Status page.

In the longer term we are moving away from the Lua WAF that I wrote years ago. We are porting the WAF to use the new firewall engine. This will make the WAF both faster and add yet another layer of protection.

Conclusion

This was an upsetting outage for our customers and for the team. We responded quickly to correct the situation and are correcting the process deficiencies that allowed the outage to occur and going deeper to protect against any further possible problems with the way we use regular expressions by replacing the underlying technology used.

We are ashamed of the outage and sorry for the impact on our customers. We believe the changes we’ve made mean such an outage will never recur.

Appendix: About Regular Expression Backtracking

To fully understand how (?:(?:"|'|]|}|\|d|(?:nan|infinity|true|false|null|undefined|symbol|math)|`|-|+)+[)]*;?((?:s|-|~|!|{}||||+)*.*(?:.*=.*)))  caused CPU exhaustion you need to understand a little about how a standard regular expression engine works. The critical part is .*(?:.*=.*). The (?: and matching ) are a non-capturing group (i.e. the expression inside the parentheses is grouped together as a single expression).

For the purposes of the discussion of why this pattern causes CPU exhaustion we can safely ignore it and treat the pattern as .*.*=.*. When reduced to this, the pattern obviously looks unnecessarily complex; but what’s important is any “real-world” expression (like the complex ones in our WAF rules) that ask the engine to “match anything followed by anything” can lead to catastrophic backtracking. Here’s why.

In a regular expression, . means match a single character, .* means match zero or more characters greedily (i.e. match as much as possible) so .*.*=.* means match zero or more characters, then match zero or more characters, then find a literal = sign, then match zero or more characters.

Consider the test string x=x. This will match the expression .*.*=.*. The .*.* before the equal can match the first x (one of the .* matches the x, the other matches zero characters). The .* after the = matches the final x.

It takes 23 steps for this match to happen. The first .* in .*.*=.* acts greedily and matches the entire x=x string. The engine moves on to consider the next .*. There are no more characters left to match so the second .* matches zero characters (that’s allowed). Then the engine moves on to the =. As there are no characters left to match (the first .* having consumed all of x=x) the match fails.

At this point the regular expression engine backtracks. It returns to the first .* and matches it against x= (instead of x=x) and then moves onto the second .*. That .* matches the second x and now there are no more characters left to match. So when the engine tries to match the = in .*.*=.* the match fails. The engine backtracks again.

This time it backtracks so that the first .* is still matching x= but the second .* no longer matches x; it matches zero characters. The engine then moves on to try to find the literal = in the .*.*=.* pattern but it fails (because it was already matched against the first .*). The engine backtracks again.

This time the first .* matches just the first x. But the second .* acts greedily and matches =x. You can see what’s coming. When it tries to match the literal = it fails and backtracks again.

The first .* still matches just the first x. Now the second .* matches just =. But, you guessed it, the engine can’t match the literal = because the second .* matched it. So the engine backtracks again. Remember, this is all to match a three character string.

Finally, with the first .* matching just the first x, the second .* matching zero characters the engine is able to match the literal = in the expression with the = in the string. It moves on and the final .* matches the final x.

23 steps to match x=x. Here’s a short video of that using the Perl Regexp::Debugger showing the steps and backtracking as they occur.

That’s a lot of work but what happens if the string is changed from x=x to x=xx? This time is takes 33 steps to match. And if the input is x=xxx it takes 45. That’s not linear. Here’s a chart showing matching from x=x to x=xxxxxxxxxxxxxxxxxxxx (20 x’s after the =). With 20 x’s after the = the engine takes 555 steps to match! (Worse, if the x= was missing, so the string was just 20 x’s, the engine would take 4,067 steps to find the pattern doesn’t match).

This video shows all the backtracking necessary to match x=xxxxxxxxxxxxxxxxxxxx:

That’s bad because as the input size goes up the match time goes up super-linearly. But things could have been even worse with a slightly different regular expression. Suppose it had been .*.*=.*; (i.e. there’s a literal semicolon at the end of the pattern). This could easily have been written to try to match an expression like foo=bar;.

This time the backtracking would have been catastrophic. To match x=x takes 90 steps instead of 23. And the number of steps grows very quickly. Matching x= followed by 20 x’s takes 5,353 steps. Here’s the corresponding chart. Look carefully at the Y-axis values compared the previous chart.

To complete the picture here are all 5,353 steps of failing to match x=xxxxxxxxxxxxxxxxxxxx against .*.*=.*;

Using lazy rather than greedy matches helps control the amount of backtracking that occurs in this case. If the original expression is changed to .*?.*?=.*? then matching x=x takes 11 steps (instead of 23) and so does matching x=xxxxxxxxxxxxxxxxxxxx. That’s because the ? after the .* instructs the engine to match the smallest number of characters first before moving on.

But laziness isn’t the total solution to this backtracking behaviour. Changing the catastrophic example .*.*=.*; to .*?.*?=.*?; doesn’t change its run time at all. x=x still takes 555 steps and x= followed by 20 x’s still takes 5,353 steps.

The only real solution, short of fully re-writing the pattern to be more specific, is to move away from a regular expression engine with this backtracking mechanism. Which we are doing within the next few weeks.

The solution to this problem has been known since 1968 when Ken Thompson wrote a paper titled “Programming Techniques: Regular expression search algorithm”. The paper describes a mechanism for converting a regular expression into an NFA (non-deterministic finite automata) and then following the state transitions in the NFA using an algorithm that executes in time linear in the size of the string being matched against.

Thompson’s paper doesn’t actually talk about NFA but the linear time algorithm is clearly explained and an ALGOL-60 program that generates assembly language code for the IBM 7094 is presented. The implementation may be arcane but the idea it presents is not.

Here’s what the .*.*=.* regular expression would look like when diagrammed in a similar manner to the pictures in Thompson’s paper.

Figure 0 has five states starting at 0. There are three loops which begin with the states 1, 2 and 3. These three loops correspond to the three .* in the regular expression. The three lozenges with dots in them match a single character. The lozenge with an = sign in it matches the literal = sign. State 4 is the ending state, if reached then the regular expression has matched.

To see how such a state diagram can be used to match the regular expression .*.*=.* we’ll examine matching the string x=x. The program starts in state 0 as shown in Figure 1.

The key to making this algorithm work is that the state machine is in multiple states at the same time. The NFA will take every transition it can, simultaneously.

Even before it reads any input, it immediately transitions to both states 1 and 2 as shown in Figure 2.

Looking at Figure 2 we can see what happened when it considers  first x in x=x. The x can match the top dot by transitioning from state 1 and back to state 1. Or the x can match the dot below it by transitioning from state 2 and back to state 2.

So after matching the first x in x=x the states are still 1 and 2. It’s not possible to reach state 3 or 4 because a literal = sign is needed.

Next the algorithm considers the = in x=x. Much like the x before it, it can be matched by either of the top two loops transitioning from state 1 to state 1 or state 2 to state 2, but additionally the literal = can be matched and the algorithm can transition state 2 to state 3 (and immediately state 4). That’s illustrated in Figure 3.

Next the algorithm reaches the final x in x=x. From states 1 and 2 the same transitions are possible back to states 1 and 2. From state 3 the x can match the dot on the right and transition back to state 3.

At that point every character of x=x has been considered and because state 4 has been reached the regular expression matches that string. Each character was processed once so the algorithm was linear in the length of the input string. And no backtracking was needed.

It might also be obvious that once state 4 was reached (after x= was matched) the regular expression had matched and the algorithm could terminate without considering the final x at all.

This algorithm is linear in the size of its input.



Source link