AMD is Cloudflare’s 10th-generation Edge server CPU


More than 1 billion unique IP addresses pass through the Cloudflare Network each day, serving on average 11 million HTTP requests per second and operating within 100ms of 95% of the Internet-connected population globally. Our network spans 200 cities in more than 90 countries, and our engineering teams have built an extremely fast and reliable infrastructure.

We’re extremely proud of our work and are determined to help make the Internet a better and more secure place. Cloudflare engineers who are involved with hardware get down to servers and their components to understand and select the best hardware to maximize the performance of our stack.

Our software stack is compute intensive and is very much CPU bound, driving our engineers to work continuously at optimizing Cloudflare’s performance and reliability at all layers of our stack. With the server, a straightforward solution for increasing computing power is to have more CPU cores. The more cores we can include in a server, the more output we can expect. This is important for us since the diversity of our products and customers has grown over time with increasing demand that requires our servers to do more. To help us drive compute performance, we needed to increase core density and that’s what we did. Below is the processor detail for servers we’ve deployed since 2015, including the core counts:

Gen 6Gen 7Gen 8Gen 9
Start of service2015201620172018
CPUIntel Xeon E5-2630 v3Intel Xeon E5-2630 v4Intel Xeon Silver 4116Intel Xeon Platinum 6162
Physical Cores2 x 82 x 102 x 122 x 24
TDP2 x 85W2 x 85W2 x 85W2 x 150W
TDP per Core10.65W8.50W7.08W6.25W

In 2018, we made a big jump in total number of cores per server with Gen 9. Our physical footprint was reduced by 33% compared to Gen 8, giving us increased capacity and computing power per rack. Thermal Design Power (TDP aka typical power usage) are mentioned above to highlight that we’ve also been more power efficient over time. Power efficiency is important to us: first, because we’d like to be as carbon friendly as we can; and second, so we can better utilize our provisioned power supplied by the data centers. But we know we can do better.

Our main defining metric is Requests per Watt. We can increase our Requests per Second number with more cores, but we have to stay within our power budget envelope. We are constrained by the data centers’ power infrastructure which, along with our selected power distribution units, leads us to power cap for each server rack. Adding servers to a rack obviously adds more power draw increasing power consumption at the rack level. Our Operational Costs significantly increase if we go over a rack’s power cap and have to provision another rack. What we need is more compute power inside the same power envelope which will drive a higher (better) Requests per Watt number – our key metric.

As you might imagine, we look at power consumption carefully in the design stage. From the above you can see that it’s not worth the time for us to deploy more power-hungry CPUs if TDP per Core is higher than our current generation which would hurt our Requests per Watt metric. As we started looking at production ready systems to power our Gen X solution, we took a long look at what is available to us in the market today and we’ve made our decision. We’re moving on from Gen 9’s 48-core setup of dual socket Intel® Xeon® Platinum 6162‘s to a 48-core single socket AMD EPYC™ 7642.

Gen X server setup with single socket 48-core AMD EPYC 7642

IntelAMD
CPUXeon Platinum 6162EPYC 7642
Microarchitecture“Skylake”“Zen 2”
Codename“Skylake SP”“Rome”
Process14nm7nm
Physical Cores2 x 2448
Frequency1.9 GHz2.4 GHz
L3 Cache / socket24 x 1.375MiB16 x 16MiB
Memory / socket6 channels, up to DDR4-24008 channels, up to DDR4-3200
TDP2 x 150W225W
PCIe / socket48 lanes128 lanes
ISAx86-64x86-64

From the specs, we see that with the AMD chip we get to keep the same amount of cores and lower TDP. Gen 9’s TDP per Core was 6.25W, Gen X’s will be 4.69W… That’s a 25% decrease. With higher frequency, and perhaps going to a more simplified setup of single socket, we can speculate that the AMD chip will perform better. We’re walking through a series of tests, simulations, and live production results in the rest of this blog to see how much better AMD performs.

As a side note before we go further, TDP is a simplified metric from the manufacturers’ datasheets that we use in the early stages of our server design and CPU selection process. A quick Google search leads to thoughts that AMD and Intel define TDP differently, which basically makes the spec unreliable. Actual CPU power draw, and more importantly server system power draw, are what we really factor in our final decisions.

Ecosystem Readiness

At the beginning of our journey to choose our next CPU, we got a variety of processors from different vendors that could fit well with our software stack and services, which are written in C, LuaJIT, and Go. More details about benchmarking for our stack were explained when we benchmarked Qualcomm’s ARM® chip in the past. We’re going to go through the same suite of tests as Vlad’s blog this time around since it is a quick and easy “sniff test”. This allows us to test a bunch of CPUs within a manageable time period before we commit to spend more engineering effort and need to apply our software stack.

We tried a variety of CPUs with different number of cores, sockets, and frequencies. Since we’re explaining how we chose the AMD EPYC 7642, all of the graphs in this blog focus on how AMD compares with our Gen 9’s Intel Xeon Platinum 6162 CPU as a baseline.

Our results correspond to server node for both CPUs tested; meaning the numbers pertain to 2x 24-core processors for Intel, and 1x 48-core processor for AMD – a two socket Intel based server and a one socket AMD EPYC powered server. Before we started our testing, we changed the Cloudflare lab test servers’ BIOS settings to match our production server settings. This gave us CPU frequencies yields for AMD at 3.03 Ghz and Intel at 2.50 Ghz on average with very little variation. With gross simplification, we expect that with the same amount of cores AMD would perform about 21% better than Intel. Let’s start with our crypto tests.

Cryptography

Looking promising for AMD. In public key cryptography, it does 18% better. Meanwhile for symmetric key, AMD loses on AES-128-GCM but it’s comparable overall.

Compression

We do a lot of compression at the edge to save bandwidth and help deliver content faster. We go through both zlib and brotli libraries written in C. All tests are done on blog.cloudflare.com HTML file in memory.

AMD wins by an average of 29% using gzip across all qualities. It does even better with brotli with tests lower than quality 7, which we use for dynamic compression. There’s a throughput cliff starting brotli-9 which Vlad’s explanation is that Brotli consumes lots of memory and thrashes cache. Nevertheless, AMD wins by a healthy margin.

A lot of our services are written in Go. In the following graphs we’re redoing the crypto and compression tests in Go along with RegExp on 32KB strings and the strings library.

Go Cryptography

Go Compression

Go Regexp

Go Strings

AMD performs better in all of our Go benchmarks except for ECDSA P256 Sign losing by 38%, which is peculiar since with the test in C it does 24% better. It’s worth investigating what’s going on here. Other than that, AMD doesn’t win by as much of a margin but it still proves to be better.

LuaJIT

We rely a lot on LuaJIT in our stack. As Vlad said, it’s the glue that holds Cloudflare together. We’re glad to show that AMD wins here as well.

Overall our tests show a single EPYC 7642 to be more competitive than two Xeon Platinum 6162. While there are a couple of tests where AMD loses out such as OpenSSL AES-128-GCM and Go OpenSSL ECDSA-P256 Sign, AMD wins in all the others. By scanning quickly and treating all tests equally, AMD does on average 25% better than Intel.

Performance Simulations

After our ‘sniff’ tests, we put our servers through another series of emulations which apply synthetic workloads simulating our edge software stack. Here we are simulating workloads of scenarios with different types of requests we see in production. Types of requests vary from asset size, whether they go through HTTP or HTTPS, WAF, Workers, or one of many additional variables. Below shows the throughput comparison between the two CPUs of the types of requests we see most typically.

The results above are ratios using Gen 9’s Intel CPUs as the baseline normalized at 1.0 on the X-axis. For example, looking at simple requests of 10KiB assets over HTTPS, we see that AMD does 1.50x better than Intel in Requests per Second. On average for the tests shown on the graph above, AMD performs 34% better than Intel. Considering that the TDP for the single AMD EPYC 7642 is 225W, when compared to two Intel’s being 300W, we’re looking at AMD delivering up to 2.0x better Requests per Watt vs. the Intel CPUs!

By this time, we were already leaning heavily toward a single socket setup with AMD EPYC 7642 as our CPU for Gen X. We were excited to see exactly how well AMD EPYC servers would do in production, so we immediately shipped a number of the servers out to some of our data centers.

Live Production

Step one of course was to get all our test servers set up for a production environment. All of our machines in the fleet are loaded with the same processes and services which makes for a great apples-to-apples comparison.  Like data centers everywhere, we have multiple generations of servers deployed and we deploy our servers in clusters such that each cluster is pretty homogeneous by server generation. In some environments this can lead to varying utilization curves between clusters.  This is not the case for us. Our engineers have optimized CPU utilization across all server generations so that no matter if the machine’s CPU has 8 cores or 24 cores, CPU usage is generally the same.

As you can see above and to illustrate our ‘similar CPU utilization’ comment, there is no significant difference in CPU usage between Gen X AMD powered servers and Gen 9 Intel based servers. This means both test and baseline servers are equally loaded. Good. This is exactly what we want to see with our setup, to have a fair comparison. The 2 graphs below show the comparative number of requests processed at the CPU single core and all core (server) level.

We see that AMD does on average about 23% more requests. That’s really good! We talked a lot about bringing more muscle in the Gen 9 blog. We have the same number of cores, yet AMD does more work, and does it with less power. Just by looking at the specs for number of cores and TDP in the beginning, it’s really nice to see that AMD also delivers significantly more performance with better power efficiency.

But as we mentioned earlier, TDP isn’t a standardized spec across manufacturers so let’s look at real power usage below. Measuring server power consumption along with requests per second (RPS) yields the graph below:

Observing our servers request rate over their power consumption, the AMD Gen X server performs 28% better. While we could have expected more out of AMD since its TDP is 25% lower, keep in mind that TDP is very ambiguous. In fact, we saw that AMD actual power draw ran nearly at spec TDP with its much higher than base frequency;  Intel was far from it. Another reason why TDP is becoming a less reliable estimate of power draw. Moreover, CPU is just one component contributing to the overall power of the system. Let’s remind that Intel CPUs are integrated in a multi-node system as described in the Gen 9 blog, while AMD is in a regular 1U form-factor machine. That actually doesn’t favor AMD since multi-node systems are designed for high density capabilities at lower power per node, yet it still outperformed the Intel system on a power per node basis anyway.

Through the majority of comparisons from the datasheets, test simulations, and live production performance, the 1P AMD EPYC 7642 configuration performed significantly better than the 2P Intel Xeon 6162. We’ve seen in some environments that AMD can do up to 36% better in live production and we believe we can achieve that consistently with some optimization on both our hardware and software.

So that’s it. AMD wins.

The additional graphs below show the median and p99 NGINX processing mostly on-CPU time latencies between the two CPUs throughout 24 hours. On average, AMD processes about 25% faster. At p99, it does about 20-50% depending on the time of day.

Conclusion

Hardware and Performance engineers at Cloudflare do significant research and testing to figure out the best server configuration for our customers. Solving big problems like this is why we love working here, and we’re also helping solving yours with our services like serverless edge compute and the array of security solutions such as Magic Transit, Argo Tunnel, and DDoS protection. All of our servers on the Cloudflare Network are designed to make our products work reliably, and we strive to make each new generation of our server design better than its predecessor. We believe the AMD EPYC 7642 is the answer for our Gen X’s processor question.

With Cloudflare Workers, developers have enjoyed deploying their applications to our Network, which is ever expanding across the globe. We’ve been proud to empower our customers by letting them focus on writing their code while we are managing the security and reliability in the cloud. We are now even more excited to say that their work will be deployed on our Gen X servers powered by 2nd Gen AMD EPYC processors.

Expanding Rome to a data center near you

Thanks to AMD, using the EPYC 7642 allows us to increase our capacity and expand into more cities easier. Rome wasn’t built in one day, but it will be very close to many of you.

In the last couple of years, we’ve been experimenting with many Intel and AMD x86 chips along with ARM CPUs. We look forward to having these CPU manufacturers partner with us for future generations so that together we can help build a better Internet.



Source link