How to discard 10 million packets per second

Prior post by Marek Majkowski .

The in-house DDoS response team is often called "packet droppers". When other teams are doing smart things with traffic passing through our network, we are delighted to find ways to discard them when they are excited.

Brian Evans

The ability to quickly discard packets is critical to surviving DDoS attacks.

It sounds easy, but discarding packets that arrive at the server is possible at several stages. Each technique has advantages and limitations. In this blog post, let's review all the techniques we have tried so far.

Test Benchmark

In order to visualize the relative performance of each technique, we will first look at the numbers. Benchmarks are synthetic tests, so they may differ slightly from actual numbers. For testing, we will use an Intel server with a 10Gbps network card. This is a test to show the limits of the operating system, not the hardware, so I will not go into details of the hardware.

The test settings are as follows:

  • Bulk delivery of small UDP packets to reach 14 Mpps (Mpps = one million packets per second)
  • This traffic is forwarded to a single CPU on the test server
  • Measuring the number of packets handled by the kernel on a single CPU

Testing is not about maximizing the speed or packet processing speed of user space applications, but about trying to figure out the bottleneck of the kernel

Composite traffic is prepared to give maximum load to conntrack – use arbitrary source IP and port fields. tcpdump would look like this:

  $ tcpdump -ni vlan100 -c 10 -t udp and dst port 1234
IP> UDP, length 16
IP> UDP, length 16
IP> UDP, length 16
IP> UDP, length 16
IP> UDP, length 16
IP> UDP, length 16
IP> UDP, length 16
IP> UDP, length 16
IP> UDP, length 16
IP> UDP, length 16

On the target server, all packets will be sent to only one RX queue, so only one CPU will be used. Through hardware flow control, set the following:

  ethtool -n ext0 flow-type udp4 dst-ip dst-port 1234 action 2

Benchmarking is always difficult. During the preparation of the test, we found that active live sockets caused performance degradation. If you think about it later, it is easy to miss. Before starting the test, let's make sure there are no tcpdump processes running. You can find the invalid process that is active with the following command:

  $ ss -A raw, packet_raw -l -p | cat
Netid State Recv-Q Send-Q Local Address: Port
p_raw UNCONN 525157 0 *: vlan100 users: (("tcpdump", pid = 23683, fd = 3)

Finally, turn off Intel Turbo Boost on the server

  echo 1 | sudo tee / sys / devices / system / cpu / intel_pstate / no_turbo

Turbo Boost is a great feature that increases performance by 20%, but this test makes standard deviations very bad. If this feature is enabled, it will have a deviation of ± 1.5% from the performance indicator, but once turned off, it will be the appropriate number, 0.25%.

Step 1: Dropping packets from the application

Let's start with throwing away the user space code when the packet is passed to the application. For test setup, Make sure that iptables does not affect performance:

  iptables -I PREROUTING -t mangle -d -p udp --dport 1234 -j ACCEPT
iptables -I PREROUTING -t raw -d -p udp --dport 1234 -j ACCEPT
iptables -I INPUT -t filter -d -p udp --dport 1234 -j ACCEPT

Application code is a simple loop that takes data and discards it from user space.

  s = socket.socket (AF_INET, SOCK_DGRAM)
s.Bind ((& quot; & quot ;, 1234))
while True:
    s.recvmmsg ([...])

Once the code is ready, let's run it:

  $ ./dropping-packets/recvmmsg-loop
packets = 171261 bytes = 1940176

As measured using ethtool and our simple mmwatch tool this test only receives 175kpps in the hardware receive queue:

  $ mmwatch 'ethtool -S ext0 | grep rx_2'
 rx2_packets: 174.0k / s

The hardware is technically capable of receiving on a cable at 14 Mpps, but not in a single RX queue managed by one CPU performing kernel tasks. mpstat :

  $ watch 'mpstat -u -I SUM -P ALL 1 1 | egrep -v Aver'
01:32:05 PM CPU% usr% nice% sys% iowait% irq% soft% steal% guest% gnice% idle
01:32:06 PM 0 0.00 0.00 0.00 2.94 0.00 3.92 0.00 0.00 0.00 93.14
01:32:06 PM 1 2.17 0.00 27.17 0.00 0.00 0.00 0.00 0.00 0.00 70.65
01:32:06 PM 2 0.00 0.00 0.00 0.00 0.00 100 0.00 0.00 0.00 0.00 0.00
01:32:06 PM 3 0.95 0.00 1.90 0.95 0.00 3.81 0.00 0.00 0.00 92.38

CPU # 1 uses 27% of system + 2% of user space, so we can check that application code is not a bottleneck, but SOFTIRQ of CPU # 2 is using 100% of resources

It is important to use recvmmsg (2) in a different story. In the case of Specter-enabled kernels, the system call is heavier because we use KPTI and retpoline in kernel 4.14:

  $ tail -n +1 / sys / devices / system / cpu / vulnerabilities / *
systems / cpu / vulnerabilities / meltdown <==
Mitigation: PTI

==> / sys / devices / system / cpu / vulnerabilities / spectre_v1 <==
Mitigation: __user pointer sanitization

==> / sys / devices / system / cpu / vulnerabilities / spectre_v2 <==>
Mitigation: Full generic retpoline, IBPB, IBRS_FW

Phase 2: Load of conntrack

This test is designed to load arbitrary source IPs and ports and to load the conntrack layer. conntrack As you can see from the number of entries, you can see that the maximum value has already been reached during the test:

  $ conntrack -C

$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 2097152

You can also see conntrack in dmesg issuing this error message:

  [4029612.456673] nf_conntrack: nf_conntrack: table full, dropping packet
[4029612.465787] nf_conntrack: nf_conntrack: table full, dropping packet
[4029617.175957] net_ratelimit: 5731 callbacks suppressed

Turn this feature off to speed up:

  iptables -t raw -I PREROUTING -d -p udp -m udp --dport 1234 -j NOTRACK

And let's test again:

  $ ./dropping-packets/recvmmsg-loop
packets = 331008 bytes = 5296128

This alone has increased application reception performance to 333kbps. Hooray!

PS. If you use SO_BUSY_POLL, you can raise it to 470kpps, but I will not cover it here.

Step 3: Socket to BPF throwing

Furthermore, why should we deliver packets to user space? Although not commonly used, you can attach existing BPF filters to SOCK_DGRAM sockets using setsockopt (SO_ATTACH_FILTER) which allows you to program filters that discard packets in kernel space.

The code is here . Let's try it:

  $ ./bpf-drop
packets = 0 bytes = 0

If you discard the BPF (which has the same performance as the old one and the extended eBPF), you can handle approximately 512kpps. The BPF filter discards packets while in software interrupt mode, saving the CPU required to wake user space applications.

Step 4: DROP iptables after routing

The next step is to add the following rule iptables to discard packets from the INPUT chain of the firewall:

  iptables -I INPUT -d -p udp --dport 1234 -j DROP

Note that you did not use conntrack already as -j NOTRACK . Using these two rules reached 608kpps

Here is the number of the iptables counter:

  $ mmwatch 'iptables -L -v -n -x | head '

Chain INPUT (policy DROP 0 packets, 0 bytes)
    pkts bytes target opt ​​in out source destination
605.9k / s 26.7m / s DROP udp - * * udp dpt: 1234

600kpps is not bad, but I can do better!

Step 5: iptables DROP in PREROUTING

A faster technique is to discard packets before they are routed. You should use the following rules:

  iptables -I PREROUTING -t raw -d -p udp --dport 1234 -j DROP

This will give you a whopping 1.688 Mpps.

This will greatly improve performance, but I do not understand why. Perhaps the routing layer is abnormally complex or a bug in server configuration.

In any case, adding to the raw table in iptables is very fast.

Step 6: nftables DROP before CONNTRACK

These days, iptables is considered obsolete. The new trend is nftables. See technical description video about why nftables is better. nftables is faster than old iptables for a variety of reasons, but one rumor is that retpolines (which do not predict indirect jumps) significantly degrade the performance of iptables.

This article is not a comparison of speed between nftables and iptables, so let's just try to discard the fastest packet we can think of:

  nft add table netdev filter
nft - add chain netdev filter input {type filter hook ingress device vlan100 priority -500 ; policy accept ; }
nft add rule netdev filter input ip daddr udp dport 1234 counter drop
nft add rule netdev filter input ip6 daddr fd00 :: / 64 udp dport 1234 counter drop

Counters can be viewed through the following command:

  $ mmwatch 'nft --handle list chain netdev filter input'
table netdev filter {
    chain input {
        type filter hook ingress device vlan100 priority -500; policy accept;
        ip daddr udp dport 1234 counter packets 1.6m / s bytes 69.6m / s drop # handle 2
        ip6 daddr fd00 :: / 64 udp dport 1234 counter packets 0 bytes 0 drop # handle 3

The "ingress" hook on nftables achieved about 1.53mpps. This is slightly slower than the PREROUTING layer of iptables. This is not well understood, technically, because "ingress" is processed before PREROUTING, so it should be faster.

In this test, nftables was slightly slower than iptables. Still, nftables would be better. : P

Step 7: DROP in processing tc ingress

A little surprising is that the tc (traffic control) ingress hook runs before the PREROUTING. tc is able to select and even discard packets based on basic rules. Syntax is somewhat neat, so We recommend using this script to set it up. A slightly more complicated tc rule setting is required, and the command line looks like this:

  tc qdisc add dev vlan100 ingress
tc filter add dev vlan100 parent ffff: prio 4 protocol ip u32 match ip protocol 17 0xff match ip dport 1234 0xffff match ip dst flowid 1: 1 action drop
tc filter add dev vlan100 parent ffff: protocol ipv6 u32 match ip6 dport 1234 0xffff match ip6 dst fd00 :: / 64 flowid 1: 1 action drop

Make sure that:

  $ mmwatch 'tc -s filter show dev vlan100 ingress'
filter parent ffff: protocol ip pref 4 u32
filter parent ffff: protocol ip pref 4 u32 fh 800: ht divisor 1
filter parent ffff: protocol ip pref 4 u32 fh 800 :: 800 order 2048 key ht 800 bkt 0 flowid 1: 1 (rule hit 1.8m / s success 1.8m / s)
  match 00110000 / 00ff0000 at 8 (success 1.8m / s)
  match 000004d2 / 0000ffff at 20 (success 1.8m / s)
  match c612000c / ffffffff at 16 (success 1.8m / s)
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1 installed 1.0 / s sec
        Action statistics:
        Sent 79.7m / s bytes 1.8m / s pkt (dropped 1.8m / s, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0

u32 matching of tc ingress hooks can now be dropped to 1.8 mpps on a single CPU. This is great!

But it may be faster …

Step 8: XDP_DROP

Finally, the final weapon is XDP- eXpress Data Path .
XDP allows you to run eBPF code at the network driver level. More importantly, this step happens very quickly because it occurs before skbuff memory allocation.

Typically, an XDP project consists of two parts:

  • eBPF code loaded into the kernel context
  • User space loader to load and manage code on the specified network card

It is quite difficult to write a loader, so we'll simply load the code using the new iproute2 function [1945904] instead of :

  ip link set dev xdp obj xdp-drop-ebpf.o


The source code for the loaded eBPF XDP program is available here . This program parses IP packets to find IP packets, UDP packets, and packets that meet the conditions of the specified subnet and port:

  if (h_proto == htons (ETH_P_IP)) {
    if (iph-> protocol == IPPROTO_UDP
        && (htonl (iph-> daddr) & 0xFFFFFF00) == 0xC6120000 //
        && udph-> dest == htons (1234)) {
        return XDP_DROP;

The XDP program should be compiled with the latest clang to generate BPF bytecode. After compiling, you can upload and verify the XDP program:

  $ ip link show dev ext0
4: ext0:  mtu 1500 xdp qdisc fq state UP mode DEFAULT group default qlen 1000
    link / ether 24: 8a: 07: 8a: 59: 8e brd ff: ff: ff: ff: ff: ff
    prog / xdp id 5 tag aedc195cc0471f51 jited

Now let's look at the statistics of the network card with ethtool -S :

Egrep -v ": 0" | egrep -v "cache | csum" "

  $ mmwatch 'ethtool -S ext0 | egrep" rx "
     rx_out_of_buffer: 4.4m / s
     rx_xdp_drop: 10.1m / s
     rx2_xdp_drop: 10.1m / s

Wow! With XDP, you can drop 10 million packets per second on a single CPU.

Andrew Filer


Both IPv4 and IPv6 have been tested and summarized in the following chart:

IPv6 settings generally show slightly lower performance. Remember that some performance differences are inevitable because IPv6 packets are slightly larger.

Linux has several ways to filter packets, each with different performance and ease of use.

For DDoS defense purposes it may be sufficient to process in user space and receive packets from the application. Properly tuned applications show fairly good numbers.

Turn off conntrack for DDoS attacks of arbitrary or manipulated source IP to get more performance. But be aware that conntrack is a very useful attack to defend.

In other cases, it may be fine to incorporate a Linux firewall into the DDoS defense pipeline. -t raw PREROUTING Please try to match in the layer [19459206] much faster than the filter

For larger sizes, you can use XDP. This is really powerful. Including XDP in the chart above:

If you want to reproduce this number, read README which is well documented in .

Cloudflare here uses almost all of these techniques. Some of the user space tricks are used by applications. The iptables layer is managed by the Gatebot DDoS pipeline . Finally, we are replacing the existing proprietary kernel offload solution with XDP.

You want to help discard more packets? We are hiring in many fields such as packet discarding, system engineering, etc.

Special thanks to Jesper Dangaard Brouer for helping me with this work

Source link