My article will tell you how to receive 10 million packets per second without using any third party libraries such as Netmap, PF_RING, DPDK and others.
We will do this using the usual Linux kernel version 3.16 and a certain amount of code in C and C ++.
The very first version of this article was published at 25th of June 2015.
You can find all source code from this article at GitHub.
As introduction I’d like to share a few words about how pcap, the well-known way of capturing packets, works. It is used in such popular utilities as iftop, tcpdump, arpwatch.
To start you open interface using pcap and wait for packets from it using the usual bind / recv approach. The kernel, in turn, receives data from the network card and stores it in the kernel space, after that it detects that the user wants to receive it in the user space and passes through the recv command argument, the address of the buffer where to put this data. The kernel copies the data (for the second time).
In addition, remember that recv is a system call and we call it for every packet coming to the interface, system calls are usually very fast, but the speed of modern 10GE interfaces (up to 14.6 million calls per second) leads to the fact that even a light call becomes very costly to the system solely because of the frequency of the calls.
It is also worth noting that we usually have more than 2 logical cores on our server. And the data can arrive to any of them. And an application that receives data using pcap uses one core. Here we turn on locks on the kernel side and drastically slow down the capture process — now we are not only copying memory / processing packets, but waiting for the release of locks occupied by other cores. In worst cases locks can often take up to 90% of the processor resources of the entire server.
Good list of problems? Yep and we will try to solve them all.
In this article we will be working on mirror ports (which means that from network switch we receive a copy of all the traffic of a certain server). On the server side we observe SYN flood packets of the minimum size at a speed of 14.6 mpps / 7.6GE.
We will use Intel 82599 NIC with ixgbe drivers from SourceForge 4.1.1, Debian 8 Jessie. Module configuration: modprobe ixgbe RSS=8,8. I have an Intel i7 3820 processor with 8 logical cores.
Distribute interrupts across available cores
I draw your attention to the fact that packets arrive on the port, the target MAC addresses of which do not match the MAC address of our network card.
Otherwise, the Linux TCP/IP stack will kick in and the machine will choke on traffic. This point is very important, we are now discussing only the capture of other people’s traffic, and not the processing of traffic that is intended for this machine (although my method is suitable for this case too).
Now let’s check how much traffic we can accept if we start listening to all traffic.
Enable promisc mode on the network card: ifconfig eth6 promisc
After that, in htop we will see a very unpleasant picture — a complete overload of one of the cores:
2 [ 0.0%]
3 [ 0.0%]
4 [ 0.0%]
5 [ 0.0%]
6 [ 0.0%]
7 [ 0.0%]
8 [ 0.0%]
To determine the packet speed on the interface, we will use the special script pps.sh. At the same time, the speed on the interface is quite small — 4 million packets per second:
RX eth6: 3 882 721 pkts/s
RX eth6: 3 745 027 pkts/s
To solve this problem and distribute the load across all logical cores (I have 8 of them), you need to run the following script which will distribute interrupts from all 8 queues of the network card to all available logical cores.
Great, the speed immediately flew up to 12mpps (but this is not a capture, this is just an indicator that we can read traffic at this speed from the network)
TX eth6: 0 pkts/s RX eth6: 12 528 942 pkts/s
TX eth6: 0 pkts/s RX eth6: 12 491 898 pkts/s
TX eth6: 0 pkts/s RX eth6: 12 554 312 pkts/s
And the load on the cores has stabilized:
1 [||||| 7.4%]
2 [||||||| 9.7%]
3 [|||||| 8.9%]
4 [|| 2.8%]
5 [||| 4.1%]
6 [||| 3.9%]
7 [||| 4.1%]
8 [||||| 7.8%]
First attempt to run AF_PACKET capture without optimizations
Let's starts from simplest and slowest example app.
So, we launch the application for capturing traffic using AF_PACKET:
We process: 222 048 pps
We process: 186 315 pps
And we can see that all CPUs cores are on peak load:
1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||| 86.1%]
2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||| 84.1%]
3 [|||||||||||||||||||||||||||||||||||||||||||||||||||| 79.8%]
4 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 88.3%]
5 [||||||||||||||||||||||||||||||||||||||||||||||||||||||| 83.7%]
6 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 86.7%]
The reason of such ahigh liad is that the kernel drowned in locks, on which it spends all the processor time:
Samples: 303K of event 'cpu-clock', Event count (approx.): 53015222600
59.57% [kernel] [k] _raw_spin_lock
9.13% [kernel] [k] packet_rcv
7.23% [ixgbe] [k] ixgbe_clean_rx_irq
3.35% [kernel] [k] pvclock_clocksource_read
2.76% [kernel] [k] __netif_receive_skb_core
2.00% [kernel] [k] dev_gro_receive
1.98% [kernel] [k] consume_skb
1.94% [kernel] [k] build_skb
1.42% [kernel] [k] kmem_cache_alloc
1.39% [kernel] [k] kmem_cache_free
0.93% [kernel] [k] inet_gro_receive
0.89% [kernel] [k] __netdev_alloc_frag
0.79% [kernel] [k] tcp_gro_receive
Optimizing AF_PACKET capture with FANOUT
So what to do? We know that locks occur when multiple processors try to use the same resource. In our case, this is due to the fact that we have one socket and one application consumes it, which forces the other 8 logical processors to stand in constant waiting.
Here a great function will come to our aid — FANOUT. For AF_PACKET, we can run several. The most optimal in our case will be the number of processes equal to the number of logical cores.
In addition, we can set the algorithm by which data will be distributed over these sockets. I chose the PACKET_FANOUT_CPU mode, since in my case the data is very evenly distributed among the queues of the network card and this, in my opinion, is the least resource-intensive balancing option (although I can’t vouch for it — I recommend looking in the kernel code).
You can find fanout enabled example app here.
And start the application again. Oh miracle, we got breathtaking 10x acceleration:
We process: 2 250 709 pps
We process: 2 234 301 pps
We process: 2 266 138 pps
Processors, of course, are still fully loaded:
But the perf top map looks completely different — there are no more locks:
Samples: 1M of event 'cpu-clock', Event count (approx.): 110166379815
17.22% [ixgbe] [k] ixgbe_clean_rx_irq
7.07% [kernel] [k] pvclock_clocksource_read
6.04% [kernel] [k] __netif_receive_skb_core
4.88% [kernel] [k] build_skb
4.76% [kernel] [k] dev_gro_receive
4.28% [kernel] [k] kmem_cache_free
3.95% [kernel] [k] kmem_cache_alloc
3.04% [kernel] [k] packet_rcv
2.47% [kernel] [k] __netdev_alloc_frag
2.39% [kernel] [k] inet_gro_receive
2.29% [kernel] [k] copy_user_generic_string
2.11% [kernel] [k] tcp_gro_receive
2.03% [kernel] [k] _raw_spin_unlock_irqrestore
Optimizing AF_PACKET Capture with RX_RING — Ring Buffer
What to do and why is it still slow? The answer is in the build_skb function, which means that two memory copies are still being made inside the kernel.
Now let’s try to deal with memory allocation through the use of RX_RING. Please use this code. Look on results, we’re nearing 4 MPPS:
We process: 3 582 498 pps
We process: 3 757 254 pps
We process: 3 669 876 pps
Such a speed increase was ensured by the fact that memory is now copied from the network card buffer only once. And when passing from kernel space to user space, there is no copy process too. This is provided by a shared buffer allocated in the kernel and skipped to user space.
The approach to work also changes — now we can’t hang and listen when a packet arrives (remember — this is an overhead), Now using the poll call we can wait for a signal when the whole block is filled. And then start processing it.
AF_PACKET capture optimization with RX_RING by FANOUT
But still we have problems with blocking. How to defeat them? The old method is to turn on FANOUT and allocate a block of memory for each handler thread. You can find app with all optimisations here.
Samples: 778K of event 'cpu-clock', Event count (approx.): 87039903833
74.26% [kernel] [k] _raw_spin_lock
4.55% [ixgbe] [k] ixgbe_clean_rx_irq
3.18% [kernel] [k] tpacket_rcv
2.50% [kernel] [k] pvclock_clocksource_read
1.78% [kernel] [k] __netif_receive_skb_core
1.55% [kernel] [k] sock_def_readable
1.20% [kernel] [k] build_skb
1.19% [kernel] [k] dev_gro_receive
0.95% [kernel] [k] kmem_cache_free
0.93% [kernel] [k] kmem_cache_alloc
0.60% [kernel] [k] inet_gro_receive
0.57% [kernel] [k] kfree_skb
0.52% [kernel] [k] tcp_gro_receive
0.52% [kernel] [k] __netdev_alloc_frag
So, we connect the FANOUT mode for the RX_RING version:
Wow, we got breathtaking 9 MPPS:
We process: 9611580 pps
We process: 8912556 pps
We process: 8941682 pps
We process: 8854304 pps
Samples: 224K of event 'cpu-clock', Event count (approx.): 42501395417
21.79% [ixgbe] [k] ixgbe_clean_rx_irq
9.96% [kernel] [k] tpacket_rcv
6.58% [kernel] [k] pvclock_clocksource_read
5.88% [kernel] [k] __netif_receive_skb_core
4.99% [kernel] [k] memcpy
4.91% [kernel] [k] dev_gro_receive
4.55% [kernel] [k] build_skb
3.10% [kernel] [k] kmem_cache_alloc
3.09% [kernel] [k] kmem_cache_free
2.63% [kernel] [k] prb_fill_curr_block.isra.57
Even more interesting that we reduced load on CPUs little bit:
1 [||||||||||||||||||||||||||||||||||||| 55.1%]
2 [||||||||||||||||||||||||||||||||||| 52.5%]
3 [|||||||||||||||||||||||||||||||||||||||||| 62.5%]
4 [|||||||||||||||||||||||||||||||||||||||||| 62.5%]
5 [||||||||||||||||||||||||||||||||||||||| 57.7%]
6 [|||||||||||||||||||||||||||||||| 47.7%]
7 [||||||||||||||||||||||||||||||||||||||| 55.9%]
8 [||||||||||||||||||||||||||||||||||||||||| 61.4%]
In conclusion, I would like to add that Linux is simply an amazing platform for analyzing traffic, even in an environment where it is impossible to build any specialized kernel module. This is very, very pleasing. There is hope that in the next kernel versions it will be possible to process 10GE at a full wire-speed of 14.6 million/packets of a second using a 1800 megahertz processor.
Recommended reading materials: