Rocky road towards ultimate UDP server with BPF based load balancing on Linux: part 1

This article will summarise decade long experience accumulated during development of FastNetMon, high performance lightning fast DDoS detection tool with open source core.

To detect DDoS attack in Telecom network we need to see traffic which traverses network and we have multiple protocols for that purpose:

  • Netflow v9
  • sFlow v5

All these protocols are UDP based and are single directional. Each of UDP packets may include information about 1-20 packets or flows observed by network equipment.

Fortunately, all these protocols use some kind of sampling (select 1 packet from 1000 and send it to collector) or aggregation (collect multiple packets which belong to same 5 tuple and report them as single multi-packet flow once per 15-30 seconds) and even for huge networks with Terabits of capacity we will see only few 10-100k packets per second of UDP traffic.

Mural Connectivity Matters, Shoreditch, London

Our initial UDP server implementation was very basic and used single thread to process all traffic from network equipment. I will use C++ for my examples but all described capabilities are part of C syscall API provided by Linux and can be used from any language with ability to run system calls. You can find complete source code for it here.

This implementation handles all UDP traffic using single thread:

Started capture
79903 packets / s
80942 packets / s
80880 packets / s
79562 packets / s  
80228 packets / s  
40% of single CPU core used only to receive traffic

As you can see even with our example UDP server which does not execute any packet parsing or data processing we used half of CPU core. Due to single thread approach we have only 50% of CPU on this core available for our tasks which is clearly not enough. For my tests I used Xeon E5-2697A v4.

What is the most obvious approach to increase amount of available CPU power for our tasks? We can run another thread on another port and switch part of network equipment to use another port. We used this approach for many years. It wasn't very user friendly as you need to guess amount of telemetry per device and do load balancing manually.

What is the best way to improve experience and process traffic towards same port using multiple threads? We can use capability called SO_REUSEPORT which is available on Linux starting from version 3.9.

How it works? It implemented in a very intuitive way. We spawn multiple threads and from each of them we create their own UDP server sockets and bind them on same host and port. The only difference is that we need to set special socket option called SO_REUSEPORT. Otherwise Linux will block such attempts with error: "errno:98 error: Address already in use".

Linux kernel distributes packets between these sockets using hash function created from 4 tuple (host IP and port + client IP and port). In a scenario when we have large number of network devices it will work just fine. You can find all source code in this repository. Majority of changes in compare with single thread implementation are about adding logic to spawn Linux threads and not related with network stack at all. The only change required for socket logic was following:

Enabling SO_REUSEPORT for socket

When we run this example multi threaded server it will distribute traffic towards same UDP port following way:

Thread ID UDP packets / second
Thread 0 10412
Thread 1 71115
Thread 0 10344
Thread 1 69908
Thread 0 10268
Thread 1 69760
Traffic was distributed between two threads but we're not very lucky to get decent distribution

Clearly such distribution is not perfect and it does not help in our particular case. Will it help in your case?

Very likely. Linux kernel uses load balancing which is based on IP addresses and source ports of clients and if you have large number of unique clients for your service then traffic will be distributed evenly over all available threads.

Why it does not work well in our particular case? Let's look on tcpdump:

22:15:45.039055 IP        > UDP, length 158
22:15:45.039056 IP > UDP, length 158
22:15:45.039066 IP > UDP, length 122
22:15:45.039070 IP > UDP, length 98
22:15:45.039098 IP > UDP, length 156

What is the issue here?

One of the issues that all network devices use same source port number for sending telemetry. We even checked vendor documentation but did not find option to change it anywhere. Apparently it's hardcoded which is not great but we still have unique device addresses to make 4 tuple hash different.

The second issue that majority of traffic is coming from single router and hash function will be the same and Linux kernel will feed all traffic from this router to specific thread and will overload it that way.

Telecom equipment is an example of cutting edge technology and you may replace dozen of old routers or switches by single device with incredible throughput on scale of tens Terabits.

That's exactly what happened in on this specific deployment. Can we make anything to spread traffic evenly over all available threads?

In Linux Kernel 4.5 we got new capability to replace logic used by Linux kernel to distribute packets between multiple threads in reuse port group by our own BPF based microcode using socket options called SO_ATTACH_REUSEPORT_CBPF and SO_ATTACH_REUSEPORT_EBPF.

In second part of this article we will provide complete example how we can implement BPF microcode which will distribute our traffic evenly between threads.

Subscribe to Pavel's blog about underlying Internet technologies

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.