Making 35 000 000 IP lookup operations per second with Patricia tree
In this article I’ll try to make some performance evaluation of Patricia tree for purpose of IP address lookup in list of prefixes in C and C++ languages.
Briefly, Patricia tree can help us to confirm that IP address 10.0.0.1 belongs or not to list of networks: 10.0.0.0/24, 10.1.1.0/22 and 10.2.4.0/32.
Naive implementation of IP lookup
We may try to do linear search and it may be fast enough when we have some reasonable number of prefixes which can fit in CPU cache. Maybe it will work fine until you have 20–30 prefixes. In real world scenarios we have hundreds of prefixes with different lengths and linear search will not scale well as we will need to check every single prefix in list and time complexity will be as bad as O(n).
It may get even more complicated if we have need to implement longest prefix match query. What is this? For example, you may have IP address 10.0.0.1 and you have two prefixes in list which both contain it 10.0.0.0/8 and 10.0.0.0/24. For many network and business applications you may need to select longest (24 is longer than 8).
For example we offer our customers ability to set different traffic levels for DDoS detection on prefix basis. So you may set some generic threshold for 10.0.0.0/8 but you know that network 10.0.0.0/24 has heavy loaded services you may set bigger values for it. During attack detection FastNetMon will use custom configuration for this specific /24 because we use longest prefix match to retrieve configuration.
What is Patricia tree?
It’s type of Radix tree which was introduced around 2005 in MRT aka Multi-threaded Routing Toolkit. It’s pretty old but it’s very mature and battle tested.
In FastNetMon DDoS Detection toolkit we use Patricia tree for two cases:
- To detect packet or flow direction: incoming or outgoing (belong or not belong check against list of prefixes operated by company)
- To find hostgroup which has configuration for specific IP address (longest prefix match)
As FastNetMon operates on really large networks and analyses many terabits of capacity we need to understand what are our expectation about IP lookup performance from Patricia.
Performance evaluation scenario
Patricia tree is a tree and it performance heavily depends on prefix lengths and their number. Having few /8 and thousands of /32 in list will show extremely different results. How can we test it? We decided to retrieve prefixes from real very large network which contains around 380 prefixes of length ranging from /22 to /15. Most popular prefix length in our data set was /16.
To test performance of tree lookup we need to lookup something. We can generate random IP addresses for lookup but it will not reflect real nature of networks because we have different performance for miss in tree and hit in tree. Both miss and hit may happen on different levels of tree and it will lead to different performance.
To get performance test result which are as possible close to our real workloads we need to retrieve some real traffic from network and we managed to do it. In our real traffic sample we had 300k+ pairs of source and destination IP addresses.
For our tests we used our own copy of Patricia tree and we used this app to load prefixes into tree and blast it with real traffic stored in JSON file. Our implementation is single threaded.
To get precise results we load source traffic into continuous memory region (pre-allocated std::vector) and run lookup operation patricia_search_best2(lookup_tree, &prefix, 1) on each source and destination IPs from input data set. And then we repeat last step 100000 times.
After running test app for more than half hour we got following results:
./patricia_performance_tests
Total time is 1781 seconds total ops: 1116028928
Million of ops per second: 17.5
As each iteration does one lookup for source IP and another one for destination IP we need to multiply 17.5 by two. Finally we got really great performance of 35 000 000 lookup / s which is more than enough for many applications.
Test server configuration
AMD Ryzen 7 5800X 8-Core Processor, 16 logical CPU cores, 32 GB DDR4 3000 Mhz, B450 TOMAHAWK, Noctua NH-D15.
Conclusions
35 000 000 IP lookup per second is really nice result for 16 year old library and pretty old algorithm. Such performance covers most of our needs for 10G bandwidth but we still need to use alternative approaches to improve performance for 40G and 100G cases. As good direction of research in this field I can point you towards DIR-24–8 and DXR algorithms.