Network Traffic Telemetry on modern routers: part 1
Hello! I’m Pavel and I’m CTO and co-founder of FastNetMon LTD, London, 🇬🇧. We’re cyber security software vendor and we develop DDoS 🎯 detection and mitigation platform for Telecoms.
Please note that this article will be focused only on traffic telemetry protocols which are able to export information about network packets (it may be in parsed format or can be just first X bytes of packet itself) observed or forwarded by network equipment. Such protocols as gNMI, NETCONF or SNMP are out of scope for this work.
My main field of interest is DDoS defence I'll provide my view on each protocol from perspective of DDoS detection. If your main application of network traffic telemetry is traffic visibility, billing or law enforcement then my experience may be less relevant for you.
Very first stable version of FastNetMon from 2014 had support only for port mirror as it was the only protocol supported by our network equipment. After getting initial feedback from network community about need to support more industry adopted network traffic telemetry protocols I started work on it. After 9 months of active development I introduced support for Netflow v5, v9, sFlow v5 and IPFIX in 2015.
Due to very large number of FastNetMon users and their diversity we had access to large variety of network equipment models and were able to test our product compatibility with majority of products available on market.
In this article I'll share our view on different networks traffic telemetry protocols from both operational (how particular vendor implements it) and implementation (how to write code to deal with particular protocol) points of view.
data:image/s3,"s3://crabby-images/82f55/82f55f1c86620491e2d34cf1d7179418075fbe04" alt=""
What was the main motivation to add new state of the art traffic telemetry protocols? To get faster attack detection time (now we can detect DDoS as fast as in 1.5 seconds) and get more information about DDoS attack's traffic to create most efficient mitigation.
Network telemetry on modern routers
There are multiple traffic telemetry protocols which are supported by modern Telco grade routers:
- Netflow v5
- Netflow v9
- IPFIX
- sFlow v5
- Port mirror
- Sampled port mirror (including GRE option)
- Raw headers over IPFIX or Netflow v9
Does it look scary? Clearly it does. It's very challenging to select the one which fits your need. You had to consider dozens of theoretical characteristics for each protocol and then apply knowledge about strong and weak points of vendor's implementation of these protocols.
Let's dig into details about each of them.
Netflow v5
I would like to start from oldest protocol which is Netflow v5. Surprisingly you still can find support for it even in very modern routers (personally I think it should be deprecated and removed).
It's fixed format protocol and it can export only limited number pre-defined fields which explain each packet. Let's look on protocol structure kindly provided by Cisco.
data:image/s3,"s3://crabby-images/485d3/485d367f2d75326533a4397528a9b5a4885e8432" alt=""
This part does not carry anything about traffic itself but required for Netflow collector implementation. For example field sampling_interval is absolutely necessarily for our needs as it carries number of skipped (sampled out) packets during observation. Let's look on flow encoding format. Each Netflow v5 packet can carry multiple (usually 12-15) flows which describe traffic observed by router. You can find full list below.
data:image/s3,"s3://crabby-images/8f5aa/8f5aa0ea0b7152bd79d80ae5b8512f3d1677ceac" alt=""
Netflow and IPFIX flow aggregation
The Netflow and IPFIX protocol family implies (with some small exceptions which I'll explain later) that on the side of network equipment (usually a router) we have flow tracking subsystem which implements packet aggregation.
What is flow? Usually, it's 5 tuple which includes following fields:
- Source IP
- Source port
- Destination IP
- Destination port
- Protocol
To implement flow tracking router creates memory entry for each unique 5 tuple and then maintains multiple metrics for each flow. Usually, it counts number of packets and bytes transferred by this flow.
As we started talking about memory, it is worth emphasising that we are talking about a finite resource, and the very essence of DDoS attacks is based on the exhaustion of a finite resource (usually the network capacity, but it can also be the compute resources of the router or memory). This is one of the first disadvantages of this family of protocols for attack detection purposes as it can itself be a vector of deny of service attack against network equipment.
What is packet aggregation? Instead of sending information about every single packet which was handled by our network device we export information only for unique flows and if thousands of packets belong to same flow then they will be exported only once which saves a lot of resources. Basically we aggregate multiple packets which belong to same flow into single flow.
As part of flow tracking implementation we need to export information about traffic transferred by each flow to Netflow collector. Let's imagine medium sized network with 100G+ of traffic. Such networks easily have tens of millions of active flows at any moment. Can we send millions of flows every single second? Well, it's possible theoretically but it will be extremely challenging for router itself and then it will overload Netflow collector.
What is the solution? Most common approach is to scan table with all flows in network every 5-15-60 seconds (this period called active flow timeout) and send to collector only information about flows which had at least one packet transferred during this period. Such approach reduces load on router (as we have more time to scan table) and sends way less data to collector.
What is the issue with this approach? It will delay export of information about packets observed by network by X seconds which is extremely dangerous in case of DDoS attack which can increase in bandwidth to hundreds of Gigabits in 30 seconds. Such delay will cost network downtime and will cause connectivity issues for customers. Many vendors limit minimum value for this timeout to quite large values as 15 or even 60 seconds to prevent control plan overloading and it makes fast traffic monitoring impossible.
Sampling
The main way to deal with flow table overload problems is to use sampling. Instead of sending all traffic to flow tracking engine we can discard 99% of it and then pass only small fraction of traffic. It will lead to way less flow tracking entries and will allow way faster flow table scan.
For example, for a 10G port, you can expect very accurate results when using 1 to 1024 sampling. From our experience, vast majority of large Telco installations use sampling.
Sampling is a very powerful tool that has almost no drawbacks for DDoS attack detection purposes but you need to be extremely careful with sampling rate value selection.
Let's summarise our field experience with Netflow v5.
Benefits of Netflow v5
- Supported even by very old equipment
- Simple parser implementation due to static structures
- Simple sampling rate encoding (available in each packet)
Issues with Netflow v5
- Official standard does not exist
- Lack of IPv6 support
- Lack of 32 bit (4 byte) ASNs support
- Sampling cannot exceed 1:16384 due to 14bit field length
- Impossible to extend due to static structures
- Flow delays in range of 1-30 seconds before export
If you still use Netflow 5 please stop it as soon as possible. This protocol does not meet needs of modern networks and should not be used. What should we use instead? Please keep reading
Netflow v9
It's one of the widely adopted protocols you may find in industry. It's used by everyone. You can find it in basically all modern routers and it's supported by majority of Netflow collectors (including FastNetMon).
It's truly great protocol and on protocol side it can carry basically any information from router to collector. In reality we're limited by fields selected by vendor and below you can find examples of field lists for two leading Telco vendors
You can see that list of exported fields is way longer then list of fields supported by Netflow v5. What is the best part? We can add new fields easily and protocol allows it.
Sadly such flexibility is coming with hidden cost for Netflow collector developers: need to do template management. Each collector which supports Netflow v9 has to track all lists of fields (called "data templates") which are announced by routers from time to time. This information is needed to decode arriving data from routers.
Certainly, it's way more tricky to decode data which uses Netflow v9 in compare with Netflow v5 as structure is dynamic for each device and even single device can use multiple formats.
Sampling encoding
Netflow v9 is doing flow aggregation and all issues covered in section about Netflow v5 about this process apply for Netflow 9 too. It does support sampling too but logic to export actual sampling rate is way more complicated. We basically have special type of packets (options and options templates) which deliver this information from device to collector. Let's look on example formats for sampling encoding. Personally I consider this part of protocol as the most over-engineered and exceedingly complicated.
Let's summarise our feedback about Netflow v9
Benefits of Netflow v9
- Supported by almost all vendors
- IPv6 support
- Can carry sampling rate in any range
- Well documented and most of the implementations are reasonably close to original implementation
- Offers almost unlimited extensibility
- Some fields are documented as part of IPFIX RFCs
Issues with Netflow v9
- Complicated data encoding for collector
- Sampling encoding is complicated and vendor specific
- Issues with flow duration encoding on some vendors
IPFIX
It's successor and further development of Netflow 9. On some rare occasions it can be referenced as Netflow v10 which highlights incredible amount of similarities in protocol design.
It's basically the very first widely adopted version of network telemetry standard which was developed by IETF and published as dozens of RFC. From my own experience I would say that IPFIX is Netflow v9 with proper documentation.
Should you migrate to IPFIX because Netflow v9 is old? Answer depends on vendor and particular model. Personally I do recommend using IPFIX for all new deployments.
Let's look on example list of IPFIX fields. Clearly looks very similar to one used by Netflow v9.
Did it fix all issues presented in Netflow 9? Clearly, no. Sampling encoding get even more complicated and each vendor does it their own way.
Let's summarise our view on IPFIX.
Benefits of IPFIX
- Well documented RFC standard
- IPv6 support
- Unlimited flexibility
Issues of IPFIX
- Complicated encoding for collector
- Tricky encoding for dropped by BGP Flow Spec traffic (some vendors)
- Some vendors still do not support it
- Limited by subset of fields selected by vendor
What's next?
In second part of my article I'll cover such protocols as sFlow v5, sampled port mirror, sampled port mirror over GRE and most innovative protocols which deliver raw headers over IPFIX or Netflow v9 (IPFIX 315 or "inline monitoring services"). Stay tuned!