In the tradition of my NIX4NetEng series I’m going to dive deep into the world of strategy, and specifically into the strategy of how we look at and operate our networks, the data they generate and the analytics that are available (and often overlooked) in how networks are managed both long term and day-to-day. So, in the spirit of visibility, lets think about how typical networks are monitored. My guess is that you either already know, or will soon realize that visibility and testing across disparate networks is hard. This is a big topic, so sit back, relax, get your feet up and prepare for a magical journey into the fun and fantastical world of network visibility!

via GIPHY
I’m going to glaze over the low hanging fruit of SNMP monitoring – you have all of that right? Yes? Good. No?!?! Shame on you…I’l do another post of some of my favorite tools soon and you can shamefully download them and set them up, head hung low (until they’re up monitoring and alerting, at which point you can raise your head up high!). Lets make a very, very bold assumption that you totally understand how your internal network works, you have graphing, up/down statistics, traffic baselines and visibility into traffic flow and configurations. Don’t worry, if you don’t have those things you can link to a few options here or wait and I’ll circle back around to them in later topics.

“Visibility and testing across diverse networks is hard.”

OK, so visibility.

Any network engineer with some experience under their belt has gotten the problem report of “the internet is down” or “the internet is slow”. Yup, we all know them, we all love them. We even had an internal joke at a previous employer of mine that we could “ping the internet”, in that we created a CNAME of “theinternet” for a host that had high uptime.

(~) heelflip $ ping theinternet
PING theinternet (10.142.143.167): 56 data bytes
64 bytes from 10.14.143.167: icmp_seq=0 ttl=54 time=0.794 ms
64 bytes from 10.14.143.167: icmp_seq=1 ttl=54 time=0.768 ms
64 bytes from 10.14.143.167: icmp_seq=2 ttl=54 time=0.734 ms
64 bytes from 10.14.143.167: icmp_seq=3 ttl=54 time=0.732 ms
64 bytes from 10.14.143.167: icmp_seq=4 ttl=54 time=0.758 ms
64 bytes from 10.14.143.167: icmp_seq=5 ttl=54 time=0.761 ms

Right, so you get the internal network. What about when you get to the part that you don’t control and can’t see into? That’s harder, but rest easy – there are a number of ways to go about gathering the necessary details. What should those data sources be? Let me throw down what I think are important to track to really understand what the heck is going on outside of your AS or sphere of influence.

Paths to common destinations (google, servicenow, SalesForce)
Route table for all peerings (if taking more than default and are using eBGP)
Latency statistics from your site to common destinations (see 1.)
Latency statistics from outside of your network to your site
Latency (and possibly throughput) statistics to intermediary points across your typical paths
External route table and path statistics
Packet loss statistics

That’s a decent amount of data. How can this be done? Well, let me tell you, there are a few ways but drawing them all together can be a daunting task. This can be accomplished by looking at data produced by smokeping or owamp with an SNMP graphing tool for interface stats and BGPMon and Peermon for BGP information. An opensource product called perfSonar rolls a lot of this together, but there are commercial packages such as ThousandEyes that offer these types of statistics across a large swath of the internet as well. RIPE ATLAS has a great deal of statistics that can be easily queried and has a large install base too. If you are a savvy coder you can grab some good information from the RIPE ATLAS API. If you don’t have the resources, capability, or time to do that then there is an option for a turnkey solution. ThousandEyes has a strong offering and there is a great deal of information that they gather. They also have very good presence and availability of information about their product, most recently presenting at Network Field Day 17 (and historically at NFD 6, NFD 8, and NFD 12). I was a delegate at NFD 17 and was pleased to see another tool that provides visibility into BGP, a very often overlooked and yet unbelievably critical and useful viewpoint which has historically been difficult to see outside of tools like BGPMon. (see my previous NIX4NetEng post about BGP visibility). NetBeez also has a reasonable offering but last I have looked it doesn’t really do much outside of a network (admittedly I may be behind the curve with their product).
If you’re interested in seeing or hearing some more about these products, I did a packet pushers podcast on perfSonar a few years ago which is dated as far as feeds and speeds, but still very relevant today, you can read the show notes and listen here. For more info on ThousandEyes you can check out the latest NFD17 videos.

The real point is that to really see the performance of your network and to fully understand the true user experience you need to have total visibility into the entire ecosystem, not just the pieces that you can control.