How I learned to stop worrying and trust my routes
If you’re a network engineer, should you be apprehensive? After all, you still trust a 30-year-old mechanism to send expensive data around the globe. Conceived in 1989 as a short-term solution, Border Gateway Protocol (BGP) was never replaced by a better one.
Jokingly called the “three-napkin-protocol” because the idea was first jotted down on the back of a napkin, BGP enabled the Internet to scale and continue its explosive growth to what it is today. This is something BGP’s predecessor, Exterior Gateway Protocol (EGP), was not capable of providing. Although attempts have been made to design alternatives, none have come close to replacing BGP, which remains the standard for inter-Autonomous System (AS) routing.
Worried yet?
Should you trust your route?
BGP’s design is based on trust, which may be its greatest strength and weakness. An AS will trust, accept, and propagate the routes advertised by its peers, no questions asked. It’s this quality that allows the whole of the Internet to be quickly updated about the most efficient path to a destination. But therein lies its weakness.
In 2008, a telecom provider in Pakistan brought down YouTube for a couple of hours. The telco advertised a more specific prefix to reach YouTube, which should have been rejected but were accepted and forwarded by an upstream provider.
Seven years later, nothing had changed. In 2015, a telecom major in Malaysia almost brought down the Internet. The telco advertised routes to 179,000 prefixes, which a U.S.-based Tier 1 provider accepted and forwarded to its peer providers and customers. As a result, the provider in Malaysia was inundated with most of the world’s Internet traffic—which it was not equipped to handle. This caused a considerable Internet slowdown.
Known as route leaks, these and other similar incidents occur because the inherent nature of BGP is to trust. It’s this same trust-based approach that abets BGP hijacking. Configure an edge router to announce more specific prefixes to a destination, and peer networks will accept and divert traffic intended for the destination to the attacker. Numerous incidents throughout Internet history, some accidental and some deliberate, have caused data to traverse the most confounding paths. In one incident, two Denver computers exchanged their traffic through Iceland, and in another, British traffic to the Atomic Weapons Establishment flowed through Ukraine!
Not just BGP—IGP too
The network has more to throw at you. Route flaps, link congestion, prefix changes, missing routes, incorrectly configured filters and policies can all lead to different routing issues. The causes can be anything from network hardware issues to misconfigurations or fat fingers. Whatever the reason for your routing troubles, as a provider of network services, the last thing you want to hear is that you violated your SLAs.
Routing issues can take a long time to resolve. Networks are large and complex, the problems are varied, and the root cause is often difficult to find. In the chaos, you need to find what is affecting data delivery while making sure you don’t violate SLAs. If you think “show ip bgp” or “debug ip ospf *” are all you need to solve your routing troubles, you may be in for a long night.
Because both inter- and intra-AS routing are complex and dynamic, traditional SNMP-based monitoring tools can’t explain why data traversed a slower or unknown route for five minutes a week ago. If you use such a tool, which provides routing information via SNMP or probes for operational monitoring, can you afford to wait until the next ‘SNMP get’ request is triggered and completed, or until a probe collects and sends its data over the Internet to your server?
How to stop worrying
Help comes in the form of proactive monitoring, providing real-time visibility into the complete data path of services you care about—a capability known as route analytics. Blue Planet Route Optimization and Assurance (ROA) is a route analytics solution that can help fast-forward your manual, time-consuming CLI-based troubleshooting by providing real-time and historical information about routing performance. ROA alerts for routing events such as changes to paths, prefixes, and route state, and provides the ability to drill down on a path change to look at underlying performance metrics, all of which reduces the mean time to identify and resolve routing issues—and helps you achieve your SLA targets.
With route analytics, you can never go off-route!