When operating large, globally distributed networks, hardware failures, provider outages, and other changes in behavior are regular occurrences. Therefore, systems that can raise alarms at the first signs of trouble can alert human operators or automated systems and enable faster corrective action. To this end, we developed ShakeAlert, an alerting system built on publicly available external data to alert to sudden Internet changes.
ShakeAlert monitors streams of BGP (Border Gateway Protocol) updates observed from public collectors for paths originating from the CDN. When the volume of updates sharply increases, ShakeAlert raises an alarm, called a Shake, signaling a possible change in the Internet’s routing behavior, and in particular, how external networks route their traffic to the CDN. Using the contents of these updates, ShakeAlert further provides an estimate of the likely impacted PoPs (Points-of-Presence), providers, and prefixes.
The system is named for the United States Geological Survey's earthquake early warning system. That system functions by detecting faster moving, but less destructive, P-waves and alerting residents before the more destructive S-waves arrive. Here, we consider the control-plane signals of BGP as the early warning signal for potentially concerning changes to the data plane and end-user traffic.
In order to facilitate routing between Autonomous Systems (ASes) on the Internet, networks communicate which prefixes they have routes to via BGP. As part of this communication, when the reachability of a prefix changes, a network sends a series of announcements called announcements that indicate the impacted prefixes and the path the network would use to reach them.
In the regular operation of the Internet, thousands of such messages are traded between networks as they update their routing tables. Each time these routes change, for example, as the result of network failures, new connectivity between networks, or planned maintenance, a new set of messages may be exchanged. These may include changes generated by the origin network (e.g., announcing a new IP block) or changes that occur downstream from the origin (e.g., the connectivity at a transit provider changes).
These messages necessarily offer a great deal of insight into the current state of the Internet, revealing the gains and losses of connectivity between networks. To take advantage of this information, many organizations1 run services called route collectors, which peer with many networks and make the collected update messages available publicly.
When connectivity changes, upstream networks send BGP updates, which eventually make their way to BGP collectors.
To develop a sense of these behaviors, we consider some initial observations. We consider the resulting update feeds for a handful of different networks. We consider two large CDN networks (CDN A, CDN B), a content network (Content), two large ISPs (ISP A, ISP B), and a DNS Root Letter. Each type of network is architected for different purposes and potentially features different peering policies. For each network, we group the updates into 1-minute buckets and consider the number of messages in each bucket over one hour in January 2021.
The above figure plots the size of each update bin during that period. Here, we see the CDNs, which both feature large deployments and many peers and providers generate the most updates for nearly the entire span, with the content network relatively close behind. The ISPs and Root Letters generate significantly fewer updates. These dramatic differences in magnitude indicate that a network's structure, peers, and architecture likely have significant impacts on the corresponding message volumes. We, therefore must ensure that our system is flexible to changing parameters, as we discuss in the next section.
ShakeAlert listens to live feeds from 21 route collectors which are part of RIPE NCC’s Routing Information System (RIS) project2. Data arrives from these feeds and is grouped into minute-long time bins. We then count the number of updates in each bin and use an outlier detection algorithm to determine if a bin has an abnormally large number of updates as compared to other recent minutes. In the event such a bin is observed, we generate a Shake.
To this end, ShakeAlert maintains a sliding window of the count of updates seen in the last w bins, allowing it to avoid storing information about updates more than w bins old. Once ShakeAlert has built a history of w bins, and the w+1th bin is complete, it considers the count of this new bin, bw+1, relative to those in the previous window. While some potential anomaly detection mechanisms could be used (e.g., modified z-score and estimates of the standard deviation, static thresholds, and various change detection techniques), we employ a density-based detection mechanism3 4.
To perform density-based anomaly detection, we consider a radius R and a neighbor count k. We say that our new time bin bw+1 is an outlier if there are fewer than k other bins in the last w minutes with counts within radius R centered around the count of the new bin. Formally, any outlier has fewer than k bins bi in the last w minutes such that |bw+1| - |bi| < R. We refer to any such outliers as shake alerts or simply shakes and the update counts of these shakes as the size.
The overall detection process of ShakeAlert
The underlying basis of our hypothesis is that large and disruptive Internet routing changes generate the largest of these events: routes that carry significant traffic are likely to be heard by many downstream networks and collectors. Fundamentally, however, many changes involve large update counts which do not fall into this category. For example, regularly scheduled maintenance for which we withdraw Anycast announcements.
The contents of the updates in the bin can be further observed to reveal details about the nature of the network event. The prefixes in the updates can be used to determine which PoPs and Anycast regions are potentially impacted by the corresponding network changes. We can further examine the paths observed across the updates and estimate the upstream networks most likely to be impacted. Finally, we can determine alerts' importance based on their importance to inbound CDN traffic.
In our CDN deployment, we use a window w of 360 minutes and k of 5, allowing us to avoid alerting on commonly observed hourly behaviors. We also take R to be the distance between the 5th and 95th percentiles of bin sizes observed in the window. To increase the operational context, we further subdivide the bins into PoP-specific time series based on the prefixes and paths observed in the updates, alerting each individually. Finally, we consider a number of specific tunings, for example, setting minimum alert sizes based on our network observations.
Next, we consider a simple example demonstrating how shakes appear in the wild. In the above example taken from September of 2022, we focus on a specific CDN PoP, noting the much smaller update count along the y-axis and linear scale than our earlier figure. During this time period, the update counts are almost entirely 0 until they suddenly increase, generating a much larger update count at 12:14 and a second spike shortly after at 12:20, both of which generated shakes. These updates were caused by an unexpected disruption in connectivity to a provider.
In order to measure whether or not the shakes represent interesting events, we consider the following analysis. For each shake generated over 30 days in the summer of 2022, we examine our internal metrics at the corresponding site to determine if we observed anomalous behavior within 10 minutes of the shake. For our anomalies, we consider the following: router resets (e.g., a router restarted or otherwise went offline), BGP state changes on a provider link (i.e., a provider BGP session exited the ESTABLISHED state), changes in the announcements announced from a site, and packet loss detected between the corresponding site and at least five other sites.
The above figure shows the breakdown of these events over 30 days. Here, we see all buckets had corresponding events for at least 60% of shakes, and on average, 80% of shakes had matching events. These findings confirm that the largest shakes correspond with important and often traffic-impacting events. However, they further emphasize the breadth of events, ranging from routine maintenance to acute failures, that may generate such shakes.
ShakeAlert offers a new angle of visibility to our already rich CDN monitoring. By pulling its data from an external source, we know it offers fundamentally different insights into the behavior of the Internet. In our ongoing work on the system, we are exploring how the data can be further combined with internal monitoring to improve the accuracy of the alerts and enable automated corrective actions.
Special thanks to the Research Team, Networking Reliability Engineering Team, and all the internal engineering teams who made this work possible. Further thanks to many external experts on routing data, including Emile Aben, Stephen Strowes, and Mingwei Zhang, who provided helpful feedback and discussions.
1 E.g. Routeviews, Ris
2 The system can fundamentally use any collectors. Here we simply focus on RIS due to the flexibility of its WebSocket interface
3 M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. Outlier detection for temporal data: A survey. IEEE Transactions on Knowledge and Data Engineering, 2014.
4 T. Kitabatake, R. Fontugne, and H. Esaki. Blt: A taxonomy and classification tool for mining bgp update messages. In Proc. of INFOCOM ’18, 2018.
Get the information you need. When you’re ready, chat with us, get an assessment or start your free trial.