At our CDN we continuously develop new and innovative ways to help our infrastructure maintain high performance and resilience during sudden load increases. This often involves exploring optimizations on existing techniques that we use. Moreover, validating new approaches is an important part of this process, as it guarantees that new changes have the expected positive impacts in production environments.
When content becomes popular, we replicate it on multiple servers in a data center/Point of Presence (PoP). This allows us to load balance better and handle such surges, reducing the risk of servers overloading. This replication mechanism is called ‘Hot Filing,’ and the popular content is called a ‘hot file’. We recently developed an adaptive load balancing (ALB) optimization that improves our traditional hot filing by making it more resilient to traffic surges. In particular, it dynamically decides the amount of replication needed by measuring server load instead of relying on static thresholds. This more evenly distributes the load across all servers in a PoP, increasing the overall request-handling capacity of any PoP and, thus, the CDN. Our previous article on evaluating an Adaptive Load Balancing System details the mechanics of this optimization.
In this post, we validate the positive impact of this optimization in production. We begin by defining the metrics we used to measure the impact. Next, we show the impact of the optimization on each of these metrics through examples. Finally, we show the aggregate impact of the optimization across all PoPs where this was deployed. In summary, we found that the total time servers saw excessive load beyond a predefined threshold reduced by at least 70%, while almost all PoPs saw an improvement in this metric. This optimization is now globally enabled for all customer traffic.
Defining the metrics
To evaluate the impact of this new optimization in production, we defined and monitored three metrics:
Server skewness: This is the ratio of the load of a server over the median load over all servers in the PoP.
Number of hot files: This is the number of files replicated by the Hot Filing mechanism at any given time. Adaptive Load Balancing is expected to increase the number of hot files, as hot filling is the underlying mechanism for its load distribution.
Time servers spend over target skewness: This metric evaluates the effectiveness of Adaptive Load Balancing. When we set our target skewness threshold to a particular value, we want to see that most servers’ skewness is usually smaller than that value.
Impact of optimization on monitored metrics
ACHIEVING TARGET SKEWNESS
The figure below shows two snapshots of the load distribution of servers in a PoP. On the left (without Adaptive Load Balancing), we see a number of servers that exceed the skewness target (red box) with their load above 1.8 times the median of the PoP. On the right (with Adaptive Load Balancing), we see greater balance across servers, with no servers exhibiting a load higher than 1.8 times the median.
Change in load distribution with Adaptive Load Balancing. We notice a reduction in overloaded servers.
NUMBER OF HOT FILES AND LOAD DISTRIBUTION
Next, we examined how the number of hot files was impacted and the corresponding changes to the load on each server. The plot below shows that the number of hot files increased when ALB was enabled on a PoP. This behavior was expected since the mechanism selectively increases the chances of files being offloaded from servers with a higher load.
Fundamentally, hot filing, and ALB, reduce the load on individual servers by increasing the number of servers that service particular files. This increases the storage load on each server. However, the additional files chosen to be replicated at any given time are relatively low compared to the total files served from the PoP since they are selected only from outlier servers that require offloading. In most cases, the additional cache space used is very small compared to the total disk space. Therefore, the trade-off is worthwhile but important to identify and be aware of. In our implementation, we included sanity checks to validate that the cache usage is not negatively affected by this optimization.
Number of hot files. The number is increased when Adaptive Load Balancing is turned on (red marker) since outlier servers are attempted to be offloaded.
The second plot shows the volume of traffic delivered by each server over the same timeframe at that PoP. We observed that when Adaptive Load Balancing was enabled (dashed red line) the load across servers became more balanced. This made the servers more resilient to incoming traffic and reduced the risk of overloading servers.
Load distribution (Mbps) in a PoP between 05/02-05/04. When Adaptive Load Balancing is turned on, the distribution becomes more smooth, with more servers delivering traffic closer to the median. This helps mitigate the risk of servers becoming overloaded with new traffic.
TIME SPENT OVER TARGET SKEWNESS
Here, we considered an experiment in which a PoP was set to maintain a target skewness of 1.6x. In the figure below, the orange line shows the distribution of “server skewness” over the experiment period. Comparing this distribution to the blue line, which is the respective distribution for the baseline period (no Adaptive Load Balancing), we saw a load shift toward the median. Notably, the “tail” of the distribution was also reduced significantly, with the 99th percentile dropping from 2.12 to 1.52, below the target skewness.
Adaptive Load Balancing decreases the maximum server load and brings server loads closer to the median.
Reducing that “tail” in the distribution is the main goal of the optimization since servers in that tail, i.e., those with the highest load, run a greater risk of being overloaded with new traffic spikes. To further quantify this reduction, we also measured the number of minutes for which any server delivered traffic over the target during the experiment periods with/without Adaptive Load Balancing:
Adaptive Load Balancing reduces the time servers spend delivering a load over a target threshold.
In this case, we observed an 88% reduction in time spent over target skewness in this PoP. This is a good indicator that Adaptive Load Balancing can maintain the skewness of the load distribution around the desired value.
Results from global deployment
After testing the optimization on a handful of selected PoPs and seeing good results on the measured metrics, we deployed the system to every PoP to quantify the aggregated impact over time. As earlier, we measured the number of collective minutes servers in a PoP spent delivering traffic above our specified target skewness (set to 1.8x median server load in a PoP). The next plot shows two distributions of minutes servers spent over that threshold for 75 PoPs. The blue line corresponds to 4 days of baseline data, and the orange line corresponds to 4 days of Adaptive Load Balancing data. The overall shift of the distribution to the left shows that servers in the PoPs running Adaptive Load Balancing spent fewer minutes over the threshold.
Aggregate minutes spent by servers delivering load higher than the specified threshold (1.8 * median) over 4 days for all PoPs. With Adaptive Load Balancing, the distribution is shifted to the left, showing that the mechanism maintains server loads below the target deviation from the median for longer times.
To further understand the impacts on individual PoPs, we also recorded the percentage change for this metric for each PoP. The results showed that half of the PoPs saw a 70-95% reduction in the total time servers above target skewness, and almost all PoPs saw a reduction in the time spent above the threshold.
In our ongoing effort to continuously improve the performance and reliability of our CDN, we recently deployed and evaluated an optimization in the way we load-balance traffic within a PoP. This optimization identifies servers loaded over a specified threshold compared to the rest of the PoP and offloads those servers, particularly mitigating the risk of performance impact in case of new traffic spikes. The results from production demonstrate significant improvements that have been consistent with earlier simulation results in showing the efficiency of the optimization at maintaining the distribution of server load within a desired skewness. As a result, we have now enabled this optimization globally for all customer traffic.
Special thanks to Angela Chen for working on the implementation and deployment of this mechanism. Also, thanks to Scott Yeager, Derek Shiell, Marcel Flores, Anant Shah, and Reed Morrison for helping with the general discussions, Colin Rasor, Richard Rymer, and Olexandr Rybak for helping with data collection and visualization.