On July 19th, a routine update from cybersecurity company CrowdStrike caused a global IT outage affecting over 8 million Windows machines. Many computer systems crashed, causing disruptions to Airlines, Banking, Medical practices and Care systems worldwide. It caused flight delays, halted operations for some businesses, and created a window for malicious actors to try phishing attacks.
In this article we look at what exactly happened in the brief 1.5hr window, why it was so devistating, was it avoidable and how we can protect ourselves from this happening again.
Root cause
As we now know the outage was a result of a software update from CrowdStrike. The specifics of the software are largely irrelevant, we just need to know that there was an update that was pushed out to multiple clients at the same time. This update failed in a spectacular way. The update necessitated a reboot. Upon reboot the Windows operating system failed to load.
As a result of the operating system failing to load, this takes out any hope of remotely fixing this issue because if we do not have an operating system we cannot be online, if we are not online, we cannot distribute a fix.
This is the biggest issue and the most crippling of this outage, but there would be a sting in the tail which was about to rear it's head.
For now let’s look at the mechanism of delivery
This software update was clearly an integral part of the operating system for it to a necessitate restart, this tells me that the update required components of the operating system to change. This would indicate that it is an integral part of the OS.
Secondly, for this update to cripple the start-up of the OS, this would further support this assumption - again I’m not looking at the specifics of this particular piece of software, just the methods of what was changed and the distribution
Clearly a faulty component of the update was delivered to the endpoint, the endpoints were then crippled. So we can look at two aspects of this:
1 - Why was a faulty component released in the first place?
2 - How was it left to be delivered unattended to so many endpoints.
Testing and Distribution
The first point we need to turn to the distributor. I'm assuming that this update will have been through a testing process before release into the general domain.
It’s in looking at the testing scenarios, we need to take into account important variables like the scope of the testing - how many machines was this tested against, in what variety of machines, what versions, what service packs were installed on the machines, what other software is loaded etc, testing is not a black-and-white excercise. Testing environments are a complex science and if you want to achieve 100% success rate, you have to test against every variable and every variant of machine and software configuration out there, but then you look at the numbers - obviously the greater success percentage is against the greater number of test variations - you will get a decreased return as the number of variations increases. But ultimately, testing takes time, and money, so we have a break point, how many variations do you test this against before you say, "you know what, that will cover most machines.". Sometimes software wvendors will put out a list of the platforms they have tested it against and in doing so, hold their hand up and say, anything that's not on the list, we can't guarentee.
% of machines impacted
While we don't know the number of endpoints the software is loaded on, we know that the company has aprox 20k customers, so a large cutomer base, but a small % of endpoints compared to total MS Windows machines in the public domain. It's my guess that this has affected almost alll Falcon endpoints that were up and available for update during that fated 1.5 hour window, which draws me to a conclusion, that the fail rate was spectacularly high and in my view should have almost certainly have been exposed in a rudamentary testing platform.
We're not privy to the specifics of the released update, safe to say, in my opinion it was either not teested thoroughly or the wrong update was released. Without a confirming statement, both scenarios are just conjecture though.
Governance
While it's convenient to look at the distributer and point fingers, we also must look at the implementation of the update.
Control
Most updates should be choked before release to a production environment. This means that in a properly controlled production environment, there is a production management framework in place, the emphasis on management.
During my time in banking, there would be no likely scenario in which we would allow an endpoint to automatically do anything on their own without control, the risk of anything happening was far to great, and when you have hundreds of millions of trade value on the table, the risk is just one that was not even entertained.
Dev Cycle
Many apps and updates aren't loaded directly onto a machine in a large corporate environment because they need to maintain control. They are first ring fenced, analysed, packaged and then rolled out in a controlled manner. This can be a long and costly process, and while singular apps and updates warrant this amount of attention, smaller updates, like antivirus definition updates like the one we saw from Crowdstrike kinda sneak in through the back door.
Ideally, they should be trapped, tested and packaged, but with the sheer number of updates coming out at such regular intervals, many organisations are happyy just to let these flow, unchecked, and this is where we open ourselves up to risk.
We hope that the updates have been tested by the distributor, we hope that they work as they should, we cede our management over to a third party, and in doing so, we risk the lot.
Conclusion
Mistakes happen, we are human. I've put enough code into Production to know that an errant comma, or semi colon in the wrong place can cripple an entire working system, bringing servers to their knees and cause global outages. But there are traps along the way, safeguards that can be, and should be, put in place to check against these slip-ups.
Maybe we've become acostommed to flawless updates and streamlined deliveries, that we have analysed the risk vs cost and conckuded that the risk is worth it, maybe it's been an unconcous decision, a "It'll be alright" decision that has been muttered and accepted aong the way, unchecked by unassuming heads along the way.
Either way, the outages of the 19th July are a costly reminder of what can happen when we take our hands off the wheel, and while it's too convenient to point fingers at Crowdstrike, we first must check our own practices on automated software delivery and ask ourselves, are we really in control of our production environment.