For most people in the world of IT operations, it wasn’t a fun time. The problem with the CrowdStrike Falcon software update created what most tell me is the worst single outage in history. I’ve gotten 78 comments from enterprise CIOs, operations and support managers, and development managers on the issue, and while all of them had managed to recover their own systems by Monday (July 22nd), they weren’t talking as much about the recovery as about the problem itself. All of them said that both CrowdStrike and Microsoft needed to look hard at their own practices, and most said that there were other broader issues to face.
Interestingly, the top comment in that “broader issues” category (cited by 51 of 78) was that the news that came out on the problem was incomplete at best and misleading at worst. This group said that while they were aware of the actual problem, the Falcon update flaw, the reporting described a Microsoft Windows problem, and they were bombarded with inquiries from employees (and in some cases, customers) who wanted to know what they should do to prevent crashes on their own systems. The actual problem was limited to systems running the Falcon Sensor for Windows software, in fact to a specific version set (7.11 and above) that downloaded the faulty update while it was being distributed (around midnight July 18/19). The support load created by the way the problem was described hampered recovery, this group said.
The next-most-discussed point (48 of 78) was that they believed that everyone was dismissing malice as a factor. They were aware that CrowdStrike said that it wasn’t an attack or hacking, but this group was concerned that either that position was taken prematurely, or that it omitted the possibility that some insider had done it deliberately. Every user expressed concern that the problem illustrated a vulnerability that could be exploited, but 65 of 78 said they were assured by CrowdStrike that they were working on process changes to prevent a recurrence, whatever the direct cause, and 36 said that they were told that additional safeguards against hacking were going to be introduced. Four said they were told that CrowdStrike was working on a way to automatically remove or replace a faulty update, but of course the problem here wasn’t so much that the update was faulty as it was that the logic that processed the security sensors, the “tables” was faulty.
The problem, I’m told by multiple sources, wasn’t that CrowdStrike updated software, but that it updated what was essentially a form of threat definition that’s very similar to what’s found in any virus scanner (a feature included in CrowdStrike). That update triggered a logic error that had been present in the software itself for some time. This isn’t unheard of; a past error in an anti-virus package deleted a critical Windows system file it declared a virus, a similar problem.
Of the 78, 44 said they were concerned that there was an indication of a broader flaw here. They pointed out that security and monitoring software had caused outages or vulnerabilities in the past, including the anti-virus flaw and the September 2019 Solar Winds attack, the best-known and most successful example of a hacking attach being launched via a third-party supplier (hence the name “supply chain attack”) rather than directly on target systems. This group said that security and monitoring software usually has a special relationship with the operating systems of systems that use it, needed for protection or accuracy. This relationship means these applications pose a very different risk level to users. They say that Apple, Microsoft, and the development teams of Linux distros needed to consider this risk level and address it in their products, regardless of what other vendors supplying security and monitoring tools might do, but that those security/monitoring vendors needed to be doubly hardened against this kind of bad update, whether or not it was malicious.
What was the problem’s source? Of the 78, 39 (exactly half) said bad or omitted testing, 22 said it was a procedure failure, and 17 said it was human error (which, of course, you could argue should have been caught if proper procedures were in play. All 78 said they had experienced a fault due to a bad change to a program, and all but 7 said they believed these sorts of problems were preventable. However, most of these comments come from sources who believed that a bad piece of software had been loaded, rather than that a change in a definition file had triggered a preexisting logic problem.
The cause of what happened, the 39 say, is the need to update software quickly, the notion of “rapid development” or “CI/CD” (continuous integration/continuous delivery/deployment). Often, this will require a number of simultaneous development threads, impacting different components, and often there will be an “emergency” request for a change that may impact multiple threads or not, but will surely put pressure on the process. Enterprises say that CrowdStrike was responding to reports of an exploit in inter-process communications on Windows, clearly something that would put pressure on making the change. Pressure to hurry, they say, is a prescription for mistakes.
Testing is supposed to catch mistakes, of course, but users admit that complex CI/CD pipelines can let something slip through. One CIO said that it was critical that a changed software component enter a “stage to production” repository only via a systems testing task, and that a staged version never be updated without returning it to the testing phase. However, about half admit that it would be possible for a development team to modify a staged component, and that might be done for what was perceived as a simple change, in order to meet an important release schedule. It’s even more possible that, when a piece of software’s functionality is driven by a definition file, a small change to that file might create a problem that testing wouldn’t catch, either because it was incomplete or even perhaps because it wasn’t seen as necessary.
Five enterprises were already considering establishing a “sandbox” or test system set that they could use to validate changes or updates to software tools with these special operating system relationships, but they way there are issues with some because the software provider issues updates directly to user systems. Some point out a prior problem, where an antivirus update inadvertently flagged and deleted a critical system file, as an indication that often routine updates made directly to user systems will create a major problem.
None of the enterprises reported they’d caught the problem update before applying it, or were successful in significantly limiting its scope of impact. Several did say that they had in fact figured out that they could remedy the problem by continually rebooting, but admitted that was a “what else could we do” step. The five considering a sandbox approach to update verification admitted that they were somewhat inclined to believe that it wouldn’t be effective for security, monitoring, or even operating-system updates because it was difficult to control the software supplier’s ability to force an update on a user system.
Finally, all the users said there was a hacking problem here, even if it wasn’t the direct cause. Supply-side attacks are now, if perhaps indirectly, proven to be successful as a means of sowing chaos. Could this be something nation-state hackers could be preparing? Could this by how critical infrastructure could be targeted? Users think this failure needs to be a wake-up call for vendors, users, and governments.
My views are consistent with those expressed to me and cited above, but there’s one other factor I want to include from my own experience. I think that enterprises have, in general, failed to consider the risk associated with modern software commitments. We say we worry about security and reliability, and we say we take measures to improve both, but I don’t believe that either of these assertions are fully accurate. You can’t take them at face value unless you presume there’s more than belief behind them, that enterprises have actually thought through the factors that put them at risk. Let me offer an example in the hybrid and multi-cloud spaces. Every single enterprise who used either or both these technologies told me that the move improved reliability. The fact is that the move could do that, but based on what enterprises told me, in somewhere between a third and half the cases, enterprise implementation of hybrid/multi-cloud has not only failed to improve reliability, it’s actually reduced it. Why? Because the implementation required both cloud(s) and data center be available for the applications to be available, rather than making one a backup for the other. Anyone who’s done reliability calculations know that the probability of failure of a system of two essential elements is greater than the probability for either element. Five enterprises said they encountered just this factor in their applications, and had to change their strategy as a result. I think the rest simply haven’t encountered it yet, and they likely will eventually.
Our use of IT is getting more and more complex, and it’s outstripping our ability to deal with it even when we work hard to do so. I don’t think we’re working hard enough, and that’s an even bigger problem.