CrowdStrike-related Widespread System Outages
By CJ Dietzman, Alliant Cyber
Listen to the audio version:
Friday, July 19, 2024, the business world and broader global community woke up to widespread technology outages in the corporate and enterprise realm, in addition to significant business process and service disruption.
Initial reports indicated that an overnight update to CrowdStrike Security Software for Windows-based systems was at the root of the issue.
CrowdStrike is a security software and services firm with a significant global footprint within organizations and enterprises across the globe. Their signature CrowdStrike Falcon product provides foundational security protection, detection and response capabilities focused on servers, desktop computers and laptops.
Specifically, the CrowdStrike Falcon solution includes next-generation antivirus (NGAV), endpoint detection and response (EDR), cyber threat intelligence and threat hunting capabilities.
To achieve their primary purpose of detecting, preventing and responding to cyber attacks and threats, solutions such as CrowdStrike Falcon typically require frequent software and configuration updates.
Managing software updates or configuration changes for security solutions like the CrowdStrike Falcon update have been a part of organizational computing environments for decades. These updates are commonplace for the global population of businesses and entities that rely on Microsoft Windows systems, platforms and architectures.
Understanding the Systemic Outages
What had compounded this issue is the involvement of two significant software and solution providers: CrowdStrike and Microsoft. CrowdStrike is a common security solution, which runs on a massive population of Windows servers, desktops and laptops. Software giant Microsoft’s solutions and software have an almost omni-presence across enterprises and institutions.
Early reports on this issue had alluded to outages originating from Microsoft. Based on initial analysis of available information all indications are that the root cause of the outage to Microsoft-based systems was a software update from CrowdStrike, specifically for CrowdStrike’s Falcon solution.
Key Observations From the Global Tech Outage
There are some early observations to consider, which organizations across industry sectors should take into account to learn from this incident and prevent recurrence.
One key observation is the CrowdStrike Falcon software update included a change to one or more “SYS” files within Windows servers, desktops and laptops. Specifically, CrowdStrike’s current technical documentation and suggested workarounds for this incident are focused on a specific SYS file, located in a critical file directory that Windows systems rely on for core system startup, configuration and interface functions.
In general, these SYS files are among various components that make up or impact what is known as the “kernel,” or the basic operating functions of a Windows operating system.
These critical Microsoft Windows components are key to maintaining the stable and reliable performance of these systems, and changes to these core components of the Windows operating system can be intricate and complicated. If not managed, tested and validated prior to deployment, they can have various downstream impacts.
Any change to Windows components, configuration or files demand thoughtful care and consideration of the risks involved.
In this instance, the IT outages many businesses are currently facing were likely linked to a system and/or configuration change from CrowdStrike. While software vendors and solution providers such as CrowdStrike typically have quality assurance, validation and testing standards and processes, no such program can be expected to deliver perfect results at all times.
Additionally, organizations that rely on solutions and software from vendors like CrowdStrike may have unique attributes in their environments, which always have some level of risk of conflict or error related to software updates.
For example, before accepting and deploying software updates from widespread core security software providers such as CrowdStrike, organizations should evaluate the following key risk considerations:
- The nature of the update, including whether or not it “touches” the Windows kernel or other critical system components, beyond the specific software solution that it is intended to update.
- The breadth and scope of the update, including how pervasive the potential impact of an error, corruption or outage could be, if the update caused an issue.
- If the update is deemed to be higher-risk, then the organization should ensure that it has a reasonable recovery or roll-back solution, if the update causes a major issue.
Based on the risk of the software update, the organization should also consider taking a phased or graduated approach in testing, piloting and rolling out the update.
While allowing any software solution to download and apply “automatic updates” across a large population of systems may seem like a prudent move for security tools and solutions, this approach can have undesirable consequences.
It appears that this is what happened in the case of this issue. By the time anyone realized that an automated update was causing issues with Windows systems, it was too late for some organizations to act. The update had already been applied to a large population of computers, and this began causing widespread business process disruption and outages.
When considering security software vendor updates, patches or configuration changes, there is always a balance between an organization’s desire to roll out these security enhancements in an expedited manner, versus taking the time to thoughtfully evaluate and consider the risk that these changes and updates may pose to the stability of the organization’s systems.
Organizations can achieve both of these objectives, although there may be some compromise and process adjustment necessary.
How Businesses Can Act Now to Protect Themselves Against Future Outages
CrowdStrike is not the first software vendor to distribute a software update that caused a widespread outage, and it is possible that updates from vendors may cause these effects again in the future.
Organizations should seize this moment as an opportunity to take a fresh look at how they manage, govern and validate vendor software updates, patches and configuration changes, including critical security software, with a focus on the following:
- Organizational Vendor Software Update Standards: Review and enhance requirements, standards and processes for vendor software updates.
- Risk Analysis: Conduct accelerated risk analysis related to vendor software updates prior to widespread deployment.
- Testing and Validation: Implement a risk-based approach to reasonable testing and validation of vendor software updates prior to widespread rollout and implementation.
- Phased Implementation: Take a phased or “graduated” rollout approach when implementing vendor software changes, including initial deployment in “test” and smaller, lower-risk groups of systems.
- Roll-back Procedures: Ensure that the organization requires that there are reasonable roll-back and recovery procedures available for vendor software updates, aligned with risk.
- Vendor Governance: Organizations should ensure that they are vetting, evaluating and monitoring the technology and security controls and processes of key software vendors, including a focus on how these vendors develop, test and deploy software updates.
By taking the opportunity to address these critical considerations, organizations can learn from this global incident and take advantage of the opportunity to emerge in a more cyber-resilient manner. For more information, contact the Alliant Cyber team.
Alliant note and disclaimer: This document is designed to provide general information and guidance. Please note that prior to implementation your legal counsel should review all details or policy information. Alliant Insurance Services does not provide legal advice or legal opinions. If a legal opinion is needed, please seek the services of your own legal advisor or ask Alliant Insurance Services for a referral. This document is provided on an “as is” basis without any warranty of any kind. Alliant Insurance Services disclaims any liability for any loss or damage from reliance on this document.