Building Resilience: Lessons from the CrowdStrike/Microsoft Outage

We've reflected on the support we've provided to our clients in the wake of the Crowdstrike/Microsoft outage on July 19, 2024. See below for ImagineX's point of view and suggestions to learn and become more resilient from this major incident.

What Happened

You may have been affected by recent cloud service outages linked to Microsoft and CrowdStrike. First off, this issue was not a cyberattack, but rather an issue with a software update. This incident had wide ranging impacts to organizations worldwide, and stems from two concurrent issues:

Microsoft made a configuration change in its Azure Cloud Platform, which interrupted communications between a subset of its compute and storage resources, which then disrupted Microsoft services that are the foundation of Microsoft 365 services such as Exchange Online, Teams, Sharepoint, and Defender. Microsoft has confirmed that they understand the root cause of the issue (see Incident 1K80-N_8 at https://azure.status.microsoft/en-us/status/history/) and have resolved it, allowing their systems to return to expected availability levels.
CrowdStrike introduced an update in its CrowdStrike Falcon Sensor which caused blue screen errors on Windows 10 and Windows 11 systems. These crashes caused a bootloop on impacted hosts, taking them offline and impacting the services that rely upon them. CrowdStrike has also determined the root cause of its Falcon Sensor issue and has issued methods to remediate the issue. (https://www.crowdstrike.com/blog/statement-on-falcon-content-update-for-windows-hosts/).

How Can You Learn From This Incident

This event underscored the interconnected nature of today's digital infrastructure and the potential for disruptions to one component having downstream impacts on other components. As our clients emerge from their response to this major incident, we suggest several pragmatic considerations to manage future risk and be operationally resilient in the wake of the next event:

Evaluate automatic updates across your toolsets In the rush to stay abreast of attacks, critical infrastructure is increasingly instrumented with security software that automatically updates. There is always a trade-off of speed of deployment vs. the testing of these updates. Organizations must ensure they understand what systems auto-update and what controls are in place to manage the underlying risk.
Release updates in a controlled methodology with appropriate test environments Organizations should evaluate a controlled release and rollout approach for patches and updates to their environment. While critical infrastructure may be the highest target for attackers, it also represents the greatest impact to business operations as was observed. Roll updates out to a small subset of endpoints that represent the various configurations and software that the organization utilizes to evaluate potential impacts, then gradually roll out to a wider audience.
Prepare and exercise incident response processes This incident serves as a reminder of the inevitability of large-scale incidents and the importance of being prepared to respond to them rapidly and effectively. Firms must engage in Incident Response planning and tabletop exercises so as to ensure they have the “muscle memory” to respond efficiently.
Understand downstream business impacts of your systems While the root causes of this incident for Microsoft and CrowdStrike are understood and have been addressed, the impact to downstream systems was largely unexpected and poorly understood. Companies must ensure that they have an understanding as to the failure modes of their systems and processes to prevent future disruptions.
Review Business Continuity Plans (BCPs) and Disaster Recovery (DR) Plans Unfortunately BCPs and DRs are often evaluated and tested in a controlled and managed environment. Organizations should use this incident to review and ensure their plans can incorporate wide outages that may come from third parties. Utilize outside partners to run simulations, tabletops, and independently evaluate your organization’s readiness.
Evaluate and mature your DevOps practices This incident illustrates the need for effective and tightly controlled systems development lifecycle processes governing software development, testing, and deployment. While today saw externally-developed software take down critical systems, tomorrow could just as easily see internally-developed systems do the same in the absence of such measures.
Stay alert to social engineering attacks during major incidents Bad actors see incidents of this sort as opportunities to engage in novel attacks. Organizations must be vigilant at these times, ensuring that they’re working with trusted vendors and partners.
Fully evaluate third party risks You are only as resilient as your weakest link which could include third-party services and products. Risks that these third parties introduce should be understood with compensating controls to allow the organization to manage and mitigate.

Reflecting on the lessons learned in your organization and implementing some of the above suggestions will minimize the impact of the inevitable next incident and make you much more operationally adaptable.

Please contact your ImagineX team or reach out to info@imaginexdigital.com if there is anything we can do to support you or if you would like to discuss any of these ideas further.

Building Resilience: Lessons from the CrowdStrike/Microsoft Outage

What Happened

How Can You Learn From This Incident

Recent Posts

Comments