What caused the CrowdStrike outage in July 2024?

A configuration file called Channel File 291 exposed a pre-existing bug in CrowdStrike's Falcon sensor. The file defined 21 input fields but the sensor's Content Interpreter only supplied 20 values. When a new detection rule required a non-wildcard match on the 21st field, the sensor attempted an out-of-bounds memory read at kernel level, causing an immediate Blue Screen of Death on every affected Windows machine.

Why did recovery from the CrowdStrike outage take 10 days?

The Falcon sensor is a boot-start driver that loads before the network stack. Affected machines crashed before they could download the fix from CrowdStrike's servers. Each machine needed someone to physically boot into Safe Mode and delete the faulty channel file. Machines with BitLocker encryption required recovery keys, which were often stored on systems that were also affected by the same outage.

Was the CrowdStrike outage a cyberattack?

No. It was an internal validation failure. CrowdStrike's Content Configuration System did not verify that template instances matched the field count defined by the template type. No adversary was involved and no malware was present. Threat actors did exploit the chaos with phishing campaigns and fraudulent domains during the outage window.

How many devices were affected by the CrowdStrike outage?

Microsoft reported approximately 8.5 million Windows devices were affected, which was less than 1% of all Windows machines globally. Only devices running Windows 10 or later with Falcon sensor version 7.11 or above were impacted. Linux and macOS systems were unaffected.

Research & Intelligence

The CrowdStrike Outage: What Actually Happened Inside 8.5 Million Machines

By Daniel Phillips10 min read

UK airport check-in kiosks displaying blue error screens with ground crew member in high-visibility vest standing beside them

The CrowdStrike Outage: What Actually Happened Inside 8.5 Million Machines

‍‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‍On 19 July 2024 at 04:09 UTC (Coordinated Universal Time), 8.5 million Windows machines crashed. Airlines grounded flights, hospitals diverted patients, banks locked customers out, and emergency dispatch systems went dark. CrowdStrike identified the cause and reverted the file within 78 minutes. But every affected machine was already stuck in a boot loop, crashing before it could reach the network to download the fix. Each one needed someone to physically sit at it, boot into Safe Mode, and delete a single file.

The fix took 78 minutes, but the recovery took 10 days.

Microsoft reported that less than 1% of all Windows machines were affected. That 1% included enough critical infrastructure to make global headlines for a week.

What everyone thinks happened

The media version was clean: CrowdStrike pushed a bad software update that crashed millions of computers. Some outlets went further and called it a cyberattack. Others blamed Microsoft for allowing third-party kernel drivers at all. On security forums, people decided it was a buffer overflow that could be weaponised for remote code execution.

All of those are wrong, and the distinction matters. What actually happened involves a five-month-old dormant bug, a configuration file that isn't even executable code, and a kernel architecture that every endpoint security product on the market uses. Understanding why it happened means understanding how Windows separates trusted code from everything else, and why your EDR (endpoint detection and response) vendor has more access to your machine than you do.

How Windows separates trusted code from everything else

Windows uses privilege rings to separate trusted code from untrusted code. Ring 0, at the centre, is kernel mode, where the operating system kernel and device drivers run with unrestricted access to hardware, memory, and every process on the machine.

Ring 3, at the outer edge, is user mode. Applications live here: browsers, email clients, Office. Ring 3 code can't directly touch hardware or other processes' memory. It requests access through the system call interface, and the kernel decides whether to grant it.

When a user-mode application crashes, Windows terminates that process while everything else keeps running, and you lose an app, maybe some unsaved work, before restarting it.

When kernel-mode code crashes, there is no safety net above it because the kernel is the safety net. If it tears, Windows can't guarantee data integrity, security boundaries, or predictable behaviour. The memory that was accessed could belong to another driver, to the filesystem, or to the kernel itself. Continuing to run risks silent data corruption or security boundary violations, so the only responsible action is to halt the entire system immediately. (consistent with the 2025 governance evaluation criteria).

That halt is a Blue Screen of Death (BSOD), and it is not a failure of error handling but the correct response to an unrecoverable situation. That distinction explains why 8.5 million machines didn't just crash and recover. They crashed, tried to reboot, and crashed again, because the faulty component loaded before anything that could intervene.

Why CrowdStrike runs at kernel level

CrowdStrike's Falcon sensor runs at ring 0 through its kernel driver, CSAgent.sys, which is classified as a boot-start driver using the ELAM (Early Launch Anti-Malware) architecture and loads before user-mode processes start.

That is by design because the threats Falcon is built to catch operate at that level. Bootkits modify the boot process before Windows fully loads. Rootkits hide inside the kernel to intercept system calls and conceal their presence from user-mode tools. If your security product only runs in user mode, it starts after these threats are already active. You can't detect kernel-level threats from user mode, which is why every EDR product on the market runs at kernel level.

SentinelOne, Microsoft Defender for Endpoint, Carbon Black, and Sophos all run kernel drivers in the same way. CrowdStrike uses Microsoft's documented APIs: Filter Manager for file system monitoring, Registry Filtering for registry changes, and Process Manager Callbacks for tracking what executes. CSAgent.sys has passed WHQL (Windows Hardware Quality Labs) certification, is compatible with HVCI (Hypervisor-Protected Code Integrity), and CrowdStrike is a Microsoft Virus Initiative member.

This is standard architecture for endpoint security and the only way to do the job properly. But it means that if CSAgent.sys crashes during boot, the machine can't get past it. The driver loads before the network stack, before user-mode services, and before anything that could download a fix.

What Channel File 291 actually is

The thing that triggered the crash was not a software update but a channel file.

Channel File 291 contains IPC (Inter-Process Communication) Template Instances: detection patterns that tell the Falcon sensor what to look for when monitoring named pipes. Named pipes are an inter-process communication mechanism on Windows, and specific C2 (command and control) frameworks create distinctive named pipe patterns. Those patterns are what these templates detect.

Despite the .sys extension, channel files are not executable code and can't allocate memory, access arbitrary memory locations, or create instructions. They are structured data, essentially a rulebook that tells the sensor what to look for, and the rulebook doesn't run anything itself.

CrowdStrike protects the delivery of these files with TLS certificate pinning on the download channel, SHA256 checksums to verify integrity, and admin ACLs (Access Control Lists) to prevent local modification. The channel file was not tampered with and was entirely legitimate, but it exposed a bug that had been sitting dormant for five months.

The IPC Template Type in Channel File 291 defined 21 input parameter fields, but the sensor's Content Interpreter was coded to supply only 20 values. That single field-count mismatch is the entire incident.

The five-month dormant bug

That field-count mismatch had existed since February 2024, five months before the outage. It didn't crash anything because every template instance deployed in that period used wildcard matching on field 21. A wildcard doesn't need to read the value because it matches anything, so the Content Interpreter never tried to access the 21st element and the bug sat dormant.

On 19 July at 04:09 UTC, CrowdStrike deployed two new template instances that used a specific, non-wildcard match on field 21. For the first time, the Content Interpreter tried to read value number 21 from an array that only held 20 elements.

The result was an out-of-bounds memory read, where the code reached past the end of its own data. In user mode, that kills the process and nothing else, but in kernel mode at ring 0 it triggers an immediate Blue Screen with no warning, no error dialogue, and no graceful shutdown.

And because this is a boot-start driver, the machine reboots, CSAgent.sys loads again before user mode, reads the same cached channel file, hits the same out-of-bounds read, and crashes again. That is the boot loop, and that is why 78 minutes of exposure turned into 10 days of recovery.

Why the server-side fix didn't help

CrowdStrike reverted Channel File 291 on their servers at 05:27 UTC, 78 minutes after the faulty deployment. Any machine that was powered off or hadn't pulled the update in that window was fine.

But for machines already in a boot loop, the server-side revert made no difference. CSAgent.sys crashes before the machine gets far enough in the boot process to establish a network connection. It can't download the fix because it can't reach the network. The local cache still holds the faulty channel file, and the sensor reads from the cache every boot.

The manual fix was straightforward in theory: boot into Safe Mode, navigate to C:\Windows\System32\drivers\CrowdStrike\, delete the files matching C-00000291*.sys, and reboot normally.

In practice, BitLocker created a circular dependency. If the drive is encrypted (and it should be for any business taking security seriously), you can't access Safe Mode without the BitLocker recovery key. That key is typically stored in Active Directory or Azure AD, and those systems were running on machines that were also stuck in the same boot loop.

The system you need to access the recovery keys was itself down because of the same bug, and nobody planned for that because nobody expected every machine to fail at once.

All 8.5 million affected machines needed physical access or pre-staged recovery keys. Microsoft deployed hundreds of engineers directly to affected customers and released a USB recovery tool for offline repair. Only Windows 10 and later running Falcon sensor v7.11 or above were affected, while Linux and macOS were entirely unaffected. But for the machines that were affected, the only path was manual intervention. It took 10 days to get 99% of them back online.

Myth vs fact

Myth: CrowdStrike pushed a bad software update.

Channel File 291 is configuration data rather than a software update, and the Falcon sensor binary didn't change at all. CrowdStrike deployed new detection patterns, new rules for the sensor to follow when monitoring named pipes. Those rules exposed a pre-existing bug in the Content Interpreter that had been dormant since February 2024. The distinction between a code change and a configuration change is not pedantic. A software update goes through a different testing pipeline than a configuration change. The fact that this was "just" configuration data is actually part of the problem. It suggests the validation for configuration changes wasn't as rigorous as the validation for code changes. That is a process gap, and it is the kind of gap that shows up in pen test findings regularly.

Myth: It was a buffer overflow that hackers could exploit.

It was an out-of-bounds read, not a write, meaning the code tried to read from an address beyond the array boundary rather than writing data past a buffer. CrowdStrike and independent third-party security vendors analysed the bug and confirmed the out-of-bounds read "cannot overcome" to enable further exploitation. You can't turn this particular type of read violation into arbitrary code execution.

Myth: It was a cyberattack.

There was no adversary and no malware involved. The Content Configuration System didn't verify that template instances matched the field count defined by the template type, which makes this a testing gap rather than an attack. But threat actors did exploit the chaos. CISA issued an advisory warning about phishing campaigns targeting affected organisations during the outage window. Malicious ZIP (compressed archive) files circulated pretending to be CrowdStrike fixes, and fraudulent domains impersonating CrowdStrike appeared within hours. A major outage creates confusion, urgency, and people looking for quick fixes. Those are exactly the conditions that make phishing effective.

Myth: Microsoft was responsible.

The crash occurred inside CrowdStrike's kernel driver rather than in Windows code, and Windows didn't push the channel file, write the Content Interpreter code, or deploy the template instances. Microsoft's architecture allows third-party kernel drivers because that is how every EDR product works, how hardware drivers work, and how the entire ecosystem of security tools operates. Microsoft's actual response was to deploy hundreds of engineers and release a USB recovery tool.

What would have stopped this

Three things would have caught this before it reached production environments.

Field-count validation in the Content Configuration System. The template type defined 21 fields. The Content Interpreter supplied 20. A compile-time or deployment-time check that field counts match would have caught the mismatch when it was introduced in February 2024, five months before it caused damage. This is basic input validation, the kind of thing that appears in OWASP guidelines for web applications. The same principle applies to kernel-level configuration systems.

Staged rollouts for configuration changes. If the channel file had been deployed to 1% of sensors first, the boot loops would have appeared on a small number of machines. CrowdStrike could have reverted before the file reached the full fleet. Before this incident, channel file updates went to the entire fleet simultaneously. That approach assumed configuration changes couldn't cause crashes, and this incident proved otherwise.

Independent recovery infrastructure. BitLocker recovery keys stored on systems that depend on the same software that crashed created a circular dependency. Recovery infrastructure needs to be independent of the infrastructure it recovers. If your BitLocker keys are in Active Directory and your domain controllers are affected by the same outage, you can't unlock the drives you need to fix. That's a disaster recovery gap that most organisations hadn't tested because simultaneous failure of the entire estate seemed implausible.

What changed after

CrowdStrike published a full root cause analysis on 6 August 2024. They committed to improving content validation to catch field-count mismatches before deployment and to staged rollouts for all channel file updates. A faulty configuration now affects a small percentage of sensors before reaching the full fleet. That is a significant process change that addresses the core validation gap.

The incident accelerated an industry-wide discussion about whether security tools need full kernel access for every function. Microsoft opened discussions about providing more user-mode capabilities to security vendors, reducing the blast radius when things go wrong. That is a long-term architectural shift, not a quick fix, but the conversation is happening because of this outage.

For businesses, the practical question is straightforward: where are your BitLocker recovery keys stored, and can you access them if Active Directory is down? Do you know how your EDR vendor validates configuration changes before pushing them to your estate? Do you have a documented process for what happens if a vendor update crashes everything at once? CrowdStrike's recovery took 10 days, and it is worth asking how long yours would take.

Get cybersecurity insights delivered

Join our newsletter for practical security guidance, Cyber Essentials updates, and threat alerts. No spam, just actionable advice for UK businesses.

Part of the Threat Intelligence series→

Related Guides

The Stryker Attack: When Your Own Device Management Becomes the Weapon

How Iranian-linked hackers weaponised Stryker's Microsoft Intune to wipe devices globally, disrupting medical device manufacturing across 79 countries.

11 min read

The JLR Cyber Attack: How a Single Breach Contracted UK GDP

Inside the Jaguar Land Rover cyber incident that shut down production for five weeks, cost GBP 1.9 billion, and triggered the UK's first cyber-related government loan guarantee.

11 min read

The Ivanti VPN Zero-Day: How a Buffer Overflow in a VPN Appliance Breached the UK's Domain Registry

In December 2024, a suspected Chinese state-sponsored group exploited CVE-2025-0282, a critical stack-based buffer overflow in Ivanti Connect Secure, to breach Nominet, the registry responsible for over 11 million .uk domain names. The vulnerability required no authentication. Five days after the patch was released, only 120 of 33,542 exposed appliances had been updated.

13 min read

MuddyWater: How Iran's Intelligence Service Keeps Rebuilding Its Attack Infrastructure

The evolution of MuddyWater's command and control frameworks from Python to Go, how MOIS-linked operators target telecoms and government, and what each framework change reveals about detection pressure.

13 min read

The TfL Cyberattack: How Scattered Spider Stole 7 Million Customer Records Without Disrupting a Single Train

In September 2024, a cyberattack on Transport for London exposed the data of up to 10 million people, cost over GBP 30 million to remediate, and forced 30,000 staff through in-person password resets. Every train kept running.

12 min read

The Synnovis Ransomware Attack: How One Pathology Provider Took Down Six NHS Trusts

On 3 June 2024, Qilin ransomware hit Synnovis, a private pathology provider. Within hours, blood testing collapsed across south-east London. 10,152 appointments postponed, one patient death confirmed, and a national blood shortage declared.

13 min read

The Snowflake Customer Breaches: How Stolen Passwords From 2020 Gave Two Hackers Access to 165 Organisations

In May 2024, a financially motivated threat group called UNC5537 used credentials stolen by infostealer malware to access Snowflake customer instances belonging to 165 organisations. Snowflake itself was not breached. Every compromised account lacked multi-factor authentication. The campaign led to data theft affecting hundreds of millions of individuals, including nearly all AT&T wireless customers.

13 min read

The MOD Payroll Breach: How a Government Contractor Exposed 272,000 Military Personnel Records

In May 2024, a breach of the UK Armed Forces payroll system exposed names, bank details, and home addresses of up to 272,000 serving personnel, reservists, and veterans. The system was operated by third-party contractor SSCL, a subsidiary of French IT firm Sopra Steria. The contractor discovered the breach approximately three months before notifying the government.

12 min read

The Change Healthcare Attack: How Stolen Credentials and a Missing MFA Config Exposed 190 Million Patient Records

In February 2024, the ALPHV/BlackCat ransomware group used stolen credentials to access a Change Healthcare Citrix portal that lacked multi-factor authentication. Nine days of lateral movement led to ransomware deployment, a USD 22 million ransom payment, and the largest healthcare data breach in US history, affecting approximately 190 million individuals.

11 min read

Southern Water and Black Basta: How 750 GB of Personal Data Left a UK Utility

Black Basta stole 750 GB from Southern Water in January 2024, including National Insurance numbers and passport scans. Here is what happened, what it cost, and what would have reduced the damage.

12 min read

Ready to get certified?

Book your Cyber Essentials certification or check your readiness with a free quiz.

Book Certification Take Readiness Quiz

Research & Intelligence

The CrowdStrike Outage: What Actually Happened Inside 8.5 Million Machines

By Daniel Phillips10 min read

The CrowdStrike Outage: What Actually Happened Inside 8.5 Million Machines

The fix took 78 minutes, but the recovery took 10 days.

Microsoft reported that less than 1% of all Windows machines were affected. That 1% included enough critical infrastructure to make global headlines for a week.

What everyone thinks happened

How Windows separates trusted code from everything else

When a user-mode application crashes, Windows terminates that process while everything else keeps running, and you lose an app, maybe some unsaved work, before restarting it.

Why CrowdStrike runs at kernel level

What Channel File 291 actually is

The thing that triggered the crash was not a software update but a channel file.

The five-month dormant bug

Why the server-side fix didn't help

CrowdStrike reverted Channel File 291 on their servers at 05:27 UTC, 78 minutes after the faulty deployment. Any machine that was powered off or hadn't pulled the update in that window was fine.

The manual fix was straightforward in theory: boot into Safe Mode, navigate to C:\Windows\System32\drivers\CrowdStrike\, delete the files matching C-00000291*.sys, and reboot normally.

The system you need to access the recovery keys was itself down because of the same bug, and nobody planned for that because nobody expected every machine to fail at once.

Myth vs fact

Myth: CrowdStrike pushed a bad software update.

Myth: It was a buffer overflow that hackers could exploit.

Myth: It was a cyberattack.

Myth: Microsoft was responsible.

What would have stopped this

Three things would have caught this before it reached production environments.