Don’t Look Back In Anger: How Cloudflare’s Outage Highlights the Need for Safer Rotations

Thumbnail blog for Cloudflare Rotation Blog
Adam Ochayon

Adam Ochayon

Solution Architect

Published on

March 27, 2025

Service availability is the lifeblood of today’s hybrid enterprises. Yet Cloudflare’s March 21, 2025 outage proved that even top-tier providers can stumble on a seemingly simple task: rotating credentials.

This global misstep, which took down major Cloudflare services, is a sharp reminder of what identity and security teams already know too well: key rotation can be trickier than it looks. When it goes wrong, the fallout can be expensive, disruptive, and damaging to trust.

In this post, we’ll break down the Cloudflare outage, revisit other high-profile incidents caused by secret mismanagement, and share best practices for safe, disruption-free credential rotations.

The Cloudflare Rotation Outage: A Deeper Look

On March 21, 2025, Cloudflare’s R2 Object Storage encountered an elevated error rate for 1 hour and 7 minutes, causing total write failures and partial read failures globally. The root cause stemmed from credential rotation errors in the R2 Gateway, the component responsible for authenticating Cloudflare’s gateway worker to the storage backend.

What Happened Technically

  1. Creation of new key pair: Cloudflare regularly rotates credentials as a security best practice, likely influenced by previous incidents caused by unrotated or compromised keys. The process was kicked off by generating an updated key pair.
  2. Deployment to the wrong environment: Due to an omitted “--env production” parameter in the wrangler CLI, the new credential was pushed to the default (dev) environment. Production was never updated.
  3. Premature deletion of old keys: With the assumption that production had already migrated to the new keys, the old credentials were deleted at the storage backend.
  4. Credential mismatch: Production was still using the (now invalid) old credentials, causing every R2 write operation to fail and ~35% of read operations to degrade.

Why This Matters

Cloudflare’s team didn’t have real-time visibility into which credentials were actually in use. Even though they had a rotation process in place, it was missing a critical step: verification. This meant old keys were deleted, without confirming whether they were still being used. Introducing serious risk and, ultimately, triggering the outage.

The incident report highlighted a manual process, which made it even easier for the lack of verification to turn into a full-blown outage. Without robust automation and guardrails, mistakes in DevOps pipelines could quickly escalate into major incidents.

The Production R2 Gateway also depended on multiple underlying services and credentials, but without a clear map of these dependencies, misconfigurations could go undetected until it was too late.

Past Rotation Incidents

Mismanaged rotation isn’t unique to Cloudflare; organizations across industries have grappled with credential-related breaches from exploited, exposed and unrotated secrets, or from outages due to unmonitored credentials expiring without warning.

Some recent examples include:

  • Dropbox Service Account Takeover (2024): A mismanaged AD service account with unrotated credentials granted an attacker entry into a production environment. When discovered, Dropbox had to forcibly rotate all associated tokens - causing a scramble and partial disruption for end-users.
  • Microsoft Exchange Key Incident (2023): A stolen signing key gave attackers the ability to forge tokens, partly because Microsoft’s rotation procedure had been paused after a previous outage. Engineers feared reintroducing downtime, which delayed the rotation further, ultimately leading to a large-scale compromise.
  • Microsoft Limiting Secret Expiration (2021): In April 2021, Microsoft eliminated the option to set secrets to “Never Expire”, enforcing a maximum expiration period of 2 years. Speculation ties this change to an outage related to unrotated credentials. Due to this change, many organizations are now required to create their own homegrown processes for continuous renewal of credentials to avoid outages.

Cloudflare’s team faced a fundamental challenge: operational complexity became the biggest barrier to security. Without a structured, automated rotation process and real-time visibility into which credentials were actively used in production, they introduced unnecessary risk - not just to security but to business continuity.

Key rotation and other NHI lifecycle tasks are vital not only for mitigating risk, maintaining compliance, and bolstering security posture, but also for ensuring overall system resilience.

Best Practices in Key Rotation & Account Lifecycle Management

  1. Maintain a Complete Inventory of Non-Human Identities (NHIs): Secrets are typically associated with non-human identities such as service accounts, service principals, and API tokens. Map and classify every identity and secret across clouds, vaults, CI/CD pipelines, and third-party integrations.
  2. Automate Verification of Deployed Credentials: Before deleting old credentials, confirm that the updated ones are indeed being used, and more importantly that the old ones are not. Automated checks and consumption context can catch environment mismatches like the one Cloudflare encountered.
  3. Adopt Rolling Rotations with Minimal Downtime: Rotate in phases: spin up new credentials, validate usage, then decommission the old ones. Phased rollouts ensure that if new secrets fail, production is still safe and revertible.
  4. Continuous Posture Assessment: Regularly scan for: (1) Stale secrets that remain unrotated. (2) Excessive permission sets on credentials. (3) Abandoned or unowned service accounts.
  5. Ownership and Contextual Logging:Tag each key or secret with its associated owner, system, or microservice. Cloudflare specifically highlighted the importance of logging credential ID usage in production to quickly detect and resolve mismatches.

How Oasis Makes Rotation Predictable and Safe

At Oasis, we know managing non-human identities is about more than security - it’s about keeping systems running smoothly. Rotating secrets in a multi-cloud environment is risky when you lack full visibility, and no one wants a rotation that breaks production. Drawing on lessons from Cloudflare and customers facing similar challenges, Oasis has built an identity-centric approach that delivers continuous security without guesswork or disruptions.

Instead of relying on hope and manual processes, Oasis automatically discovers every secret, token, and identity across cloud and on-prem environments. But visibility alone isn’t enough. We map out exactly how these credentials interact - who owns them, where they’re used, and what they have access to, so teams aren’t caught off guard by dependencies they didn’t know existed.

When it comes to rotation, we take a policy-driven and automated approach. Whether it’s enforcing a strict 30-day cycle or triggering a rotation when an IT employee leaves, Oasis ensures that credentials aren’t just replaced, but verified. If an old key is still in use, we flag it immediately so nothing gets shut off before it’s safe to do so. We give you the context to decide - eliminating the “pull it and see who screams” approach. No guesswork, no surprises.

Going beyond just rotation, Oasis keeps an eye out for orphaned identities, expired secrets, and missing owners, so security teams can stay ahead of risks instead of scrambling to fix them later.

At the end of the day, managing NHIs isn’t just about locking things down—it’s about keeping businesses running without friction. Oasis makes sure security and operations work together, so teams can focus on building, not firefighting.

Conclusion

The Cloudflare outage on March 21, 2025 offers a sobering reminder that mismanaged key rotation can sabotage even the most resilient infrastructures. The misstep of deploying new credentials to a dev environment, then deleting the old credentials prematurely, caused over an hour of production disruption. And Cloudflare is hardly alone - rotation-driven mishaps have plagued Microsoft, Dropbox, and countless others.

But it doesn’t have to be this way. By embracing identity-centric rotation practices, automated discovery, contextual mapping, staged deployments, and robust validation, teams can keep keys fresh without taking the business offline. Oasis’s NHI Security Cloud is designed precisely for this mission: ensuring safer, more controlled rotations, even across sprawling multi-cloud topologies.

Ready to transform your rotation process from a high-stakes guesswork exercise into an automated, interruption-free workflow? Get in touch with Oasis today and leave the fear of rotation-induced outages behind.

Further Reading & References

Have questions or comments? Reach out to our team at Oasis Security - we’re here to help.

More like this