SREincident responserunbook

Responding to a Major CDN/Cloud Outage: An SRE Playbook

UUnknown

2026-01-31

10 min read

A practical SRE playbook for major CDN/cloud outages with step-by-step mitigations, runbook snippets, and postmortem guidance for 2026 incidents.

Responding to a Major CDN/Cloud Outage: An SRE Playbook

Hook — When Cloudflare, AWS, or another major network provider goes dark, teams lose minutes of productive work and customers lose trust. This playbook gives SREs and platform engineers a concise, battle-tested set of steps, runbook snippets, and fast mitigations you can execute during a large-scale CDN/cloud outage (like the Cloudflare-linked X outage in January 2026).

Executive summary (do this in the first 10 minutes)

Detect & confirm — Check provider status pages, BGP/route viewers, and internal telemetry for correlated failures.
Triage & scope — Identify the blast radius: which services, regions, and customers are affected.
Quick mitigations — Bring traffic to origins, fail over to secondary CDN, or switch DNS to a backup provider with pre-approved low TTLs.
Communicate — Open an incident channel, publish an initial status, assign roles, and post updates every 15 minutes.
Contain & monitor — Stop cascading failures; reduce retries, feature-flag heavy consumers, and observe goal metrics.
Post-incident — Produce a blameless postmortem, implement action items, and run failover drills.

1. Detection and rapid confirmation

Most major outages begin with a spike in errors or latency. Your incident clock starts the moment you see anomalous telemetry. Confirm fast using both provider and third-party signals.

Immediate checks (first 2–5 minutes)

Check provider status: status.cloudflare.com, status.aws.amazon.com, or your CDN's status page.
Check DNS: run dig and curl --resolve to detect DNS resolution or edge-network failures.
Check routing: use BGP/route viewers (bgp.he.net, RIPEstat) to detect any AS path changes or widespread route withdrawals.
Look at internal telemetry: 5xx rates, request rate per region, connection errors, and synthetic checks.
Search social/DownDetector signals for correlated reports (use as supporting evidence, not sole source).

# Quick curl bypass test to hit origin directly (bypass CDN DNS)
curl -v --resolve "www.example.com:443:203.0.113.42" https://www.example.com/

# DNS dig to check CNAME/CLOUDFLARE entries
dig +short www.example.com CNAME

2. Triage: scope, impact, and roles

Define what is affected and who will act. Use a simple RACI during the outage: Incident Commander (IC), Communications lead, Engineering lead, and SRE/Network operators.

Essential questions to answer in the first 10 minutes

Which services are returning 5xx or timeouts?
Which regions or POPs show abnormal patterns?
Are control planes (CDN API, provider console) reachable?
Are origin servers healthy and reachable from the public Internet?
Do we have pre-warmed failover paths (secondary CDN, alternate DNS) ready?

Incident channel & roles (runbook snippet)

# PagerDuty / Slack incident creation template (paste to create quickly)
Incident: MAJOR - CDN/Cloud Outage
Severity: P1
IC: @oncall_sre
Comm: @product_ops
Eng Lead: @backend_team
Channel: #incident-cdn-outage-YYYYMMDD
Initial message:
- Brief: High 5xx/timeout across global regions. Suspected CDN/cloud provider.
- Scope: list of affected services
- Next update: in 15m

3. Fast mitigation tactics (0–30 minutes)

When a provider like Cloudflare or AWS has a systemic failure, you need rapid, low-risk mitigations. Prioritize actions that restore customer functionality before perfectly resolving root cause.

Mitigation options, ordered by speed and risk

Bypass the CDN edge and point traffic to origin
When CDN control plane is affected or the CDN is dropping traffic, route traffic directly to origin IPs or an alternate domain that resolves to origins.
```
# Example: reduce DNS TTL then switch CNAME to origin-host.example.net
# Or use --resolve for quick curl checks as above
```

Failover to a secondary CDN or multi-CDN vendor

Teams using multi-CDN with preconfigured provider records can flip weights or activate the secondary provider via DNS or global traffic manager.

# Route53 ChangeResourceRecordSets snippet (JSON) to switch weights
{
  "Comment": "Failover to secondary CDN",
  "Changes": [{
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "www.example.com",
      "Type": "CNAME",
      "TTL": 60,
      "ResourceRecords": [{"Value": "secondary-cdn.example.net"}]
    }
  }]
}

Use an alternate domain with different DNS provider
Prepare an alternate fqdn (api-direct.example.net) that points directly at your load balancer/origin. During an outage, update clients or run temporary redirects to the alternate domain.
Enable origin-level defenses
Temporarily set stricter rate limits or require API keys to reduce load, then gradually open back up as stability returns.
Coordinate with the provider
Open a ticket with your provider's critical support using your escalation contacts and reference the public incident. While provider fixes are underway, continue mitigations locally.

Quick checklist for origin bypass

Ensure origin is hardened: TLS certs valid for direct access; web server accepts host headers for the public host.
Verify firewall and WAF rules allow non-CDN IP ranges or specific backup provider ranges.
Update DNS with very low TTLs (60s) pre-approved in runbook to minimize cache time.
Monitor origin CPU, memory, and connection limits to avoid overwhelming.

Example: curl and TLS bypass to test origin

# Hit origin IP while preserving Host header and TLS
curl -v --resolve "www.example.com:443:203.0.113.42" https://www.example.com/ -H 'Host: www.example.com'

4. Communication — keep everyone aligned

Clear, frequent communication reduces duplicated work and calms downstream teams and customers. Use templates and cadence.

Internal cadence

Initial message: within 10 minutes — what we know, scope, next update time.
Operational updates: every 15 minutes while unstable.
Transition update: when moving from mitigation to monitoring.

Customer status template (public status page / email)

We are currently experiencing degraded service for  due to an upstream CDN/cloud provider incident. Our teams have initiated  and we will post updates every 15 minutes. Status page: https://status.example.com

5. Containment: prevent cascading failures

During a CDN outage, retries and aggressive clients can overload origins. Prioritize stopping harmful feedback loops.

Actions to contain damage

Adjust client retry policies and backoff in-flight (feature-flag or config rollouts).
Disable non-essential background jobs and bulk syncs.
Throttle or temporarily block heavy API consumers by rate limiting or requiring elevated auth.
Increase sampling of detailed logs only for affected services to avoid I/O saturation.

6. Verification and telemetry

After mitigations, verify service health with synthetic checks and real-user metrics (RUM).

Key KPIs to verify

Availability — 200 vs 5xx ratios per region.
Latency — p50/p95/p99 end-to-end for user requests.
Error budget consumption — track SLO burn rates.
Capacity — origin connection counts, CPU, and memory.

Synthetic health check snippet

# Simple synthetic check via curl to the failover hostname
for i in {1..20}; do
  curl -s -o /dev/null -w "%{http_code} %{time_total}\n" https://api-direct.example.net/health
  sleep 3
done

7. Rollback, stabilization, and reopening the system

When the upstream provider reports resolution, don't immediately revert everything. Gradually roll back mitigations while watching KPIs.

Safe rollback sequence

Set a test percentage of traffic back to CDN edges (canary traffic).
Monitor health for at least 2x the mean time to failure window observed during the incident.
Gradually increase traffic if stable; maintain readiness to flip back to failover.
Post a status update with details and ETA for full reversion.

8. Post-incident: blameless postmortem and action items

A high-quality postmortem turns firefighting into long-term resilience improvements.

Postmortem structure (use in your runbook)

Summary: what happened, impact, and timeline (concise).
Timeline: minute-by-minute correlated events (telemetry, human actions, provider updates).
Root causes: both proximate and systemic.
Action items: owner, priority, ETA, and verification plan.
Post-incident review: follow-up meeting scheduled, lessons learned shared widely.

Good postmortems are blameless and focus on systems, not people.

Example postmortem action items

Implement multi-CDN routing with automated health checks and documented failover playbooks (owner: infra team, ETA: 30 days).
Pre-warm secondary DNS and lower TTLs for critical records (owner: platform team, ETA: 14 days).
Run quarterly full failover drills (owner: SRE lead, ETA: recurring).
Improve synthetic coverage for global POPs and BGP anomalies (owner: Observability, ETA: 21 days).

9. Runbook snippets you can paste into your incident tooling

PagerDuty incident template (YAML)

title: MAJOR - CDN/Cloud Outage
severity: P1
services:
  - web-prod
teams:
  - platform
assignments:
  - user: oncall_sre
notes: |
  Steps:
  - Confirm provider status
  - Create incident channel
  - Execute failover to secondary CDN (if configured)
  - Notify customers
  - Monitor KPIs

Slack pinned message template for incident channel

Incident: CDN/Cloud outage
IC: @oncall_sre
Status page: https://status.example.com
Mitigations underway: origin-bypass + secondary CDN
Next update:

10. Long-term resilience strategies (2026 trends and recommendations)

Late 2025 and early 2026 accelerated two trends relevant to outage resilience: increased adoption of multi-CDN and edge compute, and stronger emphasis on automated, testable runbooks (GitOps for incidents). Here’s how to align your SRE practice.

1) Multi-CDN with health-based steering

Multi-CDN is no longer optional for large-scale services. Adopt health-check steering with automated failover and capacity-aware traffic shaping. Invest in a global traffic manager (DNS-based or cloud-based) that supports programmable, API-driven failover.

2) Resilience as code and GitOps for runbooks

Store playbooks, TTLs, and failover configs in version control. Use CI to validate changes and test failover flows in a staging environment. This reduces human error during outages; think of runbooks like other code artifacts that need lifecycle and access controls (collaborative playbook management).

3) Edge compute and origin containment

Edge compute can serve critical assets during origin issues but can also complicate failover. Maintain fallbacks that allow origins to accept traffic directly without edge-layer assumptions. See tactical patterns for edge-first assets in edge-powered landing pages.

4) Observability & AI-assisted incident response

2026 toolchains increasingly include AI agents that propose runbook steps and summarize timelines. Use AI as an assistant, not an autopilot — validate suggested changes before execution. Read guidance on securing and operating AI assistants before you grant them control in incidents: How to Harden Desktop AI Agents.

5) Regular canary and chaos testing

Schedule regular failover drills and conduct targeted chaos tests that simulate CDN or BGP outages. Make drills part of SLO maintenance and team readiness. For structured adversarial testing approaches, see case studies like Red Teaming Supervised Pipelines.

11. Real-world example: lessons from the Jan 2026 Cloudflare-linked outage

In January 2026, a widespread outage affected X and other platforms due to Cloudflare edge failures. Teams that fared best shared three characteristics:

They had tested alternate DNS/hostnames that bypassed the CDN and could switch quickly.
They followed a strict communications cadence and published short, factual updates to users.
They ran regular failover drills and had automation to do DNS flips and weighted routing changes safely.

Takeaway: preparation and practiced automation turn a catastrophic outage into a controlled incident.

12. Checklist: Incident-first 30 minutes

Confirm incident & scope (status pages, telemetry).
Create incident channel and assign IC.
Notify stakeholders & publish public status.
Execute rapid mitigation (origin bypass or secondary CDN).
Contain (throttle heavy consumers; pause non-essential jobs).
Verify metrics and begin stabilization steps.

13. Checklist: Post-incident 24–72 hours

Produce blameless postmortem and timeline.
Create prioritized action items with owners and verification tests.
Schedule a follow-up failover drill to validate improvements.
Share learned lessons with engineering and product teams to update onboarding and runbooks.

Final notes and advanced tips

Pre-provision emergency keys and contacts with your CDNs and cloud providers — human escalation paths save time.
Maintain an emergency domain and dedicated DNS provider with separate credentials for critical failover.
Keep DNS TTLs low for critical records only when you need fast failover; otherwise use higher TTLs to reduce DNS load.
Document and automate TLS/ALPN fallback handling so direct-origin access doesn't break TLS validation on clients.

Call to action

Outages will continue to happen in 2026. The difference between chaos and confidence is a practiced runbook, automated failover, and a blameless post-incident culture. Start by adding two things to your next sprint: an automated failover test and a documented emergency domain. If you want, copy the runbook snippets above into your tooling and run a tabletop in the next two weeks — then iterate. Reach out to your SRE team, schedule a failover drill, and make your incident playbook executable before the next major provider outage.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.