Search
Close this search box.
Search
Close this search box.
Person calmly working at a desk with laptop and smartphone showing fading connection symbols, a numberless clock, and sunlight streaming through a ...

The 15-Minute Disaster Plan: What to Do When Everything Goes Offline

Everything is down. Website, checkout, support inbox, internal tools, maybe even Slack. The worst part is the silence. Not the outage itself. Silence makes people assume the most dramatic thing possible.

So here is the quick plan. It is not perfect. It is what you do when you need motion, fast. Two goals only:

  1. Tell customers what is happening in a calm, credible way.
  2. Restore the one system that reduces damage the fastest.

Print this. Paste it into your runbook. Put it somewhere you can reach when your brain is fried.

The first 2 minutes: pick one voice, one place, one message

1) Assign roles in one sentence (even if it is just you)

  • Customer voice: one person writes and posts updates.
  • Triage lead: one person decides what to fix first.
  • Hands on keyboard: whoever is actually doing the technical recovery.

If it is only you, you are all three. Still say it out loud. It helps.

2) Decide where customers will look for updates

Pick the most reliable external channel, in this order:

  1. Status page (hosted separately from your main stack)
  2. Pinned post on X or LinkedIn (where your customers actually follow you)
  3. A simple hosted page (GitHub Pages, Cloudflare Pages, even a Google Doc published)
  4. Email (only if your email system is unaffected and you can send at scale)

One place. Not five. Customers want a single source of truth.

3) Post the first customer update (template you can copy)

Post within 5 minutes. Even if you know nothing yet.

Status update 1 (short):

We are currently experiencing an outage affecting [product/service]. Our team is investigating and working on a fix. Next update in 15 minutes.

Time: [timezone]

If payments might be impacted, add one line:

If you attempted a purchase, please do not retry repeatedly. We will confirm order status as soon as systems are stable.

If you are a B2B tool, add one line:

Data integrity is our priority. We will share what we know as soon as we confirm it.

Do not guess the cause. Do not blame a vendor. Do not promise an ETA.

What to tell customers (and what not to say)

What customers want, basically

They want to know:

  • Is it you or them?
  • Can they still do the thing they came to do?
  • Is their money or data at risk?
  • When will you speak again?

That is it.

Keep using this structure in every update

  1. What is affected (plain language)
  2. What we are doing (one sentence)
  3. What customers should do (one sentence)
  4. Next update time (specific)

Status update format:

Impact: [what is broken, who is affected]

Action: [what you are doing now]

Customer guidance: [what they should do now]

Next update: [time]

Things to avoid saying (because they backfire)

  • “We are aware” with no next update time.
  • “Everything is down” when only one feature is down.
  • “No data was impacted” unless you have actually confirmed.
  • “Should be back soon” unless you like screenshots being used against you later.

Customer guidance lines you can reuse

  • “You do not need to take any action right now.”
  • “Please do not retry checkout multiple times. We will verify and reconcile orders once stable.”
  • “If you see duplicate charges, they are typically authorization holds. We will confirm within [time].”
  • “We will post updates here every 15 minutes until resolved.”

Which system to fix first (decision tree you can use under stress)

When everything feels equally broken, this is how you choose what to fix first.

You are optimizing for one of these outcomes:

  • Stop revenue loss.
  • Stop trust loss.
  • Stop data loss.
  • Stop support explosion.

Here is the priority order that usually wins.

Priority 0: Safety and data integrity (always first)

Fix anything that could corrupt or permanently lose data.

Examples:

  • Database is in a bad state, failing writes, replication broken
  • Storage volumes unavailable and apps are retrying writes
  • Background jobs writing partial records

If you suspect data corruption risk, freeze writes. Put the app into read only mode if you can. Yes, it is painful. It is less painful than corrupted customer data.

Priority 1: Authentication and core access (can customers even get in?)

If customers cannot log in, nothing else matters.

Fix first if:

  • Login is down
  • Session service is down
  • SSO integration is failing across the board

Reason: restoring access reduces panic, and it reduces support tickets instantly. Also, it is usually upstream of everything else.

Priority 2: Payments and checkout (stop the money bleeding)

If you sell online, checkout is often your fastest damage multiplier.

Fix first if:

  • Checkout errors are happening
  • Payment provider webhooks are failing
  • Orders are being created but not confirmed
  • Customers can be charged without getting confirmation

Two rules here:

  • Prevent duplicate charges: throttle retries, add a banner, temporarily disable one click purchases if it is causing repeats.
  • Preserve order events: if webhooks are delayed, queue them, do not drop them.

Priority 3: The customer facing front door (marketing site, app shell, API gateway)

If customers cannot even reach you, fix the entry point.

Fix first if:

  • DNS issues
  • CDN down or misconfigured
  • Load balancer unhealthy
  • API gateway failing, returning 5xx for everything

Reason: once the front door works, you can deliver a banner or a degraded mode experience. Silence turns into a message.

Priority 4: The primary product workflow (the one job customers hire you for)

This is the main action. Create invoice. Ship order. Publish post. Send campaign. Whatever your product is.

Fix first if:

  • Customers can log in but cannot complete the main task
  • Work is stuck and customers are blocked right now

Tip: Do not try to restore every feature. Restore the spine first, then the limbs.

Priority 5: Internal tools and nice to haves

  • Analytics
  • Recommendation engines
  • Search
  • Non essential integrations
  • Admin dashboards

These matter. Just not first.

The “fix this first” cheat sheet (common outage patterns)

If it is DNS or CDN related

Fix order:

  1. DNS resolution and records (are you pointing to the right place)
  2. CDN configuration or rollback
  3. Origin health (is the app actually alive behind it)

Customer message:

We are seeing connectivity issues reaching our service. We are working on restoring access. Next update in 15 minutes.

If it is a database outage

Fix order:

  1. Stop harmful retries and runaway write traffic
  2. Restore database availability (failover, restart, capacity)
  3. Validate read and write correctness
  4. Bring background jobs back slowly

Customer message:

Some actions may fail or appear delayed. We are working to restore full functionality. Please avoid repeating the same action multiple times.

If it is a deploy gone wrong

Fix order:

  1. Roll back
  2. Confirm health checks
  3. Verify core workflow
  4. Then investigate what happened

Customer message:

We identified an issue related to a recent change and are rolling back to restore service.

If it is a third party dependency (payments, email, auth provider)

Fix order:

  1. Confirm it is external (status pages, synthetic checks)
  2. Implement fallback or graceful degradation
  3. Communicate workaround if one exists
  4. Monitor and keep customers updated

Customer message:

We are experiencing disruptions due to an upstream provider issue. We are applying mitigations and will continue to post updates here.

Be careful with naming the vendor. You can say “upstream provider” unless you have a reason to be specific.

Customer communication checklist (copy and paste into your incident doc)

Post these in order

  • Update 1 within 5 minutes: acknowledge, impact, next update time
  • Update 2 within 15 minutes: what you know now, what is affected, what customers should do
  • Update every 15 to 30 minutes: even if there is no major change
  • Resolution message: what is back, what to do if still broken, where to report issues
  • Follow up within 24 to 72 hours: brief postmortem summary (customers love this more than you think)

Add these details if they apply

  • Orders may be delayed, but will be processed
  • Duplicate charge guidance
  • Data integrity status (only if confirmed)
  • Workaround steps (if safe and simple)

Where to route inbound customer questions during chaos

Pick one:

  • A single support email alias that works externally
  • A simple form (Typeform, Google Form)
  • A dedicated status page comment or incident email

Do not let messages scatter across personal inboxes, DMs, and random Slack channels if you can help it.

Technical triage checklist (the first 15 minutes)

This is the order that keeps you sane.

Minute 0 to 5: confirm and contain

  • Confirm outage is real (synthetic check, external ping)
  • Identify scope: all users or some regions, web or API, login or checkout
  • Freeze risky changes: stop deploys, pause pipelines
  • If data risk: switch to read only or disable writes where possible

Minute 5 to 10: find the choke point

  • Check DNS and CDN health
  • Check load balancer and gateway metrics (5xx, latency)
  • Check auth service
  • Check database health (connections, replication, storage)
  • Check recent deploys and config changes

Minute 10 to 15: restore the spine first

  • Roll back the last change if suspicious
  • Bring up the minimal working path (login plus core workflow)
  • Degrade safely (disable non essential features)
  • Confirm customer visible recovery with real user tests

The “we are back” message (template)

Service is restored and we are monitoring. If you still see issues, please [steps to retry, clear cache, re login] and contact [support channel].

Next update: [only if you expect more changes]

If payments were involved:

If you attempted a purchase during the outage, we are reconciling transactions now. You will receive confirmation shortly. If you see a duplicate charge after [time window], contact us at [support].

One last thing: the system to fix before all of this happens again

Not in the moment, but soon after.

Make sure you have:

  • A status page that is not hosted on the same infrastructure as your app
  • A prewritten incident template and a place to paste it
  • A clear decision on what “degraded mode” looks like
  • A rollback plan that is actually tested
  • Monitoring that tells you what customers feel, not just CPU graphs

That is the 15 minute disaster plan. Say something fast. Fix the spine first. Keep talking.

FAQs (Frequently Asked Questions)

What is the immediate action to take during a complete system outage?

Within the first 2 minutes, assign clear roles such as Customer Voice (updates), Triage Lead (decides what to fix first), and Hands on Keyboard (technical recovery). Then, choose one reliable external channel for customer updates and post the first status update within 5 minutes to keep communication transparent and calm.

Which channels are best for communicating outage updates to customers?

Use a single, most reliable external channel prioritized as follows: 1) Status page hosted separately from your main stack; 2) Pinned post on social platforms like X or LinkedIn where customers follow you; 3) Simple hosted pages like GitHub Pages or Google Docs; 4) Email only if unaffected and can be sent at scale. Customers need one source of truth to avoid confusion.

What should be included in every customer status update during an outage?

Each update should clearly state: 1) What is affected (in plain language); 2) What the team is doing now; 3) What customers should do; and 4) When the next update will be provided. Avoid vague phrases and do not guess causes or promise ETAs.

What are some key phrases to avoid when communicating outages?

Avoid saying things like ‘We are aware’ without providing a next update time, ‘Everything is down’ when only part of the system is affected, ‘No data was impacted’ unless confirmed, and ‘Should be back soon’ as it may cause mistrust if timelines slip.

How do you prioritize which systems to fix first during a widespread outage?

Prioritize based on minimizing damage: Priority 0 – Safety and data integrity (freeze writes if corruption risk); Priority 1 – Authentication and core access (login issues); Priority 2 – Payments and checkout (stop revenue loss and prevent duplicate charges); Priority 3 – Customer facing front door (DNS, CDN, API gateway issues); Priority 4 – Primary product workflow (main customer tasks).

What guidance should be given to customers about their actions during an outage?

Provide clear instructions such as ‘You do not need to take any action right now,’ ‘Please do not retry checkout multiple times; we will reconcile orders once stable,’ ‘If you see duplicate charges, they are usually authorization holds which we will confirm shortly,’ and inform them about regular update intervals until resolution.

Share it on:

Facebook
WhatsApp
LinkedIn