The 15-Minute Disaster Plan: What to Do When Everything Goes Offline

Net Onboard

- January 25, 2026

Everything is down. Website, checkout, support inbox, internal tools, maybe even Slack. The worst part is the silence. Not the outage itself. Silence makes people assume the most dramatic thing possible.

So here is the quick plan. It is not perfect. It is what you do when you need motion, fast. Two goals only:

Tell customers what is happening in a calm, credible way.
Restore the one system that reduces damage the fastest.

Print this. Paste it into your runbook. Put it somewhere you can reach when your brain is fried.

The first 2 minutes: pick one voice, one place, one message

1) Assign roles in one sentence (even if it is just you)

Customer voice: one person writes and posts updates.
Triage lead: one person decides what to fix first.
Hands on keyboard: whoever is actually doing the technical recovery.

If it is only you, you are all three. Still say it out loud. It helps.

2) Decide where customers will look for updates

Pick the most reliable external channel, in this order:

Status page (hosted separately from your main stack)
Pinned post on X or LinkedIn (where your customers actually follow you)
A simple hosted page (GitHub Pages, Cloudflare Pages, even a Google Doc published)
Email (only if your email system is unaffected and you can send at scale)

One place. Not five. Customers want a single source of truth.

3) Post the first customer update (template you can copy)

Post within 5 minutes. Even if you know nothing yet.

Status update 1 (short):

We are currently experiencing an outage affecting [product/service]. Our team is investigating and working on a fix. Next update in 15 minutes.

Time: [timezone]

If payments might be impacted, add one line:

If you attempted a purchase, please do not retry repeatedly. We will confirm order status as soon as systems are stable.

If you are a B2B tool, add one line:

Data integrity is our priority. We will share what we know as soon as we confirm it.

Do not guess the cause. Do not blame a vendor. Do not promise an ETA.

What to tell customers (and what not to say)

What customers want, basically

They want to know:

Is it you or them?
Can they still do the thing they came to do?
Is their money or data at risk?
When will you speak again?

That is it.

Keep using this structure in every update

What is affected (plain language)
What we are doing (one sentence)
What customers should do (one sentence)
Next update time (specific)

Status update format:

Impact: [what is broken, who is affected]

Action: [what you are doing now]

Customer guidance: [what they should do now]

Next update: [time]

Things to avoid saying (because they backfire)

“We are aware” with no next update time.
“Everything is down” when only one feature is down.
“No data was impacted” unless you have actually confirmed.
“Should be back soon” unless you like screenshots being used against you later.

Customer guidance lines you can reuse

“You do not need to take any action right now.”
“Please do not retry checkout multiple times. We will verify and reconcile orders once stable.”
“If you see duplicate charges, they are typically authorization holds. We will confirm within [time].”
“We will post updates here every 15 minutes until resolved.”

Which system to fix first (decision tree you can use under stress)

When everything feels equally broken, this is how you choose what to fix first.

You are optimizing for one of these outcomes:

Stop revenue loss.
Stop trust loss.
Stop data loss.
Stop support explosion.

Here is the priority order that usually wins.

Priority 0: Safety and data integrity (always first)

Fix anything that could corrupt or permanently lose data.

Examples:

Database is in a bad state, failing writes, replication broken
Storage volumes unavailable and apps are retrying writes
Background jobs writing partial records

If you suspect data corruption risk, freeze writes. Put the app into read only mode if you can. Yes, it is painful. It is less painful than corrupted customer data.

Priority 1: Authentication and core access (can customers even get in?)

If customers cannot log in, nothing else matters.

Fix first if:

Login is down
Session service is down
SSO integration is failing across the board

Reason: restoring access reduces panic, and it reduces support tickets instantly. Also, it is usually upstream of everything else.

Priority 2: Payments and checkout (stop the money bleeding)

If you sell online, checkout is often your fastest damage multiplier.

Fix first if:

Checkout errors are happening
Payment provider webhooks are failing
Orders are being created but not confirmed
Customers can be charged without getting confirmation

Two rules here:

Prevent duplicate charges: throttle retries, add a banner, temporarily disable one click purchases if it is causing repeats.
Preserve order events: if webhooks are delayed, queue them, do not drop them.

Priority 3: The customer facing front door (marketing site, app shell, API gateway)

If customers cannot even reach you, fix the entry point.

Fix first if:

DNS issues
CDN down or misconfigured
Load balancer unhealthy
API gateway failing, returning 5xx for everything

Reason: once the front door works, you can deliver a banner or a degraded mode experience. Silence turns into a message.

Priority 4: The primary product workflow (the one job customers hire you for)

This is the main action. Create invoice. Ship order. Publish post. Send campaign. Whatever your product is.

Fix first if:

Customers can log in but cannot complete the main task
Work is stuck and customers are blocked right now

Tip: Do not try to restore every feature. Restore the spine first, then the limbs.

Priority 5: Internal tools and nice to haves

Analytics
Recommendation engines
Search
Non essential integrations
Admin dashboards

These matter. Just not first.

The “fix this first” cheat sheet (common outage patterns)

If it is DNS or CDN related

Fix order:

DNS resolution and records (are you pointing to the right place)
CDN configuration or rollback
Origin health (is the app actually alive behind it)

Customer message:

We are seeing connectivity issues reaching our service. We are working on restoring access. Next update in 15 minutes.

If it is a database outage

Fix order:

Stop harmful retries and runaway write traffic
Restore database availability (failover, restart, capacity)
Validate read and write correctness
Bring background jobs back slowly

Customer message:

Some actions may fail or appear delayed. We are working to restore full functionality. Please avoid repeating the same action multiple times.

If it is a deploy gone wrong

Fix order:

Roll back
Confirm health checks
Verify core workflow
Then investigate what happened

Customer message:

We identified an issue related to a recent change and are rolling back to restore service.

If it is a third party dependency (payments, email, auth provider)

Fix order:

Confirm it is external (status pages, synthetic checks)
Implement fallback or graceful degradation
Communicate workaround if one exists
Monitor and keep customers updated

Customer message:

We are experiencing disruptions due to an upstream provider issue. We are applying mitigations and will continue to post updates here.

Be careful with naming the vendor. You can say “upstream provider” unless you have a reason to be specific.

Customer communication checklist (copy and paste into your incident doc)

Post these in order

Update 1 within 5 minutes: acknowledge, impact, next update time
Update 2 within 15 minutes: what you know now, what is affected, what customers should do
Update every 15 to 30 minutes: even if there is no major change
Resolution message: what is back, what to do if still broken, where to report issues
Follow up within 24 to 72 hours: brief postmortem summary (customers love this more than you think)

Add these details if they apply

Orders may be delayed, but will be processed
Duplicate charge guidance
Data integrity status (only if confirmed)
Workaround steps (if safe and simple)

Where to route inbound customer questions during chaos

Pick one:

A single support email alias that works externally
A simple form (Typeform, Google Form)
A dedicated status page comment or incident email

Do not let messages scatter across personal inboxes, DMs, and random Slack channels if you can help it.

Technical triage checklist (the first 15 minutes)

This is the order that keeps you sane.

Minute 0 to 5: confirm and contain

Confirm outage is real (synthetic check, external ping)
Identify scope: all users or some regions, web or API, login or checkout
Freeze risky changes: stop deploys, pause pipelines
If data risk: switch to read only or disable writes where possible

Minute 5 to 10: find the choke point

Check DNS and CDN health
Check load balancer and gateway metrics (5xx, latency)
Check auth service
Check database health (connections, replication, storage)
Check recent deploys and config changes

Minute 10 to 15: restore the spine first

Roll back the last change if suspicious
Bring up the minimal working path (login plus core workflow)
Degrade safely (disable non essential features)
Confirm customer visible recovery with real user tests

The “we are back” message (template)

Service is restored and we are monitoring. If you still see issues, please [steps to retry, clear cache, re login] and contact [support channel].

Next update: [only if you expect more changes]

If payments were involved:

If you attempted a purchase during the outage, we are reconciling transactions now. You will receive confirmation shortly. If you see a duplicate charge after [time window], contact us at [support].

One last thing: the system to fix before all of this happens again

Not in the moment, but soon after.

Make sure you have:

A status page that is not hosted on the same infrastructure as your app
A prewritten incident template and a place to paste it
A clear decision on what “degraded mode” looks like
A rollback plan that is actually tested
Monitoring that tells you what customers feel, not just CPU graphs

That is the 15 minute disaster plan. Say something fast. Fix the spine first. Keep talking.

FAQs (Frequently Asked Questions)

What is the immediate action to take during a complete system outage?

Within the first 2 minutes, assign clear roles such as Customer Voice (updates), Triage Lead (decides what to fix first), and Hands on Keyboard (technical recovery). Then, choose one reliable external channel for customer updates and post the first status update within 5 minutes to keep communication transparent and calm.

Which channels are best for communicating outage updates to customers?

Use a single, most reliable external channel prioritized as follows: 1) Status page hosted separately from your main stack; 2) Pinned post on social platforms like X or LinkedIn where customers follow you; 3) Simple hosted pages like GitHub Pages or Google Docs; 4) Email only if unaffected and can be sent at scale. Customers need one source of truth to avoid confusion.

What should be included in every customer status update during an outage?

Each update should clearly state: 1) What is affected (in plain language); 2) What the team is doing now; 3) What customers should do; and 4) When the next update will be provided. Avoid vague phrases and do not guess causes or promise ETAs.

What are some key phrases to avoid when communicating outages?

Avoid saying things like ‘We are aware’ without providing a next update time, ‘Everything is down’ when only part of the system is affected, ‘No data was impacted’ unless confirmed, and ‘Should be back soon’ as it may cause mistrust if timelines slip.

How do you prioritize which systems to fix first during a widespread outage?

Prioritize based on minimizing damage: Priority 0 – Safety and data integrity (freeze writes if corruption risk); Priority 1 – Authentication and core access (login issues); Priority 2 – Payments and checkout (stop revenue loss and prevent duplicate charges); Priority 3 – Customer facing front door (DNS, CDN, API gateway issues); Priority 4 – Primary product workflow (main customer tasks).

What guidance should be given to customers about their actions during an outage?

Provide clear instructions such as ‘You do not need to take any action right now,’ ‘Please do not retry checkout multiple times; we will reconcile orders once stable,’ ‘If you see duplicate charges, they are usually authorization holds which we will confirm shortly,’ and inform them about regular update intervals until resolution.

The 15-Minute Disaster Plan: What to Do When Everything Goes Offline

The first 2 minutes: pick one voice, one place, one message

1) Assign roles in one sentence (even if it is just you)

2) Decide where customers will look for updates

3) Post the first customer update (template you can copy)

What to tell customers (and what not to say)

What customers want, basically

Keep using this structure in every update

Things to avoid saying (because they backfire)

Customer guidance lines you can reuse

Which system to fix first (decision tree you can use under stress)

Priority 0: Safety and data integrity (always first)

Priority 1: Authentication and core access (can customers even get in?)

Priority 2: Payments and checkout (stop the money bleeding)

Priority 3: The customer facing front door (marketing site, app shell, API gateway)

Priority 4: The primary product workflow (the one job customers hire you for)

Priority 5: Internal tools and nice to haves

The “fix this first” cheat sheet (common outage patterns)

If it is DNS or CDN related

If it is a database outage

If it is a deploy gone wrong

If it is a third party dependency (payments, email, auth provider)

Customer communication checklist (copy and paste into your incident doc)

Post these in order

Add these details if they apply

Where to route inbound customer questions during chaos

Technical triage checklist (the first 15 minutes)

Minute 0 to 5: confirm and contain

Minute 5 to 10: find the choke point

Minute 10 to 15: restore the spine first

The “we are back” message (template)

One last thing: the system to fix before all of this happens again

FAQs (Frequently Asked Questions)

What is the immediate action to take during a complete system outage?

Which channels are best for communicating outage updates to customers?

What should be included in every customer status update during an outage?

What are some key phrases to avoid when communicating outages?

How do you prioritize which systems to fix first during a widespread outage?

What guidance should be given to customers about their actions during an outage?

Share it on: