Back to Blog

How to Run a DR Tabletop Exercise

A step-by-step facilitation guide with scenario scripts, inject timelines, and the exact questions to ask at every step.

Disaster Recovery Ransomware

A DR runbook that has never been tested is a hypothesis. The first real incident will prove that — conclusively and painfully. But most organizations treat DR testing as a full failover or nothing, which means it rarely happens. Tabletop exercises fill that gap: they test the people, not the systems, and they cost nothing but two hours in a conference room.

This is the facilitation guide we use at Shift7 Consulting. It includes ready-to-use scenario scripts, a timed inject timeline, the four questions you ask at every runbook step, and a post-exercise action framework. Whether you're running your first tabletop or your tenth, this guide works.

DR Tabletop Exercise overview slide
The complete walkthrough is also available as a slide deck for video generation.

Why Tabletop Exercises Matter

They test people, not systems. A failover test validates that SRM works and VMs power on. A tabletop validates whether your team knows what to do, who to call, in what order, and who has the authority to make decisions. Human readiness is the gap that actually kills you during a real event.

They find gaps nothing else does. Missing escalation contacts. Unclear declaration authority. A runbook step that references a server decommissioned six months ago. Two people who both think they're in charge. These only surface when you talk through the scenario step by step.

They're cheap and low-risk. No systems touched. No production impact. Two to three hours. Compare that to the cost of discovering these gaps during a real incident — where every minute of confusion is a minute of extended downtime.

They build muscle memory. The team that has practiced together handles pressure differently than the team seeing the runbook for the first time at 2 AM.

Before the Exercise (1-2 Weeks Out)

Preparation makes or breaks a tabletop. Here's the checklist:

  1. Pick a scenario and scope. Total site loss? Ransomware? Partial failure? Choose one. Don't try to cover everything in one session.
  2. Invite the right people. Facilitator, scribe, incident commander, VMware admin, network admin, and at least one business application owner. For ransomware scenarios, add the security lead and optionally comms/legal.
  3. Distribute the runbook in advance. Everyone should read it before they walk in. Most won't. That's a finding too.
  4. Prepare your inject timeline. Write 4-6 complications you'll introduce at specific points. These are the pressure tests.
  5. Assign the facilitator and scribe. These are separate roles. The facilitator runs the exercise. The scribe documents findings. Neither participates in the recovery discussion.

The scribe is non-negotiable. Without a scribe, you have no findings. Without findings, you have no follow-up. Without follow-up, you had a meeting, not an exercise.

Setting Ground Rules (First 10 Minutes)

The first ten minutes set the tone. Here's the exact script we read at the start of every exercise:

Ground rules for DR tabletop exercises
Read this verbatim. The six ground rules establish psychological safety and set expectations.

The key messages to land: no wrong answers (if you don't know, say it — that's a finding), stay in your role (respond as you would at 2 AM, not with hindsight), and nothing leaves the room unowned.

Scenario 1: Total Site Loss

Facilitator reads:

"It's Tuesday at 9:15 AM. Facilities reports a major cooling system failure at our Phoenix data center. Multiple ESXi hosts are thermal-shutting down. vCenter is unresponsive. The MPLS link is still up but there's nothing to connect to on the other end. Your phone is ringing — the VP of Finance can't access SAP, and the customer portal is returning 502 errors. What do you do first?"
Total site loss scenario with key questions
The scenario script plus the key questions to ask at each runbook step.

Then walk through the runbook step by step. At each step, ask:

Injects for Scenario 1

Introduce these complications at the appropriate points:

~20 min
The IC's phone goes straight to voicemail.
Who's the backup IC? How long do you wait before escalating?
~35 min
SRM reports a replication error on one protection group.
Which VMs are affected? Skip them or troubleshoot? Who decides?
~50 min
The SSL cert on the portal at DR expired yesterday.
Who can renew it? Who has cert management access?
~60 min
Payment provider rejecting transactions from unknown IPs.
Were DR egress IPs pre-registered? Who has the provider portal?
~75 min
The CEO calls wanting a board update.
Who takes this call? What do you say? When's the next update?
~85 min
Failover complete. How do you get back to primary?
Walk me through failback. When do you start? What has to be true first?

Scenario 2: Ransomware Attack

Facilitator reads:

"It's Wednesday at 2:30 AM. Your monitoring system fires an alert: CPU and disk I/O have spiked to 100% across multiple production VMs simultaneously. Within minutes, the helpdesk starts receiving calls — file shares are showing encrypted files with a .locked extension. A ransom note appears on the domain controller console. The note demands 50 BTC and threatens to publish exfiltrated customer data within 72 hours. What do you do first?"
Ransomware scenario with key questions
The ransomware scenario changes every assumption. The first question — what's your FIRST action? — reveals the most.

The ransomware scenario changes every assumption. The questions are fundamentally different:

Ransomware Injects

~15 min
The Veeam backup server console also shows a ransom note.
Were backup creds the same as domain admin? Is the hardened repo intact?
~25 min
Attacker emails your CFO with a sample of exfiltrated customer data.
Who handles external comms? Legal? PR? Is there a prepared statement?
~35 min
Cyber insurance wants forensic evidence before engaging.
Did you preserve VM snapshots? Where are the logs?
~45 min
Security identifies the initial breach: a phished credential from 8 days ago.
Your clean RPO might be 8 days old. Is your retention long enough?
~55 min
Law enforcement wants you to wait before restoring.
Restore now for business continuity, or preserve evidence?
~70 min
Restored to clean room. How do you verify the VMs are clean?
What IOCs are you scanning for? Who validates before connecting to production?

The 45-minute inject is the gut punch. If the attacker's dwell time was 8 days and your backup retention is 7 days, you have no clean copy. This single finding has changed the backup architecture at every organization where we've run this scenario.

The Four Questions

These are the engine of the tabletop. Ask them at every single step of the runbook:

The four questions to ask at every runbook step
Ask these at every runbook step. The answers — or the silence — are your findings.
1
Who does this?
Not a role — a name. "The VMware admin" is not an answer. "S. Patel" is. If nobody knows the name, that's a finding.
2
Do we have the access right now?
Can you log into DR vCenter at this moment? Do you know the password? If you'd need to look it up, that's a finding.
3
Is this step still accurate?
Does the server name match? Is the IP current? When was this last verified? If it was written by someone who left, that's a finding.
4
What could go wrong?
What if this step fails? What's the fallback? At what point do you skip it? If nobody has an answer, that's a finding.

After the Exercise: The 48-Hour Rule

If the after-action doesn't happen within 48 hours, it won't happen at all. Urgency fades fast. Here's the framework:

  1. Compile findings report — clean up the scribe's notes, group by severity, send to all participants. Day 1-2.
  2. Assign remediation owners — every finding gets a named person and a due date. No exceptions.
  3. Update the runbook immediately — wrong IP, wrong server name, wrong contact? Fix it today, not in 30 days.
  4. Log in the DR Test Log — date, type, scope, findings count, action items.
  5. Track remediation at 15 and 30 days — anything still open at 30 gets escalated to the IC.
  6. Feed into next quarter — unresolved findings become the focus of the next tabletop.

Annual Scenario Rotation

Don't run the same scenario twice in a row. Here's the recommended annual rotation:

Annual tabletop scenario rotation by quarter
Q1: Total site loss. Q2: Ransomware. Q3: Partial failure. Q4: Wildcard.

Q1: Total Site Loss — the classic DR scenario. Tests full failover procedures. Q2: Ransomware Attack — tests isolation, forensics, clean room recovery. This consistently reveals the most gaps. Q3: Partial Failure — storage controller outage, half the datastores. Tests gray-area decision making. Q4: Wildcard — ransomware during a DR exercise, WAN failure plus insider threat. Keeps the team from getting complacent.

Common Mistakes

Six mistakes we see in every other tabletop:

  1. No scribe assigned. Without a scribe, you have no findings. Non-negotiable.
  2. No app owners invited. If only IT is in the room, you're testing half the plan.
  3. Same scenario every time. Teams memorize responses instead of thinking.
  4. Skipping the after-action. A tabletop without follow-up is a meeting.
  5. Too gentle on the team. Don't softball the injects. Real incidents are chaotic.
  6. Facilitator participates. The facilitator runs the exercise, not the recovery.

Watch the Video

How to Run a DR Tabletop Exercise — the full facilitation guide in 6 minutes.

Need Help Facilitating?

We run tabletop exercises for enterprise clients — including ransomware scenarios with realistic injects.
One session. Real findings. Named owners. 30-day remediation tracking.

Request a Tabletop Exercise
7 SHIFT7 CONSULTING

Nate Sellers is a Principal Consultant at Shift7 Consulting LLC, specializing in enterprise infrastructure strategy, cloud architecture, and cyber resilience. 20+ years in enterprise infrastructure and disaster recovery.

contact@shift7az.com · (480) 243-5793 · shift7az.com