Home
Product
Assessments Screening Links Team Drills CLI Chaos Mode
Solutions by Stack
AI Infrastructure Cloud & Platforms Kubernetes Data Centers Linux & Bare Metal
Solutions by Role
Site Reliability Engineers Platform Engineers DevOps Engineers DC Technicians Linux Admins
Resources
Blog Status Privacy Login Sign up
Chaos Multi-engineer incidents

Production incidents aren't solved alone

Two engineers enter the same live incident. One broken system. Real-time coordination under pressure. Practice the communication, task splitting, and shared situational awareness that make on-call rotations work.

War Rooms · Pair Debugging · Team Training · Final Interviews
See CLI
War Room Active
Live
Engineer A - IC Lead
engineer-1
Engineer B - Responder
engineer-2
shared terminal
engineer-a$ kubectl describe pod api-gateway -n edge
Observation: readiness failing after dependency timeout

engineer-b$ kubectl get svc,endpoints -n edge
Thread: endpoints missing for auth backend

engineer-a$ kubectl describe networkpolicy -n edge
Decision: isolate policy drift before restart

Both engineers see the same system.
The report shows who did what.
2–3 engineers Per incident room
Shared container Same system, same state
Live presence Real-time connection tracking
Per-engineer replay See who did what and when

Solo tests miss the most important signal

Real outages are team events. The best SRE in the world is useless if they can't coordinate with another engineer under pressure.

Solo assessment

  • Tests individual knowledge, not team behaviour
  • No signal on communication or delegation
  • Can't observe leadership under pressure
  • Doesn't reflect how production incidents actually work

With Chaos Mode

  • See how engineers split the problem and coordinate
  • Who narrates findings vs. who stays silent
  • Who drives the room vs. who adds noise
  • Who verifies before declaring "fixed"

Lobby, readiness, incident, review

The room model mirrors real incident management, not a chat room with a shared terminal bolted on.

01

Create a room

Pick a scenario: Kubernetes cascading failure, Azure networking, GPU diagnostics. The system provisions a shared container and generates paired handoff tokens for Engineer A and Engineer B.

02

Both engineers join

Each engineer enters the lobby with their token. Presence indicators show who's connected. Engineer A starts the session when both are ready. It's a deliberate readiness gate, not an accidental race.

03

Debug together, review separately

Both engineers work the same live system. The platform tracks each engineer's commands independently. After the session, managers get per-engineer replays showing decision patterns, not just commands.

The signals you can't get from a solo test

Chaos Mode exposes collaboration quality, the one dimension most technical assessments completely ignore.

Coordination Quality

Do they split the problem sensibly? One engineer narrows blast radius while the other validates dependencies. Or do they both run the same commands?

Leadership Under Pressure

Who takes ownership of the incident? Who proposes a plan, assigns threads, and keeps the room focused? Or does nobody step up?

Decision Discipline

Do they verify changes before moving on? Do they test the user path, not just the symptom? Or do they declare victory at the first green light?

Communication Clarity

How well do they narrate what they're finding? Can Engineer B understand what Engineer A discovered without asking? Silence is data too.

Delegation Patterns

"You take the network policy, I'll trace the service mesh." Clean delegation under pressure is a signal you cannot get from a multiple-choice quiz.

Recovery Sequencing

In cascading failures, the order of fixes matters. Do they understand dependency chains? Or do they fix the loudest symptom first?

Four ways teams use Chaos Mode

01

Final-stage interviews

Put your top two candidates in the same incident room. See who actually leads under pressure instead of who interviews better. One 20-minute session replaces hours of panel interviews.

02

Kubernetes adoption

Your team just migrated to Kubernetes. Instead of hoping they'll learn from runbooks, throw them into a cascading pod failure together. They'll learn faster under pressure, and a colleague is there to catch mistakes.

03

Onboarding war games

New SRE joins the team. Pair them with a senior engineer in a Chaos Mode session. The senior watches methodology, the new hire builds confidence, and managers get a real read on readiness.

04

Team readiness checks

Before the next change freeze or on-call rotation, run your team through a shared incident. Same pressure, none of the customer impact.

Incidents that evolve

Our most advanced scenario doesn't end when you fix the first problem. Each resolution triggers the next hidden failure, just like production.

PHASE 01 Pod crash-loop api-gateway pods in CrashLoopBackOff. Diagnose and fix the liveness probe. $15K/hr
PHASE 02 Node goes down Worker node flips to NotReady. SSH in, diagnose kubelet, restore the node. $45K/hr
PHASE 03 DNS breaks Network policy blocks DNS cluster-wide. Services can't resolve each other. $120K/hr
PHASE 04 Memory surge Backed-up traffic floods recovered services. OOMKilled pods everywhere. $180K/hr
PHASE 05 Etcd split-brain Clock skew on control-plane-02 causes etcd leader election instability. $350K/hr
PHASE 06 Drain storm Autoscaler panic triggers aggressive cordon and drain across the fleet. $500K/hr
Each fix triggers the next failure

CLI + Chaos Mode is the strongest combination

Both engineers can join through the browser or the CLI. Each person picks whichever interface they work fastest in, and both connect to the same shared container over the same WebSocket. Commands, presence, and replay all work identically regardless of surface.

  • Browser and CLI attach to the same room container
  • Either engineer can use either interface, or switch mid-session
  • Per-engineer command replay works across both surfaces
  • Use Team Drills for structured internal training programs
chaos room lobby
Chaos Mode
  PARIUM / war room

  SCENARIO  K8s Cascading Failure
  ROOM      chaos-room-42
  STATUS    Waiting for start

  ────────────────────────────────
   Engineer A  connected (you)
   Engineer B  connected
  ────────────────────────────────

  Press S to start the session
  Press Tab to cycle themes
  Press Ctrl+C to leave

Common questions about Chaos Mode

Screen sharing lets one person drive while others watch. Chaos Mode gives both engineers full terminal access to the same live system. Both can run commands, both get tracked independently, and the report shows exactly who contributed what. It's the difference between watching someone cook and both being in the kitchen.

Yes, and it's one of the best fits. Pair a new SRE with a senior engineer. Run your platform team through a Kubernetes failure before the next migration. Use it for on-call readiness checks. Same incident engine, different purpose.

Currently optimised for two engineers, labelled Engineer A and Engineer B. This keeps the signal clean: you can clearly see who led, who investigated, and who verified. Two is enough to expose collaboration patterns without the noise of a large group.

Scenarios with multiple root causes or branching failure paths. The Kubernetes cascading failure (6 phases) is designed for exactly this. It rewards engineers who split threads and coordinate. Simple single-fix scenarios work better as solo assessments.

Per-engineer command history with timestamps. Who ran what, when, and in what order. Presence data: connection times, disconnects. AI analysis of each engineer's approach. And the full session replay, so you can watch the collaboration unfold like a recording.

Get Started

Test how your team handles pressure together

Because production is a team sport, and your interviews should be too.

See Team Drills
Hire SREs Hire Platform Engineers Kubernetes Incidents