The flight simulator for
production incidents

See how candidates handle outages before something actually breaks.

SRE · DevOps · Platform · Linux Admin · Infrastructure · Data Center

Run a Demo Incident See How It Works

Scenario Simulation

Incident

INC-7234

Severity

SEV-1

State

Active

System

k8s-prod-03

Issue

Pod CrashLoop

Impact

API degraded

Duration

3m 6s

candidate@gpu-node-01

Active

07:12

candidate@gpu-node-01 - parium assessment

# Candidate investigating GPU driver failure

root@gpu-node-01:~$ nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.

root@gpu-node-01:~$ lsmod | grep -E 'nvidia|nouveau'

nouveau 2093056 1

root@gpu-node-01:~$ modprobe -r nouveau && modprobe nvidia

Loading nvidia driver...

root@gpu-node-01:~$ curl -s localhost:8080/health | jq

{ "status": "healthy", "gpus": 2 }

# ✓ Incident resolved in 08:42 - 0 hints used

Real Linux VMs Via the browser, or your own terminal via CLI

Under 15 minutes Start to scored result

AI-scored Zero manual review

Full session replay Every keystroke captured

Why Parium

What take-home tests miss

Respect your candidates' time - and your engineers' too.

The take-home test

3-hour time commitment - the best candidates might not find the time
Another hour for your team to review each submission
Artificial tasks that don't test real incident response
Non-deterministic - two reviewers, two different scores
Hard to know if LLMs have been used

With Parium

15 minutes. A real broken server. A real terminal.
AI analysis reads the session so your team doesn't have to
Tests exactly what they'll do on day one: debug production
Same scenario, every candidate. Clear pass/fail with data.
Built-in paste detection and tab-switch monitoring
Full behavioral picture: session replay shows pastes, tab switches, and every command

How It Works

Up and running in 3 simple steps

Use our ready-made scenarios or let us build custom assessments for your stack.

Tell us what you need

Pick from our ready-made scenarios (GPU debugging, server performance, Kubernetes) or tell us your stack and we'll build custom assessments.

Send to candidates

Share a link. Candidates enter their details and drop straight into a live terminal. No downloads, no accounts, no friction.

Get real insights

See exactly how they debug: time to resolution, commands used, thought process. Make confident hiring decisions backed by data.

Features

Built for serious technical hiring

Everything you need to assess real engineering skills

Real Terminal Environments

Full Linux VMs via the browser, or connect through our CLI for chaos room sessions. Not a sandbox — a real system to debug.

Time-to-Resolution Tracking

Automatic timing from first command to incident resolution. Compare candidates objectively against your team's benchmarks.

Runbook & Hint System

Real SOPs like your team uses. Track whether candidates follow procedures independently or need guidance — and how much.

LLM Detection

Paste event tracking and behavioural pattern analysis to flag candidates using AI assistance during the assessment.

Full Session Replay

Every keystroke, every command, every pause — with timestamps. Replay the entire session or export the full log for review.

Multiple Scenarios

Azure networking, K8s cascading failures, GPU driver conflicts, and more. Match the scenario to the role you're hiring for.

Inside the Assessment

See what candidates actually face

Each scenario is a carefully crafted incident with production-accurate logs, configs, and system state. Real kernel modules, real error messages, real tools — dmesg journalctl kubectl nvidia-smi — with health check endpoints that validate the fix.

SEV-1 Mid–Expert $8,200/hr impact

Production Edge API unreachable through Azure Load Balancer

Three independent root causes have drifted into a cascading failure. Candidates must trace the request path through Azure networking layers and fix each misconfiguration in the right order.

Root causes to identify

01 NSG rule blocking port 8080 traffic

02 Load Balancer health probe pointing to wrong endpoint

03 Application bound to loopback instead of 0.0.0.0

Simulated tools

az network systemctl journalctl ss curl ssh

INCIDENT.txt

═══════════════════════════════════════════════
          INCIDENT ALERT — SEV 1
═══════════════════════════════════════════════

INCIDENT ID:  INC-2026-0315-LB503
SEVERITY:     Critical — Production
AFFECTED:     edge-api.parium.internal
IMPACT:       $8,200/hr revenue at risk

───────────────────────────────────────────────

Production Edge API is returning HTTP 503 through
the Azure Load Balancer. The VM appears to be
running, but zero backend health probes succeed.

Active escalations:  3 customer tickets
Executive visibility: Yes — CTO notified

═══════════════════════════════════════════════
              YOUR TASK
═══════════════════════════════════════════════

1. Investigate why the LB returns 503
2. Identify all root causes (there may be more
   than one)
3. Apply fixes using approved remediation tools
4. Verify health check returns 200 OK

SEV-1 Expert Up to $500K/hr

Cascading cluster failure across 6 progressive phases

A war room scenario where every fix triggers the next hidden failure. Starts with a pod crash-loop and escalates to etcd split-brain and cascading drain storms. Tests crisis management, not just Kubernetes knowledge.

Cascade progression

01 Pod CrashLoopBackOff → fix liveness probe

02 Worker node goes NotReady → diagnose kubelet

03 DNS network policy breaks cluster-wide

04 Memory surge from backed-up traffic

05 Etcd split-brain from clock skew

06 Cascading cordon/drain storm

Simulated tools

kubectl crictl systemctl journalctl etcdctl timedatectl

INCIDENT.txt

═══════════════════════════════════════════════
          INCIDENT ALERT — SEV 1
═══════════════════════════════════════════════

INCIDENT ID:  INC-2026-WAR-ROOM
SEVERITY:     Critical — Cascading
CLUSTER:      prod-us-east-1 (18 nodes)
IMPACT:       $15K/hr → escalating

───────────────────────────────────────────────

api-gateway pods are in CrashLoopBackOff.
Customer-facing traffic is failing. SLA budget
is burning. This incident has executive
visibility.

WARNING: This incident will escalate.
Each fix you apply may reveal the next failure.
Prioritise methodically.

SLA budget remaining: 47 minutes
Oncall team:          Platform Engineering
War room:             Active — you are IC

═══════════════════════════════════════════════
              YOUR TASK
═══════════════════════════════════════════════

1. Restore api-gateway service availability
2. Investigate and resolve cascading failures
3. Validate cluster health at each phase
4. Maintain SLA budget — time matters

SEV-2 Mid-Level $4,200/hr impact

GPU has fallen off the bus — 7 of 8 GPUs visible

An NVIDIA A100 GPU is reporting Xid 79 errors and has disappeared from the PCIe bus. ML training jobs expecting 8 GPUs are failing. Tests hardware diagnosis skills and — critically — whether candidates know when to escalate vs. fix.

Diagnostic path

01 Verify GPU count and identify missing device

02 Check kernel logs for Xid errors and PCIe faults

03 Run DCGM diagnostics to rule out ECC errors

04 Apply ASPM power management fix or escalate

Simulated tools

nvidia-smi dcgmi lspci dmesg lsmod ipmitool

INCIDENT.txt

═══════════════════════════════════════════════
          INCIDENT ALERT — SEV 2
═══════════════════════════════════════════════

INCIDENT ID:  INC-2026-0119-GPU
SEVERITY:     High — Production ML
AFFECTED:     gpu-node-01.neocloud.internal
IMPACT:       $4,200/hr compute waste

───────────────────────────────────────────────

GPU compute jobs are failing on gpu-node-01.
The node has 2x NVIDIA A100 80GB GPUs but only
7 of 8 devices are detected by monitoring.

Queued jobs:     3 LLM fine-tuning runs
Last healthy:    08:00 UTC today
Kernel log:      Xid 79 — GPU fallen off bus

═══════════════════════════════════════════════
              YOUR TASK
═══════════════════════════════════════════════

1. Investigate why nvidia-smi shows fewer GPUs
2. Identify the root cause (driver vs hardware)
3. Restore GPU functionality if possible
4. Escalate to hardware team if necessary

Use Cases

From L1 support to senior SRE

Scenarios matched to every role on your team

Site Reliability Engineers

Test incident response, system debugging, and production troubleshooting skills with real-world scenarios.

GPU Driver Failure Kubernetes Performance Issues Service Outages

DevOps & Platform Engineers

Assess configuration management, CI/CD pipelines, container orchestration, and infrastructure automation.

API Gateway Config Container Issues Log Analysis CI/CD Pipelines

Data Center Engineers

Evaluate hardware diagnostics, bare metal troubleshooting, and GPU/accelerator management skills.

GPU Diagnostics IPMI/BMC Driver Conflicts Hardware Failures

Linux System Administrators

Test core Linux skills, process management, and filesystem troubleshooting abilities.

Runaway Process Disk Management Service Recovery System Boot

Incidents that evolve

Our most advanced scenario doesn't end when you fix the first problem. Each resolution triggers the next hidden failure — just like production.

War Room Mode

Multiple engineers connect to the same live incident simultaneously. Test how candidates collaborate under pressure — or use it for team training exercises.

engineer-1 IC

engineer-2 responder

PHASE 01 Pod crash-loop api-gateway pods in CrashLoopBackOff. Diagnose and fix the liveness probe. $15K/hr

PHASE 02 Node goes down Worker node flips to NotReady. SSH in, diagnose kubelet, restore the node. $45K/hr

PHASE 03 DNS breaks Network policy blocks DNS cluster-wide. Services can't resolve each other. $120K/hr

PHASE 04 Memory surge Backed-up traffic floods recovered services. OOMKilled pods everywhere. $180K/hr

PHASE 05 Etcd split-brain Clock skew on control-plane-02 causes etcd leader election instability. $350K/hr

PHASE 06 Drain storm Autoscaler panic triggers aggressive cordon and drain across the fleet. $500K/hr

Each fix triggers the next failure

Beta

Drop into incidents from your terminal

The Parium CLI connects you directly to war room sessions from your own terminal. No browser, no context switching — just parium open and you're in.

$ npm install -g @parium.ai/cli@preview

WebSocket terminal attach — real SSH-like sessions
Collaborative war room mode for team incidents
Dark, light, and mono themes — auto-detects your terminal
Browser-to-terminal handoff with secure tokens

Terminal — parium

Preview

$ parium open

  █▀█ ▄▀█ █▀█ █ █ █ █▀▄▀█
  █▀▀ █▀█ █▀▄ █ █▄█ █ ▀ █
  Chaos Terminal Client v0.1.0-alpha.2

Paste handoff token: ••••••••••••

✓ Token validated
✓ Session resolved — k8s-chaos-war-room
⟳ Attaching to terminal...

──────────────────────────────────────
  SESSION  K8s Cascading Failure
  STATUS   ● LIVE
  PHASE    3 of 6 — DNS network policy
  IMPACT   $120K/hr
──────────────────────────────────────

candidate@prod-worker-07:~$ █

Better for Everyone

An assessment that respects engineers' time.

No unfamiliar IDEs. No artificial puzzles. Just a terminal and a real incident - the environment they work in every day.

For candidates

Finish in under 20 minutes - not days
Real tools, real terminal - no unfamiliar IDEs
Reflects how your team actually works

For your team

Your engineers focus on building, not reviewing take-homes
AI-generated analysis - no more subjective scoring
Results ready to share with the hiring panel

Passed

Assessment Results

Feb 15, 2025 · 14:32 UTC

Candidate

Sarah Chen

Scenario

GPU Failure

Resolution

07:38

Time Limit

20:00

Commands

Hints Used

LLM Risk

Low

Outcome

Root cause correctly identified

Production-safe fix applied

Service health verified

Timeline

00:00 Session started

01:12 Checked GPU state

03:44 Identified driver conflict

05:21 Applied fix

07:38 Health check passed

Behaviour

3:44

Time to root cause

High

Confidence

FAQs

Frequently Asked Questions

Everything you need to know about how Parium works.

Candidates connect to a real, isolated Linux environment - not a browser simulation or multiple-choice sandbox. Each assessment spins up a fresh system with the incident pre-configured. They get full terminal access with real bash, real logs, and real system tools. It's the same experience as SSH'ing into a production server.

Parium is built for any role that requires hands-on Linux troubleshooting: Site Reliability Engineers (SRE), DevOps Engineers, Platform Engineers, Data Center Technicians, Linux System Administrators, Cloud Engineers, and Infrastructure Engineers. Our scenarios range from L1 support tasks (config errors, disk space) to L4 senior-level incidents (GPU driver conflicts, kernel modules, PCIe issues).

We monitor for patterns that suggest external help - things like leaving the terminal for extended periods, large paste events, and unusual command timing. Suspicious activity gets flagged in the hiring manager report with enough context for you to make an informed judgment. We can't catch everything, but the patterns are usually pretty obvious.

When the candidate clicks "Verify Fix," we run a health check against the scenario's success criteria (e.g., curl the API endpoint, check nvidia-smi output). If it passes, we record their time-to-resolution. The hiring manager gets a full report: every command with timestamps, hints used, suspicious activity flags, and an AI-generated analysis of their troubleshooting approach and methodology.

HackerRank, Codility, and similar platforms test algorithmic coding in sandboxed editors. Parium tests operational skills in real Linux environments. Your SRE candidates don't need to reverse a linked list - they need to figure out why nginx won't start or why the GPU driver isn't loading. We measure how they investigate, not whether they memorised the answer.

Yes. We can build scenarios that mirror your actual production environment - your monitoring tools, your deployment setup, your common failure modes. Whether it's Kubernetes on EKS, GPU clusters with SLURM, or legacy systems with custom daemons, we'll create assessments that test exactly what your team deals with day-to-day. Get in touch to discuss.

Beyond pass/fail, we give you session replay - watch exactly how candidates approached the problem. You'll see every command they ran, when they pasted content (and what they pasted), when they switched tabs, how long they were away, and when they used hints. It's like watching over their shoulder, but asynchronously. You see how they think, not just whether they got the answer.

Fair & Repeatable

Consistent by design

Every candidate gets the same scenario, the same environment, the same success criteria. No more "it depends on who reviewed it." Structured evaluation that gives every candidate a fair shot.

Same scenario, every time

No variation between candidates. Everyone faces the same incident with the same tools available.

Objective criteria

Clear pass/fail based on whether the fix works — not on how well someone writes a README or formats their code.

Data-driven decisions

Time-to-resolution, commands used, hints requested. Compare candidates on the metrics that matter.

Get in Touch

Get started with Parium

Whether you need a custom scenario for your stack, want to discuss enterprise pricing, or just have questions, we'd love to hear from you.

Ready to hire engineers you'd trust on call?

See real incident performance before you hire.

Run a Demo Incident Contact Sales

The flight simulator forproduction incidents

What take-home tests miss

The take-home test

With Parium

Up and running in 3 simple steps

Tell us what you need

Send to candidates

Get real insights

Built for serious technical hiring

Real Terminal Environments

Time-to-Resolution Tracking

Runbook & Hint System

LLM Detection

Full Session Replay

Multiple Scenarios

See what candidates actually face

Production Edge API unreachable through Azure Load Balancer

Cascading cluster failure across 6 progressive phases

GPU has fallen off the bus — 7 of 8 GPUs visible

From L1 support to senior SRE

Site Reliability Engineers

DevOps & Platform Engineers

Data Center Engineers

Linux System Administrators

Incidents that evolve

Drop into incidents from your terminal

An assessment that respects engineers' time.

Frequently Asked Questions

Consistent by design

Same scenario, every time

Objective criteria

Data-driven decisions

Get started with Parium

Request a callback

Ready to hire engineers you'd trust on call?

The flight simulator for
production incidents