How is this different from hiring assessments?

The engine is the same, but the purpose is different. Hiring assessments filter external candidates, while Team Drills develop internal engineers and measure readiness.

Who typically owns the drill program?

Engineering managers, SRE leads, platform leaders, or enablement owners usually own the drill program.

Can this help with Kubernetes adoption?

Yes. Teams learn Kubernetes operations faster by repeatedly debugging realistic failures than by relying only on documentation or tutorials.

How does it connect to Chaos Mode and CLI?

Teams can start with solo drills, move into Chaos Mode for collaboration practice, and use the CLI for native terminal workflows across both.

Can we track improvement over time?

Yes. Each drill session records metrics like time to resolution, commands used, hints requested, and verification steps so progress can be measured over time.

Drills Internal training

Stop guessing who's ready for on-call

Your team reads runbooks. But can they execute under pressure? Team Drills takes the same incident engine used for hiring and turns it into a training system for your existing engineers.

Onboarding · Upskilling · Readiness Checks · K8s Adoption

See Chaos Mode

drill session - node pressure

Training

root@node-17:~$ kubectl get nodes
NAME      STATUS                     ROLES    AGE
node-17   Ready,SchedulingDisabled   worker   47d
node-18   Ready                      worker   47d
node-19   Ready                      worker   47d

root@node-17:~$ kubectl describe node node-17 | grep Pressure
  MemoryPressure   True
  DiskPressure     False
  PIDPressure      False

root@node-17:~$ kubectl top pods -A --sort-by=memory | head -5
NAMESPACE   NAME                    CPU    MEMORY
monitoring  prometheus-0            120m   4.2Gi
edge        api-gateway-7d4f8       45m    1.1Gi
logging     fluentd-8k2x9           30m    890Mi

Goal: diagnose, cordon, drain, remediate, verify
Training: competence becomes specific and measurable

Why Drills

The difference between "I read the runbook" and "I've done this before"

Most teams discover gaps during real incidents. Team Drills surfaces those gaps in a controlled environment, before the pager goes off.

Traditional training

Slide decks and documentation reviews
"Shadow an on-call for a month" and hope they learn
No way to measure readiness objectively
Gaps only discovered during real outages

With Team Drills

Live incidents on real systems, not theory
Structured drill programs with tracked results
Session replay: see exactly how they debug
Readiness becomes measurable, not assumed

How It Works

Run a drill in three steps

Same incident engine as hiring. Different purpose: build capability instead of filtering candidates.

Pick the learning target

Choose from existing scenarios (Kubernetes node pressure, GPU diagnostics, Azure networking, Docker security) or request custom ones that mirror your production stack.

Run a live drill

Each engineer enters a real environment and works the incident. No mock data, no multiple choice. They investigate, remediate, and verify on a live system that behaves like production.

Review the evidence

Managers and leads coach from actual session data: command history, time to root cause, verification steps, hints used. Evidence replaces post-hoc storytelling.

Training Targets

What your team can drill on

Every scenario available for hiring is available for internal training. Pick the technology or skill gap you want to close.

KUBERNETES Cluster operations Pod failures, node pressure, network policies, etcd recovery, cascading drain storms. From L2 basics to L4 control-plane work.

LINUX System fundamentals Disk full, runaway processes, service recovery, performance tuning, kernel module debugging. The foundation everything else sits on.

GPU / AI INFRA Accelerator operations Xid errors, driver conflicts, PCIe failures, DCGM diagnostics. Critical for teams running ML training or inference at scale.

CLOUD Azure & networking Load balancer misconfigs, NSG rule tracing, health probe debugging, bind address issues. Multi-layer cloud networking scenarios.

DOCKER Container security Readonly filesystems, container escapes, privilege escalation, network isolation. For teams building container-native platforms.

INCIDENT POSTURE Response methodology Beyond technical skill. How engineers approach debugging. Systematic investigation, verification discipline, knowing when to escalate.

One Platform

Hire with assessments. Train with drills. Evolve into Chaos Mode.

The best results come from using all three. Assessments filter candidates. Drills build internal capability. Chaos Mode tests collaboration. Same scenarios, same engine, three different motions.

Start with solo drills for baseline validation
Graduate to Chaos Mode when team coordination matters
Use CLI for engineers who prefer native terminals
Track progress across sessions and scenarios

drill results

Results

  PARIUM / drill summary

  ENGINEER   Alex Chen
  SCENARIO   K8s Node Pressure
  RESULT     ● RESOLVED
  TIME       08:42 / 20:00 limit

  ────────────────────────────────
  ✓ Root cause identified     03:12
  ✓ Remediation applied      06:45
  ✓ Health check verified    08:42
  ────────────────────────────────

  Commands: 18   Hints: 0   LLM risk: Low

  Manager note: Clean investigation path.
  Verified before declaring resolved.
  Ready for on-call rotation.

Use Cases

Four programs your team can run today

New engineer onboarding

Week one: run them through your core scenarios. Week four: run the same scenarios again. You'll have data on how fast they're ramping instead of vibes from a 1:1.

Technology migration readiness

Moving to Kubernetes? Adopting GPU workloads? Run your team through the relevant scenarios before the migration goes live. Find gaps when the cost of a mistake is zero.

On-call readiness validation

Before someone goes on the rotation, they should be able to handle the incidents your team actually sees. Drills give you evidence instead of "they seem ready" from a skip-level.

Quarterly team exercises

Run your entire SRE team through a drill each quarter. Track improvement. Identify who needs coaching. Build the kind of incident response culture that makes 3am pages less terrifying.

FAQ

Common questions about Team Drills

Same engine, different purpose. Hiring assessments filter external candidates. Team Drills develop internal engineers. The runtime is identical (real containers, real scenarios, real scoring) but the goal shifts from "should we hire this person?" to "is this person ready for on-call?"

Engineering managers, SRE leads, platform team leads, or enablement owners. Anyone who needs to know whether their team can actually handle production incidents, and wants evidence instead of assumptions.

This is one of the best use cases. Teams learn Kubernetes operations far faster by repeatedly debugging realistic failures (pod crash-loops, node pressure, network policies, etcd issues) instead of reading documentation or watching tutorials.

Start with solo drills for individual baseline validation. When the team is ready, graduate into Chaos Mode for collaborative war room practice. Engineers who prefer native terminals can use the CLI for both. It's one platform with three training motions.

Yes. Each drill session is recorded with full metrics: time to resolution, commands used, hints requested, verification steps. Run the same scenario at different points and compare. The data tells you if training is working.