The Four Conditions You Need To Separate

Not all node failures are equal. Treating them as one bucket is how incidents drag on.

Split quickly into:

  • DiskPressure
  • MemoryPressure
  • PIDPressure
  • NotReady

1. Get Fleet Shape In Two Commands

kubectl get nodes
kubectl top nodes

Then classify each impacted node with:

kubectl describe node <node-name>

2. Use Safe Maintenance Order

Before touching node internals:

kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --force

This protects workload placement while you repair.

3. Apply Class-Specific Fixes

  • Disk pressure: remove image/log buildup and prune stale artifacts.
  • Memory pressure: kill runaway processes and verify allocation headroom.
  • PID pressure: identify fork storms and recover process table capacity.
  • NotReady: restore kubelet/runtime config and restart services deliberately.

4. Verify, Then Return Capacity

kubectl uncordon <node-name>
kubectl get nodes
curl -s http://localhost:8080/health | jq

A node is not recovered until it is schedulable, stable, and verified in cluster health metrics.