The Four Conditions You Need To Separate
Not all node failures are equal. Treating them as one bucket is how incidents drag on.
Split quickly into:
DiskPressureMemoryPressurePIDPressureNotReady
1. Get Fleet Shape In Two Commands
kubectl get nodes
kubectl top nodes
Then classify each impacted node with:
kubectl describe node <node-name>
2. Use Safe Maintenance Order
Before touching node internals:
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --force
This protects workload placement while you repair.
3. Apply Class-Specific Fixes
- Disk pressure: remove image/log buildup and prune stale artifacts.
- Memory pressure: kill runaway processes and verify allocation headroom.
- PID pressure: identify fork storms and recover process table capacity.
- NotReady: restore kubelet/runtime config and restart services deliberately.
4. Verify, Then Return Capacity
kubectl uncordon <node-name>
kubectl get nodes
curl -s http://localhost:8080/health | jq
A node is not recovered until it is schedulable, stable, and verified in cluster health metrics.