Start Wide, Then Go Deep

When ingress goes 503, teams usually lose time by diving straight into logs.

A faster sequence is:

  • Validate ingress -> service -> endpoints first.
  • Check deployment readiness shape next.
  • Confirm pod-level DNS path after routing is clean.

1. Validate The Data Path

Run the shortest high-signal checks first:

kubectl get ingress -n edge
kubectl get svc,endpoints -n edge
kubectl describe ingress edge-gateway -n edge

If endpoints are empty, your ingress controller may be fine while service selection is wrong.

2. Confirm Readiness Drift

If selectors are fixed but endpoints still churn, inspect probe paths and events:

kubectl describe deployment api-gateway -n edge
kubectl describe pod -n edge <pod-name>
kubectl get events -n edge --sort-by=.metadata.creationTimestamp

Look for probe path mismatches like /readyz versus /healthz.

3. Verify DNS Egress From The Pod

Once endpoints appear healthy, test resolution from workload context:

kubectl exec -n edge deploy/api-gateway -- nslookup internal-auth.edge.svc.cluster.local

If this fails, inspect NetworkPolicy egress rules for kube-dns/coredns access.

4. Lock In Verification

A fix is complete only when the user path recovers:

kubectl get endpoints api-gateway -n edge
curl -s http://localhost:8080/health | jq

Publish this workflow as your on-call runbook so engineers can execute it under pressure with no guesswork.