🚨Incident Response
🚨 Incident Response Framework
›Why structured incident response matters
Unstructured incident response: multiple engineers doing duplicate work, no one communicating to stakeholders, random restarts without understanding root cause, same incident happening again next month.
Structured incident response: clear roles, focused investigation, stakeholders informed, postmortem prevents recurrence.
| Severity | Definition | Response time | Resolve within |
|---|---|---|---|
| SEV1 | All users affected, production down | Immediate (24/7) | 1 hour |
| SEV2 | Major features unavailable, significant impact | 15 minutes | 4 hours |
| SEV3 | Minor degradation, workaround available | 1 hour (business hours) | 24 hours |
| SEV4 | Cosmetic issues, no user impact | Next sprint | Sprint cycle |
Severity levels and team roles
📋 Incident Runbook
›Detect → Assess → Mitigate → Resolve phases
📝 Blameless Postmortem
›Postmortem template + 5 Whys + action items
🔧 Tools & Commands
›Alerting tools + quick investigation commands + rollback
🎯 Interview Questions
›INCIDENT · ENGINEER
What is the difference between an incident and a problem in ITSM?
In ITSM (IT Service Management, ITIL framework): an Incident is an unplanned interruption or degradation of service — something is broken right now. The goal is to restore service as fast as possible, root cause can wait. A Problem is the underlying cause of one or more incidents. Problem management investigates root causes to prevent future incidents. Example: Monday morning, payment service is down (Incident). The team restores service by restarting pods. Later that week, Problem management investigates why pods crash — discovers memory leak in a new library version. Fix the library to prevent future incidents. In DevOps practice we use simpler terminology: Incident (acute, restore now), Postmortem (root cause analysis, prevent recurrence). The ITSM distinction is still important at enterprise accounts (banks, telco, HPE-scale) where formal ITSM processes are required for compliance and change management.
INCIDENT · ARCHITECT
How do you build a blameless postmortem culture?
Blameless postmortems require a top-down commitment that the goal is system improvement not punishment. The foundational principle: engineers make the best decisions possible with the information available at the time. If the system was designed so that a single engineer's mistake causes a major outage, that is a system design problem, not a human failure. Practices: no names in root cause analysis — write about the system, not the person. Use passive voice: the deployment was triggered (not Vishnu triggered). Replace every instance of could have or should have with the system lacked. Focus on what information was available at decision time, not what we know now. Five whys goes five levels deep into system failures, never stops at human error. At HPE: we had a incident where a developer accidentally deleted a production namespace. Blameless analysis found: no RBAC preventing namespace deletion, no confirmation prompt for delete operations, no backup to restore from. Three system fixes. If we had blamed the developer: we would have fixed nothing and the next developer in a stressful 2am situation would make the same mistake with the same missing protections.
INCIDENT · PRODUCTION
Production is down. Walk through your first 15 minutes.
Structured response, no panic. Minute 0-2: acknowledge the alert (stops duplicate response). Quick assessment — is this real or monitoring glitch? Check if multiple signals correlate: alert firing + elevated error rate in Grafana + user reports in support channel. Minute 2-5: declare the incident in the incidents channel with: what is happening, who is affected, who is the Incident Commander. Start a shared document or Slack thread for the timeline. Minute 5-10: understand scope before acting. kubectl get pods -A grep -v Running. kubectl get events sorted by time. Check recent deployments — was there a deploy in the last hour? Check monitoring dashboards — when did the issue start exactly? Correlate with deployment history. Minute 10-15: identify the fastest path to service restoration, not root cause. If there was a recent deployment: roll it back immediately, even if you are not sure it caused the issue. Rolling back is safe. Continuing to investigate while users are affected costs more than a premature rollback. If no recent deployment: check pod health, scale up replicas, check database connectivity. Communicate status every 15 minutes to stakeholders even if there is no update. Silence during an incident is worse than bad news.
Continue Learning