Incident Management for Delivery Leaders
When production breaks, the Delivery Manager's job isn't to fix the code — it's to ensure the response is coordinated, communication is clear, and the organisation learns from every incident to prevent recurrence.
The Delivery Manager's Role in Incidents
You're not the on-call engineer. You're not debugging the root cause. Your role during an incident is to ensure:
1. The right people are engaged and coordinated 2. Stakeholders are informed at the right cadence 3. The response follows a structured process (not a panic) 4. Recovery is prioritised over root cause analysis (fix first, learn later) 5. After recovery, the organisation learns and improves
The 2026 SRE best practices emphasise a five-stage model: Prepare, Detect, Respond, Recover, Learn. The Delivery Manager owns the "Respond" coordination and the "Learn" follow-through.
The Incident Response Framework
Severity Classification
Define severity levels before incidents happen. When production is down, you don't want to debate whether it's a P1 or P2.
P1 — Critical: Service is completely unavailable or data integrity is compromised. All customers affected. Revenue impact immediate.
- Response: All-hands. War room activated. Stakeholder communication every 30 minutes.
- Target resolution: < 1 hour
P2 — Major: Significant degradation affecting many customers. Core functionality impaired but workarounds exist.
- Response: On-call team + relevant engineers. DM coordinates communication.
- Target resolution: < 4 hours
P3 — Minor: Limited impact. Small subset of customers affected. Non-critical functionality impaired.
- Response: On-call team handles. DM informed but not actively involved.
- Target resolution: < 24 hours
P4 — Low: Cosmetic issues, minor bugs, no customer impact.
- Response: Normal sprint work. No incident process needed.
The Incident Commander Role
For P1 and P2 incidents, designate an Incident Commander (IC). This can be the Delivery Manager or a senior engineer — the key is that one person owns coordination:
- Declares the incident and severity
- Assembles the response team
- Coordinates workstreams (diagnosis, fix, communication, customer support)
- Makes decisions when the team disagrees on approach
- Declares resolution and initiates the post-mortem
Communication During Incidents
Internal communication:
- Dedicated Slack channel per incident (not the general channel)
- Status updates every 30 minutes for P1, every hour for P2
- Clear format: "Current status → What we're trying → Next update at [time]"
- Tag stakeholders who need to know — don't make them ask
External communication (customers):
- Status page updated within 15 minutes of detection
- Honest about impact: "Some users are experiencing..." not "We're investigating an issue"
- Estimated resolution time (even if uncertain): "We expect to resolve within 2 hours"
- Resolution confirmation with brief explanation
Stakeholder communication:
- Executive summary within 30 minutes of P1 declaration
- Format: What happened → Customer impact → Current status → Expected resolution → What we need
- Don't wait for full understanding before communicating — share what you know
Recovery Over Root Cause
During an active incident, the priority is restoring service — not understanding why it broke. Common recovery actions:
- Rollback the last deployment
- Scale up infrastructure
- Failover to backup systems
- Disable the problematic feature (feature flag)
- Redirect traffic away from the affected component
Root cause analysis happens after recovery, in the post-mortem. Never delay recovery to investigate cause.
Post-Incident Learning
The Blameless Post-Mortem
Within 48 hours of resolution, run a blameless post-mortem. The goal is learning, not blame.
Structure: 1. Timeline: What happened, when, in what order (facts only) 2. Impact: Who was affected, for how long, what was the business cost 3. Root cause: Why did it happen? (Use "5 Whys" to dig deeper) 4. Contributing factors: What made detection slow? What made recovery hard? 5. Action items: What will we change to prevent recurrence?
Blameless principles:
- Focus on the system, not the person ("The deployment pipeline didn't catch this" not "John deployed broken code")
- Assume everyone acted with the best information available at the time
- Ask "what" and "how" questions, not "who" and "why didn't you"
- Publish the post-mortem widely — transparency builds trust
Action Item Follow-Through
Post-mortem actions are worthless if they're not completed. The Delivery Manager owns follow-through:
- Every action has an owner and a deadline
- Actions are tracked in the team's backlog (not a separate document that gets forgotten)
- Review outstanding post-mortem actions in the weekly delivery review
- Escalate overdue actions — if the same root cause causes a second incident, the follow-through process failed
Incident Metrics
Track over time:
- MTTR by severity: Are we getting faster at recovery?
- Incident frequency: Are incidents becoming less common?
- Repeat incidents: Are the same root causes recurring? (Indicates failed follow-through)
- Detection time: How long between incident start and detection? (Indicates observability gaps)
- Post-mortem completion rate: Are post-mortems happening within 48 hours?
- Action completion rate: Are post-mortem actions being completed on time?
Building Incident Readiness
Don't wait for incidents to build your response capability:
Runbooks: Document recovery procedures for known failure modes. When production is down at 2am, engineers shouldn't be figuring out the rollback process from scratch.
On-call rotation: Ensure clear ownership of who responds first. Rotate fairly. Compensate appropriately.
Game Days: Periodically simulate incidents to practice the response process. Inject failures in staging and run through the full incident lifecycle.
Observability investment: You can't recover from what you can't detect. Invest in monitoring, alerting, and dashboards that surface problems before customers report them.
Communication templates: Pre-written templates for status page updates, stakeholder emails, and internal announcements. Fill in the specifics during the incident rather than composing from scratch under pressure.
---
Download the [Escalation Framework template](/templates) to define your incident severity levels and response procedures.