Delivery Manager 12 min

Incident Management for Delivery Leaders

When production breaks, the Delivery Manager's job isn't to fix the code — it's to ensure the response is coordinated, communication is clear, and the organisation learns from every incident to prevent recurrence.

The Delivery Manager's Role in Incidents

You're not the on-call engineer. You're not debugging the root cause. Your role during an incident is to ensure:

1. The right people are engaged and coordinated 2. Stakeholders are informed at the right cadence 3. The response follows a structured process (not a panic) 4. Recovery is prioritised over root cause analysis (fix first, learn later) 5. After recovery, the organisation learns and improves

The 2026 SRE best practices emphasise a five-stage model: Prepare, Detect, Respond, Recover, Learn. The Delivery Manager owns the "Respond" coordination and the "Learn" follow-through.

The Incident Response Framework

Severity Classification

Define severity levels before incidents happen. When production is down, you don't want to debate whether it's a P1 or P2.

P1 — Critical: Service is completely unavailable or data integrity is compromised. All customers affected. Revenue impact immediate.

Response: All-hands. War room activated. Stakeholder communication every 30 minutes.
Target resolution: < 1 hour

P2 — Major: Significant degradation affecting many customers. Core functionality impaired but workarounds exist.

Response: On-call team + relevant engineers. DM coordinates communication.
Target resolution: < 4 hours

P3 — Minor: Limited impact. Small subset of customers affected. Non-critical functionality impaired.

Response: On-call team handles. DM informed but not actively involved.
Target resolution: < 24 hours

P4 — Low: Cosmetic issues, minor bugs, no customer impact.

Response: Normal sprint work. No incident process needed.

The Incident Commander Role

For P1 and P2 incidents, designate an Incident Commander (IC). This can be the Delivery Manager or a senior engineer — the key is that one person owns coordination:

Declares the incident and severity
Assembles the response team
Coordinates workstreams (diagnosis, fix, communication, customer support)
Makes decisions when the team disagrees on approach
Declares resolution and initiates the post-mortem

Communication During Incidents

Internal communication:

Dedicated Slack channel per incident (not the general channel)
Status updates every 30 minutes for P1, every hour for P2
Clear format: "Current status → What we're trying → Next update at [time]"
Tag stakeholders who need to know — don't make them ask

External communication (customers):

Status page updated within 15 minutes of detection
Honest about impact: "Some users are experiencing..." not "We're investigating an issue"
Estimated resolution time (even if uncertain): "We expect to resolve within 2 hours"
Resolution confirmation with brief explanation

Stakeholder communication:

Executive summary within 30 minutes of P1 declaration
Format: What happened → Customer impact → Current status → Expected resolution → What we need
Don't wait for full understanding before communicating — share what you know

Recovery Over Root Cause

During an active incident, the priority is restoring service — not understanding why it broke. Common recovery actions:

Rollback the last deployment
Scale up infrastructure
Failover to backup systems
Disable the problematic feature (feature flag)
Redirect traffic away from the affected component

Root cause analysis happens after recovery, in the post-mortem. Never delay recovery to investigate cause.

Post-Incident Learning

The Blameless Post-Mortem

Within 48 hours of resolution, run a blameless post-mortem. The goal is learning, not blame.

Structure: 1. Timeline: What happened, when, in what order (facts only) 2. Impact: Who was affected, for how long, what was the business cost 3. Root cause: Why did it happen? (Use "5 Whys" to dig deeper) 4. Contributing factors: What made detection slow? What made recovery hard? 5. Action items: What will we change to prevent recurrence?

Blameless principles:

Focus on the system, not the person ("The deployment pipeline didn't catch this" not "John deployed broken code")
Assume everyone acted with the best information available at the time
Ask "what" and "how" questions, not "who" and "why didn't you"
Publish the post-mortem widely — transparency builds trust

Action Item Follow-Through

Post-mortem actions are worthless if they're not completed. The Delivery Manager owns follow-through:

Every action has an owner and a deadline
Actions are tracked in the team's backlog (not a separate document that gets forgotten)
Review outstanding post-mortem actions in the weekly delivery review
Escalate overdue actions — if the same root cause causes a second incident, the follow-through process failed

Incident Metrics

Track over time:

MTTR by severity: Are we getting faster at recovery?
Incident frequency: Are incidents becoming less common?
Repeat incidents: Are the same root causes recurring? (Indicates failed follow-through)
Detection time: How long between incident start and detection? (Indicates observability gaps)
Post-mortem completion rate: Are post-mortems happening within 48 hours?
Action completion rate: Are post-mortem actions being completed on time?

Building Incident Readiness

Don't wait for incidents to build your response capability:

Runbooks: Document recovery procedures for known failure modes. When production is down at 2am, engineers shouldn't be figuring out the rollback process from scratch.

On-call rotation: Ensure clear ownership of who responds first. Rotate fairly. Compensate appropriately.

Game Days: Periodically simulate incidents to practice the response process. Inject failures in staging and run through the full incident lifecycle.

Observability investment: You can't recover from what you can't detect. Invest in monitoring, alerting, and dashboards that surface problems before customers report them.

Communication templates: Pre-written templates for status page updates, stakeholder emails, and internal announcements. Fill in the specifics during the incident rather than composing from scratch under pressure.

---

Download the [Escalation Framework template](/templates) to define your incident severity levels and response procedures.

More playbooks

Scrum Master · 11 min