PagerDuty
The on-call standard. Schedules, escalations, alerting integrations, incident response. The thing that wakes engineers up.
Mindmap
The plain-English version
PagerDuty is the category leader for on-call management. It defines who's on-call (rotation schedules), how alerts escalate (primary → secondary after N minutes unacknowledged), and how alerts arrive (push, SMS, voice call). Integrates with virtually every monitoring tool.
The problem it solves
For any team that takes on-call seriously, you need scheduling (who's covering what), routing (which alerts go to which team), escalation (what happens if no one responds), and history (incident records, postmortem support). PagerDuty does all of this and has been trusted with it for over a decade.
Alternatives
Deep links
The words you'll hear
- Service
- A thing that pages — usually one app or component.
- Schedule
- Rotation of who's primary on-call.
- Escalation policy
- Sequence: notify primary, then secondary, then manager, etc.
- Incident
- An active issue. Acknowledged, resolved, with a timeline.
- Integration
- Source of alerts. Datadog, Sentry, custom webhooks, etc.
- Runbook
- Linked from the incident — what to do.
- Postmortem
- Post-incident write-up. Some teams use PagerDuty's; others use FireHydrant, Rootly, incident.io.
Bad vs. good prompt for PagerDuty
Why it works: Specifies team size, rotation cadence, exact escalation timings, integrations, and severity thresholds. The result is reviewable as a real on-call setup.
What bites real teams
Pages at 3am are different from pages at 3pm. Most teams should configure non-urgent severities to wait for business hours.
Every false page erodes trust. Audit alerts ruthlessly: would a human do something useful in the next 15 min?
Without a written shift handoff, the next on-call starts blind. Set a Slack channel or template for it.