alerting Introduced in L8

PagerDuty

The on-call standard. Schedules, escalations, alerting integrations, incident response. The thing that wakes engineers up.

Mindmap

hover · click to navigate
this tech depends on / used by alternative Shipyard anchor
What it is

The plain-English version

PagerDuty is the category leader for on-call management. It defines who's on-call (rotation schedules), how alerts escalate (primary → secondary after N minutes unacknowledged), and how alerts arrive (push, SMS, voice call). Integrates with virtually every monitoring tool.

Why it exists

The problem it solves

For any team that takes on-call seriously, you need scheduling (who's covering what), routing (which alerts go to which team), escalation (what happens if no one responds), and history (incident records, postmortem support). PagerDuty does all of this and has been trusted with it for over a decade.

What it competes with

Alternatives

Where it shows up in Shipyard

Deep links

Vocabulary

The words you'll hear

Service
A thing that pages — usually one app or component.
Schedule
Rotation of who's primary on-call.
Escalation policy
Sequence: notify primary, then secondary, then manager, etc.
Incident
An active issue. Acknowledged, resolved, with a timeline.
Integration
Source of alerts. Datadog, Sentry, custom webhooks, etc.
Runbook
Linked from the incident — what to do.
Postmortem
Post-incident write-up. Some teams use PagerDuty's; others use FireHydrant, Rootly, incident.io.
Prompting

Bad vs. good prompt for PagerDuty

✕ Bad prompt
set up paging
✓ Good prompt
Sketch a PagerDuty setup for a 4-engineer team: weekly rotation (Mon 9am handoff), one shared service 'tasklane-prod' that all engineers cover. Escalation policy: notify primary, escalate to secondary after 5 min, escalate to all-hands after 15 min. Integrate with Datadog and Sentry. Define what severity levels page (P1/P2 page, P3/P4 don't).

Why it works: Specifies team size, rotation cadence, exact escalation timings, integrations, and severity thresholds. The result is reviewable as a real on-call setup.

Pitfalls

What bites real teams

⚠ No quiet hours

Pages at 3am are different from pages at 3pm. Most teams should configure non-urgent severities to wait for business hours.

⚠ Page-fatigue erosion

Every false page erodes trust. Audit alerts ruthlessly: would a human do something useful in the next 15 min?

⚠ No handoff ritual

Without a written shift handoff, the next on-call starts blind. Set a Slack channel or template for it.

References

Official docs only