alerting Introduced in L8

PagerDuty

The on-call standard. Schedules, escalations, alerting integrations, incident response. The thing that wakes engineers up.

this tech depends on / used by alternative Shipyard anchor

What it is

The plain-English version

PagerDuty is the category leader for on-call management. It defines who's on-call (rotation schedules), how alerts escalate (primary → secondary after N minutes unacknowledged), and how alerts arrive (push, SMS, voice call). Integrates with virtually every monitoring tool.

Why it exists

The problem it solves

For any team that takes on-call seriously, you need scheduling (who's covering what), routing (which alerts go to which team), escalation (what happens if no one responds), and history (incident records, postmortem support). PagerDuty does all of this and has been trusted with it for over a decade.

What it competes with

Alternatives

Where it shows up in Shipyard

Deep links

L8SysOps

Vocabulary

The words you'll hear

Service: A thing that pages — usually one app or component.
Schedule: Rotation of who's primary on-call.
Escalation policy: Sequence: notify primary, then secondary, then manager, etc.
Incident: An active issue. Acknowledged, resolved, with a timeline.
Integration: Source of alerts. Datadog, Sentry, custom webhooks, etc.
Runbook: Linked from the incident — what to do.
Postmortem: Post-incident write-up. Some teams use PagerDuty's; others use FireHydrant, Rootly, incident.io.

Prompting

Bad vs. good prompt for PagerDuty

✕ Bad prompt

set up paging

✓ Good prompt

Sketch a PagerDuty setup for a 4-engineer team: weekly rotation (Mon 9am handoff), one shared service 'tasklane-prod' that all engineers cover. Escalation policy: notify primary, escalate to secondary after 5 min, escalate to all-hands after 15 min. Integrate with Datadog and Sentry. Define what severity levels page (P1/P2 page, P3/P4 don't).

Why it works: Specifies team size, rotation cadence, exact escalation timings, integrations, and severity thresholds. The result is reviewable as a real on-call setup.

Pitfalls

What bites real teams

⚠ No quiet hours

Pages at 3am are different from pages at 3pm. Most teams should configure non-urgent severities to wait for business hours.

⚠ Page-fatigue erosion

Every false page erodes trust. Audit alerts ruthlessly: would a human do something useful in the next 15 min?

⚠ No handoff ritual

Without a written shift handoff, the next on-call starts blind. Set a Slack channel or template for it.

References

Official docs only

PagerDuty docs

◇Mindmap

The plain-English version

The problem it solves

Alternatives

Deep links

The words you'll hear

Bad vs. good prompt for PagerDuty

What bites real teams

Official docs only

Mindmap