APM Introduced in L6

Datadog

Comprehensive observability — metrics, logs, APM, RUM, all under one expensive roof.

Mindmap

hover · click to navigate
this tech depends on / used by alternative Shipyard anchor
What it is

The plain-English version

Datadog is an observability platform: infrastructure metrics, application performance monitoring (APM), log management, real user monitoring (RUM), synthetic checks, security tools — all integrated. The de facto enterprise choice. Expensive at scale; powerful when used well.

Why it exists

The problem it solves

Datadog's pitch is correlation: you see a CPU spike on a host, click through to the service running on it, click through to the slow trace, click through to the offending log line — all in one tool. Self-assembling this out of open-source pieces is real work; Datadog sells the integration.

What it competes with

Alternatives

AlternativeTypeWhen it wins
SentryerrorsThe error-tracking standard. Captures frontend and backend exceptions with full context. First tool teams add for production observability.
PrometheusmetricsThe open-source metrics standard. Pull-based scraping, time-series database, the basis of most cloud-native observability.
ELK Stacklog mgmtElasticsearch + Logstash + Kibana — the open-source log management trio. Now also "Elastic Stack" with Beats.
Where it shows up in Shipyard

Deep links

Vocabulary

The words you'll hear

Agent
Long-running process on each host. Collects metrics, logs, traces.
Metric
Time-series numerical data. Counters, gauges, histograms.
APM trace
Distributed trace across services. Same idea as Jaeger or X-Ray.
RUM
Real User Monitoring. JS snippet on the frontend captures real-user performance.
Synthetic
Scheduled checks against your endpoints from various locations.
Monitor / SLO
Alert rules and reliability targets, both first-class.
Prompting

Bad vs. good prompt for Datadog

✕ Bad prompt
set up datadog
✓ Good prompt
Add Datadog APM to our Node.js Express app. Initialize dd-trace as the very first import. Tag traces with env=production and service=tasklane-api. Enable log injection (so logs include trace_id). Set DD_TRACE_SAMPLE_RATE=0.1 in production env. Show the require/import order — the order matters.

Why it works: Specifies the gotcha (dd-trace must be imported first), the tagging, the log-trace correlation, and the sample rate. The order issue is the #1 reason 'it didn't work.'

Pitfalls

What bites real teams

⚠ Cost surprises

Custom metric cardinality, log volume, and host count all drive cost. Read pricing carefully and budget alerts.

⚠ Agent versioning

The Datadog agent updates often; sometimes broken in minor versions. Pin in production.

⚠ Tag explosion

High-cardinality tags (user_id, request_id) create millions of unique series. Avoid for metrics; use logs/traces instead.

References

Official docs only