APM Introduced in L6

Datadog

Comprehensive observability — metrics, logs, APM, RUM, all under one expensive roof.

this tech depends on / used by alternative Shipyard anchor

What it is

The plain-English version

Datadog is an observability platform: infrastructure metrics, application performance monitoring (APM), log management, real user monitoring (RUM), synthetic checks, security tools — all integrated. The de facto enterprise choice. Expensive at scale; powerful when used well.

Why it exists

The problem it solves

Datadog's pitch is correlation: you see a CPU spike on a host, click through to the service running on it, click through to the slow trace, click through to the offending log line — all in one tool. Self-assembling this out of open-source pieces is real work; Datadog sells the integration.

What it competes with

Alternatives

Alternative	Type	When it wins
Sentry	errors	The error-tracking standard. Captures frontend and backend exceptions with full context. First tool teams add for production observability.
Prometheus	metrics	The open-source metrics standard. Pull-based scraping, time-series database, the basis of most cloud-native observability.
ELK Stack	log mgmt	Elasticsearch + Logstash + Kibana — the open-source log management trio. Now also "Elastic Stack" with Beats.

Where it shows up in Shipyard

Deep links

L6Production-Grade

Vocabulary

The words you'll hear

Agent: Long-running process on each host. Collects metrics, logs, traces.
Metric: Time-series numerical data. Counters, gauges, histograms.
APM trace: Distributed trace across services. Same idea as Jaeger or X-Ray.
RUM: Real User Monitoring. JS snippet on the frontend captures real-user performance.
Synthetic: Scheduled checks against your endpoints from various locations.
Monitor / SLO: Alert rules and reliability targets, both first-class.

Prompting

Bad vs. good prompt for Datadog

✕ Bad prompt

set up datadog

✓ Good prompt

Add Datadog APM to our Node.js Express app. Initialize dd-trace as the very first import. Tag traces with env=production and service=tasklane-api. Enable log injection (so logs include trace_id). Set DD_TRACE_SAMPLE_RATE=0.1 in production env. Show the require/import order — the order matters.

Why it works: Specifies the gotcha (dd-trace must be imported first), the tagging, the log-trace correlation, and the sample rate. The order issue is the #1 reason 'it didn't work.'

Pitfalls

What bites real teams

⚠ Cost surprises

Custom metric cardinality, log volume, and host count all drive cost. Read pricing carefully and budget alerts.

⚠ Agent versioning

The Datadog agent updates often; sometimes broken in minor versions. Pin in production.

⚠ Tag explosion

High-cardinality tags (user_id, request_id) create millions of unique series. Avoid for metrics; use logs/traces instead.

References

Official docs only

docs.datadoghq.com

◇Mindmap

The plain-English version

The problem it solves

Alternatives

Deep links

The words you'll hear

Bad vs. good prompt for Datadog

What bites real teams

Official docs only

Mindmap