Datadog
Comprehensive observability — metrics, logs, APM, RUM, all under one expensive roof.
Mindmap
The plain-English version
Datadog is an observability platform: infrastructure metrics, application performance monitoring (APM), log management, real user monitoring (RUM), synthetic checks, security tools — all integrated. The de facto enterprise choice. Expensive at scale; powerful when used well.
The problem it solves
Datadog's pitch is correlation: you see a CPU spike on a host, click through to the service running on it, click through to the slow trace, click through to the offending log line — all in one tool. Self-assembling this out of open-source pieces is real work; Datadog sells the integration.
Alternatives
| Alternative | Type | When it wins |
|---|---|---|
| Sentry | errors | The error-tracking standard. Captures frontend and backend exceptions with full context. First tool teams add for production observability. |
| Prometheus | metrics | The open-source metrics standard. Pull-based scraping, time-series database, the basis of most cloud-native observability. |
| ELK Stack | log mgmt | Elasticsearch + Logstash + Kibana — the open-source log management trio. Now also "Elastic Stack" with Beats. |
Deep links
The words you'll hear
- Agent
- Long-running process on each host. Collects metrics, logs, traces.
- Metric
- Time-series numerical data. Counters, gauges, histograms.
- APM trace
- Distributed trace across services. Same idea as Jaeger or X-Ray.
- RUM
- Real User Monitoring. JS snippet on the frontend captures real-user performance.
- Synthetic
- Scheduled checks against your endpoints from various locations.
- Monitor / SLO
- Alert rules and reliability targets, both first-class.
Bad vs. good prompt for Datadog
Why it works: Specifies the gotcha (dd-trace must be imported first), the tagging, the log-trace correlation, and the sample rate. The order issue is the #1 reason 'it didn't work.'
What bites real teams
Custom metric cardinality, log volume, and host count all drive cost. Read pricing carefully and budget alerts.
The Datadog agent updates often; sometimes broken in minor versions. Pin in production.
High-cardinality tags (user_id, request_id) create millions of unique series. Avoid for metrics; use logs/traces instead.