Prometheus
The open-source metrics standard. Pull-based scraping, time-series database, the basis of most cloud-native observability.
Mindmap
The plain-English version
Prometheus is an open-source monitoring and alerting toolkit, originated at SoundCloud. It pulls metrics from applications via HTTP scraping, stores them in a time-series database, and lets you query them with PromQL. Paired with Grafana for dashboards and Alertmanager for routing alerts.
The problem it solves
Prometheus is the CNCF default for metrics. Anywhere you see Kubernetes, you see Prometheus. The pull model is unusual but well-suited to ephemeral containers (each Pod gets discovered and scraped automatically). It's the standard layer for self-hosted observability.
Alternatives
| Alternative | Type | When it wins |
|---|---|---|
| Sentry | errors | The error-tracking standard. Captures frontend and backend exceptions with full context. First tool teams add for production observability. |
| Datadog | APM | Comprehensive observability — metrics, logs, APM, RUM, all under one expensive roof. |
| ELK Stack | log mgmt | Elasticsearch + Logstash + Kibana — the open-source log management trio. Now also "Elastic Stack" with Beats. |
Deep links
The words you'll hear
- Exporter
- An agent that exposes metrics in Prometheus format. node_exporter, postgres_exporter, etc.
- Scrape
- Prometheus pulls
/metricsfrom each target on an interval. - Metric type
- Counter (only goes up), gauge (any value), histogram (buckets), summary (quantiles).
- PromQL
- The query language.
rate(http_requests_total[5m]). - Alertmanager
- Companion service that routes alerts to email, PagerDuty, Slack.
- Grafana
- Dashboard tool, often paired with Prometheus.
- Recording rule
- Pre-computed query stored as a new metric. Saves repeated work.
Bad vs. good prompt for Prometheus
Why it works: Specifies the SDK, what metrics to expose, the scrape config and the actually-useful dashboard panels. Most 'set up Prometheus' answers stop at /metrics; this one finishes the job.
What bites real teams
Labels like user_id create millions of series and OOM Prometheus. Keep label cardinality bounded.
Prometheus's local storage isn't designed for years of data. Pair with Thanos, Cortex, or Mimir for long retention.
It's tempting to alert on every metric. Most alerts should be SLO-burn-rate alerts, not threshold alerts.