Prometheus + Grafana Complete Guide

Prometheus + Grafana

BeginnerEngineerProductionArchitectTime-series monitoring — metrics, PromQL, alerts, Grafana dashboards

What is Prometheus Metric Types PromQL Alert Rules Alertmanager Troubleshoot Interview Q&A Roadmap

🔥 What is Prometheus + Grafana?

›

The Monitoring Problem

When a user reports slow responses at 3pm on Tuesday, you need to answer: was it slow before the last deployment? Was it one pod or all pods? Was CPU high or memory? Was the database slow? Without time-series monitoring, you cannot answer any of these retrospectively. Prometheus records every metric as a timestamped value, forever. Grafana lets you visualise and query that history.

How it Works

Step	What happens
1. Application exposes metrics	Your app has a GET /metrics endpoint returning Prometheus format text
2. Prometheus scrapes	Every 15-30 seconds, Prometheus calls /metrics on every target
3. Stored as time-series	Each metric value stored with timestamp + labels in local TSDB
4. Grafana queries	Grafana sends PromQL queries to Prometheus, renders charts
5. Alert evaluation	Prometheus evaluates alert rules every 30 seconds
6. Alertmanager routes	If rule fires, Alertmanager sends to Slack/PagerDuty/email

Prometheus vs Alternatives

	Prometheus	Datadog	Nagios
Cost	Free and open source	$15+/host/month	Free (complex)
Model	Pull (scrape)	Push (agent)	Check-based
Kubernetes	Native, ServiceMonitor CRD	DaemonSet agent	Plugins, not native
Query power	PromQL — very powerful	Good	None
Best for	K8s, on-prem, multi-cloud, cost-sensitive	SaaS, ease of use, APM	Legacy, simple checks

Install kube-prometheus-stack + access Grafana

📊 Metric Types — What They Mean

›

Why metric types matter

Using the wrong metric type or wrong PromQL function is the most common Prometheus mistake. A counter used with the wrong function looks like it is always zero or always at max. Understanding types prevents these bugs.

All four metric types with examples

🔍 PromQL — Essential Queries

›

PromQL — the query language you must know

PromQL is not SQL. It is designed for time-series data. Key operators: rate() for counter speed, histogram_quantile() for percentiles, by() for grouping, without() for removing labels, topk() for top N, avg_over_time() for rolling average.

Essential PromQL queries — CPU, memory, errors, K8s

🚨 Alert Rules — Writing Good Alerts

›

What makes a good alert?

Actionable — someone waking up at 3am must know exactly what to do
Has FOR duration — transient spikes should not fire alerts
Has runbook_url — link to documented response procedure
Tests symptoms not causes — alert on high error rate, not on "database connection timeout"
Right severity — CRITICAL wakes people up, WARNING can wait until morning

PrometheusRule with real production alerts

📬 Alertmanager — Routing Alerts

›

What is Alertmanager?

Alertmanager handles routing, deduplication, grouping, and silencing of alerts. Prometheus evaluates rules and fires alerts. Alertmanager decides who gets notified, how, and when. Without Alertmanager you would get a separate Slack message for every single pod that crashed — with it, 50 pod restarts become one grouped notification.

Alertmanager config — Slack, PagerDuty, inhibition

🔍 Troubleshooting

›

Systematic debugging approach

The Prometheus targets page at http://prometheus:9090/targets is your first debugging stop. Green = scraping successfully, Red = failing (check error message). From there you can determine if the problem is ServiceMonitor labels, network, or the application itself.

Common issues and fixes

🎯 Interview Questions

›

PROMETHEUS · ENGINEER

What is Prometheus and how does it differ from traditional monitoring tools like Nagios?

Prometheus is a pull-based, time-series monitoring system. It periodically scrapes metrics from HTTP endpoints on your applications and infrastructure. Nagios and Zabbix are check-based — they periodically run a check script (is the service responding? is disk below 90%?) and return OK/WARNING/CRITICAL. The fundamental difference: Prometheus stores actual metric values over time as numbers, enabling powerful queries like trends, rates, and percentile calculations. Nagios only knows current state. With Prometheus you can ask: what was the 99th percentile latency last Tuesday between 2pm-4pm? What is the rate of change in memory usage over the last 6 hours? Nagios cannot answer these. Prometheus is cloud-native — designed for containers where services appear and disappear. Service discovery automatically finds new pods. Nagios requires manual host registration.

PROMETHEUS · ENGINEER

Explain the four Prometheus metric types. When do you use each?

Counter: always increases, never decreases (resets on restart). Use for total requests, errors, bytes processed. Always use rate() or increase() in PromQL to get rate of change per second — the raw counter number is not useful. Gauge: current value, can go up or down. Use for memory usage, active connections, queue depth, temperature. Query directly without rate(). Histogram: distributes values into predefined buckets. Use for request latency and response sizes where you need percentile calculations. Creates three series: _bucket (counts), _count (total), _sum (sum of all values). Use histogram_quantile() to calculate P95/P99. Summary: similar to histogram but calculates quantiles in the application code. Less flexible than histogram because you cannot aggregate across multiple instances. Prefer histogram unless you have a specific reason for summary.

PROMETHEUS · ARCHITECT

An alert is firing in Prometheus but you see no issue in Grafana. What are the possible causes?

Five possible causes. One: time range mismatch — Grafana dashboard is showing last 1 hour but the alert triggered on a spike that is now outside the range. Expand time range. Two: alert has no FOR duration — fired on a single data point that immediately resolved. The alert appears in history but Grafana dashboard now shows normal. Add FOR 5m to the alert rule. Three: alert is silenced in Alertmanager — someone silenced it while investigating. Check Alertmanager UI silences. Four: the metric labels in the alert do not match the Grafana panel query — alert is for namespace=production but Grafana panel shows all namespaces combined. Look at the exact alert labels. Five: recording rules lag — if the alert uses a recording rule, there is a scrape interval delay. The alert fired on slightly stale data. Check the recording rule evaluation interval. At HPE: most of these issues come from copy-pasting alert rules without understanding the labels — the alert fires for a specific pod but the Grafana panel aggregates all pods.

PROMETHEUS · PRODUCTION

Prometheus is using too much memory and pods keep OOMKilling. How do you fix it?

Root cause: too many time series (high cardinality) or too long retention. Investigation: Prometheus UI → Status → TSDB Status shows top series by metric name and label name. Look for any label with millions of unique values — typically user_id, session_id, request_id, URL path with parameters. These are high-cardinality labels and each unique value = separate time series = memory. Immediate fix: increase memory limit and reduce retention from 15 days to 7 days. Medium-term fix: remove high-cardinality labels from metrics using metric_relabel_configs in the scrape config — drop the offending label. Long-term fix: code review of application metrics — every label must have bounded cardinality. Also check for metrics that are never queried and can be dropped entirely. Recording rules help too — pre-aggregate expensive queries into new lower-cardinality series. At HPE: a developer added request_path as a label with full URL paths including query parameters. 50 million unique series in 2 hours. Fix: drop path label in relabel config, add path_prefix with only the first URL segment.

PROMETHEUS · ENGINEER

What is the difference between Prometheus rate() and increase() functions?

Both calculate how much a counter changed over a time window, but express it differently. rate(counter[5m]) gives the per-second rate of increase averaged over 5 minutes. increase(counter[5m]) gives the total increase over 5 minutes. Mathematically: increase = rate × 300 (seconds in 5 minutes). Use rate() when you care about speed — requests per second, errors per second. Use increase() when you care about total count in a window — how many restarts in the last hour, how many deploys today. Important nuance: both handle counter resets (when a process restarts and counter goes to 0) by detecting the reset and not counting it as a decrease. This is why you must use rate()/increase() instead of subtracting counter values directly.

PROMETHEUS · ARCHITECT

How do you design Prometheus alerting for a Kubernetes production cluster to avoid alert fatigue?

Alert fatigue happens when too many alerts fire, operators stop paying attention, and real incidents get missed. Prevention strategy: only alert on symptoms (user-visible impact) not causes. Symptom alert: high error rate. Cause alert: database connection timeout — often too specific. Four golden signals to always alert on: Latency (P99 > SLA), Traffic (abnormal request rate), Errors (error rate > threshold), Saturation (CPU/memory/disk approaching limit). For each alert: add FOR duration (5 minutes for critical, 15 for warning) to prevent noise from transient spikes. Add runbook_url annotation pointing to documented response. Inhibition rules: if a node is down, suppress all pod alerts from that node — the root cause is the node, not the individual pods. Group related alerts in Alertmanager: group by namespace so 50 pod alerts from one bad deployment become one grouped notification. Route by team: platform team gets infra alerts, app team gets their service alerts. Review alerts monthly — if an alert fires more than twice a week for a non-incident, it is threshold too low or not important enough.

🗺️ Roadmap

›

Day 1

Install

Install kube-prometheus-stack

Access Grafana — import K8s dashboards

Run first PromQL query

Week 1

PromQL

Learn rate(), histogram_quantile()

CPU, memory, error rate queries

Instrument your own app with /metrics

Week 2

Alerting

Write PrometheusRule with FOR duration

Configure Alertmanager Slack route

Test alert end-to-end

Month 2

Production

Recording rules for expensive queries

Cardinality control

Thanos or VictoriaMetrics for long-term storage

OpenTelemetry integration

Continue Learning

📊 ELK Stack 📈 Datadog ☸️ Kubernetes 🏠 All Topics