LearnwithVishnu
LearnwithVishnu
Basics → Production → Architect
← Home
Prometheus + GrafanaPrometheus + Grafana
BeginnerEngineerProductionArchitectTime-series monitoring — metrics, PromQL, alerts, Grafana dashboards
What is PrometheusMetric TypesPromQLAlert RulesAlertmanagerTroubleshootInterview Q&ARoadmap

🔥 What is Prometheus + Grafana?

The Monitoring Problem

When a user reports slow responses at 3pm on Tuesday, you need to answer: was it slow before the last deployment? Was it one pod or all pods? Was CPU high or memory? Was the database slow? Without time-series monitoring, you cannot answer any of these retrospectively. Prometheus records every metric as a timestamped value, forever. Grafana lets you visualise and query that history.

How it Works

StepWhat happens
1. Application exposes metricsYour app has a GET /metrics endpoint returning Prometheus format text
2. Prometheus scrapesEvery 15-30 seconds, Prometheus calls /metrics on every target
3. Stored as time-seriesEach metric value stored with timestamp + labels in local TSDB
4. Grafana queriesGrafana sends PromQL queries to Prometheus, renders charts
5. Alert evaluationPrometheus evaluates alert rules every 30 seconds
6. Alertmanager routesIf rule fires, Alertmanager sends to Slack/PagerDuty/email

Prometheus vs Alternatives

PrometheusDatadogNagios
CostFree and open source$15+/host/monthFree (complex)
ModelPull (scrape)Push (agent)Check-based
KubernetesNative, ServiceMonitor CRDDaemonSet agentPlugins, not native
Query powerPromQL — very powerfulGoodNone
Best forK8s, on-prem, multi-cloud, cost-sensitiveSaaS, ease of use, APMLegacy, simple checks
Install kube-prometheus-stack + access Grafana

📊 Metric Types — What They Mean

Why metric types matter

Using the wrong metric type or wrong PromQL function is the most common Prometheus mistake. A counter used with the wrong function looks like it is always zero or always at max. Understanding types prevents these bugs.

All four metric types with examples

🔍 PromQL — Essential Queries

PromQL — the query language you must know

PromQL is not SQL. It is designed for time-series data. Key operators: rate() for counter speed, histogram_quantile() for percentiles, by() for grouping, without() for removing labels, topk() for top N, avg_over_time() for rolling average.

Essential PromQL queries — CPU, memory, errors, K8s

🚨 Alert Rules — Writing Good Alerts

What makes a good alert?

  • Actionable — someone waking up at 3am must know exactly what to do
  • Has FOR duration — transient spikes should not fire alerts
  • Has runbook_url — link to documented response procedure
  • Tests symptoms not causes — alert on high error rate, not on "database connection timeout"
  • Right severity — CRITICAL wakes people up, WARNING can wait until morning
PrometheusRule with real production alerts

📬 Alertmanager — Routing Alerts

What is Alertmanager?

Alertmanager handles routing, deduplication, grouping, and silencing of alerts. Prometheus evaluates rules and fires alerts. Alertmanager decides who gets notified, how, and when. Without Alertmanager you would get a separate Slack message for every single pod that crashed — with it, 50 pod restarts become one grouped notification.

Alertmanager config — Slack, PagerDuty, inhibition

🔍 Troubleshooting

Systematic debugging approach

The Prometheus targets page at http://prometheus:9090/targets is your first debugging stop. Green = scraping successfully, Red = failing (check error message). From there you can determine if the problem is ServiceMonitor labels, network, or the application itself.

Common issues and fixes

🎯 Interview Questions

PROMETHEUS · ENGINEER
What is Prometheus and how does it differ from traditional monitoring tools like Nagios?
Prometheus is a pull-based, time-series monitoring system. It periodically scrapes metrics from HTTP endpoints on your applications and infrastructure. Nagios and Zabbix are check-based — they periodically run a check script (is the service responding? is disk below 90%?) and return OK/WARNING/CRITICAL. The fundamental difference: Prometheus stores actual metric values over time as numbers, enabling powerful queries like trends, rates, and percentile calculations. Nagios only knows current state. With Prometheus you can ask: what was the 99th percentile latency last Tuesday between 2pm-4pm? What is the rate of change in memory usage over the last 6 hours? Nagios cannot answer these. Prometheus is cloud-native — designed for containers where services appear and disappear. Service discovery automatically finds new pods. Nagios requires manual host registration.
PROMETHEUS · ENGINEER
Explain the four Prometheus metric types. When do you use each?
Counter: always increases, never decreases (resets on restart). Use for total requests, errors, bytes processed. Always use rate() or increase() in PromQL to get rate of change per second — the raw counter number is not useful. Gauge: current value, can go up or down. Use for memory usage, active connections, queue depth, temperature. Query directly without rate(). Histogram: distributes values into predefined buckets. Use for request latency and response sizes where you need percentile calculations. Creates three series: _bucket (counts), _count (total), _sum (sum of all values). Use histogram_quantile() to calculate P95/P99. Summary: similar to histogram but calculates quantiles in the application code. Less flexible than histogram because you cannot aggregate across multiple instances. Prefer histogram unless you have a specific reason for summary.
PROMETHEUS · ARCHITECT
An alert is firing in Prometheus but you see no issue in Grafana. What are the possible causes?
Five possible causes. One: time range mismatch — Grafana dashboard is showing last 1 hour but the alert triggered on a spike that is now outside the range. Expand time range. Two: alert has no FOR duration — fired on a single data point that immediately resolved. The alert appears in history but Grafana dashboard now shows normal. Add FOR 5m to the alert rule. Three: alert is silenced in Alertmanager — someone silenced it while investigating. Check Alertmanager UI silences. Four: the metric labels in the alert do not match the Grafana panel query — alert is for namespace=production but Grafana panel shows all namespaces combined. Look at the exact alert labels. Five: recording rules lag — if the alert uses a recording rule, there is a scrape interval delay. The alert fired on slightly stale data. Check the recording rule evaluation interval. At HPE: most of these issues come from copy-pasting alert rules without understanding the labels — the alert fires for a specific pod but the Grafana panel aggregates all pods.
PROMETHEUS · PRODUCTION
Prometheus is using too much memory and pods keep OOMKilling. How do you fix it?
Root cause: too many time series (high cardinality) or too long retention. Investigation: Prometheus UI → Status → TSDB Status shows top series by metric name and label name. Look for any label with millions of unique values — typically user_id, session_id, request_id, URL path with parameters. These are high-cardinality labels and each unique value = separate time series = memory. Immediate fix: increase memory limit and reduce retention from 15 days to 7 days. Medium-term fix: remove high-cardinality labels from metrics using metric_relabel_configs in the scrape config — drop the offending label. Long-term fix: code review of application metrics — every label must have bounded cardinality. Also check for metrics that are never queried and can be dropped entirely. Recording rules help too — pre-aggregate expensive queries into new lower-cardinality series. At HPE: a developer added request_path as a label with full URL paths including query parameters. 50 million unique series in 2 hours. Fix: drop path label in relabel config, add path_prefix with only the first URL segment.
PROMETHEUS · ENGINEER
What is the difference between Prometheus rate() and increase() functions?
Both calculate how much a counter changed over a time window, but express it differently. rate(counter[5m]) gives the per-second rate of increase averaged over 5 minutes. increase(counter[5m]) gives the total increase over 5 minutes. Mathematically: increase = rate × 300 (seconds in 5 minutes). Use rate() when you care about speed — requests per second, errors per second. Use increase() when you care about total count in a window — how many restarts in the last hour, how many deploys today. Important nuance: both handle counter resets (when a process restarts and counter goes to 0) by detecting the reset and not counting it as a decrease. This is why you must use rate()/increase() instead of subtracting counter values directly.
PROMETHEUS · ARCHITECT
How do you design Prometheus alerting for a Kubernetes production cluster to avoid alert fatigue?
Alert fatigue happens when too many alerts fire, operators stop paying attention, and real incidents get missed. Prevention strategy: only alert on symptoms (user-visible impact) not causes. Symptom alert: high error rate. Cause alert: database connection timeout — often too specific. Four golden signals to always alert on: Latency (P99 > SLA), Traffic (abnormal request rate), Errors (error rate > threshold), Saturation (CPU/memory/disk approaching limit). For each alert: add FOR duration (5 minutes for critical, 15 for warning) to prevent noise from transient spikes. Add runbook_url annotation pointing to documented response. Inhibition rules: if a node is down, suppress all pod alerts from that node — the root cause is the node, not the individual pods. Group related alerts in Alertmanager: group by namespace so 50 pod alerts from one bad deployment become one grouped notification. Route by team: platform team gets infra alerts, app team gets their service alerts. Review alerts monthly — if an alert fires more than twice a week for a non-incident, it is threshold too low or not important enough.

🗺️ Roadmap

Day 1
Install
Install kube-prometheus-stack
Access Grafana — import K8s dashboards
Run first PromQL query
Week 1
PromQL
Learn rate(), histogram_quantile()
CPU, memory, error rate queries
Instrument your own app with /metrics
Week 2
Alerting
Write PrometheusRule with FOR duration
Configure Alertmanager Slack route
Test alert end-to-end
Month 2
Production
Recording rules for expensive queries
Cardinality control
Thanos or VictoriaMetrics for long-term storage
OpenTelemetry integration
Continue Learning
📊 ELK Stack📈 Datadog☸️ Kubernetes🏠 All Topics
🤖
AI Assistant
Ask anything about this topic
👋 Hi! I have read this page and can answer your questions.

Try asking: "Explain this topic in simple terms" or "Give me an example" or ask any specific question.