Loki — LearnwithVishnu

📊Loki

BeginnerEngineerArchitectLog aggregation — Promtail → Loki → Grafana, LogQL queries, vs Elasticsearch

📊 What is Loki?

›

What is Loki?

Loki is a log aggregation system built by Grafana Labs — designed as the logging equivalent of Prometheus. It collects logs from all your applications and infrastructure, stores them cost-efficiently, and lets you query them using LogQL — a language with the same feel as Prometheus's PromQL.

Loki is part of the PLG stack: Promtail (log collector on every node) → Loki (log storage and query engine) → Grafana (visualisation and alerting). All three integrate natively — you query metrics and logs side by side in the same Grafana dashboard.

Why Loki is different from Elasticsearch

The fundamental difference is the indexing strategy. Elasticsearch indexes the full text of every log — every word becomes searchable. Fast for arbitrary searches but the index is large (20-30% of raw log volume) and Elasticsearch clusters are expensive to operate. Loki indexes only labels (metadata: namespace, app, pod, environment) — never the log content. Log content is stored as compressed text chunks in cheap object storage (S3, GCS, Azure Blob Storage).

	Loki	Elasticsearch (ELK)
What is indexed	Labels only (namespace, app, pod)	Full text of every log message
Storage cost	Very low — compressed chunks in S3/GCS	High — Elasticsearch index + shard storage
Query speed	Fast for label-based, slower for full-text scan	Fast for any text search
Setup complexity	Simple — 3 components, one Helm chart	Complex — ES cluster sizing, JVM tuning
Integration	Native Grafana data source	Kibana (separate UI)
Best for	K8s logs with consistent labels	Full-text search, unknown log patterns

🏗️ Architecture — Promtail, Loki, Grafana

›

Loki components and how they work together

Component	What it does
Promtail	Agent that runs on every node. Reads container logs from /var/log/pods/, attaches K8s labels (pod, namespace, app), sends to Loki.
Loki	The log aggregation server. Receives logs, indexes only labels (not content), stores compressed chunks in object storage (S3/GCS/Azure Blob).
Grafana	Query and visualise logs using Explore tab + LogQL. Create dashboards combining Prometheus metrics and Loki logs.
Ruler	Evaluate LogQL rules and send alerts (like Prometheus AlertManager but for logs).

Why Loki is different from ELK

	Loki	ELK (Elasticsearch)
Index	Labels only (like Prometheus) — small index	Full-text search index — large index
Storage cost	Very low — compressed chunks in object storage	High — Elasticsearch is expensive to store
Query speed	Fast for label-based queries, slower for full-text	Faster for arbitrary text search
Setup complexity	Simple — Helm chart, 3 components	Complex — ES cluster tuning required
Best for	Known log patterns, Kubernetes label-based queries	Full-text search, unknown log patterns

Install Loki stack with Helm

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Loki + Promtail + Grafana together
helm install loki-stack grafana/loki-stack   --namespace monitoring   --set grafana.enabled=true   --set prometheus.enabled=true   --set loki.persistence.enabled=true   --set loki.persistence.size=50Gi

Promtail config — scrape K8s pod logs

scrape_configs:
- job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  pipeline_stages:
  - docker: {}          # parse Docker log format
  - json:               # parse JSON app logs
      expressions:
        level: level
        msg: message
        duration: duration
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app]
    target_label: app
  - source_labels: [__meta_kubernetes_namespace]
    target_label: namespace

🔍 LogQL — Querying Logs

›

LogQL syntax — label selector then pipeline filters

Every LogQL query starts with a log stream selector in curly braces (selects which logs using the label index — fast). Then optional pipeline stages filter or parse the content.

# Basic label selection
{namespace="production"}
{namespace="production", app="payment-api"}

# Text filters
{namespace="production"} |= "ERROR"            # contains ERROR
{namespace="production"} != "DEBUG"            # excludes DEBUG
{namespace="production"} |~ "timeout|refused"  # regex match

# JSON log parsing
{app="payment-api"} | json | level="error"
{app="payment-api"} | json | duration > 1000

# Metric queries — count log lines
rate({namespace="production"} |= "ERROR" [5m])

# Error rate per service over time
sum by (app) (rate({namespace="production"} |= "ERROR" [5m]))

# Error ratio (errors / total)
sum(rate({namespace="production"} |= "ERROR" [5m])) /
sum(rate({namespace="production"} [5m]))

Common LogQL patterns

Use case	LogQL query
All errors in namespace	`{namespace="production"} \|= "ERROR"`
Specific pod logs	`{pod="payment-api-7d8f9-xyz"}`
JSON field filter	`{app="api"} \| json \| status_code >= 500`
Error rate per app	`sum by(app)(rate({ns="prod"} \|= "ERROR" [5m]))`
Slowest requests	`{app="api"} \| json \| unwrap duration \| p99 by (endpoint) [5m]`

Using Loki in Grafana

In Grafana: go to Explore → select Loki as data source → paste LogQL query. The result shows matching log lines in a timeline. Switch to Metrics view to see the rate chart. Create Dashboard panels combining Prometheus error rate metrics with Loki log lines for the same service side by side — correlation without context switching.

🚀 Promtail — Log Collection

›

Promtail config + drop health checks

🚨 Alerting from Logs

›

Alert on log content — no need to instrument your app

Loki Ruler evaluates LogQL expressions on a schedule and fires alerts — just like Prometheus evaluates PromQL. Alert on ERROR rate, specific exception messages, or any log pattern.

Loki alerting rules + Grafana dashboard

🎯 Interview Questions

›

LOKI · ENGINEER

What is Loki and how does it differ from ELK?

Loki is a log aggregation system by Grafana designed to be cost-effective and Kubernetes-native. The key architectural difference from ELK: Loki does not index the content of log messages. It only indexes the labels (namespace, pod, container, app). This is the same approach Prometheus uses for metrics — small indexed label set, large time-series data. ELK indexes every word in every log message. This makes ELK searches very fast for any search term but requires significant storage and compute for maintaining the Lucene index. Loki requires 10x less storage for the same log volume and requires no complex shard management. The tradeoff: searching for a specific string inside logs is slower in Loki because it must scan log lines. Loki is perfect for: teams already using Grafana for metrics who want logs in the same UI, Kubernetes environments where label-based filtering (show me all errors from namespace production) is the primary use case, and cost-sensitive environments. Choose ELK when you need fast full-text search across billions of logs or complex analytics.

LOKI · ARCHITECT

How do you design a Loki deployment for a production Kubernetes cluster handling 10GB logs/day?

Architecture for production Loki: use Loki in microservices mode (distributed components) rather than single binary for scale and resilience. Components: Distributor (receives and validates log streams from Promtail), Ingester (buffers in memory, flushes to object storage), Querier (runs LogQL queries), Query Frontend (caches and parallelises queries). Storage: use object storage (S3 or Azure Blob) for chunk storage — much cheaper than EBS/SSD. Use local SSDs only for index (BoltDB Shipper or TSDB index). Retention: set retention_period in the limits config — 30 days default. Compression: Loki uses Snappy compression — typical 10:1 compression ratio. 10GB/day uncompressed = 1GB/day stored. Sizing: at 10GB/day with 30-day retention: 300GB compressed storage. 3 ingesters with 1.5GB memory each handles ingest. Promtail tuning: drop health check logs to reduce volume by 20-30%. Add pipeline stages to parse JSON logs and extract fields as labels for faster querying. Grafana Loki datasource: configure max_look_back to 720h (30 days) to prevent accidentally querying too much data.

LOKI · ENGINEER

How does Loki differ from Elasticsearch for log management?

The fundamental difference is indexing strategy. Elasticsearch indexes the full text of every log message — every word in every log is searchable. This makes arbitrary text search very fast but the index is large (often 20-30% of raw log size) and Elasticsearch clusters are expensive to run. Loki only indexes labels (metadata like namespace, app, pod, environment) — not the log content. Log content is stored as compressed text chunks in cheap object storage (S3, GCS, Azure Blob). Storage cost is dramatically lower. Query approach differs: Loki queries use LogQL where you first select logs by labels (fast — uses the index), then optionally filter by content (slower — scans the selected chunks). ELK allows full-text search across all logs without knowing labels upfront. When to choose Loki: Kubernetes environments where logs have consistent labels, cost-sensitive deployments, teams already using Prometheus and Grafana (same data source, same UI). When to choose ELK: complex full-text search requirements, log analytics with unknown patterns, teams needing powerful aggregations and visualisations. In practice: Loki is much cheaper and simpler to operate for Kubernetes log aggregation. ELK is more powerful for complex log analysis. At HPE we used Loki for Kubernetes application logs and a separate ELK for security/audit logs that needed full-text search.

LOKI · ENGINEER

Write a LogQL query to find all error logs in the production namespace and calculate error rate.

To find all error logs: {namespace="production"} |= "ERROR". This selects all log streams with the label namespace=production, then filters to lines containing ERROR. To make it more specific for JSON structured logs: {namespace="production"} | json | level="error". This parses each line as JSON and filters where the level field equals error. To calculate error rate per minute across all apps: sum by (app) (rate({namespace="production"} |= "ERROR" [5m])). This counts error log lines per 5-minute window and groups by the app label. To compare errors to total logs (error ratio): sum(rate({namespace="production"} |= "ERROR" [5m])) / sum(rate({namespace="production"} [5m])). To find errors from a specific pod: {namespace="production", pod=~"payment-api.*"} |= "ERROR". To get the last 100 errors with context: {namespace="production"} |= "ERROR" | json | line_format "{{.timestamp}} [{{.level}}] {{.msg}}". In Grafana: use the Explore tab, select Loki as data source, paste the LogQL query. Create a dashboard panel combining this with Prometheus metrics — error rate chart beside latency P99 chart for the same service.

LOKI · PRODUCTION

Loki is missing logs from some pods. How do you troubleshoot?

Step 1: check Promtail is running on the affected node. kubectl get pods -n monitoring -l app=promtail. If a node has no Promtail pod (DaemonSet issue) all pods on that node will have missing logs. kubectl describe daemonset promtail -n monitoring shows if there are scheduling issues. Step 2: check Promtail logs on the affected node. kubectl logs -n monitoring promtail-xyz | tail -50. Look for: permission denied (Promtail cannot read log files), target matching errors (pod labels not matching scrape config), connection refused to Loki. Step 3: verify the pod logs exist on disk. kubectl exec -n monitoring promtail-xyz -- ls /var/log/pods/. If the directory for the pod is missing, the container runtime has not created logs. Step 4: check label matching. Promtail scrapes pods matching specific labels. If the pod does not have the expected labels (app, namespace), it may not be scraped. kubectl get pod mypod -n production --show-labels. Step 5: check Loki ingestion. Loki has ingestion rate limits. If one noisy service is flooding logs, others may be rate-limited. kubectl logs -n monitoring loki-0 | grep -i rate. Step 6: check Loki storage. If the persistent volume is full, Loki stops accepting new logs. kubectl exec -n monitoring loki-0 -- df -h.

Continue Learning

📈 Prometheus 📊 ELK 🏠 Home