LearnwithVishnu
LearnwithVishnu
Basics → Production → Architect
← Home
🔄HA & DR
BeginnerEngineerProductionArchitectHigh Availability and Disaster Recovery — RTO, RPO, Kubernetes HA, database failover
HA vs DRK8s HADatabase HADisaster RecoveryInterview Q&A

🔄 HA vs DR — Core Concepts

Two Different Problems

High Availability (HA)Disaster Recovery (DR)
ScenarioPod crashes, node fails, AZ downEntire region unavailable
GoalZero downtime during partial failureRecover from total failure
RTOSeconds to minutesMinutes to hours
SolutionMultiple replicas, anti-affinity, PDBMulti-region, backups, runbooks
CostMedium (+50-100% infra)High (+100-200% infra for active-passive)
RTO/RPO/MTTR + CAP theorem

☸️ HA in Kubernetes

PDB, anti-affinity, topology spread, health probes

🗄️ Database HA

PostgreSQL HA, RDS Multi-AZ, backup strategy

🌍 Disaster Recovery

DR Tiers

StrategyRPORTOCostUse when
Active-Active~0~02xRTO/RPO requirements are seconds
Active-Passive (warm)Minutes5-15 min1.5xBusiness-critical, can afford 15 min downtime
Backup + RestoreHours1-4 hours1.1xNon-critical, cost-sensitive
Velero K8s backup, DR runbook, chaos engineering

🎯 Interview Questions

HA/DR · ENGINEER
What is the difference between RTO and RPO? Give a concrete example.
RTO (Recovery Time Objective) is the maximum acceptable time your system can be down after a failure. It answers: how long until we must be back online? RPO (Recovery Point Objective) is the maximum acceptable amount of data that can be lost. It answers: how old can our last good data backup be? Concrete example: payment processing system. Business decides: we cannot afford to be down more than 15 minutes (RTO=15min) and we cannot lose more than 1 minute of transaction data (RPO=1min). These requirements drive architecture decisions: RTO of 15 minutes means you need a warm standby that can be promoted quickly — not a cold backup that takes 2 hours to restore. RPO of 1 minute means you need synchronous or near-synchronous replication — daily backups would give RPO of 24 hours. Lower RTO and RPO = higher infrastructure cost. A system with RTO=0 and RPO=0 (no downtime, no data loss) requires active-active multi-region architecture — very expensive. Chose RTO and RPO based on business impact of downtime versus cost of HA infrastructure.
HA/DR · ARCHITECT
How do you design a highly available application on Kubernetes?
HA in Kubernetes requires multiple layers working together. Application layer: minimum 3 replicas, never 1. Pod Disruption Budget ensuring at least 2 pods always running during disruptions. Proper liveness and readiness probes so Kubernetes knows when a pod is unhealthy. Anti-affinity rules spreading pods across availability zones — if all pods are in AZ-A and AZ-A goes down, you have zero replicas. Topology spread constraints are the modern way to enforce this. Infrastructure layer: multiple nodes across multiple AZs. Node auto-scaling with cluster autoscaler. Database layer: PostgreSQL with streaming replication, connection pooling with PgBouncer for resilience during failovers. Networking layer: services with session affinity disabled (stateless pods), graceful shutdown with terminationGracePeriodSeconds matching your request timeout. The test: can you drain any single node without downtime? kubectl drain node --ignore-daemonsets. If this causes alerts or errors, your HA is incomplete. Run this test monthly in staging.
HA/DR · PRODUCTION
Production database just failed. Walk through your incident response.
Structured response: first 2 minutes — assess not act. Is this a node failure (standby should auto-promote), network issue (routing problem), or true data loss? Check: db pod status in kubectl, CloudWatch/Azure Monitor for the RDS instance status, application error logs to understand when the issue started. Minutes 2-5 — trigger automatic failover if not already happening. For RDS Multi-AZ: failover is automatic (30-60 seconds). Monitor: aws rds describe-events to see failover progress. For self-managed PostgreSQL with Patroni: check patronictl -c /etc/patroni/patroni.yml list — it shows cluster state and should show new leader. Minutes 5-15 — verify applications reconnected. Applications with connection pooling (PgBouncer) handle failover transparently. Applications with direct connections may need restart. Check application health endpoints. Update incident status channel. Minutes 15-30 — if automatic failover did not happen, manual failover. For RDS: aws rds reboot-db-instance --force-failover. Post-incident: run postmortem. Was the backup restoration tested recently? Did the failover time meet RTO? Update runbook based on what took longer than expected.
Continue Learning
🔥 Prometheus📐 SLO☸️ Kubernetes🏠 All Topics
🤖
AI Assistant
Ask anything about this topic
👋 Hi! I have read this page and can answer your questions.

Try asking: "Explain this topic in simple terms" or "Give me an example" or ask any specific question.