LearnwithVishnu
LearnwithVishnu
Basics → Production → Architect
← Home
AWSAWS
BeginnerEngineerProductionArchitectAmazon Web Services — core services, VPC, IAM, EKS, cost optimisation
Regions & AZsIAMVPCEC2S3EKSCostInterview Q&ARoadmap

🌍 AWS Global Infrastructure

What is a Region and Availability Zone?

AWS has 33+ Regions globally. Each Region is an independent geographic area with its own electricity, cooling, and networking. Each Region contains 2-6 Availability Zones (AZs). An AZ is one or more physical data centres with independent power and connectivity, connected to other AZs in the same region via private high-bandwidth links.

Why this matters for architecture: Deploying across 2+ AZs gives you automatic resilience. If one AZ goes down (fire, power failure), your application keeps running in the others. Deploying across 2+ Regions protects against regional failures and lets you serve users with lower latency.

RegionCodeCommon use
Mumbaiap-south-1Indian companies, low latency for India
N. Virginiaus-east-1Default, cheapest, all services available first
Singaporeap-southeast-1Southeast Asia
Irelandeu-west-1Europe, GDPR-compliant
Regions and AZ commands

🔐 IAM — Identity and Access Management

The Most Important AWS Concept

IAM controls who can do what to which AWS resources. Every API call to AWS is authenticated (who are you?) and authorised (are you allowed to do this?). Getting IAM wrong is the #1 cause of AWS security incidents.

IAM Concepts

ConceptWhat it isWhen to use
UserLong-term identity with permanent credentialsOnly when roles impossible. Always enforce MFA.
RoleTemporary credentials assumed by services or usersEC2, Lambda, EKS pods, cross-account — everything
GroupCollection of usersOrganise humans by team. Attach policies to group.
PolicyJSON document: what actions on what resourcesAttach to user, group, or role
IRSAIAM role for Kubernetes service account (EKS)Give pods AWS access without stored keys

Policy Evaluation Logic

By default: everything is DENIED. An explicit DENY always overrides an ALLOW (even from another policy). Only an explicit ALLOW grants access. For cross-account: BOTH the resource policy and the identity policy must allow the action.

IAM policies, roles, and IRSA

🌐 VPC — Virtual Private Cloud

What is a VPC?

A VPC is your private network inside AWS. All AWS resources (EC2, RDS, EKS) run inside a VPC. You control: IP address ranges, subnets, routing, and firewalls. Nothing is reachable from the internet unless you explicitly allow it.

Key Components

ComponentPurpose
SubnetSubdivision of VPC in one AZ. Public = has internet route. Private = no direct internet.
Internet GatewayAllows public subnets to reach internet. One per VPC.
NAT GatewayAllows private subnets to initiate internet connections (for updates etc) without being reachable from internet.
Security GroupStateful firewall at instance level. Allow rules only.
NACLStateless firewall at subnet level. Allow and deny rules.
VPC PeeringConnect two VPCs privately (no internet). Non-transitive.
VPC EndpointAccess AWS services (S3, DynamoDB) without internet traffic. Saves NAT cost.
Production VPC design

💻 EC2 — Compute

Choosing the Right Instance Type

Instance family tells you what it is optimised for. Size (small/medium/large/xlarge) tells you how much. Always start with general purpose (m6i), then right-size based on metrics after 2 weeks in production.

EC2 launch + Spot + SSM

🪣 S3 — Object Storage

S3 Fundamentals

S3 is object storage — you store objects (files) identified by a key (path). Not a filesystem. Infinitely scalable, 11 nines durability (99.999999999%). Used for: backups, static websites, data lakes, application artifacts, logs, container images (ECR), Terraform state.

S3 vs EBS vs EFS

S3EBSEFS
TypeObject storageBlock storageFile system
AccessHTTP APIOne EC2 at a timeMultiple instances
Use forBackups, static files, data lakeOS disk, databasesShared content across instances
Latencyms (network)Sub-ms (local)ms (network)
S3 best practices + security

☸️ EKS — Kubernetes on AWS

What is EKS?

EKS is AWS's managed Kubernetes service. AWS manages the control plane (API server, etcd, scheduler) — you never touch master nodes. You manage worker nodes (EC2 instances or Fargate). The same kubectl commands work on EKS as on any Kubernetes cluster.

EKS vs ECS vs Fargate

EKSECSFargate
What it isManaged KubernetesAWS-native container orchestrationServerless containers (no node management)
Learning curveHigh (K8s knowledge needed)Medium (AWS-specific)Low
Use whenTeam knows K8s, multi-cloud, complexAWS-only, simpler needsNo node management wanted
EKS setup + storage + load balancers

💰 Cost Optimisation

Where AWS Cost Goes — and How to Reduce It

Cost DriverSavings StrategyTypical Saving
EC2 (on-demand)Spot for stateless, Savings Plans, Graviton instances40-70%
RDSReserved instances (1-year), right-size, stop dev instances overnight30-50%
NAT GatewayVPC endpoints for S3/DynamoDB traffic20-40%
S3Lifecycle policies to Glacier, S3 Intelligent-Tiering50-80%
Data TransferCloudFront for CDN, same-region replication, compress payloads30-60%
Cost optimisation commands

☸️ EKS — Deep Dive for Production

EKS architecture vs AKS — key differences to know

AWS EKSAzure AKS
Control plane cost$0.10/hr per cluster ($73/month) — you pay for itFree — Azure manages for free
Node identityIAM Roles for Service Accounts (IRSA)Workload Identity (Azure AD federation)
Node typesManaged Node Groups, Self-managed, FargateNode Pools (system + user)
NetworkingVPC CNI — pods get VPC IPsAzure CNI — pods get VNet IPs
Load BalancerAWS Load Balancer Controller creates ALB/NLBAGIC creates Application Gateway
StorageEBS CSI driver, EFS CSI driverAzure Disk CSI, Azure Files CSI

IRSA — IAM Roles for Service Accounts

IRSA is the AWS equivalent of Azure Workload Identity. It links a Kubernetes ServiceAccount to an IAM Role, allowing pods to access AWS services (S3, DynamoDB, Secrets Manager) without any stored credentials.

# 1. Create IAM OIDC provider for the EKS cluster
eksctl utils associate-iam-oidc-provider --cluster myeks --approve

# 2. Create IAM role with trust policy for the ServiceAccount
eksctl create iamserviceaccount   --name payment-service-sa   --namespace production   --cluster myeks   --attach-policy-arn arn:aws:iam::123456789:policy/PaymentServicePolicy   --approve

# 3. Pod uses the ServiceAccount — gets AWS credentials automatically
# No access keys stored anywhere

Fargate for EKS — serverless nodes

Fargate eliminates node management — each pod runs on a dedicated micro-VM. You define Fargate profiles: which namespaces/labels use Fargate vs managed node groups. Best for: burst workloads, batch jobs, dev/test environments where you don't want to manage nodes. Not suitable for: DaemonSets (cannot run on Fargate), privileged pods, GPU workloads.

EKS add-ons — managed cluster components

Add-onWhat it does
CoreDNSDNS for service discovery inside cluster
kube-proxyNetwork rules on each node
VPC CNIPod networking with VPC IPs
EBS CSI DriverDynamic persistent volume provisioning with EBS
AWS Load Balancer ControllerCreates ALB for Ingress, NLB for Service type LoadBalancer

⚡ AWS Lambda and Serverless Architecture

Lambda — the key concepts

Lambda runs your code in response to events without you managing any servers. You deploy a function (Python, Node.js, Java, Go, etc.), configure what triggers it, and AWS scales it from 0 to thousands of instances automatically.

ConceptExplanation
TriggerWhat invokes the function: API Gateway (HTTP), S3 event (file upload), SQS message, EventBridge schedule (cron), DynamoDB stream
Execution environmentIsolated container, 512MB-10GB memory, up to 15 minutes runtime, ephemeral /tmp storage
Cold startFirst invocation after idle period: container initialised = 100ms-3s latency. Mitigate with Provisioned Concurrency.
ConcurrencyEach concurrent request gets its own execution environment. Default limit: 1000 concurrent per region.
PricingPay per invocation ($0.20 per million) + duration (per GB-second). First 1 million requests/month free.

Lambda in DevOps — common uses

  • Automated remediation — CloudWatch alarm triggers Lambda which restarts an ECS service or scales up capacity
  • CI/CD webhook processor — API Gateway receives GitHub webhook, Lambda triggers CodePipeline
  • Scheduled maintenance — EventBridge cron triggers Lambda to stop dev environments at night
  • Log processing — S3 event triggers Lambda to process and forward logs to Elasticsearch
  • Slack/Teams bot — API Gateway + Lambda handles slash commands from your ops chat

🔄 AWS CI/CD — CodePipeline and GitHub Actions

AWS native CI/CD stack

ServiceRoleEquivalent
CodeCommitManaged Git repositoryGitHub, Azure Repos
CodeBuildManaged build service — run tests, build Docker imagesJenkins, GitHub Actions runner
CodeDeployDeployment service — rolling, canary, blue/green deployments to EC2, ECS, LambdaOctopus Deploy, Spinnaker
CodePipelineOrchestrates the full CI/CD workflow — source → build → test → deployAzure DevOps, Jenkins Pipeline
ECRPrivate Docker image registryACR, Docker Hub

GitHub Actions to EKS — OIDC authentication (no stored credentials)

name: Deploy to EKS
on:
  push:
    branches: [main]

permissions:
  id-token: write   # REQUIRED for OIDC
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3

    - name: Configure AWS credentials (OIDC — no secrets stored)
      uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: arn:aws:iam::123456789:role/GitHubActionsEKSRole
        aws-region: us-east-1

    - name: Login to ECR
      run: aws ecr get-login-password | docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com

    - name: Build and push
      run: |
        docker build -t myapp:${{ github.sha }} .
        docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:${{ github.sha }}

    - name: Deploy to EKS
      run: |
        aws eks update-kubeconfig --name myeks --region us-east-1
        helm upgrade --install myapp ./charts/myapp           --set image.tag=${{ github.sha }} --atomic --wait

🛡️ AWS High Availability and Disaster Recovery

HA design principles on AWS

PatternWhat it meansAWS implementation
Multi-AZRun across multiple Availability Zones in one regionRDS Multi-AZ, ALB across AZs, EKS nodes in multiple AZs
Multi-Region Active-PassivePrimary region active, secondary on standby. Failover on disaster.Route53 health checks + failover routing, RDS read replica in secondary region
Multi-Region Active-ActiveBoth regions serve traffic simultaneouslyRoute53 latency routing, DynamoDB Global Tables, S3 Cross-Region Replication

RTO and RPO — the two DR metrics

  • RTO (Recovery Time Objective) — how long can the business be down? "We must be back online within 4 hours." Drives: how fast your failover automation must work.
  • RPO (Recovery Point Objective) — how much data can we lose? "We cannot lose more than 15 minutes of transactions." Drives: how frequently you must backup/replicate.

DR strategies by cost and RTO

StrategyRTOCostHow
Backup & RestoreHoursLowS3 backups, restore from scratch when disaster strikes
Pilot Light30-60 minMediumCore DB running in secondary region, scale out compute on failover
Warm StandbyMinutesHighScaled-down running copy in secondary, scale up on failover
Active-ActiveSeconds2xFull capacity in both regions, instant failover via DNS

🎯 Interview Questions

AWS · ARCHITECT
Design a production-grade VPC architecture for a 3-tier application on AWS.
Three-tier VPC with public, private-app, and private-data subnets across 3 AZs. VPC CIDR 10.0.0.0/16 giving 65,536 IPs. Public subnets host: Application Load Balancer (inbound 443 from 0.0.0.0/0), NAT Gateways (one per AZ for HA), Bastion host (if needed). Private app subnets host EKS nodes and EC2 — they route outbound traffic through NAT Gateway. Private data subnets host RDS, ElastiCache — NO internet route at all, fully isolated. Security groups implement least-privilege: ALB-SG allows inbound 443 from anywhere. App-SG allows inbound only from ALB-SG. DB-SG allows inbound only from App-SG. VPC Flow Logs enabled for security audit. VPC endpoints for S3 and DynamoDB avoid NAT Gateway costs for AWS service traffic. At scale: inter-AZ data transfer costs money — keep app pods and their RDS AZ aligned.
AWS · ENGINEER
What is the difference between Security Groups and NACLs in AWS?
Security Groups are stateful — if you allow inbound traffic, return traffic is automatically allowed. They operate at the instance/ENI level. You can only create ALLOW rules. Changes take effect immediately. NACLs (Network ACLs) are stateless — you must explicitly allow both inbound and outbound traffic for a connection to work. They operate at the subnet level and apply to all instances in the subnet. Rules are evaluated in order by rule number — first match wins. You can create both ALLOW and DENY rules. Use case for NACLs: blocking a specific IP range at the subnet level (DDoS mitigation), quick emergency block. Use case for Security Groups: fine-grained instance-level control. Best practice: Security Groups for normal operations. NACLs as an additional layer for subnet-level blocking only.
AWS · ARCHITECT
Explain IRSA — IAM Roles for Service Accounts — and why it matters for EKS security.
Before IRSA: to give a pod AWS access, you stored access keys as Kubernetes secrets or gave the EC2 node role broad permissions (all pods on that node get all permissions). Both are security risks. IRSA uses OIDC federation: EKS cluster has an OIDC endpoint. You associate this with your AWS account. Create an IAM role with a trust policy allowing only the specific Kubernetes service account in a specific namespace. Pod uses that service account. At runtime, the pod automatically gets a temporary credential via the OIDC token. No keys stored anywhere. If the pod is compromised, credentials expire in 1 hour. Blast radius is limited to exactly the permissions in that role. Implementation: eksctl utils associate-iam-oidc-provider, then eksctl create iamserviceaccount. This is the production standard for EKS and what every AWS interview expects you to know for containerised workloads.
AWS · PRODUCTION
S3 bucket was accidentally made public. What do you do in the next 60 seconds?
First 60 seconds: block public access immediately at both account level and bucket level using aws s3control put-public-access-block with all four flags set to true. This stops new exposure instantly. Next 5 minutes: check what was in the bucket using aws s3api list-objects-v2. Check CloudTrail for GetObject API calls in the last hour to understand what was potentially accessed. Check the bucket policy and ACL that allowed public access. Next 30 minutes: file a security incident. Notify your security team. If the bucket contained PII, initiate your GDPR/data breach notification process (72 hours under GDPR). Fix the root cause: identify what Terraform/IaC missed the block_public_acls setting. Prevention for future: AWS Config rule s3-bucket-public-read-prohibited that fires immediately. SCP at organization level blocking all public bucket creation. In Terraform: always include aws_s3_bucket_public_access_block resource with all four booleans set to true.
AWS · ENGINEER
What is the difference between EBS, EFS, and S3 storage on AWS?
EBS (Elastic Block Store): block storage, attached to one EC2 instance at a time (ReadWriteOnce). Like a hard drive. Use for OS volumes, databases (RDS uses EBS), single-instance app data. High IOPS, low latency. Types: gp3 (general), io2 (high performance databases), st1 (throughput — Kafka), sc1 (cold, infrequent access). EFS (Elastic File System): network file system, multiple EC2 instances can mount simultaneously (ReadWriteMany). NFS protocol. Use for shared content (web tier accessing same files), EKS pods needing shared storage across nodes. More expensive than EBS. S3: object storage, HTTP API (not mountable as filesystem natively). Unlimited scale. Use for: backups, static content, data lake, application artifacts, logs. Cannot run a database on S3. Differences in EKS context: EBS for databases in StatefulSets, EFS for shared config or content across pods, S3 for application data and backups via SDK.
AWS · PRODUCTION
How do you troubleshoot an EC2 instance that is unreachable via SSH?
Systematic approach. First: check EC2 console — is the instance state Running? Check system status checks and instance status checks. System check failure means AWS hardware issue — stop and start the instance (this migrates to new hardware). Instance check failure means OS-level issue. Second: check Security Group — does it allow inbound 22 from your IP? Connections silently drop without SG rule. Third: check NACLs — is there a DENY rule on port 22? Fourth: check instance system log — EC2 Console → Get system log — shows Linux boot messages and any panic/crash. Fifth: if no SSH key access, use AWS Systems Manager Session Manager — no SSH needed, works through Systems Manager agent. Sixth: for EBS-backed instances, detach the root volume, attach to a working instance as secondary volume, fix the issue (bad sshd_config, full disk), reattach. Production rule: disable SSH entirely and use SSM Session Manager — no inbound ports, full audit trail.
AWS · ARCHITECT
How does AWS Auto Scaling work with ALB for a production application?
Full flow: ALB receives traffic, distributes to target group. Target group contains EC2 instances or EKS pods. Auto Scaling Group manages the EC2 instances. When CPU/memory/custom metric crosses threshold, ASG launches new instances, registers them with the target group, ALB starts sending traffic once health check passes. Scale-down: after cooldown period (300 seconds default), underutilised instances are terminated, deregistered from target group first so in-flight requests complete. Key settings: min/desired/max capacity, health check grace period (give new instances time to start), cooldown (prevent rapid scale-up/down flapping), instance warm-up (how long before new instance counted in metrics). ALB health check vs EC2 health check: ALB health check tests HTTP endpoint. If it fails, ALB removes instance from rotation but ASG does not know. Configure ASG to use ELB health checks to replace unhealthy instances automatically.
AWS · ENGINEER
What is IRSA in EKS and how does it compare to Azure Workload Identity?
IRSA (IAM Roles for Service Accounts) is EKS's mechanism for giving pods access to AWS services without storing credentials. It works through OIDC federation: the EKS cluster has an OIDC issuer URL. An IAM Role is created with a trust policy that trusts tokens from that OIDC issuer for a specific Kubernetes ServiceAccount. The pod uses that ServiceAccount and gets temporary AWS credentials automatically via the AWS SDK credential chain. The trust policy specifies: "I trust tokens from cluster X for ServiceAccount Y in namespace Z." When the payment pod calls S3, boto3 sees the ServiceAccount token in the pod filesystem, exchanges it for temporary AWS credentials, and makes the S3 call. No access keys stored anywhere. Comparison with Azure Workload Identity: identical concept, different implementation. Both use OIDC federation between Kubernetes ServiceAccounts and the cloud identity system (AWS IAM vs Azure AD). Both eliminate stored credentials. The practical difference: IRSA configuration is done per-cluster with eksctl or Terraform. Azure Workload Identity requires the federated credential to be set up on the Managed Identity resource. Both are the current production standard for their respective platforms — never use static access keys in pods.
AWS · ENGINEER
What is the difference between ALB and NLB in AWS? When do you use each?
ALB (Application Load Balancer) operates at Layer 7 (HTTP/HTTPS). It understands the content of the request: URL path routing (/api → backend, /static → S3), host-based routing (api.example.com → API service, app.example.com → frontend), gRPC, WebSocket, SSL termination, content-based routing, and WAF integration. Use ALB for: web applications, microservices with path-based routing, HTTP API endpoints, WebSocket applications. NLB (Network Load Balancer) operates at Layer 4 (TCP/UDP/TLS). It routes based on IP and port only, no content inspection. Handles millions of requests per second with ultra-low latency (microseconds vs milliseconds for ALB). Preserves the source IP of the client (ALB changes source IP to the ALB IP). Use NLB for: TCP/UDP applications (gaming servers, IoT), high-frequency trading where microsecond latency matters, applications requiring source IP preservation, EKS Services with type LoadBalancer where you need a static IP (NLB supports static Elastic IPs, ALB does not). In EKS: use the AWS Load Balancer Controller (LBC). Annotate Service type LoadBalancer with service.beta.kubernetes.io/aws-load-balancer-type: external to get an NLB, or use Ingress with IngressClass alb to get an ALB for HTTP routing.
AWS · ARCHITECT
How do you design a highly available three-tier application on AWS?
Three-tier (presentation, application, data) deployed across two AZs minimum. Presentation tier: static assets in S3 with CloudFront CDN in front. Dynamic frontend in ECS/EKS. ALB distributes across AZs. Auto Scaling Group maintains minimum 2 instances across AZs. Application tier: EKS with node groups spanning two AZs. HPA scales pods. Cluster Autoscaler adds nodes. ALB routes to the EKS service. Security groups allow only the ALB to reach the application tier — no direct internet access. Data tier: RDS with Multi-AZ deployment — primary in AZ-1, standby in AZ-2. Automatic failover if primary fails (1-2 minutes). Read replicas for read-heavy workloads. ElastiCache (Redis) in cluster mode across AZs for session storage and caching. All data tier resources in private subnets — no public access. Supporting services: Route53 for DNS with health checks and automatic failover. CloudFront for global CDN and DDoS protection (absorbs layer 7 attacks at edge). WAF attached to CloudFront and ALB for OWASP Top 10 protection. KMS encrypts all data at rest. Secrets Manager for database credentials (rotate automatically). CloudWatch for monitoring, CloudTrail for audit. For disaster recovery: S3 Cross-Region Replication for static assets, RDS cross-region read replica that can be promoted, infrastructure as Terraform code so you can rebuild in a new region from code.
AWS · PRODUCTION
EKS pods cannot access S3. Walk through your troubleshooting steps.
Step 1: check the error. From inside the pod: kubectl exec -it pod-name -- aws s3 ls s3://my-bucket. The error tells you everything. "Unable to locate credentials" means no IAM role attached. "Access Denied" means wrong permissions. "NoSuchBucket" means wrong bucket name or region. Step 2: verify IRSA setup. kubectl describe serviceaccount my-sa -n namespace — check for the annotation eks.amazonaws.com/role-arn. If missing: the ServiceAccount is not linked to an IAM role. Step 3: verify the pod is using the correct ServiceAccount. kubectl describe pod my-pod — check serviceAccountName. If it says "default", the deployment is not using the annotated ServiceAccount. Step 4: verify the IAM role trust policy. aws iam get-role --role-name MyEKSRole -- check the trust policy allows the EKS cluster OIDC issuer and the specific ServiceAccount. Common mistake: trust policy says the right cluster but wrong namespace or ServiceAccount name. Step 5: verify the IAM role has the right permissions. aws iam simulate-principal-policy checks whether the role can perform s3:GetObject on the bucket ARN without actually doing it. Step 6: check the bucket policy. The S3 bucket might have a policy that denies access regardless of IAM role. Also check if the bucket is in a different region and you need --region flag. Step 7: token expiry. IRSA tokens expire after 24 hours by default. Check pod age — if very old pod, restart it to get fresh credentials.
id="sec-roadmap">

🗺️ Roadmap

Week 1
Foundations
Create AWS free account
Understand regions and AZs
IAM users, groups, policies
Launch first EC2, SSH in
Week 2
Networking
Create VPC from scratch (not default)
Public + private subnets
Security groups + NACLs
NAT Gateway
Week 3-4
Core Services
S3 with versioning + encryption
RDS Multi-AZ
ALB + Auto Scaling Group
CloudWatch alarms
Month 2
DevOps on AWS
EKS cluster with eksctl
IRSA for pod permissions
Terraform for all infrastructure
AWS Solutions Architect exam prep
Continue Learning
☸️ Kubernetes🔷 Terraform🔵 Azure🏠 All Topics
🤖
AI Assistant
Ask anything about this topic
👋 Hi! I have read this page and can answer your questions.

Try asking: "Explain this topic in simple terms" or "Give me an example" or ask any specific question.