LearnwithVishnu
LearnwithVishnu
Basics → Production → Architect
← Home
🤖Ansible
Beginner Engineer Production Architect Agentless automation — configure servers, deploy apps, enforce compliance
What you will learn: What Ansible is → Inventory (static, dynamic, group_vars) → Playbooks (tasks, handlers, templates) → Roles (reusable automation) → Variables and precedence → Ansible Vault (secrets) → All CLI commands → AAP/Tower (enterprise) → Rolling deployments → Idempotency → CI/CD integration → 10 senior interview Q&As
What is Ansible Inventory Playbooks Roles Variables & Vault Commands AAP / Tower Production Interview Q&A

🤖 What is Ansible?

Ansible is an agentless IT automation tool by Red Hat. It connects to servers over SSH and runs tasks defined in YAML files called playbooks. No agent needs to be installed on target servers — this is its biggest advantage over Chef and Puppet.

Ansible vs Chef vs Puppet

FeatureAnsiblePuppetChef
Agent requiredNo — agentless (SSH only)Yes — Puppet agentYes — Chef client
LanguageYAML — readable by everyonePuppet DSL (Ruby-based)Ruby DSL (complex)
ModelPush — control node pushesPull — agents pull configPull — agents pull config
Learning curveLow — write YAML in hoursHigh — weeks to masterHigh — weeks to master
Setup timeMinutes — pip install ansibleDays — install agents everywhereDays — install agents everywhere
Best forAd-hoc automation, CI/CD integrationContinuous compliance enforcementComplex enterprise config management

Where Ansible fits in the DevOps toolchain

ToolWhat it doesAnalogy
TerraformProvisions servers, networks, databases in cloudBuilder — creates the house
AnsibleConfigures what is inside the serversInterior designer — furnishes the house
Jenkins/GitHub ActionsDeploys application codeMoving company — brings your stuff in
Install Ansible and test connectivity

📋 Inventory — Static, Dynamic, Group Vars

The inventory tells Ansible which servers to manage and how to group them. Groups let you target subsets: run tasks on all webservers, or just databases, or just production servers.

Static inventory — INI and YAML formats

Group and host variables

group_vars and host_vars

Dynamic Inventory — for cloud environments

Static inventory files are unmanageable for cloud environments where VMs are created and destroyed regularly. Dynamic inventory queries cloud APIs at runtime.

Dynamic inventory — AWS EC2 and Azure

📝 Playbooks — Tasks, Handlers, Templates

A playbook is a YAML file that defines tasks to run on target servers. Each task uses an Ansible module (copy, template, service, package, command, etc.). Tasks run in order — if one fails, the play stops.

Complete playbook — deploy application

Handlers — run once at the end

Handlers are tasks triggered by notify — they run only once at the end of a play regardless of how many tasks triggered them. Use for: restart service, reload config, clear cache.

Handlers example

Jinja2 Templates — dynamic config files

Jinja2 template and usage

📦 Roles — Reusable Automation

A role is a reusable, structured collection of tasks, vars, templates, and handlers. Instead of one giant playbook, roles make automation modular and shareable across projects and teams. This is what separates junior from senior Ansible usage.

Role directory structure

Role structure and usage

Complete role example — nginx with TLS

nginx role — tasks/main.yml

🔧 Variables, Precedence & Vault

Variable precedence — most important concept

Ansible has 22 variable precedence levels. For interviews, know these 6 in order:

PrioritySourceExample
LowestRole defaultsrole/defaults/main.yml
2Inventory varsinventory.ini host variables
3Group varsgroup_vars/webservers.yml
4Host varshost_vars/web-01.yml
5Playbook varsvars: section in playbook
HighestExtra vars (-e flag)ansible-playbook deploy.yml -e "env=prod"
Variable precedence examples

Ansible Vault — encrypt secrets

Vault encrypts sensitive data (passwords, API keys) so they can be safely stored in Git. The encrypted file looks like AES256 ciphertext — useless without the vault password.

Ansible Vault — encrypt and use secrets

🖥️ CLI Commands — Complete Reference

ansible and ansible-playbook commands

🏢 Ansible Automation Platform (AAP / Tower)

AAP (Ansible Automation Platform) is enterprise Ansible. It adds: Web UI, REST API, RBAC, job scheduling, audit logs, and credential management on top of standard Ansible. At HPE/Vodafone scale you need AAP — CLI Ansible is unmanageable for teams.

What AAP adds over CLI Ansible

Before AAP (CLI)After AAP
Manual inventory.ini filesDynamic inventory synced from AWS/Azure
SSH keys on engineer laptopsCredentials stored in AAP vault
No audit trailFull job history: who, what, when, output
No access controlRBAC: Dev team cannot run prod playbooks
Cron jobs on control nodeSchedules in AAP with Slack notifications
Manual playbook updatesProjects auto-sync from Git on every push

AAP Core Objects

ObjectPurpose
OrganizationTop-level tenant grouping (Telecom Org, Healthcare Org)
InventoryServer lists — static or dynamic. Scoped to an org.
CredentialSSH keys, vault passwords, cloud creds — stored encrypted
ProjectLink to a Git repo containing playbooks. Auto-syncs on commit.
Job TemplateDefines: which playbook + inventory + credentials + vars. The "run button"
Workflow TemplateChain multiple Job Templates: backup → deploy → verify → notify
ScheduleRun Job Templates on cron schedule without Jenkins

AAP RBAC — role levels

RolePermissions
AdminFull control — create, edit, delete, execute
ExecuteCan run Job Templates — cannot edit them
UseCan reference Credential/Inventory — cannot view secret values
UpdateCan sync Projects and Inventories
ReadView only — can see job history
AAP REST API — automate from CI/CD

🚀 Production Patterns — CI/CD, Rolling, Idempotency

Idempotency — the most important Ansible concept

An idempotent playbook produces the same result whether run once or 100 times. Running a properly written Ansible playbook against an already-configured server should result in "OK" (no changes) not "Changed" or "Failed". This is the difference between professional and amateur Ansible usage.

Idempotent patterns

Rolling deployment with serial

Rolling deployment — zero downtime

Ansible in Jenkins CI/CD pipeline

Jenkins + Ansible pipeline

🔍 Troubleshooting — Common Issues

ErrorCauseFix
SSH connection refusedWrong IP, firewall blocking port 22, wrong SSH userCheck ansible_host, ansible_user, ansible_port
Permission denied (publickey)SSH key not added to authorized_keysCopy SSH key: ssh-copy-id user@host
Python not foundOld server without Python, or wrong pathSet ansible_python_interpreter=/usr/bin/python3
sudo password requiredbecome: yes but no sudo password configuredUse --ask-become-pass or configure NOPASSWD in sudoers
Task not idempotentUsing shell/command module instead of dedicated moduleUse package/service/file modules instead of shell
Variable undefinedVariable not set in inventory or vars filesCheck variable precedence, use default() filter
Debugging commands

🎯 Interview Questions — Senior Level

ANSIBLE · ENGINEER
What is Ansible and how is it different from Chef and Puppet?
Ansible is agentless — it connects to servers over SSH and requires only Python on the target server. No agent daemon to install, maintain, or upgrade. Chef and Puppet require an agent running on every managed server — agent upgrades, agent authentication, agent failures become their own operational problem. Ansible uses YAML playbooks which any developer can read. Chef uses Ruby DSL which requires programming knowledge. Ansible is push-based — control node pushes tasks when you run ansible-playbook. Puppet and Chef are pull-based — agents periodically check for updates. The pull model is better for continuous compliance; the push model is better for on-demand deployments and CI/CD integration. At HPE: we chose Ansible specifically because the infrastructure team could write and understand playbooks without needing Ruby knowledge, and because we needed CI/CD integration that push-based Ansible makes natural.
ANSIBLE · ENGINEER
Explain Ansible variable precedence. Which wins?
Ansible has 22 precedence levels. For interviews, the 6 most important in order from lowest to highest: role defaults (role/defaults/main.yml) — anyone can override these; inventory variables (host and group vars in inventory file); group_vars (files in group_vars/ folder); host_vars (files in host_vars/ folder); playbook vars (vars: section); extra vars (-e flag) — always wins, cannot be overridden. Practical implication: role defaults are the safety net defaults. group_vars/production.yml overrides them for production. host_vars/critical-server.yml can further override for one specific server. And in an emergency, -e "log_level=DEBUG" overrides everything without touching any files. The most common mistake: setting variables in role vars/main.yml (high precedence) instead of defaults/main.yml (low precedence) — then nobody can override them from group_vars, which breaks multi-environment playbooks.
ANSIBLE · ARCHITECT
How do you design Ansible roles to support both on-premise and cloud environments without duplicating code?
The key is parameterization and abstraction through variables. Design roles to be environment-agnostic by default, environment-specific through variable overrides. Example: my nginx role defines nginx_worker_processes in defaults/main.yml as 4. For cloud VMs with 8 cores, group_vars/cloud_webservers.yml sets it to 8. For on-prem servers with 16 cores, group_vars/onprem_webservers.yml sets it to 16. The role code never changes — only the variables differ. For genuinely different behavior (systemd vs init.d, different package managers), use when conditionals on ansible_os_family and ansible_distribution_major_version. For cloud-specific tasks (register with cloud load balancer, fetch secrets from Key Vault), use delegate_to: localhost to run cloud API calls from the control node. The role structure stays identical — cloud tasks are just enabled or disabled via variables like cloud_provider: azure or cloud_provider: none.
ANSIBLE · PRODUCTION
Your Ansible playbook runs successfully against dev but fails against production. What do you investigate?
Systematic approach — differences between dev and prod that could cause failures: First, run with -vvv to see exact SSH and task output. Most common causes in order of frequency: 1) Variable values — prod group_vars has different values (db_host, app_port, credentials). Verify with ansible prod-servers -m debug -a "var=hostvars[inventory_hostname]". 2) Ansible Vault — prod uses different vault password. Verify vault decryption works: ansible-playbook --check --vault-password-file prod_vault.pass. 3) Network/firewall — target port not open, package repository not reachable from prod network. Test with ansible prod-server -m uri -a "url=https://registry.example.com". 4) Permissions — prod has stricter sudo rules or SELinux enforcing. Check with ansible prod-server -m shell -a "getenforce". 5) OS version differences — prod is RHEL 8, dev is RHEL 9. Some modules behave differently. Use --check --diff to preview exactly what would change on prod without making changes.
ANSIBLE · ARCHITECT
What is Ansible Automation Platform and when would you choose it over CLI Ansible?
AAP is enterprise Ansible with Web UI, RBAC, scheduling, audit logs, and centralized credential management. You need AAP when you have: more than 3 engineers running Ansible (SSH keys on laptops = security risk), any compliance requirement (PCI-DSS, SOC2 require audit trails of every change — CLI Ansible has none), production environments that need approval gates (AAP Workflow Templates support approval steps), and 24x7 operations (AAP schedules nightly compliance runs without a Jenkins dependency). Key AAP RBAC use case: Dev team gets Execute permission on dev Job Templates only. Ops team gets Execute on all. Nobody gets SSH key access to servers directly — all access goes through AAP with full logging. At Vodafone scale with 400+ servers across dev/staging/prod, CLI Ansible was a security and audit nightmare. AAP replaced it: every playbook run recorded, every credential centralized, every dev action approved by ops.
ANSIBLE · ENGINEER
What is idempotency in Ansible and why does it matter?
Idempotency means running a playbook once or 100 times produces the same result — the system ends up in the desired state either way, with no side effects from repeated runs. Why it matters: CI/CD pipelines run playbooks on every deployment. If a playbook is not idempotent, running it twice might install duplicate packages, create duplicate users, append duplicate config lines, or fail because a resource already exists. Ansible built-in modules (package, file, service, user, template, lineinfile) are idempotent. The shell and command modules are NOT idempotent by default — they run every time. If you must use shell, use creates or removes flags: shell: create_database.sh creates=/var/lib/db — this skips the command if the file already exists. The measure of a good playbook: run it against an already-configured server — all tasks should show "ok" (unchanged), zero "changed". If any task shows "changed" every time, it is not idempotent.
ANSIBLE · PRODUCTION
Production server configuration drifted from your Ansible playbooks. How do you detect and remediate this?
Configuration drift in Ansible is detected by running playbooks in check mode against production: ansible-playbook site.yml --check --diff -i inventory/prod. This shows every difference between current state and desired state without making any changes. The --diff flag shows exact file content changes. Anything showing "changed" in check mode = drift. Common drift sources: manual emergency fixes during incidents that were never formalized into playbooks, security patches applied manually, and configuration changes made directly on servers by application teams. Remediation decision: if the drift was an intentional improvement, update the playbook first, then apply. If the drift was incorrect, run ansible-playbook site.yml -i inventory/prod to revert to desired state. Prevention: run check mode as a nightly Jenkins job. Any drift detected = Slack alert to the team. At HPE: nightly drift detection on 50+ servers. Alert fires maybe twice per month, usually from manual emergency changes. Having the alert meant we always caught and formalized the change within 24 hours.
ANSIBLE · ENGINEER
What is the difference between include_tasks and import_tasks in Ansible?
import_tasks (static): The tasks file is read and included at parse time, before playbook execution starts. It is as if the tasks were written directly in the playbook. Result: you can use --list-tasks to see all tasks before running, tags applied to the import apply to all imported tasks. Limitation: you cannot use variables in the file path — it must be a static path. include_tasks (dynamic): The tasks file is loaded at runtime when that point in the playbook is reached. You can use variables in the file path: include_tasks: "{{ ansible_os_family }}_tasks.yml" — loads different file based on OS. Tags on the include_tasks do NOT automatically apply to included tasks. Limitation: --list-tasks does not show the included tasks before running. Rule of thumb: use import_tasks for static includes where you always know what to include. Use include_tasks for conditional inclusion based on variables, or when you need to loop over multiple task files.
ANSIBLE · ARCHITECT
How do you handle secrets in Ansible across a team of 20 engineers?
Three-layer secret management strategy. Layer 1: Ansible Vault for playbook secrets (database passwords, API keys in vars/secrets.yml). Vault password stored in a password manager (HashiCorp Vault or 1Password for teams) — never in Git. Each environment has a separate vault password. Layer 2: SSH keys managed in AAP credential store — engineers never see or hold SSH keys. AAP injects them at job execution time. Complete audit: who connected to which server and when. Layer 3: For production secrets that rotate regularly (DB passwords, API tokens), use External Secrets Operator or Vault Agent to inject secrets at playbook runtime from HashiCorp Vault, never hardcode even in vault files. Rotation: when a secret rotates, update in HashiCorp Vault — all playbooks pick it up automatically on next run without any code changes. At HPE: I implemented this three-layer approach. Result: no engineer has direct SSH access to production servers, every secret access is audited, and we passed SOC2 audit without any findings related to credential management.
ANSIBLE · PRODUCTION
A runaway Ansible playbook is running on production and making unintended changes. How do you stop it?
Immediate stop: Ctrl+C in the terminal if you are watching it. Ansible stops after the current task completes — it does not kill mid-task. If it is running in Jenkins/AAP: cancel the job immediately in the UI. For SSH-based playbooks you can also kill the SSH sessions to the target hosts: pkill -f "ssh.*production-server" from the control node — this interrupts the current task on all hosts. Assessment: check what already ran using --start-at-task to understand blast radius. Ansible stores no rollback information — if tasks already ran (files changed, services restarted, packages installed), you must manually reverse them or re-run an earlier version of the playbook. Prevention: always run --check --diff in CI before any prod apply. Use serial to limit blast radius. For high-risk plays, add a manual approval step in AAP workflow before the actual execution stage. At HPE: we had a runaway playbook that restarted all telecom services simultaneously instead of serially. The fix took 2 hours. After this we added serial: 1 to all service-restart playbooks and mandatory --check in CI.

🗺️ Learning Roadmap

Week 1
Foundations
Install Ansible: pip install ansible
Create inventory with 3 hosts
Run: ansible all -m ping
Write first playbook: install nginx
Week 2
Core Concepts
Roles — create and reuse
Variables and precedence
Ansible Vault for secrets
Handlers and templates
Week 3-4
Production Patterns
Dynamic inventory (AWS/Azure)
Rolling deployments with serial
Idempotency — write it right
Ansible in CI/CD pipeline
Month 2
Enterprise — AAP
AAP installation and setup
RBAC — teams and job templates
Workflow templates
REST API integration
Continue Learning
🔷 Terraform 🔧 Jenkins ☸️ Kubernetes 🏠 All Topics
🤖
AI Assistant
Ask anything about this topic
👋 Hi! I have read this page and can answer your questions.

Try asking: "Explain this topic in simple terms" or "Give me an example" or ask any specific question.