LearnwithVishnu
LearnwithVishnu
Basics → Production → Architect
← Home
Linux & BashLinux & Bash
BeginnerEngineerProductionArchitectCommand line mastery — every DevOps engineer's foundation
Why LinuxProcessesPerformanceFiles & PermissionsNetworkingBash ScriptsTroubleshootInterview Q&ARoadmap

🐧 Why Linux for DevOps?

Every server, every container, every Kubernetes node runs Linux. When something breaks at 2am, you debug it in a Linux terminal. When you write automation scripts, you use bash. When you tune performance, you read Linux metrics. Linux command line proficiency is non-negotiable for any DevOps role.

Linux in the DevOps World

Where you encounter LinuxWhat you need to know
Production servers (AWS EC2, Azure VM)Files, processes, networking, services
Docker containersAlpine/Debian/Ubuntu base images, shell debugging
Kubernetes nodeskubelet runs on Linux, debug with node exec
CI/CD pipelines (GitHub Actions, Jenkins)Pipeline steps run bash on Ubuntu runners
Ansible playbooksSSH into Linux servers, execute Linux commands
Terraform remote-execRun shell scripts on provisioned Linux VMs
Linux navigation and distributions

⚙️ Processes & Services

What is a Process?

A process is a running program with its own PID (Process ID), memory space, and file handles. When you start nginx, the OS creates a process. When nginx spawns worker processes, each gets its own PID. Every process has a parent — orphan processes cause zombie issues.

Process States

StateWhat it means
R RunningActively using CPU or ready to use CPU
S SleepingWaiting for something (I/O, timer, signal) — normal
D Uninterruptible sleepWaiting for disk I/O — if many D state processes, disk is slow
Z ZombieProcess finished but parent hasn't acknowledged yet — minor issue
T StoppedPaused (Ctrl+Z in terminal)
Process management commands

📊 Performance Troubleshooting

The USE Method — Systematic Performance Analysis

For every resource (CPU, memory, disk, network): check Utilisation (how busy is it?), Saturation (is there a queue forming?), Errors (are there failures?). Don't randomly check things — follow this framework every time.

Performance troubleshooting — full flow

📁 Files, Permissions & Text Processing

Understanding Linux Permissions

Every file has three permission sets: owner, group, others. Each set has read (r=4), write (w=2), execute (x=1). The number 755 means: owner=7(rwx), group=5(r-x), others=5(r-x).

PermissionOctalUse case
rwxr-xr-x755Executables, directories with public access
rw-r--r--644Regular config files, public readable
rw-------600SSH private keys, sensitive credentials
rwx------700Directories with sensitive content
rwxrwxrwx777NEVER use in production — anyone can modify!
File operations + text processing

🌐 Networking Commands

Network Troubleshooting Mindset

Work through the OSI layers from bottom up: Physical → Network (ping, ip route) → Transport (ss, netstat, nc) → Application (curl, wget, nslookup). Most DevOps networking problems are at layers 3-7.

Complete networking commands

🖥️ Bash Scripting — Production Standard

Why Bash Matters

CI/CD pipeline steps are bash. Deployment scripts are bash. Cron jobs are bash. The difference between a good bash script and a dangerous one is error handling. A script that silently continues after an error can delete production data.

Critical rules: Always use set -euo pipefail. Always use logging functions. Always trap cleanup on exit. Never use rm -rf with an unquoted variable.

Production bash script template

🔍 Troubleshooting — Scenarios

These are the exact scenarios asked in senior DevOps interviews. Know them cold.

Server troubleshooting — complete playbook

🎯 Interview Questions

LINUX · ENGINEER
Server is at high CPU. Walk through how you find the cause.
Start broad, then narrow. First: uptime to see load average — compare to number of CPUs. If load is 2× number of CPUs, something is wrong. Then: ps aux --sort=-%cpu to find the top consumer. Note the PID and process name. Check how long it has been running with ps -o pid,etime,cmd -p PID. If it is a known service (nginx, java): check its logs — journalctl -u nginx --since '30 min ago'. If it is a runaway process: check what it is doing with strace -p PID -e trace=all — you will see infinite loops, repeated failed syscalls. Common causes at HPE: a Kafka consumer stuck in retry loop consuming 100% CPU. Fix: kill the process, find the poison message, add retry limit with backoff in code.
LINUX · ENGINEER
What is the difference between a process and a thread in Linux?
A process is an independent program with its own memory space, file descriptors, and PID. A thread is a lightweight execution unit WITHIN a process — threads share the same memory space and file descriptors as the parent process. Creating a process (fork) is expensive — copies all memory. Creating a thread is cheap — shares existing memory. In Linux, both are implemented as tasks with clone() syscall — processes use clone() without CLONE_VM flag (separate memory), threads use clone() with CLONE_VM (shared memory). For DevOps: ps aux shows processes. To see threads: ps -eLf or top -H. Important for troubleshooting: if a Java process has 200 threads and CPU is high, it might be a thread pool exhaustion issue. Use jstack PID to get thread dump.
LINUX · PRODUCTION
Your disk is 100% full on a production server. Walk through the fix without downtime.
Do NOT just delete random files. Systematic approach: First: df -h to confirm which partition is full. Second: du -sh /* to find the largest directories. Third: common culprits in order — /var/log (logs grew unbounded), /var/lib/docker (Docker images/containers), /tmp (someone wrote large temp files), /home (developer left large files). Safe immediate fixes: journalctl --vacuum-size=500M to trim journal logs. find /var/log -name '*.gz' -mtime +30 -delete to remove old compressed logs. docker system prune -f to remove unused Docker resources. For permanent fix: add logrotate config, add monitoring alert at 80% disk usage. At HPE: had this on a TeMIP server. /var/log/app filled up because log level was set to DEBUG in production. Fixed by changing log level to INFO and adding logrotate.
LINUX · ARCHITECT
Explain Linux file permissions. How do you secure a private key file?
Every file has three permission sets: owner, group, others. Each set has three bits: read (4), write (2), execute (1). Common values: 755 = owner can rwx, group and others can rx — good for executables. 644 = owner can rw, group and others can read — good for config files. 600 = only owner can rw, nobody else has any access — required for SSH private keys. 700 = only owner can rwx — good for directories with sensitive content. For SSH private key: chmod 600 ~/.ssh/id_rsa. If permissions are wrong (too open), SSH refuses to use the key with a permission denied error. For production: sensitive config files should be 640 (owner read-write, group read) and owned by the application user. Never 777 on production — that means anyone can modify the file.
LINUX · PRODUCTION
How do you investigate a memory leak on a Linux server?
Memory leak = application allocates memory and never frees it. Symptoms: free -h shows available memory decreasing over hours/days, server eventually OOM-kills processes. Investigation: watch the specific process over time: watch -n 60 'ps -o pid,vsz,rss,comm -p PID' — VSZ (virtual) and RSS (resident) should both grow over time for a leak. Check dmesg and journalctl -k for OOM killer messages — they show which process was killed and how much memory it had. For Java: jmap -histo PID shows object count by class — which class is growing? For Python: use tracemalloc or memory_profiler. For Go: use pprof. Immediate mitigation: restart the leaking service (cron job restart every night if fix takes time). Permanent fix: find the object that is never dereferenced and fix the code. At HPE: Python Kafka consumer cached every processed message ID in a dict without expiry. Fixed by using OrderedDict with maxlen limit.
LINUX · ENGINEER
What is set -euo pipefail and why do you use it in bash scripts?
Three separate options: set -e makes the script exit immediately when any command returns non-zero exit code. Without it, errors are silently ignored and the script continues — dangerous in deployment scripts. set -u makes the script exit when you reference an undefined variable. Without it, a typo in a variable name gives an empty string — silent bug. Example: rm -rf $DIRECOTRY/ (typo) without -u would run rm -rf / (delete everything). set -o pipefail makes a pipeline fail if ANY command in the pipe fails. Without it, ls /nonexistent | sort returns exit code 0 because sort succeeded — the ls failure is hidden. Together they make bash scripts behave like proper programming languages — fail loudly on errors rather than silently continuing in a broken state. Every production bash script should start with these.
LINUX · PRODUCTION
A service cannot connect to a database. Walk through network troubleshooting.
Layered investigation from application to network. Step 1: can we reach the DB host at all? ping db-server from the app server. If ping fails, routing or firewall issue. Step 2: is the DB port open? nc -zv db-server 5432 (PostgreSQL) or nc -zv db-server 3306 (MySQL). If this fails, DB is not listening, firewall blocking, or wrong host/port. Step 3: is DNS resolving correctly? nslookup db-server — check if it resolves to the right IP. Step 4: is there a firewall rule? On the DB server: sudo iptables -L -n | grep 5432, or ss -tlnp | grep 5432 — is PostgreSQL actually listening? On the app server: check if outbound traffic on 5432 is allowed. Step 5: test the actual connection with the DB client: psql -h db-server -U user -d dbname — this confirms credentials and SSL settings too. Step 6: check application config — wrong host name? wrong port? wrong credentials in config file?
LINUX · ARCHITECT
What is the Linux /proc filesystem and how do you use it for troubleshooting?
/proc is a virtual filesystem — it exists only in memory, not on disk. It exposes kernel and process information as readable files. Every process has a directory /proc/PID containing: cmdline (full command), fd (open file descriptors), status (memory, state), net (network info). Key files: /proc/meminfo shows detailed memory breakdown including cached, buffers, available. /proc/cpuinfo shows CPU details, core count. /proc/loadavg shows 1/5/15 minute load average. /proc/net/tcp shows all TCP connections in kernel format. For troubleshooting: cat /proc/PID/status shows memory usage and OOM score. ls -la /proc/PID/fd | wc -l counts open file descriptors — if this is very high, you have a file descriptor leak. cat /proc/PID/net/tcp shows which network connections this process has. You should never edit /proc files except for specific tuning like /proc/sys/net/ipv4/tcp_fin_timeout or /proc/PID/oom_score_adj.

🗺️ Roadmap

Week 1
Navigation
Navigate filesystem without GUI
Understand file permissions
Manage files: cp, mv, rm, find, grep
Week 2
Processes & Services
ps, top, kill — find and manage processes
systemctl — manage services
journalctl — read system logs
Week 3
Networking & Troubleshooting
ss, netstat, nc — port checking
curl, dig — HTTP and DNS testing
iostat, vmstat — performance analysis
Month 2
Bash Scripting
set -euo pipefail in every script
Functions, loops, error handling
Write a deployment script from scratch
Understand cron, at, systemd timers
Continue Learning
☸️ Kubernetes🐳 Docker🤖 Ansible🏠 All Topics
🤖
AI Assistant
Ask anything about this topic
👋 Hi! I have read this page and can answer your questions.

Try asking: "Explain this topic in simple terms" or "Give me an example" or ask any specific question.