DevOps
Culture and set of practices that unify development (Dev) and operations (Ops) to deliver software with greater speed, quality, and reliability. It's not a role — it's a way of working.
DevOps is a cultural and technical movement that eliminates silos between development and operations. It was born from frustration with the traditional model where Dev "throws code over the wall" and Ops "keeps it alive" — with no shared responsibility.
What problem it solves
In the traditional model:
- Dev wants fast changes, Ops wants stability — permanent conflict
- Manual deploys every weeks/months — risk accumulation
- "Works on my machine" — production problems
- Blame culture — nobody wants to deploy on Fridays
DevOps aligns incentives: the team that builds the software is responsible for operating it.
The three ways
Fundamental principles from The Phoenix Project:
1. Flow (systems left to right)
Optimize the flow of work from development to production:
- Make work visible (Kanban boards)
- Limit work in progress (WIP)
- Reduce batch sizes
- Eliminate handoffs and queues
- Automate everything repetitive
2. Feedback (right to left)
Create fast feedback loops:
- Monitoring and alerts in production
- Automated tests in CI
- Code review in PRs
- Blameless post-mortems
- User telemetry
3. Continual learning
Culture of experimentation and improvement:
- Blameless post-mortems
- Chaos engineering
- Game days (incident simulations)
- 20% time for technical improvements
- Knowledge sharing (tech talks, documentation)
CALMS framework
Model for evaluating DevOps adoption:
| Pillar | Meaning | Example |
|---|---|---|
| Culture | Collaboration over silos | Cross-functional teams |
| Automation | Eliminate manual work | CI/CD, IaC, auto-scaling |
| Lean | Eliminate waste | Limit WIP, reduce batch size |
| Measurement | Measure everything | DORA metrics, SLOs, error budgets |
| Sharing | Share knowledge | Post-mortems, runbooks, tech talks |
Essential practices
Infrastructure as Code (IaC)
Define infrastructure in versioned files:
# Terraform
resource "aws_lambda_function" "api" {
function_name = "api-handler"
runtime = "nodejs20.x"
handler = "index.handler"
filename = "lambda.zip"
}Benefits: reproducibility, auditing, rollback, review in PRs.
Monitoring and observability
The three pillars:
- Logs — discrete events (what happened)
- Metrics — numerical values over time (how much)
- Traces — flow of a request through services (where)
SLOs and error budgets
- SLI (Service Level Indicator) — measurable metric (p99 latency, availability)
- SLO (Service Level Objective) — internal target (99.9% availability)
- SLA (Service Level Agreement) — contractual commitment with consequences
- Error budget — allowed margin of failure (0.1% = 43 min/month of downtime)
If the error budget is exhausted, freeze features and prioritize stability.
Blameless post-mortems
After every incident:
- Timeline — what happened, when, who did what
- Root cause — 5 whys analysis
- Impact — affected users, duration, data lost
- Action items — concrete improvements with owners and deadlines
- Lessons learned — what worked well, what didn't
Cardinal rule: blame the system, not the people.
Chaos engineering
Deliberately inject failures to discover weaknesses:
- Kill random instances (Chaos Monkey)
- Inject network latency
- Fill disks
- Simulate dependency failures
DevOps vs SRE
| Aspect | DevOps | SRE |
|---|---|---|
| Origin | Community (2009) | Google (2003) |
| Focus | Culture + practices | Reliability engineering |
| Definition | Movement | Role/discipline |
| Relationship | Philosophy | DevOps implementation with engineering |
As Ben Treynor (SRE creator at Google) said: "SRE is what happens when you ask a software engineer to design an operations team."
Evolution: Platform Engineering
The natural evolution of DevOps in large organizations:
- DevOps — "you build it, you run it" (each team operates its software)
- Platform Engineering — one team builds the internal platform that other teams consume
The platform abstracts complexity: the developer does git push and the platform handles build, test, deploy, monitoring.
Anti-patterns
- DevOps team — creating a team called "DevOps" that becomes the new silo
- Automation without culture — tools without cultural change solve nothing
- Heroism — depending on one person who "knows everything" instead of documenting
- Vanity metrics — measuring deploys/day without measuring quality or impact
- Tool obsession — switching tools every 6 months without solving root problems
Why it matters
DevOps is not a role or a tool — it is a cultural shift that removes the barrier between those who write code and those who operate it. Organizations that adopt it effectively deliver software faster, with fewer failures, and with more agile recovery. Those that treat it as a job title miss the point.
References
- The Phoenix Project — Gene Kim, Kevin Behr & George Spafford, 2013. The novel that popularized DevOps.
- The DevOps Handbook — Gene Kim et al., 2021. Practical implementation guide (second edition).
- Accelerate — Nicole Forsgren, Jez Humble & Gene Kim, 2018. Scientific research on DORA metrics.
- Google SRE Books — Google, 2016-2024. Three free books on Site Reliability Engineering.
- State of DevOps Report — DORA/Google Cloud, 2024. Annual research on practices and performance.
- The Twelve-Factor App — Adam Wiggins, 2011. Methodology for building cloud-native applications.