DevOps

DevOps is a cultural and technical movement that eliminates silos between development and operations. It was born from frustration with the traditional model where Dev "throws code over the wall" and Ops "keeps it alive" — with no shared responsibility.

What problem it solves

In the traditional model:

Dev wants fast changes, Ops wants stability — permanent conflict
Manual deploys every weeks/months — risk accumulation
"Works on my machine" — production problems
Blame culture — nobody wants to deploy on Fridays

DevOps aligns incentives: the team that builds the software is responsible for operating it.

The three ways

Fundamental principles from The Phoenix Project:

1. Flow (systems left to right)

Optimize the flow of work from development to production:

Make work visible (Kanban boards)
Limit work in progress (WIP)
Reduce batch sizes
Eliminate handoffs and queues
Automate everything repetitive

2. Feedback (right to left)

Create fast feedback loops:

Monitoring and alerts in production
Automated tests in CI
Code review in PRs
Blameless post-mortems
User telemetry

3. Continual learning

Culture of experimentation and improvement:

Blameless post-mortems
Chaos engineering
Game days (incident simulations)
20% time for technical improvements
Knowledge sharing (tech talks, documentation)

CALMS framework

Model for evaluating DevOps adoption:

Pillar	Meaning	Example
Culture	Collaboration over silos	Cross-functional teams
Automation	Eliminate manual work	CI/CD, IaC, auto-scaling
Lean	Eliminate waste	Limit WIP, reduce batch size
Measurement	Measure everything	DORA metrics, SLOs, error budgets
Sharing	Share knowledge	Post-mortems, runbooks, tech talks

Essential practices

Infrastructure as Code (IaC)

Define infrastructure in versioned files:

# Terraform
resource "aws_lambda_function" "api" {
  function_name = "api-handler"
  runtime       = "nodejs20.x"
  handler       = "index.handler"
  filename      = "lambda.zip"
}

Benefits: reproducibility, auditing, rollback, review in PRs.

Monitoring and observability

The three pillars:

Logs — discrete events (what happened)
Metrics — numerical values over time (how much)
Traces — flow of a request through services (where)

SLOs and error budgets

SLI (Service Level Indicator) — measurable metric (p99 latency, availability)
SLO (Service Level Objective) — internal target (99.9% availability)
SLA (Service Level Agreement) — contractual commitment with consequences
Error budget — allowed margin of failure (0.1% = 43 min/month of downtime)

If the error budget is exhausted, freeze features and prioritize stability.

Blameless post-mortems

After every incident:

Timeline — what happened, when, who did what
Root cause — 5 whys analysis
Impact — affected users, duration, data lost
Action items — concrete improvements with owners and deadlines
Lessons learned — what worked well, what didn't

Cardinal rule: blame the system, not the people.

Chaos engineering

Deliberately inject failures to discover weaknesses:

Kill random instances (Chaos Monkey)
Inject network latency
Fill disks
Simulate dependency failures

DevOps vs SRE

Aspect	DevOps	SRE
Origin	Community (2009)	Google (2003)
Focus	Culture + practices	Reliability engineering
Definition	Movement	Role/discipline
Relationship	Philosophy	DevOps implementation with engineering

As Ben Treynor (SRE creator at Google) said: "SRE is what happens when you ask a software engineer to design an operations team."

Evolution: Platform Engineering

The natural evolution of DevOps in large organizations:

DevOps — "you build it, you run it" (each team operates its software)
Platform Engineering — one team builds the internal platform that other teams consume

The platform abstracts complexity: the developer does git push and the platform handles build, test, deploy, monitoring.

Anti-patterns

DevOps team — creating a team called "DevOps" that becomes the new silo
Automation without culture — tools without cultural change solve nothing
Heroism — depending on one person who "knows everything" instead of documenting
Vanity metrics — measuring deploys/day without measuring quality or impact
Tool obsession — switching tools every 6 months without solving root problems

Why it matters

DevOps is not a role or a tool — it is a cultural shift that removes the barrier between those who write code and those who operate it. Organizations that adopt it effectively deliver software faster, with fewer failures, and with more agile recovery. Those that treat it as a job title miss the point.

References

The Phoenix Project — Gene Kim, Kevin Behr & George Spafford, 2013. The novel that popularized DevOps.
The DevOps Handbook — Gene Kim et al., 2021. Practical implementation guide (second edition).
Accelerate — Nicole Forsgren, Jez Humble & Gene Kim, 2018. Scientific research on DORA metrics.
Google SRE Books — Google, 2016-2024. Three free books on Site Reliability Engineering.
State of DevOps Report — DORA/Google Cloud, 2024. Annual research on practices and performance.
The Twelve-Factor App — Adam Wiggins, 2011. Methodology for building cloud-native applications.

What problem it solves

In the traditional model:

Dev wants fast changes, Ops wants stability — permanent conflict
Manual deploys every weeks/months — risk accumulation
"Works on my machine" — production problems
Blame culture — nobody wants to deploy on Fridays

DevOps aligns incentives: the team that builds the software is responsible for operating it.

The three ways

Fundamental principles from The Phoenix Project:

1. Flow (systems left to right)

Optimize the flow of work from development to production:

Make work visible (Kanban boards)
Limit work in progress (WIP)
Reduce batch sizes
Eliminate handoffs and queues
Automate everything repetitive

2. Feedback (right to left)

Create fast feedback loops:

Monitoring and alerts in production
Automated tests in CI
Code review in PRs
Blameless post-mortems
User telemetry

3. Continual learning

Culture of experimentation and improvement:

Blameless post-mortems
Chaos engineering
Game days (incident simulations)
20% time for technical improvements
Knowledge sharing (tech talks, documentation)

CALMS framework

Model for evaluating DevOps adoption:

Pillar	Meaning	Example
Culture	Collaboration over silos	Cross-functional teams
Automation	Eliminate manual work	CI/CD, IaC, auto-scaling
Lean	Eliminate waste	Limit WIP, reduce batch size
Measurement	Measure everything	DORA metrics, SLOs, error budgets
Sharing	Share knowledge	Post-mortems, runbooks, tech talks

Essential practices

Infrastructure as Code (IaC)

Define infrastructure in versioned files:

# Terraform
resource "aws_lambda_function" "api" {
  function_name = "api-handler"
  runtime       = "nodejs20.x"
  handler       = "index.handler"
  filename      = "lambda.zip"
}

Benefits: reproducibility, auditing, rollback, review in PRs.

Monitoring and observability

The three pillars:

Logs — discrete events (what happened)
Metrics — numerical values over time (how much)
Traces — flow of a request through services (where)

SLOs and error budgets

SLI (Service Level Indicator) — measurable metric (p99 latency, availability)
SLO (Service Level Objective) — internal target (99.9% availability)
SLA (Service Level Agreement) — contractual commitment with consequences
Error budget — allowed margin of failure (0.1% = 43 min/month of downtime)

If the error budget is exhausted, freeze features and prioritize stability.

Blameless post-mortems

After every incident:

Timeline — what happened, when, who did what
Root cause — 5 whys analysis
Impact — affected users, duration, data lost
Action items — concrete improvements with owners and deadlines
Lessons learned — what worked well, what didn't

Cardinal rule: blame the system, not the people.

Chaos engineering

Deliberately inject failures to discover weaknesses:

Kill random instances (Chaos Monkey)
Inject network latency
Fill disks
Simulate dependency failures

DevOps vs SRE

Aspect	DevOps	SRE
Origin	Community (2009)	Google (2003)
Focus	Culture + practices	Reliability engineering
Definition	Movement	Role/discipline
Relationship	Philosophy	DevOps implementation with engineering

As Ben Treynor (SRE creator at Google) said: "SRE is what happens when you ask a software engineer to design an operations team."

Evolution: Platform Engineering

The natural evolution of DevOps in large organizations:

DevOps — "you build it, you run it" (each team operates its software)
Platform Engineering — one team builds the internal platform that other teams consume

The platform abstracts complexity: the developer does git push and the platform handles build, test, deploy, monitoring.

Anti-patterns

DevOps team — creating a team called "DevOps" that becomes the new silo
Automation without culture — tools without cultural change solve nothing
Heroism — depending on one person who "knows everything" instead of documenting
Vanity metrics — measuring deploys/day without measuring quality or impact
Tool obsession — switching tools every 6 months without solving root problems

Why it matters

References

The Phoenix Project — Gene Kim, Kevin Behr & George Spafford, 2013. The novel that popularized DevOps.
The DevOps Handbook — Gene Kim et al., 2021. Practical implementation guide (second edition).
Accelerate — Nicole Forsgren, Jez Humble & Gene Kim, 2018. Scientific research on DORA metrics.
Google SRE Books — Google, 2016-2024. Three free books on Site Reliability Engineering.
State of DevOps Report — DORA/Google Cloud, 2024. Annual research on practices and performance.
The Twelve-Factor App — Adam Wiggins, 2011. Methodology for building cloud-native applications.

DevOps

What problem it solves

The three ways

1. Flow (systems left to right)

2. Feedback (right to left)

3. Continual learning

CALMS framework

Essential practices

Infrastructure as Code (IaC)

Monitoring and observability

SLOs and error budgets

Blameless post-mortems

Chaos engineering

DevOps vs SRE

Evolution: Platform Engineering

Anti-patterns

Why it matters

References

Related content

DevOps

What problem it solves

The three ways

1. Flow (systems left to right)

2. Feedback (right to left)

3. Continual learning

CALMS framework

Essential practices

Infrastructure as Code (IaC)

Monitoring and observability

SLOs and error budgets

Blameless post-mortems

Chaos engineering

DevOps vs SRE

Evolution: Platform Engineering

Anti-patterns

Why it matters

References

Related content