Concepts

DevOps

Culture and set of practices that unify development (Dev) and operations (Ops) to deliver software with greater speed, quality, and reliability. It's not a role — it's a way of working.

evergreen#devops#culture#automation#sre

DevOps is a cultural and technical movement that eliminates silos between development and operations. It was born from frustration with the traditional model where Dev "throws code over the wall" and Ops "keeps it alive" — with no shared responsibility.

What problem it solves

In the traditional model:

  • Dev wants fast changes, Ops wants stability — permanent conflict
  • Manual deploys every weeks/months — risk accumulation
  • "Works on my machine" — production problems
  • Blame culture — nobody wants to deploy on Fridays

DevOps aligns incentives: the team that builds the software is responsible for operating it.

The three ways

Fundamental principles from The Phoenix Project:

1. Flow (systems left to right)

Optimize the flow of work from development to production:

  • Make work visible (Kanban boards)
  • Limit work in progress (WIP)
  • Reduce batch sizes
  • Eliminate handoffs and queues
  • Automate everything repetitive

2. Feedback (right to left)

Create fast feedback loops:

  • Monitoring and alerts in production
  • Automated tests in CI
  • Code review in PRs
  • Blameless post-mortems
  • User telemetry

3. Continual learning

Culture of experimentation and improvement:

  • Blameless post-mortems
  • Chaos engineering
  • Game days (incident simulations)
  • 20% time for technical improvements
  • Knowledge sharing (tech talks, documentation)

CALMS framework

Model for evaluating DevOps adoption:

PillarMeaningExample
CultureCollaboration over silosCross-functional teams
AutomationEliminate manual workCI/CD, IaC, auto-scaling
LeanEliminate wasteLimit WIP, reduce batch size
MeasurementMeasure everythingDORA metrics, SLOs, error budgets
SharingShare knowledgePost-mortems, runbooks, tech talks

Essential practices

Infrastructure as Code (IaC)

Define infrastructure in versioned files:

# Terraform
resource "aws_lambda_function" "api" {
  function_name = "api-handler"
  runtime       = "nodejs20.x"
  handler       = "index.handler"
  filename      = "lambda.zip"
}

Benefits: reproducibility, auditing, rollback, review in PRs.

Monitoring and observability

The three pillars:

  • Logs — discrete events (what happened)
  • Metrics — numerical values over time (how much)
  • Traces — flow of a request through services (where)

SLOs and error budgets

  • SLI (Service Level Indicator) — measurable metric (p99 latency, availability)
  • SLO (Service Level Objective) — internal target (99.9% availability)
  • SLA (Service Level Agreement) — contractual commitment with consequences
  • Error budget — allowed margin of failure (0.1% = 43 min/month of downtime)

If the error budget is exhausted, freeze features and prioritize stability.

Blameless post-mortems

After every incident:

  1. Timeline — what happened, when, who did what
  2. Root cause — 5 whys analysis
  3. Impact — affected users, duration, data lost
  4. Action items — concrete improvements with owners and deadlines
  5. Lessons learned — what worked well, what didn't

Cardinal rule: blame the system, not the people.

Chaos engineering

Deliberately inject failures to discover weaknesses:

  • Kill random instances (Chaos Monkey)
  • Inject network latency
  • Fill disks
  • Simulate dependency failures

DevOps vs SRE

AspectDevOpsSRE
OriginCommunity (2009)Google (2003)
FocusCulture + practicesReliability engineering
DefinitionMovementRole/discipline
RelationshipPhilosophyDevOps implementation with engineering

As Ben Treynor (SRE creator at Google) said: "SRE is what happens when you ask a software engineer to design an operations team."

Evolution: Platform Engineering

The natural evolution of DevOps in large organizations:

  • DevOps — "you build it, you run it" (each team operates its software)
  • Platform Engineering — one team builds the internal platform that other teams consume

The platform abstracts complexity: the developer does git push and the platform handles build, test, deploy, monitoring.

Anti-patterns

  • DevOps team — creating a team called "DevOps" that becomes the new silo
  • Automation without culture — tools without cultural change solve nothing
  • Heroism — depending on one person who "knows everything" instead of documenting
  • Vanity metrics — measuring deploys/day without measuring quality or impact
  • Tool obsession — switching tools every 6 months without solving root problems

Why it matters

DevOps is not a role or a tool — it is a cultural shift that removes the barrier between those who write code and those who operate it. Organizations that adopt it effectively deliver software faster, with fewer failures, and with more agile recovery. Those that treat it as a job title miss the point.

References

  • The Phoenix Project — Gene Kim, Kevin Behr & George Spafford, 2013. The novel that popularized DevOps.
  • The DevOps Handbook — Gene Kim et al., 2021. Practical implementation guide (second edition).
  • Accelerate — Nicole Forsgren, Jez Humble & Gene Kim, 2018. Scientific research on DORA metrics.
  • Google SRE Books — Google, 2016-2024. Three free books on Site Reliability Engineering.
  • State of DevOps Report — DORA/Google Cloud, 2024. Annual research on practices and performance.
  • The Twelve-Factor App — Adam Wiggins, 2011. Methodology for building cloud-native applications.
Concepts