DevOps Practices

DevOps practices are the concrete implementations of the DevOps philosophy. While DevOps is the "what" and "why," these practices are the "how."

Infrastructure as Code (IaC)

Define and manage infrastructure through versioned configuration files.

Main tools

Tool	Focus	Language
Terraform	Multi-cloud, declarative	HCL
Pulumi	Multi-cloud, imperative	TypeScript, Python, Go
AWS CDK	AWS, imperative	TypeScript, Python, Java
CloudFormation	AWS, declarative	YAML/JSON
Ansible	Configuration, agentless	YAML

Terraform example

resource "aws_s3_bucket" "data" {
  bucket = "my-data-bucket"
  
  versioning {
    enabled = true
  }
  
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }
}

IaC principles

Idempotency — applying multiple times produces the same result
Versioned — all infra in Git with full history
Review — infra changes go through PR like code
Modules — reuse common patterns
State management — shared remote state (S3, Terraform Cloud)

Configuration Management

Keep servers in a desired state automatically.

# Ansible playbook
- hosts: webservers
  tasks:
    - name: Install nginx
      apt:
        name: nginx
        state: present
    
    - name: Copy config
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify: restart nginx
  
  handlers:
    - name: restart nginx
      service:
        name: nginx
        state: restarted

Containerization

Package applications with all their dependencies.

# Multi-stage build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
 
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/index.js"]

Docker best practices

Minimal base images (Alpine, distroless)
Multi-stage builds to reduce size
One process per container
Don't run as root
.dockerignore to exclude unnecessary files
Pin base image versions

GitOps

Use Git as the source of truth for infrastructure and deployments.

Principles

Declarative — desired state is in Git
Versioned — Git is the change history
Automatic — agents reconcile actual state with desired
Auditable — every change has author, timestamp, and reason

Tools

ArgoCD — Kubernetes GitOps controller
Flux — Kubernetes GitOps toolkit
Atlantis — Terraform pull request automation

# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  source:
    repoURL: https://github.com/org/repo
    path: k8s/
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Feature Flags

Separate deployment from release — code in production but functionality controlled.

// Example with LaunchDarkly / Unleash / custom
if (featureFlags.isEnabled('new-checkout', { userId })) {
  return <NewCheckout />;
}
return <LegacyCheckout />;

Use cases

Canary releases — enable for % of users
Beta testing — enable for specific users
Kill switch — disable problematic feature without deploy
A/B testing — compare variants
Trunk-based development — merge incomplete code

Observability

The three pillars for understanding production systems:

Logs

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "service": "api",
  "trace_id": "abc123",
  "message": "Payment failed",
  "user_id": "user_456",
  "error_code": "INSUFFICIENT_FUNDS"
}

Structured logging — parseable JSON, not free text.

Metrics

# Prometheus format
http_requests_total{method="GET", status="200"} 1234
http_request_duration_seconds{quantile="0.99"} 0.25

Types: counters, gauges, histograms, summaries.

Traces

Follow a request through multiple services:

[API Gateway] → [Auth Service] → [User Service] → [Database]
     2ms            5ms              3ms            10ms

Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM.

Incident Management

On-call

Defined rotations (PagerDuty, Opsgenie)
Runbooks for common incidents
Clear escalation paths
On-call compensation

Incident response

Detect — alerts, users, monitoring
Triage — severity, impact, who responds
Mitigate — restore service (rollback, scale, failover)
Resolve — permanent fix
Learn — post-mortem

Severities

Sev	Impact	Response time	Example
1	Service down	Immediate	Site won't load
2	Major degradation	< 30 min	Payments failing
3	Minor degradation	< 4 hours	Secondary feature broken
4	Low impact	Next business day	Cosmetic bug

Chaos Engineering

Inject controlled failures to discover weaknesses.

Principles

Define "steady state" (normal metrics)
Hypothesis: the system tolerates X failure
Introduce real-world variables (latency, failures, partitions)
Try to disprove the hypothesis
Minimize blast radius

Tools

Chaos Monkey — terminates random instances
Gremlin — chaos engineering platform
Litmus — chaos engineering for Kubernetes
AWS Fault Injection Simulator — native chaos on AWS

Security Practices (DevSecOps)

Integrate security throughout the pipeline:

Shift left

SAST — static code analysis (SonarQube, Semgrep)
SCA — dependency analysis (Snyk, Dependabot)
Secret scanning — detect credentials in code
Container scanning — image vulnerabilities (Trivy)

Runtime

DAST — dynamic application testing
WAF — web application firewall
Runtime protection — detect anomalous behavior

Why it matters

These practices are not optional for teams operating software in production. Each one reduces a specific type of risk: IaC eliminates manual configuration, feature flags decouple deploy from release, observability turns incidents into learning. Adopting them incrementally is more effective than trying to implement everything at once.

References

Infrastructure as Code — Kief Morris, 2020. The definitive book on IaC.
Site Reliability Engineering — Google, 2016. SRE practices including incident management.
Observability Engineering — Charity Majors et al., 2022. Modern observability guide.
Chaos Engineering — Casey Rosenthal & Nora Jones, 2020. Principles and practices.
GitOps and Kubernetes — Billy Yuen et al., 2021. Practical GitOps implementation.
The Practice of Cloud System Administration — Limoncelli, Chalup & Hogan, 2014. Operations practices at scale.

DevOps practices are the concrete implementations of the DevOps philosophy. While DevOps is the "what" and "why," these practices are the "how."

Infrastructure as Code (IaC)

Define and manage infrastructure through versioned configuration files.

Main tools

Tool	Focus	Language
Terraform	Multi-cloud, declarative	HCL
Pulumi	Multi-cloud, imperative	TypeScript, Python, Go
AWS CDK	AWS, imperative	TypeScript, Python, Java
CloudFormation	AWS, declarative	YAML/JSON
Ansible	Configuration, agentless	YAML

Terraform example

resource "aws_s3_bucket" "data" {
  bucket = "my-data-bucket"
  
  versioning {
    enabled = true
  }
  
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }
}

IaC principles

Idempotency — applying multiple times produces the same result
Versioned — all infra in Git with full history
Review — infra changes go through PR like code
Modules — reuse common patterns
State management — shared remote state (S3, Terraform Cloud)

Configuration Management

Keep servers in a desired state automatically.

# Ansible playbook
- hosts: webservers
  tasks:
    - name: Install nginx
      apt:
        name: nginx
        state: present
    
    - name: Copy config
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify: restart nginx
  
  handlers:
    - name: restart nginx
      service:
        name: nginx
        state: restarted

Containerization

Package applications with all their dependencies.

# Multi-stage build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
 
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/index.js"]

Docker best practices

Minimal base images (Alpine, distroless)
Multi-stage builds to reduce size
One process per container
Don't run as root
.dockerignore to exclude unnecessary files
Pin base image versions

GitOps

Use Git as the source of truth for infrastructure and deployments.

Principles

Declarative — desired state is in Git
Versioned — Git is the change history
Automatic — agents reconcile actual state with desired
Auditable — every change has author, timestamp, and reason

Tools

ArgoCD — Kubernetes GitOps controller
Flux — Kubernetes GitOps toolkit
Atlantis — Terraform pull request automation

# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  source:
    repoURL: https://github.com/org/repo
    path: k8s/
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Feature Flags

Separate deployment from release — code in production but functionality controlled.

// Example with LaunchDarkly / Unleash / custom
if (featureFlags.isEnabled('new-checkout', { userId })) {
  return <NewCheckout />;
}
return <LegacyCheckout />;

Use cases

Canary releases — enable for % of users
Beta testing — enable for specific users
Kill switch — disable problematic feature without deploy
A/B testing — compare variants
Trunk-based development — merge incomplete code

Observability

The three pillars for understanding production systems:

Logs

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "service": "api",
  "trace_id": "abc123",
  "message": "Payment failed",
  "user_id": "user_456",
  "error_code": "INSUFFICIENT_FUNDS"
}

Structured logging — parseable JSON, not free text.

Metrics

# Prometheus format
http_requests_total{method="GET", status="200"} 1234
http_request_duration_seconds{quantile="0.99"} 0.25

Types: counters, gauges, histograms, summaries.

Traces

Follow a request through multiple services:

[API Gateway] → [Auth Service] → [User Service] → [Database]
     2ms            5ms              3ms            10ms

Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM.

Incident Management

On-call

Defined rotations (PagerDuty, Opsgenie)
Runbooks for common incidents
Clear escalation paths
On-call compensation

Incident response

Detect — alerts, users, monitoring
Triage — severity, impact, who responds
Mitigate — restore service (rollback, scale, failover)
Resolve — permanent fix
Learn — post-mortem

Severities

Sev	Impact	Response time	Example
1	Service down	Immediate	Site won't load
2	Major degradation	< 30 min	Payments failing
3	Minor degradation	< 4 hours	Secondary feature broken
4	Low impact	Next business day	Cosmetic bug

Chaos Engineering

Inject controlled failures to discover weaknesses.

Principles

Define "steady state" (normal metrics)
Hypothesis: the system tolerates X failure
Introduce real-world variables (latency, failures, partitions)
Try to disprove the hypothesis
Minimize blast radius

Tools

Chaos Monkey — terminates random instances
Gremlin — chaos engineering platform
Litmus — chaos engineering for Kubernetes
AWS Fault Injection Simulator — native chaos on AWS

Security Practices (DevSecOps)

Integrate security throughout the pipeline:

Shift left

SAST — static code analysis (SonarQube, Semgrep)
SCA — dependency analysis (Snyk, Dependabot)
Secret scanning — detect credentials in code
Container scanning — image vulnerabilities (Trivy)

Runtime

DAST — dynamic application testing
WAF — web application firewall
Runtime protection — detect anomalous behavior

Why it matters

References

Infrastructure as Code — Kief Morris, 2020. The definitive book on IaC.
Site Reliability Engineering — Google, 2016. SRE practices including incident management.
Observability Engineering — Charity Majors et al., 2022. Modern observability guide.
Chaos Engineering — Casey Rosenthal & Nora Jones, 2020. Principles and practices.
GitOps and Kubernetes — Billy Yuen et al., 2021. Practical GitOps implementation.
The Practice of Cloud System Administration — Limoncelli, Chalup & Hogan, 2014. Operations practices at scale.

Infrastructure as Code (IaC)

Main tools

Terraform example

IaC principles

Configuration Management

Containerization

Docker best practices

GitOps

Principles

Tools

Feature Flags

Use cases

Observability

Logs

Metrics

Traces

Incident Management

On-call

Incident response

Severities

Chaos Engineering

Principles

Tools

Security Practices (DevSecOps)

Shift left

Runtime

Why it matters

References

Related content

Infrastructure as Code (IaC)

Main tools

Terraform example

IaC principles

Configuration Management

Containerization

Docker best practices

GitOps

Principles

Tools

Feature Flags

Use cases

Observability

Logs

Metrics

Traces

Incident Management

On-call

Incident response

Severities

Chaos Engineering

Principles

Tools

Security Practices (DevSecOps)

Shift left

Runtime

Why it matters

References

Related content