DevOps Practices
Set of technical and cultural practices that implement DevOps principles — from Infrastructure as Code to blameless post-mortems. The "how" behind the philosophy.
DevOps practices are the concrete implementations of the DevOps philosophy. While DevOps is the "what" and "why," these practices are the "how."
Infrastructure as Code (IaC)
Define and manage infrastructure through versioned configuration files.
Main tools
| Tool | Focus | Language |
|---|---|---|
| Terraform | Multi-cloud, declarative | HCL |
| Pulumi | Multi-cloud, imperative | TypeScript, Python, Go |
| AWS CDK | AWS, imperative | TypeScript, Python, Java |
| CloudFormation | AWS, declarative | YAML/JSON |
| Ansible | Configuration, agentless | YAML |
Terraform example
resource "aws_s3_bucket" "data" {
bucket = "my-data-bucket"
versioning {
enabled = true
}
server_side_encryption_configuration {
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
}IaC principles
- Idempotency — applying multiple times produces the same result
- Versioned — all infra in Git with full history
- Review — infra changes go through PR like code
- Modules — reuse common patterns
- State management — shared remote state (S3, Terraform Cloud)
Configuration Management
Keep servers in a desired state automatically.
# Ansible playbook
- hosts: webservers
tasks:
- name: Install nginx
apt:
name: nginx
state: present
- name: Copy config
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
notify: restart nginx
handlers:
- name: restart nginx
service:
name: nginx
state: restartedContainerization
Package applications with all their dependencies.
# Multi-stage build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/index.js"]Docker best practices
- Minimal base images (Alpine, distroless)
- Multi-stage builds to reduce size
- One process per container
- Don't run as root
.dockerignoreto exclude unnecessary files- Pin base image versions
GitOps
Use Git as the source of truth for infrastructure and deployments.
Principles
- Declarative — desired state is in Git
- Versioned — Git is the change history
- Automatic — agents reconcile actual state with desired
- Auditable — every change has author, timestamp, and reason
Tools
- ArgoCD — Kubernetes GitOps controller
- Flux — Kubernetes GitOps toolkit
- Atlantis — Terraform pull request automation
# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
spec:
source:
repoURL: https://github.com/org/repo
path: k8s/
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: trueFeature Flags
Separate deployment from release — code in production but functionality controlled.
// Example with LaunchDarkly / Unleash / custom
if (featureFlags.isEnabled('new-checkout', { userId })) {
return <NewCheckout />;
}
return <LegacyCheckout />;Use cases
- Canary releases — enable for % of users
- Beta testing — enable for specific users
- Kill switch — disable problematic feature without deploy
- A/B testing — compare variants
- Trunk-based development — merge incomplete code
Observability
The three pillars for understanding production systems:
Logs
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "error",
"service": "api",
"trace_id": "abc123",
"message": "Payment failed",
"user_id": "user_456",
"error_code": "INSUFFICIENT_FUNDS"
}Structured logging — parseable JSON, not free text.
Metrics
# Prometheus format
http_requests_total{method="GET", status="200"} 1234
http_request_duration_seconds{quantile="0.99"} 0.25
Types: counters, gauges, histograms, summaries.
Traces
Follow a request through multiple services:
[API Gateway] → [Auth Service] → [User Service] → [Database]
2ms 5ms 3ms 10ms
Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM.
Incident Management
On-call
- Defined rotations (PagerDuty, Opsgenie)
- Runbooks for common incidents
- Clear escalation paths
- On-call compensation
Incident response
- Detect — alerts, users, monitoring
- Triage — severity, impact, who responds
- Mitigate — restore service (rollback, scale, failover)
- Resolve — permanent fix
- Learn — post-mortem
Severities
| Sev | Impact | Response time | Example |
|---|---|---|---|
| 1 | Service down | Immediate | Site won't load |
| 2 | Major degradation | < 30 min | Payments failing |
| 3 | Minor degradation | < 4 hours | Secondary feature broken |
| 4 | Low impact | Next business day | Cosmetic bug |
Chaos Engineering
Inject controlled failures to discover weaknesses.
Principles
- Define "steady state" (normal metrics)
- Hypothesis: the system tolerates X failure
- Introduce real-world variables (latency, failures, partitions)
- Try to disprove the hypothesis
- Minimize blast radius
Tools
- Chaos Monkey — terminates random instances
- Gremlin — chaos engineering platform
- Litmus — chaos engineering for Kubernetes
- AWS Fault Injection Simulator — native chaos on AWS
Security Practices (DevSecOps)
Integrate security throughout the pipeline:
Shift left
- SAST — static code analysis (SonarQube, Semgrep)
- SCA — dependency analysis (Snyk, Dependabot)
- Secret scanning — detect credentials in code
- Container scanning — image vulnerabilities (Trivy)
Runtime
- DAST — dynamic application testing
- WAF — web application firewall
- Runtime protection — detect anomalous behavior
Why it matters
These practices are not optional for teams operating software in production. Each one reduces a specific type of risk: IaC eliminates manual configuration, feature flags decouple deploy from release, observability turns incidents into learning. Adopting them incrementally is more effective than trying to implement everything at once.
References
- Infrastructure as Code — Kief Morris, 2020. The definitive book on IaC.
- Site Reliability Engineering — Google, 2016. SRE practices including incident management.
- Observability Engineering — Charity Majors et al., 2022. Modern observability guide.
- Chaos Engineering — Casey Rosenthal & Nora Jones, 2020. Principles and practices.
- GitOps and Kubernetes — Billy Yuen et al., 2021. Practical GitOps implementation.
- The Practice of Cloud System Administration — Limoncelli, Chalup & Hogan, 2014. Operations practices at scale.