AWS serverless orchestration service that coordinates multiple services into visual workflows using Amazon States Language (ASL), with built-in error handling, retries, and parallel execution.
AWS Step Functions is a serverless orchestration service that coordinates multiple AWS services into visual workflows using Amazon States Language (ASL). It defines flows as declarative state machines with steps, conditions, parallelism, and built-in error handling.
Unlike orchestrating services with custom code in AWS Lambda, Step Functions separates business logic from workflow coordination. Each state in the machine can invoke AWS services, external APIs, or Lambda functions, while the service automatically handles retries, timeouts, and state transitions.
The service uses JSON to define state machines that are both executable and visual documentation of the process. This facilitates debugging, auditing, and maintaining complex workflows in microservices architectures and event-driven systems.
Step Functions offers two workflow types with different characteristics and pricing:
| Feature | Standard | Express |
|---|---|---|
| Maximum duration | 1 year | 5 minutes |
| Pricing model | Per state transition | Per execution and duration |
| Execution history | Complete and persistent | Limited, optional |
| Execution guarantees | Exactly once | At least once |
| Use cases | Long, durable workflows | High volume, low latency |
| Typical cost | Higher for high volume | Lower for frequent executions |
| Execution limit | 2,000 concurrent | 100,000 concurrent |
Standard workflows are ideal for business processes requiring complete auditing, such as approvals, ETL pipelines, or complex agentic workflows. Express workflows optimize for streaming cases, real-time data validation, or microservices requiring fast orchestration.
Amazon States Language defines seven state types for building workflows:
{
"Comment": "Order processing example",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:ValidateOrder",
"Next": "CheckInventory",
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["States.TaskFailed"],
"Next": "OrderFailed"
}
]
},
"CheckInventory": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.inventory.available",
"BooleanEquals": true,
"Next": "ProcessPayment"
}
],
"Default": "OutOfStock"
},
"ProcessPayment": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "ChargeCard",
"States": {
"ChargeCard": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:ChargeCard",
"End": true
}
}
},
{
"StartAt": "SendConfirmation",
"States": {
"SendConfirmation": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456789012:order-confirmations",
"Message.$": "$.confirmationMessage"
},
"End": true
}
}
}
],
"Next": "ProcessItems"
},
"ProcessItems": {
"Type": "Map",
"ItemsPath": "$.order.items",
"MaxConcurrency": 5,
"Iterator": {
"StartAt": "ProcessItem",
"States": {
"ProcessItem": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:ProcessItem",
"End": true
}
}
},
"Next": "OrderComplete"
},
"OutOfStock": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123456789012:inventory-alerts",
"Message": "Item out of stock"
},
"Next": "OrderFailed"
},
"OrderComplete": {
"Type": "Succeed"
},
"OrderFailed": {
"Type": "Fail",
"Cause": "Order processing failed"
}
}
}This example demonstrates Task (execute Lambda), Choice (conditional branching), Parallel (concurrent execution), Map (array iteration), Succeed/Fail (explicit termination).
Step Functions includes robust patterns for handling failures:
Retry configures automatic retries with exponential backoff:
ErrorEquals: error types to retryIntervalSeconds: initial time between retriesMaxAttempts: maximum number of retriesBackoffRate: multiplier for exponential backoffCatch handles errors that don't resolve with retries:
Step Functions integrates natively with over 200 AWS services without intermediate Lambda code:
This direct integration reduces latency, cost, and complexity compared to orchestrating services through Lambda wrapper functions.
Saga Pattern: For distributed transactions, each step includes a compensation action. If a step fails, Step Functions executes compensations in reverse order.
Human-in-the-loop: Workflows can pause awaiting human approval using callback tokens. Useful for expense approvals, content reviews, or decisions requiring human judgment.
Fan-out/Fan-in: The Map state processes arrays in parallel with concurrency control. Ideal for processing data batches, validating multiple inputs, or executing independent tasks.
Circuit Breaker: Combining Choice and Wait, you can implement circuit breakers that pause workflows when downstream services fail repeatedly.
Step Functions transforms complex workflows from imperative code to auditable declarative definitions. Instead of handling coordination, retries, and error states in custom code, you define the flow once and the service handles reliable execution.
For teams building distributed systems, this means less orchestration code to maintain, better system state visibility, and ability to modify workflows without code deployments. The separation between business logic and coordination facilitates testing, debugging, and evolution of complex processes.
In serverless and microservices architectures, Step Functions acts as the "glue" that coordinates independent services into cohesive business processes, with the reliability and observability that production systems require.
Cloud computing model where the provider manages infrastructure automatically, allowing code execution without provisioning or managing servers, paying only for actual usage.
Design patterns where AI agents execute complex multi-step tasks autonomously, combining reasoning, tool use, and iterative decision-making.
Architectural pattern where components communicate through asynchronous events, enabling decoupled, scalable, and reactive systems.
AWS serverless compute service that runs code in response to events without provisioning or managing servers, automatically scaling from zero to thousands of concurrent executions.
Architectural style structuring an application as a collection of small, independent, deployable services, each with its own business logic and data.
Architecture design for scaling a personal second brain to a production system with AWS serverless — from the current prototype to specialized use cases in legal, research, and community building.
Production-ready serverless backend for a personal knowledge graph — DynamoDB, Lambda, Bedrock, MCP, Step Functions. The implementation of the architecture described in the 'From Prototype to Production' essay.