AWS fully managed message queue service that decouples distributed application components, guaranteeing message delivery with unlimited scalability.
Amazon SQS (Simple Queue Service) is a fully managed message queue service that enables decoupling and scaling of microservices, distributed systems, and serverless applications. Unlike direct synchronous communication, SQS acts as a durable buffer between producers and consumers, guaranteeing at-least-once message delivery and keeping messages available until successfully processed.
SQS automatically handles the underlying infrastructure, including replication, encryption in transit and at rest, and dynamic scaling. Messages can contain up to 256 KB of text data in any format, and the service can handle virtually any traffic volume without prior capacity configuration.
The service integrates natively with other AWS services like AWS Lambda, AWS SNS, and AWS Step Functions, being a fundamental component in event-driven architectures.
The choice between Standard and FIFO queues is a critical architectural decision that impacts performance, cost, and delivery guarantees:
| Aspect | Standard | FIFO |
|---|---|---|
| Throughput | Unlimited | 300 msg/s (no batching), 3,000 msg/s (with batching) |
| Order | Best-effort | Strictly preserved |
| Deduplication | At-least-once delivery | Exactly-once processing |
| Cost | $0.40 per million requests | $0.50 per million requests |
| Latency | Lower | Slightly higher |
| Use cases | Logs, metrics, notifications | Financial transactions, state commands |
Decision framework:
Visibility timeout is crucial for performance and reliability. When a consumer receives a message, it becomes invisible to other consumers during this period:
import boto3
import json
def lambda_handler(event, context):
sqs = boto3.client('sqs')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789012/my-queue'
# Receive messages with optimized visibility timeout
response = sqs.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=10, # Batch processing
VisibilityTimeoutSeconds=300, # 5 minutes for processing
WaitTimeSeconds=20 # Long polling
)
messages = response.get('Messages', [])
for message in messages:
try:
# Process message
body = json.loads(message['Body'])
process_business_logic(body)
# Delete successful message
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message['ReceiptHandle']
)
except Exception as e:
# Extend visibility timeout if you need more time
sqs.change_message_visibility(
QueueUrl=queue_url,
ReceiptHandle=message['ReceiptHandle'],
VisibilityTimeoutSeconds=600 # Extend to 10 minutes
)
print(f"Error processing message: {e}")
# Message will automatically become available again
def process_business_logic(data):
# Business logic that may take several minutes
passTuning rules:
SQS supports batch processing to optimize throughput and reduce costs:
# Lambda consumer optimized for batches
def lambda_handler(event, context):
# Lambda can receive up to 10 messages per invocation
processed = 0
failed = 0
for record in event['Records']:
try:
message_body = json.loads(record['body'])
process_message(message_body)
processed += 1
except Exception as e:
failed += 1
# Lambda automatically retries failed messages
print(f"Failed to process message: {e}")
# Metrics for observability
print(f"Processed: {processed}, Failed: {failed}")
# If there are failures, Lambda will automatically retry
if failed > 0:
raise Exception(f"Failed to process {failed} messages")
# Lambda trigger configuration
{
"BatchSize": 10,
"MaximumBatchingWindowInSeconds": 5,
"FunctionResponseTypes": ["ReportBatchItemFailures"]
}DLQs are essential for handling messages that repeatedly fail:
# CloudFormation template for DLQ
Resources:
MainQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: orders-processing
VisibilityTimeoutSeconds: 300
RedrivePolicy:
deadLetterTargetArn: !GetAtt DeadLetterQueue.Arn
maxReceiveCount: 3
DeadLetterQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: orders-processing-dlq
MessageRetentionPeriod: 1209600 # 14 days
# Alarm to monitor DLQ
DLQAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: orders-dlq-messages
MetricName: ApproximateNumberOfVisibleMessages
Namespace: AWS/SQS
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: QueueName
Value: !GetAtt DeadLetterQueue.QueueNameRedrive strategy:
StartMessageMoveTaskFor production systems, SQS observability is critical:
import boto3
from aws_lambda_powertools import Logger, Tracer, Metrics
from aws_lambda_powertools.metrics import MetricUnit
logger = Logger()
tracer = Tracer()
metrics = Metrics()
@tracer.capture_lambda_handler
@logger.inject_lambda_context
@metrics.log_metrics
def lambda_handler(event, context):
queue_depth = get_queue_depth()
# Custom metrics
metrics.add_metric(name="QueueDepth", unit=MetricUnit.Count, value=queue_depth)
metrics.add_metric(name="MessagesProcessed", unit=MetricUnit.Count, value=len(event['Records']))
# Structured logging
logger.info("Processing batch", extra={
"batch_size": len(event['Records']),
"queue_depth": queue_depth
})
for record in event['Records']:
with tracer.subsegment("process_message"):
process_message_with_tracing(record)
def get_queue_depth():
cloudwatch = boto3.client('cloudwatch')
response = cloudwatch.get_metric_statistics(
Namespace='AWS/SQS',
MetricName='ApproximateNumberOfVisibleMessages',
Dimensions=[{'Name': 'QueueName', 'Value': 'orders-processing'}],
StartTime=datetime.utcnow() - timedelta(minutes=5),
EndTime=datetime.utcnow(),
Period=300,
Statistics=['Average']
)
return response['Datapoints'][-1]['Average'] if response['Datapoints'] else 0SQS is the most widely used messaging service in AWS and a fundamental component for building resilient distributed systems. At the staff engineer level, SQS solves three critical problems: temporal decoupling (producers and consumers operate independently), spike absorption (the queue acts as a buffer during variable loads), and delivery guarantees (messages persist until successfully processed).
The choice between Standard and FIFO directly impacts system architecture: Standard allows unlimited horizontal scaling but requires idempotent logic, while FIFO guarantees order but limits throughput. In high-volume systems, this decision can determine whether you need additional sharding or more complex processing patterns.
For platform teams, SQS significantly reduces operational complexity compared to self-managed solutions like Apache Kafka, eliminating the need to manage brokers, partitions, and consumer rebalancing.
Cloud computing model where the provider manages infrastructure automatically, allowing code execution without provisioning or managing servers, paying only for actual usage.
Architectural pattern where components communicate through asynchronous events, enabling decoupled, scalable, and reactive systems.
AWS serverless compute service that runs code in response to events without provisioning or managing servers, automatically scaling from zero to thousands of concurrent executions.
AWS pub/sub messaging service that distributes messages to multiple subscribers simultaneously, enabling fan-out patterns and notifications at scale.
Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.
Collection of 13 Terraform modules published on the Terraform Registry for deploying serverless architectures on AWS, with 12 examples covering basic ECS to full-stack CRUD with DynamoDB and AgentCore with MCP.