Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

AWS SQS

AWS fully managed message queue service that decouples distributed application components, guaranteeing message delivery with unlimited scalability.

evergreen#aws#sqs#messaging#queue#serverless#decoupling

What it is

Amazon SQS (Simple Queue Service) is a fully managed message queue service that enables decoupling and scaling of microservices, distributed systems, and serverless applications. Unlike direct synchronous communication, SQS acts as a durable buffer between producers and consumers, guaranteeing at-least-once message delivery and keeping messages available until successfully processed.

SQS automatically handles the underlying infrastructure, including replication, encryption in transit and at rest, and dynamic scaling. Messages can contain up to 256 KB of text data in any format, and the service can handle virtually any traffic volume without prior capacity configuration.

The service integrates natively with other AWS services like AWS Lambda, AWS SNS, and AWS Step Functions, being a fundamental component in event-driven architectures.

Queue types: Standard vs FIFO

The choice between Standard and FIFO queues is a critical architectural decision that impacts performance, cost, and delivery guarantees:

AspectStandardFIFO
ThroughputUnlimited300 msg/s (no batching), 3,000 msg/s (with batching)
OrderBest-effortStrictly preserved
DeduplicationAt-least-once deliveryExactly-once processing
Cost$0.40 per million requests$0.50 per million requests
LatencyLowerSlightly higher
Use casesLogs, metrics, notificationsFinancial transactions, state commands

Decision framework:

  • Use Standard when throughput is critical and you can handle duplicates or out-of-order processing
  • Use FIFO when order is essential and you cannot tolerate duplicates (e.g., inventory updates, payment processing)

Visibility timeout configuration

Visibility timeout is crucial for performance and reliability. When a consumer receives a message, it becomes invisible to other consumers during this period:

import boto3
import json
 
def lambda_handler(event, context):
    sqs = boto3.client('sqs')
    queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789012/my-queue'
    
    # Receive messages with optimized visibility timeout
    response = sqs.receive_message(
        QueueUrl=queue_url,
        MaxNumberOfMessages=10,  # Batch processing
        VisibilityTimeoutSeconds=300,  # 5 minutes for processing
        WaitTimeSeconds=20  # Long polling
    )
    
    messages = response.get('Messages', [])
    
    for message in messages:
        try:
            # Process message
            body = json.loads(message['Body'])
            process_business_logic(body)
            
            # Delete successful message
            sqs.delete_message(
                QueueUrl=queue_url,
                ReceiptHandle=message['ReceiptHandle']
            )
            
        except Exception as e:
            # Extend visibility timeout if you need more time
            sqs.change_message_visibility(
                QueueUrl=queue_url,
                ReceiptHandle=message['ReceiptHandle'],
                VisibilityTimeoutSeconds=600  # Extend to 10 minutes
            )
            
            print(f"Error processing message: {e}")
            # Message will automatically become available again
 
def process_business_logic(data):
    # Business logic that may take several minutes
    pass

Tuning rules:

  • Visibility timeout = maximum processing time + 20% buffer
  • For Lambda: typically 30-300 seconds
  • For long processes: up to 12 hours (SQS maximum)

Batch processing patterns

SQS supports batch processing to optimize throughput and reduce costs:

# Lambda consumer optimized for batches
def lambda_handler(event, context):
    # Lambda can receive up to 10 messages per invocation
    processed = 0
    failed = 0
    
    for record in event['Records']:
        try:
            message_body = json.loads(record['body'])
            process_message(message_body)
            processed += 1
        except Exception as e:
            failed += 1
            # Lambda automatically retries failed messages
            print(f"Failed to process message: {e}")
    
    # Metrics for observability
    print(f"Processed: {processed}, Failed: {failed}")
    
    # If there are failures, Lambda will automatically retry
    if failed > 0:
        raise Exception(f"Failed to process {failed} messages")
 
# Lambda trigger configuration
{
    "BatchSize": 10,
    "MaximumBatchingWindowInSeconds": 5,
    "FunctionResponseTypes": ["ReportBatchItemFailures"]
}

Dead Letter Queues and redrive strategies

DLQs are essential for handling messages that repeatedly fail:

# CloudFormation template for DLQ
Resources:
  MainQueue:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: orders-processing
      VisibilityTimeoutSeconds: 300
      RedrivePolicy:
        deadLetterTargetArn: !GetAtt DeadLetterQueue.Arn
        maxReceiveCount: 3
      
  DeadLetterQueue:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: orders-processing-dlq
      MessageRetentionPeriod: 1209600  # 14 days
      
  # Alarm to monitor DLQ
  DLQAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: orders-dlq-messages
      MetricName: ApproximateNumberOfVisibleMessages
      Namespace: AWS/SQS
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 1
      Threshold: 1
      ComparisonOperator: GreaterThanOrEqualToThreshold
      Dimensions:
        - Name: QueueName
          Value: !GetAtt DeadLetterQueue.QueueName

Redrive strategy:

  1. Analysis: examine DLQ messages to identify failure patterns
  2. Correction: fix bugs in consumer code
  3. Redrive: move messages back to main queue using StartMessageMoveTask
  4. Monitoring: configure alerts to detect new failures quickly

Integration with observability

For production systems, SQS observability is critical:

import boto3
from aws_lambda_powertools import Logger, Tracer, Metrics
from aws_lambda_powertools.metrics import MetricUnit
 
logger = Logger()
tracer = Tracer()
metrics = Metrics()
 
@tracer.capture_lambda_handler
@logger.inject_lambda_context
@metrics.log_metrics
def lambda_handler(event, context):
    queue_depth = get_queue_depth()
    
    # Custom metrics
    metrics.add_metric(name="QueueDepth", unit=MetricUnit.Count, value=queue_depth)
    metrics.add_metric(name="MessagesProcessed", unit=MetricUnit.Count, value=len(event['Records']))
    
    # Structured logging
    logger.info("Processing batch", extra={
        "batch_size": len(event['Records']),
        "queue_depth": queue_depth
    })
    
    for record in event['Records']:
        with tracer.subsegment("process_message"):
            process_message_with_tracing(record)
 
def get_queue_depth():
    cloudwatch = boto3.client('cloudwatch')
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/SQS',
        MetricName='ApproximateNumberOfVisibleMessages',
        Dimensions=[{'Name': 'QueueName', 'Value': 'orders-processing'}],
        StartTime=datetime.utcnow() - timedelta(minutes=5),
        EndTime=datetime.utcnow(),
        Period=300,
        Statistics=['Average']
    )
    return response['Datapoints'][-1]['Average'] if response['Datapoints'] else 0

Why it matters

SQS is the most widely used messaging service in AWS and a fundamental component for building resilient distributed systems. At the staff engineer level, SQS solves three critical problems: temporal decoupling (producers and consumers operate independently), spike absorption (the queue acts as a buffer during variable loads), and delivery guarantees (messages persist until successfully processed).

The choice between Standard and FIFO directly impacts system architecture: Standard allows unlimited horizontal scaling but requires idempotent logic, while FIFO guarantees order but limits throughput. In high-volume systems, this decision can determine whether you need additional sharding or more complex processing patterns.

For platform teams, SQS significantly reduces operational complexity compared to self-managed solutions like Apache Kafka, eliminating the need to manage brokers, partitions, and consumer rebalancing.

References

  • Amazon SQS Developer Guide — AWS, 2024. Complete developer guide.
  • SQS Best Practices for Performance — AWS, 2024. Performance and cost optimizations.
  • Event-Driven Architectures - Serverless Lens — AWS Well-Architected, 2024. Architectural patterns for serverless applications.
  • SQS Pricing and Cost Optimization — AWS, 2024. Detailed pricing model.
  • Monitoring Amazon SQS with CloudWatch — AWS, 2024. Recommended metrics and alerts.
  • AWS Lambda SQS Integration — AWS, 2024. Configuration and best practices for triggers.

Related content

  • Serverless

    Cloud computing model where the provider manages infrastructure automatically, allowing code execution without provisioning or managing servers, paying only for actual usage.

  • Event-Driven Architecture

    Architectural pattern where components communicate through asynchronous events, enabling decoupled, scalable, and reactive systems.

  • AWS Lambda

    AWS serverless compute service that runs code in response to events without provisioning or managing servers, automatically scaling from zero to thousands of concurrent executions.

  • AWS SNS

    AWS pub/sub messaging service that distributes messages to multiple subscribers simultaneously, enabling fan-out patterns and notifications at scale.

  • Observability

    Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.

  • Terraform AWS Serverless Modules

    Collection of 13 Terraform modules published on the Terraform Registry for deploying serverless architectures on AWS, with 12 examples covering basic ECS to full-stack CRUD with DynamoDB and AgentCore with MCP.

Concepts