Our system uses Fifo SQS queues to drive lambdas. Here's from our SAM template:
EventParserTriggeringQueue: Type: AWS::SQS::Queue Properties: MessageRetentionPeriod: 1209600 # 14 Days (max) FifoQueue: true ContentBasedDeduplication: true VisibilityTimeout: 240 # Must be > EventParser Timeout Tags: - Key: "datadog" Value: "true" RedrivePolicy: deadLetterTargetArn: !GetAtt EventParserDeadLetters.Arn maxReceiveCount: 1 EventParser: Type: AWS::Serverless::Function Properties: CodeUri: lambdas/event_parser_lambda/ Handler: event_parser.lambda_handler Timeout: 120 Events: EventParserTriggeringQueueEvent: Type: SQS Properties: Queue: !GetAtt EventParserTriggeringQueue.Arn BatchSize: 1 ScalingConfig: MaximumConcurrency: 2 Policies: Statement: - Action: - ssm:GetParametersByPath - ssm:GetParameters - ssm:GetParameter Effect: Allow Resource: - Fn::Sub: "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/datadog/api_key" - Fn::Sub: "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/sentry/dsn" - Fn::Sub: "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/${AWS::StackName}/*" - Action: - sqs:DeleteMessage - sqs:GetQueueAttributes - sqs:ReceiveMessage Effect: Allow Resource: !GetAtt EventParserTriggeringQueue.Arn EventParserDeadLetters: Type: AWS::SQS::Queue Properties: MessageRetentionPeriod: 1209600 # 14 Days (max) FifoQueue: true ContentBasedDeduplication: true Tags: - Key: "datadog" Value: "true" - Key: "deadletter" Value: "true"
What I'm looking for is retry behavior that looks like:
- If a lambda fails, it gets to retry immediately
- If a lambda fails more than the maximum allowed failure count, its message goes on a dead-letter queue immediately and the next message can be tried immediately.
Instead, the behavior we're seeing is:
- If a lambda fails, it is retried only after the visibility timeout period. This period is necessarily longer than the lambda's typical runtime, so a lot of delay is imposed here.
- If a lambda fails more than the maximum allowed failure count, the message only goes on a dead-letter queue after the visibility timeout period.
First, let me check my understanding of how the system works, because it's not really documented in any one place:
- For an SQS-driven lambda, the lambda runtime calls
ReceiveMessage
on the SQS queue periodically. From our system, it looks like the default is once every 10 seconds. - If there's a message available, the queue returns it.
- When the queue returns a message, it starts the clock on the visibility timeout.
- Until the visibility timeout has elapsed,
ReceiveMessage
calls to the queue (for the same message group ID) come back empty. (This is a Fifo SQS feature. For non-FIFO queues, only the received messages are hidden.) - When the visibility timeout has elapsed, if the head message has been received at least the queue's
maxReceiveCount
, the queue gives up on the message, optionally placing it on a dead-letter queue.
- Until the visibility timeout has elapsed,
- The lambda runtime passes the message along to the lambda function.
- If the function succeeds, the runtime calls
DeleteMessage
on the queue. This removes the head message, and also makes the next message available (i.e. it clears the visibility timeout). - If the message fails, the runtime carries on as though nothing has happened:
- It polls the queue periodically, meaning it gets empty responses to
ReceiveMessage
until the visibility timeout has elapsed - Once the visibility timeout is passed, the queue returns the same message again. Or, if the message has been received at least its "max receive count," the queue will return the next message.
- It polls the queue periodically, meaning it gets empty responses to
One solution I have considered:
Basically, put the lambda in charge:
- Put retry logic in a loop in the lambda
- If the lambda gets through its loop without a success, have it explicitly enqueue the message to an SQS queue that we'll use for dead letters. This queue wouldn't be configured as a DLQ, only we'd use it that way.
- The lambda always returns successfully, so the lambda runtime always deletes the message from the Fifo input queue.
Is this the best I can do?
One serious issue with this approach is, lambda functions can't run longer than 15 minutes and I do worry that retrying 5 times could put us at risk.