Quantcast
Channel: Active questions tagged retry-logic - Stack Overflow
Viewing all articles
Browse latest Browse all 950

Fifo-SQS lambda triggering failure handling

$
0
0

Our system uses Fifo SQS queues to drive lambdas. Here's from our SAM template:

  EventParserTriggeringQueue:    Type: AWS::SQS::Queue    Properties:      MessageRetentionPeriod: 1209600  # 14 Days (max)      FifoQueue: true      ContentBasedDeduplication: true      VisibilityTimeout: 240  # Must be > EventParser Timeout      Tags:        - Key: "datadog"          Value: "true"      RedrivePolicy:        deadLetterTargetArn: !GetAtt EventParserDeadLetters.Arn        maxReceiveCount: 1  EventParser:    Type: AWS::Serverless::Function    Properties:      CodeUri: lambdas/event_parser_lambda/      Handler: event_parser.lambda_handler      Timeout: 120      Events:        EventParserTriggeringQueueEvent:          Type: SQS          Properties:            Queue: !GetAtt EventParserTriggeringQueue.Arn            BatchSize: 1            ScalingConfig:              MaximumConcurrency: 2      Policies:        Statement:          - Action:              - ssm:GetParametersByPath              - ssm:GetParameters              - ssm:GetParameter            Effect: Allow            Resource:              - Fn::Sub: "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/datadog/api_key"              - Fn::Sub: "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/sentry/dsn"              - Fn::Sub: "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/${AWS::StackName}/*"          - Action:              - sqs:DeleteMessage              - sqs:GetQueueAttributes              - sqs:ReceiveMessage            Effect: Allow            Resource: !GetAtt EventParserTriggeringQueue.Arn  EventParserDeadLetters:    Type: AWS::SQS::Queue    Properties:      MessageRetentionPeriod: 1209600  # 14 Days (max)      FifoQueue: true      ContentBasedDeduplication: true      Tags:        - Key: "datadog"          Value: "true"        - Key: "deadletter"          Value: "true"
What I'm looking for is retry behavior that looks like:
  • If a lambda fails, it gets to retry immediately
  • If a lambda fails more than the maximum allowed failure count, its message goes on a dead-letter queue immediately and the next message can be tried immediately.
Instead, the behavior we're seeing is:
  • If a lambda fails, it is retried only after the visibility timeout period. This period is necessarily longer than the lambda's typical runtime, so a lot of delay is imposed here.
  • If a lambda fails more than the maximum allowed failure count, the message only goes on a dead-letter queue after the visibility timeout period.
First, let me check my understanding of how the system works, because it's not really documented in any one place:
  • For an SQS-driven lambda, the lambda runtime calls ReceiveMessage on the SQS queue periodically. From our system, it looks like the default is once every 10 seconds.
  • If there's a message available, the queue returns it.
  • When the queue returns a message, it starts the clock on the visibility timeout.
    • Until the visibility timeout has elapsed, ReceiveMessage calls to the queue (for the same message group ID) come back empty. (This is a Fifo SQS feature. For non-FIFO queues, only the received messages are hidden.)
    • When the visibility timeout has elapsed, if the head message has been received at least the queue's maxReceiveCount, the queue gives up on the message, optionally placing it on a dead-letter queue.
  • The lambda runtime passes the message along to the lambda function.
  • If the function succeeds, the runtime calls DeleteMessage on the queue. This removes the head message, and also makes the next message available (i.e. it clears the visibility timeout).
  • If the message fails, the runtime carries on as though nothing has happened:
    • It polls the queue periodically, meaning it gets empty responses to ReceiveMessage until the visibility timeout has elapsed
    • Once the visibility timeout is passed, the queue returns the same message again. Or, if the message has been received at least its "max receive count," the queue will return the next message.
One solution I have considered:

Basically, put the lambda in charge:

  • Put retry logic in a loop in the lambda
  • If the lambda gets through its loop without a success, have it explicitly enqueue the message to an SQS queue that we'll use for dead letters. This queue wouldn't be configured as a DLQ, only we'd use it that way.
  • The lambda always returns successfully, so the lambda runtime always deletes the message from the Fifo input queue.
Is this the best I can do?

One serious issue with this approach is, lambda functions can't run longer than 15 minutes and I do worry that retrying 5 times could put us at risk.


Viewing all articles
Browse latest Browse all 950

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>