Serverless SQS consumer skips messages - amazon-sqs

I am using the Serverless Framework to consume messages from SQS. Some of the messages sent to the queue do not get consumed. They go straight to the in-flight SQS status and from there to my dead letter queue. When I look at my log of the consumer, I can see that it consumed and successfully processed 9/10 messages. One is always not consumed and ends up in the dead letter queue. I am setting reservedConcurrency to 1 so that only one consumer can run at a time. The function consumer timeout is set to 30 seconds. This is the consumer code:
module.exports.mySQSConsumer = async (event, context) => {
context.callbackWaitsForEmptyEventLoop = false;
console.log(event.Records);
await new Promise((res, rej) => {
setTimeout(() => {
res();
}, 100);
});
console.log('DONE');
return true;
}
Consumer function configuration follow:
functions:
mySQSConsumer:
handler: handler.mySQSConsumer
timeout: 30 # seconds
reservedConcurrency: 1
events:
- sqs:
arn: arn:aws:sqs:us-east-1:xyz:my-test-queue
batchSize: 1
enabled: true
If I remove the await function, it will process all messages. If I increase the timeout to 200ms, even more messages will go to straight to the in-flight status and from there to the dead letter queue. This code is very simple. Any ideas why it's skipping some messages? The messages that don't get consumed don't even show up in the log using the first console.log() statement. They seem entirely ignored.

I figured out the problem. The SQS queue Lambda function event triggering works differently than I thought. The messages get pushed into the Lambda function, not pulled by it. I think this could be engineered better by AWS, but it's what it is.
The issue was the Default Visibility Timeout set to 30 seconds together with Reserved Concurrency set to 1. When the SQS queue gets filled up quickly with thousands of records, AWS starts pushing the messages to the Lambda function at a rate that is faster than the rate at which the single function instance can process them. AWS "assumes" that it can simply spin up more instances of the Lambda to keep up with the backpressure. However, the concurrency limit doesn't let it spin up more instances - the Lambda function is throttled. As a result, the function starts returning failure to the AWS backend for some messages, which will, consequently, hide the failed messages for 30 seconds (the default setting) and put them back into the queue after this period for reprocessing. Since there are so many records to process by the single instance, 30 seconds later, the Lambda function is still busy and can't process those messages again. So the situation repeats itself and the messages go back to invisibility for 30 seconds. This repeats total 3 times. After the third attempt, the messages go to the dead letter queue (we configured our SQS queue that way).
To resolve this issue, we increased the Default Visibility Timeout to 5 minutes. That's enough time for the Lambda function to process through most of the messages in the queue while the failed ones wait in invisibility. After 5 minutes, they get pushed back into the queue and since the Lambda function is no longer busy, it will process most of them. Some of them have to go to invisibility twice before being successfully processed.
So the remedy to this problem is either increasing the Default Invisibility Timeout like we did or increasing the number of failures necessary before a message goes to the dead letter queue.
I hope this helps someone.

Related

Unable to trigger Signal to stop block in Project Reactor

I want to try blockOptional in Project Reactor. According to the description, it is used to
Subscribe to this Mono and block indefinitely until a next signal is
received or the Mono completes empty
I tried the following
.
Mono<Void> triggerSignal() {
Signal.next("signal");
}
triggerSignal()
.delayElement(Duration.ofSeconds(30))
.blockOptional();
System.out.println("Outside chain");
I expect to see "Outside chain" printed before the 30s delay. However I have to wait 30s.
Isn't a signal received and the block should stop when a signal received? Is it the correct way to send a signal?
triggerSignal() does indeed produce a signal immediately.
However, the next item in your chain is a delayElement() call which will always delay the element emitted by 30s (before it reaches the next operator in the reactive chain); hence blockOptional() will never "see" the signal before 30s has elapsed.
It sounds like you want to, instead, block for up to 30s while you wait for the signal. If that's the case, then you can pass a duration to blockOptional() instead of delaying the element, such as:
triggerSignal()
.blockOptional(Duration.ofSeconds(30));

Azure service bus queue trigger function not picking up my queued message after running for a while

I have a service bus queue in Azure and a service bus queue trigger function. When I first publish the function and send a message to the service bus queue the function gets triggered and runs ok.
But if I leave it alone and don't send any messages to the queue for say ~ 1 hour and then I send a message the function doesn't get triggered. I have to manually run the function again in the portal by pressing 'run' or I have to re-publish it to Azure.
How do I keep it running so I don't have to restart it every hour or so???
My app might not send a new message to the queue for a couple of hours or even days.
FYI - I read here that the function shuts down after 5 minutes, I can't use functions if this is the case and I don't want to use a timer trigger because then I'd be running the function more then I would want, wasting money! Right?
If my only, best, choice here is to run a timer function every 30 minutes throughout the day I might just have to suck it up and do it this way, but I'd prefer to have the function run when a message hits the message queue.
FYI2 - ultimately I want to hide the message until a certain date, when I first push it to the queue (ex. set to show 1 week from the time it's placed in the queue). What am I trying to accomplish? I want to send an email out to registered students after a class has ended, and the class might be scheduled 1-30 days in advance. So when the class is scheduled I don't want the function to run until after the class ends, which might be 1 week, 2 weeks 2 days, 3 weeks 3 days, etc after the class is initially scheduled in my app.
Here is the snippet from the function.json
{
"generatedBy": "Microsoft.NET.Sdk.Functions-1.0.0.0",
"configurationSource": "attributes",
"bindings": [
{
"type": "serviceBusTrigger",
"connection": "Yogabandy2017_RootManageSharedAccessKey_SERVICEBUS",
"queueName": "yogabandy2017",
"accessRights": "listen",
"name": "myQueueItem"
}
],
"disabled": false,
"scriptFile": "..\\bin\\PostEventEmailFunction.dll",
"entryPoint": "PostEventEmailFunction.Function1.Run"
}
Here is the function
public static class Function1
{
[FunctionName("Function1")]
public static void Run([ServiceBusTrigger("yogabandy2017", AccessRights.Listen, Connection = "Yogabandy2017_RootManageSharedAccessKey_SERVICEBUS")]string myQueueItem, TraceWriter log)
{
log.Info($"C# ServiceBus queue trigger function processed message: {myQueueItem}");
}
}
I've been asked to post my payment plan
I suspect you're using scenario 1 and 2 below - which will shutdown after a period of no activity.
The Function app on a Free/Shared App Service Plan. Always On is not available, so you'll have the upgrade your plan.
The Function App on a Basic/Standard/Premium App Service Plan. You'll need to ensure the Always On is checked in the function Application Setting
If the Function App using a Consumption Plan, this should work. It will wake up the function app on each request. For your scenario I would recommend this approach as it only utilises resources when required.
FYI1 - If you get the above working, you won't need the timer.
A Consumption Plan in the Application Settings will look like:
Please check your service plan by
1. Platform features
2. App Service plan
3. under Pricing Tier
Please provide you app name.

Azsure Worker Role reading from Queue

I've implemented a Web role that writes to a queue. This is working fine. Then I developed a Worker role to read from the queue. When I run it in debug mode from my local machine it reads the messages from the queue fine, but when i deploy the Worker role it dos'nt seem to be reading the queue as the message eventually end up in the dead letter queue. Anyone know what could be causing this behavior? Below are some bit that might be key in figuring this thing out
queueClient = QueueClient.Create(queueName, ReceiveMode.PeekLock);
var queueDescription = new QueueDescription(QueueName)
{
RequiresSession = false,
DefaultMessageTimeToLive = TimeSpan.FromMinutes(2),
EnableDeadLetteringOnMessageExpiration = true,
MaxDeliveryCount = 20
};
Increase the QueueDescription.DefaultmessageTimeToLive to ~10 mins.
This property dictates how much time a message should live in the Queue - before being processed (Message.Complete() is called). If it remains in the queue for more than 2 mins - it will be automatically moved to DeadLetterQueue (as you had Set EnableDeadLetteringOnMsgExp to true).
TTL is useful in these messaging scenarios
if a message is not being processed after N mins after it arrived -then it might not be useful to process it any more
if the message was attempted to process many times and was never completed (Reciever - msg.Complete()) - this might be needing special processing
So - to be safe have a bit higher value of DefaultMsgTTL.
Hope it Helps!
Sree

Celery Task that comunicate with Twitter

What is the right approach when writing celery tasks that communicate with service that have a rate limits and sometimes is missing (not responding) for a long time of period?
Do I have to use task retry? What if service is missing too much time? Is there a way to store this tasks for later execution after a long time period?
What if this is a subtask in a long task?
First, i suggest you to set a socket timeout for avoid long waiting of a response.
Than you can catch the socket TimeOutException and in this particoular case to a retry with big amount of time, like 15 minutes.
Anyway normally I use an incrementalRetry with a percentage of increment, this will increase the time every time the task retry, this is useful when you write task that depends on external services that can be available for long time.
You can set on task an high number of retry like 50 and than set the standard retry time by using the var
#20 seconds
self.default_retry_delay = 20
after you can implement a method like this for your task
def incrementalRetry(self, exc, perc = 20, args = None):
"""By default the retry delay is increased by 20 percent"""
if args:
self.request.args = args
delay = self.default_retry_delay
if self.request.kwargs.has_key('retry_deleay'):
delay = self.request.kwargs['retry_deleay']
retry_delay = delay+round((delay*perc)/100,2)
#print "delay"+str(retry_delay)
self.retry(self.request.args,
self.request.kwargs.update({'retry_deleay':retry_delay}),
exc=exc,countdown=retry_delay, max_retries=self.max_retries)
What if this is a subtask in a long task?
If you don't need the result you can launch it in async mode by using task.delay(args=[])
A nice feature is also the task group that allow you to launch different tasks and after all are finished you can to something else in you work flow.

QTP Recovery scenario used to "skip" consecutive FAILED steps with 0 timeout -- how can I restore original timeout value?

Suppose I use QTPs recovery scenario manager to set the playback synchronization timeout to 0. The handler would return with "continue with next statement".
I'd do that to make sure that any following playback statements don't waste their time waiting for the next non-existing/non-matching step before failing:
I have a lot of GUI tests that kind of get stuck because let's say if 10 controls are missing, their (consecutive) playback steps produce 10 timeout waits before failing. If the playback timeout is 30 seconds, I loose 10x30 seconds=5 minutes execution time while it really would be sufficient to wait for 30 seconds ONCE (because the app does not change anymore -- we waited a full timeout period already).
Now if I have 100 test cases (=action iterations), this possibly happens 100 times, wasting 500 minutes of my test exec time window.
That's why I come up with the idea of a recovery scenario function setting the timeout to 0 after/upon the first failed playback step. This would accelerate the speed while skipping the rightly-FAILED step, yet would not compromise the precision/reliability of identifying the next matching GUI context (which creates a PASSED step).
Then of course upon the next passed playback step, I would want to restore the original timeout value. How could I do that? This is my question.
One cannot define a recovery scenario function that is called for PASSED steps.
I am currently thinking about setting a method function for Reporter.ReportEvent, and "sniffing" for PASSED log entries there. I'd install that method function in the scenario recovery function which sets timeout to 0. Then, when the "sniffer" function senses a ReportEvent call with PASSED status during one of the following playback steps, I'd reset everything (i.e. restore the original timeout, and uninstall the method function). (I am 99% sure, however, that .Click and .Set methods do not call ReportEvent to write their result status...so this option might probably not work.)
Better ideas? This really bugs me.
It sounds to me like you tests aren't designed correctly, if you fail to find an object why do you continue?
One possible (non recovery scenario) solution would be to use RegisterUserFunc to override the methods you are using in order to do an obj.Exist(0) before running the required method.
Function MyClick(obj)
If obj.Exist(1) Then
obj.Click
Else
Reporter.ReportEvent micFail, "Click failed, no object", "Object does not exist"
End If
End Function
RegisterUserFunc "Link", "Click", "MyClick"
RegisterUserFunc "WebButton", "Click", "MyClick"
''# etc
If you have many controls of which some may be missing and you know that after 10 seconds you mentioned (when the first timeout occurs), nothing more will show up, then you can use the exists method with a timeout parameter.
Something like this:
timeout = 10
For Each control in controls
If control.exists(timeout) Then
do something with the control
Else
timeout = 0
End If
Next
Now only the first timeout will be 10 seconds. Each and every subsequent timeout in your collection of controls will have the timeout set to 0 which will save your time.

Resources