Rebus with SQS - a failed message is handled by 2 consumers more times than number of configured re-tries - amazon-sqs

I have a failing message in Rebus with SQS transport. I have 2 Rebus consumers, each configured with 5 re-tries, and I can see that a failed message is handled from 5 times up to 10 times. I thought it's a bug first, but later realized it's probably an expected behaviour. My understanding is that when a message fails for the first time in Rebus consumer 1, it's made visible in the queue again, and potentially received by Rebus consumer 2. Each consumer counting number of re-tries independently, thus handling it up to 10 times in total. Am I correct in my understanding that it's a correct and expected behaviour?

Rebus tracks the caught exceptions in memory, so you are absolutely correct 😊
I've found no better way of tracking exceptions between delivery attempts.

Related

Getting error 504, Gateway Timeout, while fetching messages from the MS Graph API

For the past week or so, we've been experiencing 504, Gateway Timeout errors while making fetching email messages from the MS Graph API. Prior to that for over a month of running, the same application did not experience that error, at least not in any significant frequency.
We are using V1.0 of the MS Graph API
Our query is fairly simple:
$top=100&$orderBy=lastModifiedDateTime desc&$filter=lastModifiedDateTime lt 2019-09-09T19:27:55Z and parentFolderId ne 'JunkEmail'
We get the timeout for users who have large volumes of data (> 100K email messages), but occasionally do get it for users with lesser (around 18K email messages) volume. Volume has not changed much from the time where the system was working, to now when we see many timeouts.
We've tried simplifying the query, reducing the number of messages we request etc., but that seems to have only limited and intermittent impact.
My question - What can we do to eliminate/significantly reduce the possibility of getting the 504, Gateway Timeout error from the MS Graph API?
I suspect that since we are asking for messages without a folder filter, it may be possible that we are stressing out the query engine. Just a hunch, and if any one has real insight into MS Graph API, i'd love to know if that may be possible. Also, any information that helps us better understand what is going on under the hood would be much appreciated.
Update 1 (2019-09-13 15:44:00 EST) - Here is a visualization of a set of fetch requests made by the app over a 12 hour period (approximately). The pink bars are the number of successful fetches, and the light blue ones are the failed requests (all having 504, Gateway Timeout as the failure code). As you can see, when the app starts it has a number of failures, which eventually reduce and go away. Then from around 4:30AM to 9:30AM, there are a number of failures, which eventually subside. Almost all failures happen while fetching messages for one user, who has a very large mailbox (> 220K messages). I realize this is a small data set, and am happy to generate one that runs for a longer period of time if that helps. Also, the app in question is running on our Azure tenant, as a part of a Azure Function app, in the "East US" location.
Update 2, (16th Sept 2019, 09:32:00 EST) - We ran the system for the last 3 days and here is a visualization of the fetch requests made by the app during that time. The blue bars are successful fetches, and the pink bars are failed fetched (all having 504, Gateway Timeout as the failure code). The summary is that except for a small window 11PM - 2AM on the first night, no request succeeded for this one particular user with a large mailbox. In effect, that means that inspite of retry logic etc., we are unable to process that user's data.
Microsoft Graph can be slow at times and will throttle occasionally.
I'd advise you let the Graph SDK do the hard work to save you from writing code to handle all this yourself.
Use the Microsoft Graph client library version 1.17.0+ as it introduced auto retry on 504 errors. It alsos handle throttling (code 429) when they occur.
The point I am trying to make is that you can retry when you get a 504 or 429 yourself or delegate such responsibilities to a SDK
Good to hear that the retry is helping. I've got a couple of options to try:
1) Change your query and move the ordering responsibilities to the client. $orderBy=lastModifiedDateTime desc and the filter require indices to be created and this increase the load on the mailbox. Doing client-side ordering may be better for these large mailboxes.
2) Use delta query (with your filter) to sync and get incremental changes. You will have to add a folder hierarchy sync. You may be able to make parallel calls. I suspect that this will give you much better performance after the initial sync.
I encountered the same issue. 504 error while trying to get all messages. After a thorough inspection I figured that in our case the problem was draft items. In some cases they were throwing errors. After adding filter "isDraft eq false" 504 stopped and we're getting all messages. Turns out that some drafts are broken. They won't show up in OWA or Outlook and in our case the one that was messing with the query was stored under parentFolderId that was non-existent, which is a huge problem in and of itself in my opinion.

Grails RabbitMQ losing messages when many messages at once

I have a Grails application using the grails RabbitMQ plugin to handle messages asynchronously. The queue is set to durable, the messages are persistent, and there are 20 concurrent consumers. Acknowledgement is turned on and is set to issue an ack/nack based on if the consumer returns normally or throws an exception. These consumers usually handle messages fine, but when the queue fills up very quickly (5,000 or so messages at once) some of the messages get lost.
There is logging in the consumer when the message from Rabbit is received and that logging event never occurs, so the consumer is not receiving the lost messages at all. Further, there are no exceptions that appear in the logs.
I have tried increasing the prefetch value of the consumers to 5 (from 1), but that did not solve the problem. I have checked the RabbitMQ UI and there are no messages stuck in the queue and there are no unacknowledged messages.

Why is ApproximateAgeOfOldestMessage in SQS not bigger than approx 5 mins

I am utilising spring cloud aws messaging (2.0.1.RELEASE) in java to consume from an SQS queue. If it's relevant we use default settings, java 10 and spring cloud Finchley.SR2,
We recently had an issue where a message could not be processed due to an application bug, leading to an exception and no confirmation (deletion) of the message. The message is later retried (this is desirable) presumably after the visibility timeout has elapsed (again default values are in use), we have not customised the settings here.
We didn't spot the error above for a few days, meaning the message receive count was very high and the message had conceptually been on the queue for a while (several days by now). We considered creating a cloud watch SQS alarm to alert us to a similar situation in future. The only suitable metric appeared to be ApproximateAgeOfOldestMessage.
Sadly, when observing this metric I see this:
The max age doesn't go much above 5 mins (despite me knowing it was several days old). If a message is getting older each time a receive happens, assuming no acknowledgment comes and the message isn't deleted - but is instead becoming available again after the visibility timeout has elapsed should this graph not be much much higher?
I don't know if this is something specific to thew way that spring cloud aws messaging consumes the message or whether it's a general SQS quirk, but my expectation was that if a message was put on the queue 5 days ago, and a consumer had not successfully consumed the message then the max age would be 5 days?
Is it in fact the case that if a message is received by a consumer, but not ultimately deleted that the max age is actually the length between consume calls?
Can anyone confirm whether my expectation is incorrect, i.e. this is indeed how SQS is expected to behave (it doesn't consider the age to be the duration of time since the message was first put on the queue, but instead considers it to be the time between receive calls?
Based on a similar question on AWS forums, this is apparently a bug with regular SQS queues where only a single message is affected.
In order to have a useful alarm for this issue, I would suggest setting up a dead-letter-queue (where messages get automatically delivered after a configurable number of consume-without-deletes), and alarm on the size of the dead-letter-queue (ApproximateNumberOfMessagesVisible).
I think this might have to do with the poison pill handling by this metric. After 3+ tries, the message won't be included in the metric. From the AWS docs:
After a message is received three times (or more) and not processed,
the message is moved to the back of the queue and the
ApproximateAgeOfOldestMessage metric points at the second-oldest
message that hasn't been received more than three times. This action
occurs even if the queue has a redrive policy.

Deploying an SQS Consumer

I am looking to run a service that will be consuming messages that are placed into an SQS queue. What is the best way to structure the consumer application?
One thought would be to create a bunch of threads or processes that run this:
def run(q, delete_on_error=False):
while True:
try:
m = q.read(VISIBILITY_TIMEOUT, wait_time_seconds=MAX_WAIT_TIME_SECONDS)
if m is not None:
try:
process(m.id, m.get_body())
except TransientError:
continue
except Exception as ex:
log_exception(ex)
if not delete_on_error:
continue
q.delete_message(m)
except StopIteration:
break
except socket.gaierror:
continue
Am I missing anything else important? What other exceptions do I have to guard against in the queue read and delete calls? How do others run these consumers?
I did find this project, but it seems stalled and has some issues.
I am leaning toward separate processes rather than threads to avoid the the GIL. Is there some container process that can be used to launch and monitor these separate running processes?
There are a few things:
The SQS API allows you to receive more than one message with a single API call (up to 10 messages, or up to 256k worth of messages, whichever limit is hit first). Taking advantage of this feature allows you to reduce costs, since you are charged per API call. It looks like you're using the boto library - have a look at get_messages.
In your code right now, if processing a message fails due to a transient error, the message won't be able to be processed again until the visibility timeout expires. You might want to consider returning the message to the queue straight away. You can do this by calling change_visibility with 0 on that message. The message will then be available for processing straight away. (It might seem that if you do this then the visibility timeout will be permanently changed on that message - this is actually not the case. The AWS docs state that "the visibility timeout for the message the next time it is received reverts to the original timeout value". See the docs for more information.)
If you're after an example of a robust SQS message consumer, you might want to check out NServiceBus.AmazonSQS (of which I am the author). (C# - sorry, I couldn't find any python examples.)

How to guarantee that Amazon SQS will receive a message only once?

I'm using an Amazon SQS queue to send notifications to an external system.
If the HTTP request fails when using SQS' SendMessage, I don't know whether the message has been queued or not. My default policy would be to retry posting the message to the queue, but there's a risk to post the message twice, which might not be acceptable depending on the use case.
Is there a way to have SQS refuse the message if there is a duplicate on the message body (or some kind of message metadata, such as a unique ID we could provide) so that we could retry until the message is accepted, and be confident that there won't be a duplicate if the first request had been already queued, but the response had been lost?
No, there's no such mechanism in SQS. Going further, it is also possible that a message will be delivered twice or more (at-least-once delivery semantics). So even if such a mechanism existed, you wouldn't be able to guarantee that the message isn't delivered multiple times.
See: http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/DistributedQueues.html
For exactly-once deliveries, you need some form of transactions (and HTTP isn't a transactional protocol) both on the sending and receiving end.
AFAIK, right now SQS does support what was asked!
Please see the "What's new" post entitled Amazon SQS Introduces FIFO Queues with Exactly-Once Processing and Lower Prices for Standard Queues
According to SQS FAQ:
FIFO queues provide exactly-once processing, which means that each message is delivered once and remains available until a consumer processes it and deletes it. Duplicates are not introduced into the queue.
There's also an AWS Blog post with a bit more insight on the subject:
These queues are designed to guarantee that messages are processed exactly once, in the order that they are sent, and without duplicates.
......
Exactly-once processing applies to both single-consumer and multiple-consumer scenarios. If you use FIFO queues in a multiple-consumer environment, you can configure your queue to make messages visible to other consumers only after the current message has been deleted or the visibility timeout expires. In this scenario, at most one consumer will actively process messages; the other consumers will be waiting until the first consumer finishes or fails.
Duplicate messages can sometimes occur when a networking issue outside of SQS prevents the message sender from learning the status of an action and causes the sender to retry the call. FIFO queues use multiple strategies to detect and eliminate duplicate messages. In addition to content-based deduplication, you can include a MessageDeduplicationId when you call SendMessage for a FIFO queue. The ID can be up to 128 characters long, and, if present, takes higher precedence than content-based deduplication.

Resources