Why is ApproximateAgeOfOldestMessage in SQS not bigger than approx 5 mins - amazon-sqs

I am utilising spring cloud aws messaging (2.0.1.RELEASE) in java to consume from an SQS queue. If it's relevant we use default settings, java 10 and spring cloud Finchley.SR2,
We recently had an issue where a message could not be processed due to an application bug, leading to an exception and no confirmation (deletion) of the message. The message is later retried (this is desirable) presumably after the visibility timeout has elapsed (again default values are in use), we have not customised the settings here.
We didn't spot the error above for a few days, meaning the message receive count was very high and the message had conceptually been on the queue for a while (several days by now). We considered creating a cloud watch SQS alarm to alert us to a similar situation in future. The only suitable metric appeared to be ApproximateAgeOfOldestMessage.
Sadly, when observing this metric I see this:
The max age doesn't go much above 5 mins (despite me knowing it was several days old). If a message is getting older each time a receive happens, assuming no acknowledgment comes and the message isn't deleted - but is instead becoming available again after the visibility timeout has elapsed should this graph not be much much higher?
I don't know if this is something specific to thew way that spring cloud aws messaging consumes the message or whether it's a general SQS quirk, but my expectation was that if a message was put on the queue 5 days ago, and a consumer had not successfully consumed the message then the max age would be 5 days?
Is it in fact the case that if a message is received by a consumer, but not ultimately deleted that the max age is actually the length between consume calls?
Can anyone confirm whether my expectation is incorrect, i.e. this is indeed how SQS is expected to behave (it doesn't consider the age to be the duration of time since the message was first put on the queue, but instead considers it to be the time between receive calls?

Based on a similar question on AWS forums, this is apparently a bug with regular SQS queues where only a single message is affected.
In order to have a useful alarm for this issue, I would suggest setting up a dead-letter-queue (where messages get automatically delivered after a configurable number of consume-without-deletes), and alarm on the size of the dead-letter-queue (ApproximateNumberOfMessagesVisible).

I think this might have to do with the poison pill handling by this metric. After 3+ tries, the message won't be included in the metric. From the AWS docs:
After a message is received three times (or more) and not processed,
the message is moved to the back of the queue and the
ApproximateAgeOfOldestMessage metric points at the second-oldest
message that hasn't been received more than three times. This action
occurs even if the queue has a redrive policy.

Related

Rebus with SQS - a failed message is handled by 2 consumers more times than number of configured re-tries

I have a failing message in Rebus with SQS transport. I have 2 Rebus consumers, each configured with 5 re-tries, and I can see that a failed message is handled from 5 times up to 10 times. I thought it's a bug first, but later realized it's probably an expected behaviour. My understanding is that when a message fails for the first time in Rebus consumer 1, it's made visible in the queue again, and potentially received by Rebus consumer 2. Each consumer counting number of re-tries independently, thus handling it up to 10 times in total. Am I correct in my understanding that it's a correct and expected behaviour?
Rebus tracks the caught exceptions in memory, so you are absolutely correct 😊
I've found no better way of tracking exceptions between delivery attempts.

Setup alarm on SQS for messages automatically deleted after 14 days (retention period)

The question is, the SQS message messages get deleted after a particular period of time that is 14 days max. I set up a cloud watch monitor for it which sends me a mail if there is a SQS message in the SQS queue, but what if I miss the mail and not check the message and it gets deleted. can i set up multiple remainders for one SQS queue message before it gets deleted?
One possibility is to create a CloudWatch alarm on the ApproximateAgeOfOldestMessage metric. The threshold should be set at a value slightly less than 14 days to ensure the old message isn't missed.
I'd like to note that in general if you are anticipating messages lingering in SQS for 14 days, this suggests you are not using SQS for its intended purpose. For a message to linger for 14 days suggests that whatever is processing the messages is severely deficient in its capacity. Once a message is seen and processed, you delete it from the queue, and in most cases the design is for this to happen within seconds or minutes, sometimes hours, but rarely multiple days.
`

Monitor Amazon SQS delayed processing

I have a series of applications that consume messages from SQS Queues. If for some reason one of these consumers fails and stop consuming messages I'd like to be notified. What's the best way to do this?
Note that some of these queues could only have one message placed into the queue every 2 - 3 days, so waiting for the # of messages in the queue to trigger a notification is not a good option for me.
What I'm looking for is something that can monitor an SQS queue and say "This message has been here for an hour and nothing has processed it ... let someone know."
Possible solution off the top of my head (possibly not the most elegant one) which does not require using CloudWatch at all (according to the comment from OP the required tracking cannot be implemented through CloudWatch alarms). Assume you have the Queue to be processed at Service and the receiving side is implemented through long polling.
Run a Lambda function (say hourly) listening to the Queue and reading messages, however never deleting (Service deletes the messages once processed). On the Queue set the Maximum Receives to any value u want, let's say 3. If Lambda function ran 3 times and all three times message was present in the queue, the message will be pushed to Dead Letter Queue (automatically if the redrive policy is set). Whenever new message is pushed to dead letter queue, it is a good indicator that your service is either down or not handling the requests fast enough. All variables can be changed to suit your needs

Deploying an SQS Consumer

I am looking to run a service that will be consuming messages that are placed into an SQS queue. What is the best way to structure the consumer application?
One thought would be to create a bunch of threads or processes that run this:
def run(q, delete_on_error=False):
while True:
try:
m = q.read(VISIBILITY_TIMEOUT, wait_time_seconds=MAX_WAIT_TIME_SECONDS)
if m is not None:
try:
process(m.id, m.get_body())
except TransientError:
continue
except Exception as ex:
log_exception(ex)
if not delete_on_error:
continue
q.delete_message(m)
except StopIteration:
break
except socket.gaierror:
continue
Am I missing anything else important? What other exceptions do I have to guard against in the queue read and delete calls? How do others run these consumers?
I did find this project, but it seems stalled and has some issues.
I am leaning toward separate processes rather than threads to avoid the the GIL. Is there some container process that can be used to launch and monitor these separate running processes?
There are a few things:
The SQS API allows you to receive more than one message with a single API call (up to 10 messages, or up to 256k worth of messages, whichever limit is hit first). Taking advantage of this feature allows you to reduce costs, since you are charged per API call. It looks like you're using the boto library - have a look at get_messages.
In your code right now, if processing a message fails due to a transient error, the message won't be able to be processed again until the visibility timeout expires. You might want to consider returning the message to the queue straight away. You can do this by calling change_visibility with 0 on that message. The message will then be available for processing straight away. (It might seem that if you do this then the visibility timeout will be permanently changed on that message - this is actually not the case. The AWS docs state that "the visibility timeout for the message the next time it is received reverts to the original timeout value". See the docs for more information.)
If you're after an example of a robust SQS message consumer, you might want to check out NServiceBus.AmazonSQS (of which I am the author). (C# - sorry, I couldn't find any python examples.)

what would be the possible approach to go : SQS or SNS?

I am going to make the rails application which integrates the Amazon's cloud services.
I have explore amazon's SNS service which gives the facility of public subscription which i don't want to do. I want to notify only particular subscriber.
For example if I have 5 subscriber in one topic then the notification should be goes to particular subscriber.
I have also explored amazon's SQS in which i have to write a poller which monitor the queue for message. SQS has also a lock mechanism but the problem is that it is distributed so there would be a chance of getting same message from another copy of queue for process.
I want to know that what would be the possible approach to go.
SQS sounds like what you want.
You can run multiple "worker" processes that compete over messages in the queue. Each message is only consumed once. The logic behind the "lock" / timeout that you mention is as follows: if one of your workers were to die after downloading a message, but before processing it, then you want that message to eventually time out and be re-downloaded for processing on another node.
Yes, SQS is built on a polling model. For example, I have a number of use cases in which I use a minutely cron job to poll for new messages in the queue and take action on any messages found. This pattern is stupid simple to build and works wonders for a bunch of use cases -- a handy little "client" script that pushes a message into the queue, and the cron activated script that will process that message within a minute or so.
If your message pattern is extremely sparse -- eg, only a few messages a day -- it may seem wasteful to poll constantly while the queue is empty. It hardly matters.
My original calculation was that a minutely cron job would cost $0.04 (now $0.02) per month. Since then, SQS added a "Long-Polling" feature that lets you achieve sub-second latency on processing new messages by sending 1 "long-poll" message every 20 seconds to poll an idle queue. Plus, they dropped the price 50%. So per month, that's 131k messages (~$0.06), a little bit more expensive, but with near realtime request processing.
Keep in mind that a minutely cron job I described only costs ~$0.04 / month in request load (30d*24h*60m * 1c / 10k msgs). So at a minutely clip, cost shouldn't really be a concern here. Even polling every second, the price rises only to $2.59 / mo, not exactly a bank buster.
However, it is possible to avoid frequent polling using a webservice that takes an SNS HTTP message. Such an architecture would work as follows: client pushes message to SNS, which pushes message to SQS and routes an HTTP request to your webservice, triggering it to drain the queue. You'd still want to poll the queue hourly or daily, just in case an HTTP request was dropped. In the end though, I'm not sure I can think of any scenario which really justifies such complexity. I'd much rather pay $0.04 a month to have a dirt simple cron job polling my queue.

Resources