I have an alarm for monitoring SQS.
Metric name is NumberOfMessagesSent and data points are 288 out of 288.
Statistic - Sum
Period - 5 Minutes
Unit - Count
When NumberOfMessagesSent < 1 for the above time period, I am raising this alarm.
But problem is the SQS is not always active and I need to make the alarm more performant. As I am aware cloudwatch alarms are only allowing to monitor data points for a single day. So is it possible to monitor the points for a week or couple of weeks?
Need help in this scenario as I am new to AWS.
Related
The question is, the SQS message messages get deleted after a particular period of time that is 14 days max. I set up a cloud watch monitor for it which sends me a mail if there is a SQS message in the SQS queue, but what if I miss the mail and not check the message and it gets deleted. can i set up multiple remainders for one SQS queue message before it gets deleted?
One possibility is to create a CloudWatch alarm on the ApproximateAgeOfOldestMessage metric. The threshold should be set at a value slightly less than 14 days to ensure the old message isn't missed.
I'd like to note that in general if you are anticipating messages lingering in SQS for 14 days, this suggests you are not using SQS for its intended purpose. For a message to linger for 14 days suggests that whatever is processing the messages is severely deficient in its capacity. Once a message is seen and processed, you delete it from the queue, and in most cases the design is for this to happen within seconds or minutes, sometimes hours, but rarely multiple days.
`
I am utilising spring cloud aws messaging (2.0.1.RELEASE) in java to consume from an SQS queue. If it's relevant we use default settings, java 10 and spring cloud Finchley.SR2,
We recently had an issue where a message could not be processed due to an application bug, leading to an exception and no confirmation (deletion) of the message. The message is later retried (this is desirable) presumably after the visibility timeout has elapsed (again default values are in use), we have not customised the settings here.
We didn't spot the error above for a few days, meaning the message receive count was very high and the message had conceptually been on the queue for a while (several days by now). We considered creating a cloud watch SQS alarm to alert us to a similar situation in future. The only suitable metric appeared to be ApproximateAgeOfOldestMessage.
Sadly, when observing this metric I see this:
The max age doesn't go much above 5 mins (despite me knowing it was several days old). If a message is getting older each time a receive happens, assuming no acknowledgment comes and the message isn't deleted - but is instead becoming available again after the visibility timeout has elapsed should this graph not be much much higher?
I don't know if this is something specific to thew way that spring cloud aws messaging consumes the message or whether it's a general SQS quirk, but my expectation was that if a message was put on the queue 5 days ago, and a consumer had not successfully consumed the message then the max age would be 5 days?
Is it in fact the case that if a message is received by a consumer, but not ultimately deleted that the max age is actually the length between consume calls?
Can anyone confirm whether my expectation is incorrect, i.e. this is indeed how SQS is expected to behave (it doesn't consider the age to be the duration of time since the message was first put on the queue, but instead considers it to be the time between receive calls?
Based on a similar question on AWS forums, this is apparently a bug with regular SQS queues where only a single message is affected.
In order to have a useful alarm for this issue, I would suggest setting up a dead-letter-queue (where messages get automatically delivered after a configurable number of consume-without-deletes), and alarm on the size of the dead-letter-queue (ApproximateNumberOfMessagesVisible).
I think this might have to do with the poison pill handling by this metric. After 3+ tries, the message won't be included in the metric. From the AWS docs:
After a message is received three times (or more) and not processed,
the message is moved to the back of the queue and the
ApproximateAgeOfOldestMessage metric points at the second-oldest
message that hasn't been received more than three times. This action
occurs even if the queue has a redrive policy.
I'm using a Google Dataflow streaming pipeline with the default settings.
Thing is, it looks like the pipeline will start off at 1 worker, then scale down to 0 for 10-20 minutes, then up to 1 for 10-40 minutes, then back down.
This causes backups and surges in my PubSub topics, and sets off alerts based on unacknowledged messages. I've adjusted the alerting to accomodate these surges, but it's still odd behavior.
If the traffic through Dataflow is sufficiently low, but not zero, is it expected that the workers will scale to 0 until there is a backlog of work to do?
May I clarify that the number of connection showing at each point is the maximum number of simutaneous connection reached during that particular hour? or it means the total number of connections made during that particular hour? thanks
The usage stats indeed show the maximum number of users that were connected at any one time during that hour.
It is not the total number of users that connected during that hour. If you want to know that, you can easily build it by having each user write an event to the database when it connects or by using a product like Firebase Analytics.
I am going to make the rails application which integrates the Amazon's cloud services.
I have explore amazon's SNS service which gives the facility of public subscription which i don't want to do. I want to notify only particular subscriber.
For example if I have 5 subscriber in one topic then the notification should be goes to particular subscriber.
I have also explored amazon's SQS in which i have to write a poller which monitor the queue for message. SQS has also a lock mechanism but the problem is that it is distributed so there would be a chance of getting same message from another copy of queue for process.
I want to know that what would be the possible approach to go.
SQS sounds like what you want.
You can run multiple "worker" processes that compete over messages in the queue. Each message is only consumed once. The logic behind the "lock" / timeout that you mention is as follows: if one of your workers were to die after downloading a message, but before processing it, then you want that message to eventually time out and be re-downloaded for processing on another node.
Yes, SQS is built on a polling model. For example, I have a number of use cases in which I use a minutely cron job to poll for new messages in the queue and take action on any messages found. This pattern is stupid simple to build and works wonders for a bunch of use cases -- a handy little "client" script that pushes a message into the queue, and the cron activated script that will process that message within a minute or so.
If your message pattern is extremely sparse -- eg, only a few messages a day -- it may seem wasteful to poll constantly while the queue is empty. It hardly matters.
My original calculation was that a minutely cron job would cost $0.04 (now $0.02) per month. Since then, SQS added a "Long-Polling" feature that lets you achieve sub-second latency on processing new messages by sending 1 "long-poll" message every 20 seconds to poll an idle queue. Plus, they dropped the price 50%. So per month, that's 131k messages (~$0.06), a little bit more expensive, but with near realtime request processing.
Keep in mind that a minutely cron job I described only costs ~$0.04 / month in request load (30d*24h*60m * 1c / 10k msgs). So at a minutely clip, cost shouldn't really be a concern here. Even polling every second, the price rises only to $2.59 / mo, not exactly a bank buster.
However, it is possible to avoid frequent polling using a webservice that takes an SNS HTTP message. Such an architecture would work as follows: client pushes message to SNS, which pushes message to SQS and routes an HTTP request to your webservice, triggering it to drain the queue. You'd still want to poll the queue hourly or daily, just in case an HTTP request was dropped. In the end though, I'm not sure I can think of any scenario which really justifies such complexity. I'd much rather pay $0.04 a month to have a dirt simple cron job polling my queue.