Monitor Amazon SQS delayed processing - amazon-sqs

I have a series of applications that consume messages from SQS Queues. If for some reason one of these consumers fails and stop consuming messages I'd like to be notified. What's the best way to do this?
Note that some of these queues could only have one message placed into the queue every 2 - 3 days, so waiting for the # of messages in the queue to trigger a notification is not a good option for me.
What I'm looking for is something that can monitor an SQS queue and say "This message has been here for an hour and nothing has processed it ... let someone know."

Possible solution off the top of my head (possibly not the most elegant one) which does not require using CloudWatch at all (according to the comment from OP the required tracking cannot be implemented through CloudWatch alarms). Assume you have the Queue to be processed at Service and the receiving side is implemented through long polling.
Run a Lambda function (say hourly) listening to the Queue and reading messages, however never deleting (Service deletes the messages once processed). On the Queue set the Maximum Receives to any value u want, let's say 3. If Lambda function ran 3 times and all three times message was present in the queue, the message will be pushed to Dead Letter Queue (automatically if the redrive policy is set). Whenever new message is pushed to dead letter queue, it is a good indicator that your service is either down or not handling the requests fast enough. All variables can be changed to suit your needs

Related

Is there a way to receive most messages out of the standard SQS Queue? [NOT FIFO]

I tried using parallel requests but the due to retention by AWS, it does not allow to poll back the same queue unless previously polled messages are deleted.
I however achieved doing the same using the FIFO, but not the standard queue.
Thanks in Advance!
:)
When you say "it does not allow to poll back the same queue unless previously polled messages are deleted", I assume you're talking about the inflight messages per queue limit, which is pretty high at 120,000:
For most standard queues (depending on queue traffic and message backlog), there can be a maximum of approximately 120,000 inflight messages (received from a queue by a consumer, but not yet deleted from the queue). If you reach this limit, Amazon SQS returns the OverLimit error message. To avoid reaching the limit, you should delete messages from the queue after they're processed. You can also increase the number of queues you use to process your messages. To request a limit increase, file a support request.
The expected use case of SQS is to have workers that receive a message, do some work, then delete the message. If you're not following this pattern, I'd strongly recommend reevaluating whether SQS is the right tool for what you're trying to do.
However, if you really have a valid use case for having more than 120K messages inflight at once, you'll need to describe your use case to AWS and get their approval to increase that limit.

AWS SQS - move unconsumed message to the dead letter queue

I am using AWS SQS with dead letter queues.
I can easily insert messages to the dead letter queue if I consume them, but I want messages that weren't consumed for like a hour to be moved to the dead letter queue automatically.
Is this possible?
Am I missing a configuration option?
Regards,
Ido
Dead letter queues are not designed for this purpose; they specifically handle the case where a message can be successfully received, but cannot be processed successfully (source). So in short, this is not currently possible. Alternatively, you could increase the message retention time so that the messages stay for longer on the queue giving your a consumer a chance to start listening again.
Hope that helps :)

how to retrieve nth item in a queue with amazon sqs and ruby

Iam sending messages to the queue and using amazon sqs queuing system in a rails application. But since the queue follows FIFO process, it will get the next items in the same fashion. Suppose if I have 100 items in a queue, how can I retrieve the 35th item from the queue and process it. As far as I know, there is no such method that amazon sqs provides for doing it. So is there any other method/workaround where I can achieve the this functionality.
There is no method to do that; SQS does not guarantee order of items in the queue due to its geographically redundant nature; it can't even guarantee FIFO. If you absolutely must process things in order, and need the ability to 'look ahead' in the queue, SQS may not be your best choice. Perhaps a custom made queue in something like DynamoDB may be work better.
SQS is designed to guarantee at-least-once delivery and does not take into account the order of messages. So the simple answer to your question on whether you can do that, is no.
A work around would depend on your use-case:
To split work among different processes handling queue messages and making sure they don't both process the same item - Different queues is one approach, or prefixing every message with an identifier denoting which process is supposed to work on it. For example, if I have 4 daemons's running, I could prefix every message in the queue with the ID of the process which should work on it - 1,2,3 or 4. Every process would only process messages with the number corresponding to it's ID.
Order of arrival is critical - In this case, you're better off not using SQS because it wasn't to be used this way. CloudAMQP is a cloud based service that is based off RabbitMQ which is a true FIFO queue and would suit this case better than SQS.

How to lock Resque jobs to one server

I have a "cluster" of Resque servers in my infrastructure. They all have the same exact job priorities etc. I automagically scale the number of Resque servers up and down based on how many pending jobs there are and available resources on the servers to handle said jobs. I always have a minimum of two Resque servers up.
My issue is that when I do a quick, one off job, sometimes both the servers process that job. This is bad.
I've tried adding a lock to my job with something like the following:
require 'resque-lock-timeout'
class ExampleJob
extend Resque::Plugins::LockTimeout
def self.perform
# some code
end
end
This plugin works for longer running jobs. However for these super tiny one off jobs, processing happens right away. The Resque servers both do not see the lock set by its sister server, both set a lock, process the job, unlock, and are done.
I'm not entirely sure what to do at this point or what solutions there are except for having one dedicated server handle this type of job. That would be a serious pain to configure and scale. I really want both the servers to be able to handle it, but once one of them grabs it from the queue, ensure the other does not run it.
Can anyone suggest some viable solution(s)?
Write your lock interpreter to wait T milliseconds before it looks for a lock with a unique_id less than the value of the lock it made.
This will determine who won the race, and the loser will self-terminate.
T is the parallelism latency between all N servers in the pool of a given queue. You can determine this heuristically by scaling back from 1000 milliseconds until you again find the job happening in-duplicate. Give padding for latency variation.
This is called the Busy-Wait solution to mutex thread safety. It is considered one of the trade-offs acceptable given the various scenarios in which one must solve Mutex (e.g. Locking, etc)
I'll post some links when off mobile. Wikipedia entry on mutex should explain all this.
Of this won't work for you, then:
1. Use a scheduler to control duplication.
2. Classify short-running jobs to a queue designed to run them in serial.
TL;DR there is no perfect solution, only good trade-off for your conditions.
It should not be possible for two workers to get the same 'payload' because items are dequeued using BLPOP. Redis will only send the queued item to the first client that calls BLPOP. It sounds like you are enqueueing the job more than once and therefore two workers are able to acquire different payloads with the same arguments. The purpose of 'resque-lock-timeout' is to assure that payloads that have the same method and arguments do not run concurrently; it does not however stop the second payload from being worked if the first job releases the lock before the second job tries to acquire it.
It would make sense that this only happens to short running jobs. Here is what might be happening:
payload 1 is enqueued
payload 2 is enqueued
payload 1 is locked
payload 1 is worked
payload 1 is unlocked
payload 2 is locked
payload 2 is worked
payload 2 is unlocked
Where as in long running jobs the following senario might happen:
payload 1 is enqueued
payload 2 is enqueued
payload 1 is locked
payload 1 is worked
payload 2 is fails to get lock
payload 1 is unlocked
Try turning off Resque and enqueueing your job. Take a look in redis at the list for your Resque queue (or monitor Redis using redis-cli monitor). See if Resque has queued more than one payload. If you still only see one payload then monitor the list to see if another one of your resque workers is calling recreate on failed jobs.
If you want to have 'resque-lock-timeout' hold the lock for longer than the duration it takes to process the job you can override the release_lock! method to set an expiry on the lock instead of just deleting it.
module Resque
module Plugins
module LockTimeout
def release_lock!(*args)
lock_redis.expire(redis_lock_key(*args), 60) # expire lock after 60 seconds
end
end
end
end
https://github.com/lantins/resque-lock-timeout/blob/master/lib/resque/plugins/lock_timeout.rb#l153-155

what would be the possible approach to go : SQS or SNS?

I am going to make the rails application which integrates the Amazon's cloud services.
I have explore amazon's SNS service which gives the facility of public subscription which i don't want to do. I want to notify only particular subscriber.
For example if I have 5 subscriber in one topic then the notification should be goes to particular subscriber.
I have also explored amazon's SQS in which i have to write a poller which monitor the queue for message. SQS has also a lock mechanism but the problem is that it is distributed so there would be a chance of getting same message from another copy of queue for process.
I want to know that what would be the possible approach to go.
SQS sounds like what you want.
You can run multiple "worker" processes that compete over messages in the queue. Each message is only consumed once. The logic behind the "lock" / timeout that you mention is as follows: if one of your workers were to die after downloading a message, but before processing it, then you want that message to eventually time out and be re-downloaded for processing on another node.
Yes, SQS is built on a polling model. For example, I have a number of use cases in which I use a minutely cron job to poll for new messages in the queue and take action on any messages found. This pattern is stupid simple to build and works wonders for a bunch of use cases -- a handy little "client" script that pushes a message into the queue, and the cron activated script that will process that message within a minute or so.
If your message pattern is extremely sparse -- eg, only a few messages a day -- it may seem wasteful to poll constantly while the queue is empty. It hardly matters.
My original calculation was that a minutely cron job would cost $0.04 (now $0.02) per month. Since then, SQS added a "Long-Polling" feature that lets you achieve sub-second latency on processing new messages by sending 1 "long-poll" message every 20 seconds to poll an idle queue. Plus, they dropped the price 50%. So per month, that's 131k messages (~$0.06), a little bit more expensive, but with near realtime request processing.
Keep in mind that a minutely cron job I described only costs ~$0.04 / month in request load (30d*24h*60m * 1c / 10k msgs). So at a minutely clip, cost shouldn't really be a concern here. Even polling every second, the price rises only to $2.59 / mo, not exactly a bank buster.
However, it is possible to avoid frequent polling using a webservice that takes an SNS HTTP message. Such an architecture would work as follows: client pushes message to SNS, which pushes message to SQS and routes an HTTP request to your webservice, triggering it to drain the queue. You'd still want to poll the queue hourly or daily, just in case an HTTP request was dropped. In the end though, I'm not sure I can think of any scenario which really justifies such complexity. I'd much rather pay $0.04 a month to have a dirt simple cron job polling my queue.

Resources