Blue green deployment with single queue - amazon-sqs

I'm currently dealing with a problem caused by the blue green deployment pattern.
We have a single SQS queue, which messages can be consumed for both from blue and green servers.
I like to make the green (new version) to consume messages originated from green servers.
I thought about passing a variable for g/b value in the message, and re-queue it if its being processed by the wrong worker. But this may cause delays (multiple re-queues etc..)
Is there a common practice for this problem?

Related

Refusing to split GroupedShuffleRangeTracker proposed split position is out of range

I am sporadically getting the following errors:
W Refusing to split
at '\x00\x00\x00\x15\xbc\x19)b\x00\x01': proposed
split position is out of range
['\x00\x00\x00\x15\x00\xff\x00\xff\x00\xff\x00\xff\x00\x01',
'\x00\x00\x00\x15\xbc\x19)b\x00\x01'). Position of last group
processed was '\x00\x00\x00\x15\xbc\x19)a\x00\x01'.
When it happens, the error is logged every so often and the job never seems to end. Although it seems that it did actually complete the job otherwise.
In the last instance I am using 10 workers and have auto scaling disabled. I am using the Python implementation of Apache Beam.
This is not an error, it's part of normal operation of a pipeline. We should probably reduce its logging level to INFO and rephrase it, because it very frequently confuses people.
This message (rather obscurely) signals that Dataflow is trying to apply dynamic rebalancing, but there's no work that can be further subdivided.
I.e. your job is stuck doing something non-parallelizable on a small number of workers, while other workers are staying idle. To investigate this further, one would need to look at the code of your job and the Dataflow job id.

SQS traffic balancing

I have an app, that is reading events from Amazon SQS. The problem, I have is that when I deploy a newer application version, it connects to same queue, so there are two stacks - old and new one consuming messages.
I would like to keep old stack consuming say 95% of messages and new only 5% so I can do a live test. When I am confident, new version is fine, I shut down old stack and make new one consume 100% of events.
The only solution, I see right now is to implement some feature on application side, for example some REST endpoint, to control how many SQS messages it should try to read.
However, may be there are some other solutions/tools for this problem. (in fact, there are several applications, so if I can solve this issue without touching all of them, it would be great)
In general how do you deal with new version deployments and reading from SQS?
thanks
Let's say there are two application stacks: S1, which needs to process 90% of the messages, and S2 10%.
This is what they can do:
They'll have two configurations: n_messages_to_get and n_messages_to_process. For S1, the values will be 10 and 9 respectively. For S2, 10 and 1 respectively.
Each will fetch n_messages_to_get from SQS, but only process n_messages_to_process out of them.
You can also think of having this configuration in a database like DynamoDB, so that you don't have to deploy your code in case you need to dial up or down.
Assumptions made:
Both S1 and S2 take approximately same time to process a message.
You can tolerate some deviation in the number of messages processed by both. For e.g., you'll be OK if S1 processes 87% of the messages and S2 13%.

Monitor Amazon SQS delayed processing

I have a series of applications that consume messages from SQS Queues. If for some reason one of these consumers fails and stop consuming messages I'd like to be notified. What's the best way to do this?
Note that some of these queues could only have one message placed into the queue every 2 - 3 days, so waiting for the # of messages in the queue to trigger a notification is not a good option for me.
What I'm looking for is something that can monitor an SQS queue and say "This message has been here for an hour and nothing has processed it ... let someone know."
Possible solution off the top of my head (possibly not the most elegant one) which does not require using CloudWatch at all (according to the comment from OP the required tracking cannot be implemented through CloudWatch alarms). Assume you have the Queue to be processed at Service and the receiving side is implemented through long polling.
Run a Lambda function (say hourly) listening to the Queue and reading messages, however never deleting (Service deletes the messages once processed). On the Queue set the Maximum Receives to any value u want, let's say 3. If Lambda function ran 3 times and all three times message was present in the queue, the message will be pushed to Dead Letter Queue (automatically if the redrive policy is set). Whenever new message is pushed to dead letter queue, it is a good indicator that your service is either down or not handling the requests fast enough. All variables can be changed to suit your needs

Background Tasks in Spring (AMQP)

I need to handle a time-consuming and error-prone task (e.g., invoking a SOAP endpoint that will trigger the delivery of an SMS) whenever a given endpoint of my REST API is invoked, but I'd prefer not to make my users wait for that before sending a response back. Spring AMQP is already part of my stack, so I though about leveraging it to establish a "work queue" and have a number of worker processes consuming from the queue and taking care of the "work units". I have, however, the following requirements:
A work unit is guaranteed to be delivered, and delivered to exactly one worker.
Shall a work unit fail to be completed for any reason it must get placed back in the queue so that another worker can pick it up later.
Work units survive server reboots and crashes. This is mandatory because I won't be using a DB of any kind to store them.
I know RabbitMQ and Spring AMQP can be configured in such a way that ensures these three requirements, but I've only ever used it to achieve RPC so I don't know much about anything other than that. Is there any example I might follow? What are some of the pitfalls to watch out for?
While creating queues, rabbitmq gives you two options; transient or durable. Durable messages will be available until you acknowledge them. And messages won't expire if you do not give queue a ttl. For starters you can enable rabbitmq management plugin and play around a little.
But if you really want to guarantee the safety of your messages against hard resets or hardware problems, i guess you need to use a rabbitmq cluster.
Rabbitmq Clustering and you can find high availability subject on the right side of the page.
This guy explaines how to cluster
By the way i like beanstalkd too. You can make it write messages to disk and they will be safe except disk failures.

Executing large numbers of asynchronous IO-bound operations in Rails

I'm working on a Rails application that periodically needs to perform large numbers of IO-bound operations. These operations can be performed asynchronously. For example, once per day, for each user, the system needs to query Salesforce.com to fetch the user's current list of accounts (companies) that he's tracking. This results in huge numbers (potentially > 100k) of small queries.
Our current approach is to use ActiveMQ with ActiveMessaging. Each of our users is pushed onto a queue as a different message. Then, the consumer pulls the user off the queue, queries Salesforce.com, and processes the results. But this approach gives us horrible performance. Within a single poller process, we can only process a single user at a time. So, the Salesforce.com queries become serialized. Unless we run literally hundreds of poller processes, we can't come anywhere close to saturating the server running poller.
We're looking at EventMachine as an alternative. It has the advantage of allowing us to kickoff large numbers of Salesforce.com queries concurrently within a single EventMachine process. So, we get great parallelism and utilization of our server.
But there are two problems with EventMachine. 1) We lose the reliable message delivery we had with ActiveMQ/ActiveMessaging. 2) We can't easily restart our EventMachine's periodically to lessen the impact of memory growth. For example, with ActiveMessaging, we have a cron job that restarts the poller once per day, and this can be done without worrying about losing any messages. But with EventMachine, if we restart the process, we could literally lose hundreds of messages that were in progress. The only way I can see around this is to build a persistance/reliable delivery layer on top of EventMachine.
Does anyone have a better approach? What's the best way to reliably execute large numbers of asynchronous IO-bound operations?
I maintain ActiveMessaging, and have been thinking about the issues of a multi-threaded poller also, though not perhaps at the same scale you guys are. I'll give you my thoughts here, but am also happy to discuss further o the active messaging list, or via email if you like.
One trick is that the poller is not the only serialized part of this. STOMP subscriptions, if you do client -> ack in order to prevent losing messages on interrupt, will only get sent a new message on a given connection when the prior message has been ack'd. Basically, you can only have one message being worked on at a time per connection.
So to keep using a broker, the trick will be to have many broker connections/subscriptions open at once. The current poller is pretty heavy for this, as it loads up a whole rails env per poller, and one poller is one connection. But there is nothing magical about the current poller, I could imagine writing a poller as an event machine client that is implemented to create new connections to the broker and get many messages at once.
In my own experiments lately, I have been thinking about using Ruby Enterprise Edition and having a master thread that forks many poller worker threads so as to get the benefit of the reduced memory footprint (much like passenger does), but I think the EM trick could work as well.
I am also an admirer of the Resque project, though I do not know that it would be any better at scaling to many workers - I think the workers might be lighter weight.
http://github.com/defunkt/resque
I've used AMQP with RabbitMQ in a way that would work for you. Since ActiveMQ implements AMQP, I imagine you can use it in a similar way. I have not used ActiveMessaging, which although it seems like an awesome package, I suspect may not be appropriate for this use case.
Here's how you could do it, using AMQP:
Have Rails process send a message saying "get info for user i".
The consumer pulls this off the message queue, making sure to specify that the message requires an 'ack' to be permanently removed from the queue. This means that if the message is not acknowledged as processed, it is returned to the queue for another worker eventually.
The worker then spins off the message into the thousands of small requests to SalesForce.
When all of these requests have successfully returned, another callback should be fired to ack the original message and return a "summary message" that has all the info germane to the original request. The key is using a message queue that lets you acknowledge successful processing of a given message, and making sure to do so only when relevant processing is complete.
Another worker pulls that message off the queue and performs whatever synchronous work is appropriate. Since all the latency-inducing bits have already performed, I imagine this should be fine.
If you're using (C)Ruby, try to never combine synchronous and asynchronous stuff in a single process. A process should either do everything via Eventmachine, with no code blocking, or only talk to an Eventmachine process via a message queue.
Also, writing asynchronous code is incredibly useful, but also difficult to write, difficult to test, and bug-prone. Be careful. Investigate using another language or tool if appropriate.
also checkout "cramp" and "beanstalk"
Someone sent me the following link: http://github.com/mperham/evented/tree/master/qanat/. This is a system that's somewhat similar to ActiveMessaging except that it is built on top of EventMachine. It's almost exactly what we need. The only problem is that it seems to only work with Amazon's queue, not ActiveMQ.

Resources