Sidekiq - Any side effect of a worker pushing to another queue? - ruby-on-rails

I'm processing many Sidekiq jobs (tens of millions) which return 1 string and 2 integers as a result. There is a central computer which is hosting the Redis server/storing the results, and multiple computers acting as workers, fetching data from this central computer.
Inside the worker, I do the logic needed for processing and after the result is called, I just call:
Sidekiq::Client.push('class' => ResultsWorker, 'args' => [arg1, arg2, arg3])
and on the central computer, I just have another worker which just checks this queue and stores the results in a database.
I haven't seen such a pattern of "two-way communication" being mentioned anywhere in this Sidekiq community. So I was wondering, is there a reason for this? Is there some side-effect of doing this? I've tried putting a simple web app on the redis server comp which accepts the results and stores them, but I figured that for such small data, why not re-use the same Redis server with a different queue?

I haven't seen such a pattern of "two-way communication" being mentioned anywhere in this Sidekiq community.
This doesn't need any special mentioning or discussion, because there's nothing special or unusual about this. A job can indeed enqueue some other jobs. This happens left and right in big apps.
Although you should measure the overhead of this concrete situation in your case. It might turn out that persisting to the database directly from the first job will be much faster (no losses on making network requests, etc.)

Related

Multiple processes management

just wanted to know what the best approach would be:
let's say I have 3 processes, each one of them does its job, calculates and passes data to a final process whose function is that of taking the data from the other processes and populating a DB.
The reason for leaving the final process by itself is that the 3 other processes may take a variable time to complete, so I want each one of them to pass data to the final one as soon as it has completed its job in order to avoid wasting time, and I don't want multiple processes to write the DB at the same time.
But to do this, each process need to know whether the final process is busy or not, and in case it is available send their data, otherwise wait for it to complete before sending.
My idea is to use 'whenever' gem and create 3 processes that would run on their own, but I am puzzled by the last one as I don't know much about daemons and the like, and I know I might be making all of this much more complicated than it really is.
Any suggestion is welcome, thank you.
So I think I can provide some insight into your problem. My dev team uses a home-grown messaging que that's backed by our database. That means that messages (job metadata) are stored in our messages table.
Our rails app then creates a daemon process using the daemons gem. It makes instantiating daemon processes much simpler.There's no need to be afraid of what daemo processes are; they are just linux/unix processes that run in the background.
You specifically mention that you don't want multiple processes to write to your db It really sounds like you are concerned about deadlock issues from multiple daemons trying to read/write to the same table (please correct me if you are not, so I can modify my answer).
In order to avoid this issue, you can use row-level locking for your messages table. That way a daemon doesn't have to lock the entire table every time it wants to see if there are any jobs to pick up.
You also mention using 3 processes (I also call them daemons out of habit) to perform a task, then once those three are done, notify another process. you could possibly implement this functionality as a specific/unique message left by your 3 workers.
For example: worker A finished his job, so he writes a custom message to the special_messages_table. Workers B and C finish there task, and also write to this table. Now the entire time these daemons are processing, your third daemon would be polling the special_messages_table to see if any combination of these three jobs had finished. Once it detects that they have, it could then start.
This is just a rough outline of how you can use daemon processes to accomplish what you are asking. If you provide more details I would be happy to refine my answer. Don't be afraid of daemons!

How do I create a worker daemon which waits for jobs and executes them?

I'm new to Rails and multithreading and am curious about how to achieve the following in the most elegant way.
I couldn't find any nice tutorials which explained in detail what's the best design decision for the following task:
I have a couple of HTTP requests which will be run for a user in the background, for example, parsing a couple websites and get some information like HTTP response code, response time, then return the results. For performance reasons, I decided to split the total number of URLs to parse into batches of 25 each, then execute each batch in a thread, join these and write the result to a database.
I decided to use the following gem (http://rubygems.org/gems/thread) to ensure that there's a maximum number of threads that are run simultaneously. So far so good.
The problem is, if two users start their analysis in parallel, the maximum number of threads is two times the maximum of my threadpool.
My solution (imho) is to create a worker daemon which runs on its own and waits for jobs from the clients.
My question is, what's the best way to achieve this in Rails?
Maybe create a Rake task, and use it as a daemon (see: "Daemoninsing a rake task") and (how?) add jobs to it?
Thank you very much in advance!
I'd build a queue in a table in the database, and a bit of code that is periodically started by cron, which walks that table, passing requests to Typhoeus and Hydra.
Here's how the author summarizes the gem:
Like a modern code version of the mythical beast with 100 serpent heads, Typhoeus runs HTTP requests in parallel while cleanly encapsulating handling logic.
As users add requests, append them to the table. You'll want fields like:
A "processed" field so you can tell which were handled in case the system goes down.
A "success" field so you can tell which requests were processed successfully, so you can retry if they failed.
A "retry_count" field so you can retry up to "n" times, then flag that URL as unreachable.
A "next_scan_time" field that says when the URL should be scanned again so you don't DOS a site by hitting it continuously.
Typhoeus and Hydra are easy to use, and do make it easy to handle multiple requests.
There are a bunch of libraries for Rails that can manage queues of long-running background jobs for you. Here are a few:
Sidekiq uses Redis for job storage and supports multiple worker threads.
Resque also uses Redis and a single worker thread.
delayed_job manages a job queue through ActiveRecord (or Mongoid).
Once you've chosen one, I'd recommend using Foreman to simplify launching multiple daemons at once.

Speeding up web service by writing to redis first, disk after?

I have a web service that runs multiple DB queries and takes roughly ~500ms-1,000ms (depending on how much I/O EC2 decides to give me at the given junction if invocation). Users want stuff faster than 1,000ms, and understandably so. What I'm thinking of doing is taking the request parameters, stuffing them into a redis queue without writing to disk, and then running a job in an asynchronous queue which does the disk writes. Does something like this happen normally in practice? am I insane for suggesting it?
So long as your Redis is persisting to disk on regular intervals, this should work. You want to limit the number of scenarios where you might lose data. A sufficiently aggressive persistence schedule for Redis should work for most cases.
Try to give feedback to the user immediately that their action has been received and is being processed. Nothing is more confusing than a slight delay before it appears that might prompt people to attempt the upload again.

What available message solutions are there for inter-process communication in ruby?

I have a rails app using delayed_job. I need my jobs to communicate with each other for things like "task 5 is done" or "this is the list of things that need to be processed for task 5".
Right now I have a special table just for this, and I always access the table inside a transaction. It's working fine. I want to build out a cleaner api/dsl for it, but first wanted to check if there were existing solutions for this already. Weirdly I haven't found a single things, I'm either googling completely wrong, or the task is so simple (set and get values inside a transaction) that no one has abstracted it out yet.
Am I missing something?
clarification: I'm not looking for a new queueing system, I'm looking for a way for background tasks to communicate with one another. Basically just safely shared variables. Do the below frameworks offer this facility? It's a shame that delayed job does not.
use case: "do these 5 tasks in parallel, and then when they are all done, do this 1 final task." So, each of the 5 tasks checks to see if it's the last one, and if it is, it fires off the final task.
I use resque. Also there are lots of plugins, which should make inter-process comms easier.
Using redis has another advantage: you can use the pub-sub channels for communication between workers/services.
Another approach (but untested by me): http://www.zeromq.org/, which also has ruby bindings. If you like to test new stuff, then try zeromq.
Update
To clarify/explain/extend my comments above:
Why I should switch from DelayedJobs to Resque is the mentioned advantage that I have queue and messages in one system because Redis offers this.
Further sources:
https://github.com/blog/542-introducing-resque
https://github.com/defunkt/resque#readme
If I had to stay on DJ I would extend the worker classes with redis or zeromq/0mq (only examples here) to get the messaging in my extisting background jobs.
I would not try messaging with ActiveRecord/MySQL (not even queueing actually!) because this DB isn't the best performing system for this use case especially if the application has too many background workers and huge queues and uncountable message exchanges in short times.
If it is a small app with less workers you also could implement a simple messaging via DB, but also here I would prefer memcache instead; messages are short living data chunk which can be handled in-memory only.
Shared variables will never be a good solution. Think of multiple machines where your application and your workers can live on. How you would ensure a save variable transfer between them?
Okay, someone could mention DRb (distributed ruby) but it seems not really used anymore. (never seen a real world example so far)
If you want to play around with DRb however, read this short introduction.
My personal preference order: Messaging (real) > Database driven messaging > Variable sharing
memcached
rabbitmq
You can use Pipes:
reader, writer = IO.pipe
fork do
loop do
payload = { name: 'Kris' }
writer.puts Marshal.dump(payload)
sleep(0.5)
end
end
loop do
begin
Timeout::timeout(1) do
puts Marshal.load(reader.gets) # => { name: 'Kris' }
end
rescue Timeout::Error
# no-op, no messages to receive
end
end
One way
Read as a byte stream
Pipes are expressed as a pair, a reader and a writer. To get two way communication you need two sets of pipes.

Executing large numbers of asynchronous IO-bound operations in Rails

I'm working on a Rails application that periodically needs to perform large numbers of IO-bound operations. These operations can be performed asynchronously. For example, once per day, for each user, the system needs to query Salesforce.com to fetch the user's current list of accounts (companies) that he's tracking. This results in huge numbers (potentially > 100k) of small queries.
Our current approach is to use ActiveMQ with ActiveMessaging. Each of our users is pushed onto a queue as a different message. Then, the consumer pulls the user off the queue, queries Salesforce.com, and processes the results. But this approach gives us horrible performance. Within a single poller process, we can only process a single user at a time. So, the Salesforce.com queries become serialized. Unless we run literally hundreds of poller processes, we can't come anywhere close to saturating the server running poller.
We're looking at EventMachine as an alternative. It has the advantage of allowing us to kickoff large numbers of Salesforce.com queries concurrently within a single EventMachine process. So, we get great parallelism and utilization of our server.
But there are two problems with EventMachine. 1) We lose the reliable message delivery we had with ActiveMQ/ActiveMessaging. 2) We can't easily restart our EventMachine's periodically to lessen the impact of memory growth. For example, with ActiveMessaging, we have a cron job that restarts the poller once per day, and this can be done without worrying about losing any messages. But with EventMachine, if we restart the process, we could literally lose hundreds of messages that were in progress. The only way I can see around this is to build a persistance/reliable delivery layer on top of EventMachine.
Does anyone have a better approach? What's the best way to reliably execute large numbers of asynchronous IO-bound operations?
I maintain ActiveMessaging, and have been thinking about the issues of a multi-threaded poller also, though not perhaps at the same scale you guys are. I'll give you my thoughts here, but am also happy to discuss further o the active messaging list, or via email if you like.
One trick is that the poller is not the only serialized part of this. STOMP subscriptions, if you do client -> ack in order to prevent losing messages on interrupt, will only get sent a new message on a given connection when the prior message has been ack'd. Basically, you can only have one message being worked on at a time per connection.
So to keep using a broker, the trick will be to have many broker connections/subscriptions open at once. The current poller is pretty heavy for this, as it loads up a whole rails env per poller, and one poller is one connection. But there is nothing magical about the current poller, I could imagine writing a poller as an event machine client that is implemented to create new connections to the broker and get many messages at once.
In my own experiments lately, I have been thinking about using Ruby Enterprise Edition and having a master thread that forks many poller worker threads so as to get the benefit of the reduced memory footprint (much like passenger does), but I think the EM trick could work as well.
I am also an admirer of the Resque project, though I do not know that it would be any better at scaling to many workers - I think the workers might be lighter weight.
http://github.com/defunkt/resque
I've used AMQP with RabbitMQ in a way that would work for you. Since ActiveMQ implements AMQP, I imagine you can use it in a similar way. I have not used ActiveMessaging, which although it seems like an awesome package, I suspect may not be appropriate for this use case.
Here's how you could do it, using AMQP:
Have Rails process send a message saying "get info for user i".
The consumer pulls this off the message queue, making sure to specify that the message requires an 'ack' to be permanently removed from the queue. This means that if the message is not acknowledged as processed, it is returned to the queue for another worker eventually.
The worker then spins off the message into the thousands of small requests to SalesForce.
When all of these requests have successfully returned, another callback should be fired to ack the original message and return a "summary message" that has all the info germane to the original request. The key is using a message queue that lets you acknowledge successful processing of a given message, and making sure to do so only when relevant processing is complete.
Another worker pulls that message off the queue and performs whatever synchronous work is appropriate. Since all the latency-inducing bits have already performed, I imagine this should be fine.
If you're using (C)Ruby, try to never combine synchronous and asynchronous stuff in a single process. A process should either do everything via Eventmachine, with no code blocking, or only talk to an Eventmachine process via a message queue.
Also, writing asynchronous code is incredibly useful, but also difficult to write, difficult to test, and bug-prone. Be careful. Investigate using another language or tool if appropriate.
also checkout "cramp" and "beanstalk"
Someone sent me the following link: http://github.com/mperham/evented/tree/master/qanat/. This is a system that's somewhat similar to ActiveMessaging except that it is built on top of EventMachine. It's almost exactly what we need. The only problem is that it seems to only work with Amazon's queue, not ActiveMQ.

Resources