How to make multiple parallel concurrent requests with Rails and Heroku - ruby-on-rails

I am currently developing a Rails application which takes a long list of links as input, scrapes them using a background worker (Resque), then serves the results to the user. However, in some cases, there are numerous URLs and I would like to be able to make multiple requests in parallel / concurrency such that it would take much less time, rather than waiting for one request to complete to a page, scraping it, and moving on to the next one.
Is there a way to do this in heroku/rails? Where might I find more information?
I've come across resque-pool but I'm not sure whether it would solve this issue and/or how to implement. I've also read about using different types of servers to run rails in order to make concurrency possible, but don't know how to modify my current situation to take advantage of this.
Any help would be greatly appreciated.

Don't use Resque. Use Sidekiq instead.
Resque runs in a single-threaded process, meaning the workers run synchronously, while Sidekiq runs in a multithreaded process, meaning the workers run asynchronously/simutaneously in different threads.
Make sure you assign a URL to scrape per worker. It's no use if one worker scrape multiple URLs.
With Sidekiq, you can pass the link to a worker, e.g.
LINKS = [...]
LINKS.each do |link|
ScrapeWoker.perform_async(link)
end
The perform_async doesn't actually execute the job right away. Instead, the link is just put in a queue in redis along with the worker class, and so on, and later (could be milliseconds later) workers are assigned to execute each job in queue in its own thread by running the perform instance method in ScrapeWorker. Sidekiq will make sure to retry again if exception occur during execution of a worker.
PS: You don't have pass a link to the worker. You can store the links to a table and then pass the ids of the records to workers.
More info about sidekiq

Adding these two lines to your code will also let you wait until the last job is complete before proceeding:
this line ensures that your program waits for at least one job is enqueued before checking that all jobs are completed as to avoid misinterpreting an unfilled queue as the completion of all jobs
sleep(0.2) until Sidekiq::Queue.new.size > 0 || Sidekiq::Workers.new.size > 0
this line ensures your program waits till all jobs are done
sleep(0.5) until Sidekiq::Workers.new.size == 0 && Sidekiq::Queue.new.size == 0

Related

How to correctly use Resque workers?

I have the following tasks to do in a rails application:
Download a video
Trim the video with FFMPEG between a given duration (Eg.: 00:02 - 00:09)
Convert the video to a given format
Move the converted video to a folder
Since I wanted to make this happen in background jobs, I used 1 resque worker that processes a queue.
For the first job, I have created a queue like this
#queue = :download_video that does it's task, and at the end of the task I am going forward to the next task by calling Resque.enqueue(ConvertVideo, name, itemId). In this way, I have created a chain of queues that are enqueued when one task is finished.
This is very wrong, since if the first job starts to enqueue the other jobs (one from another), then everything get's blocked with 1 worker until the first list of queued jobs is finished.
How should this be optimised? I tried adding more workers to this way of enqueueing jobs, but the results are wrong and unpredictable.
Another aspect is that each job is saving a status in the database and I need the jobs to be processed in the right order.
Should each worker do a single job from above and have at least 4 workers? If I double the amount to 8 workers, would it be an improvement?
Have you considered using sidekiq ?
As said in Sidekiq documentation :
resque uses redis for storage and processes messages in a single-threaded process. The redis requirement makes it a little more difficult to set up, compared to delayed_job, but redis is far better as a queue than a SQL database. Being single-threaded means that processing 20 jobs in parallel requires 20 processes, which can take a lot of memory.
sidekiq uses redis for storage and processes jobs in a multi-threaded process. It's just as easy to set up as resque but more efficient in terms of raw processing speed. Your worker code does need to be thread-safe.
So you should have two kind of jobs : download videos and convert videos and any download video job should be done in parallel (you can limit that if you want) and then each stored in one queue (the "in-between queue") before being converted by multiple convert jobs in parallel.
I hope that helps, this link explains quite well the best practices in Sidekiq : https://github.com/mperham/sidekiq/wiki/Best-Practices
As #Ghislaindj noted Sidekiq might be an alternative - largely because it offers plugins that control execution ordering.
See this list:
https://github.com/mperham/sidekiq/wiki/Related-Projects#execution-ordering
Nonetheless, yes, you should be using different queues and more workers which are specific to the queue. So you have a set of workers all working on the :download_video queue and then you other workers attached to the :convert_video queue, etc.
If you want to continue using Resque another approach would be to use delayed execution, so when you enqueue your subsequent jobs you specify a delay parameter.
Resque.enqueue_in(10.seconds, ConvertVideo, name, itemId)
The down-side to using delayed execution in Resque is that it requires the resque-scheduler package, so you're introducing a new dependency:
https://github.com/resque/resque-scheduler
For comparison Sidekiq has delayed execution natively available.
Have you considered merging all four tasks into just one? In this case you can have any number of workers, one will do the job. It will work very predictable, you can even know how much time will take to finish the task. You also don't have problems when one of the subtasks takes longer than all others and it piles up in the queue.

Block/Re-queue other sidekiq jobs from processing when existing sidekiq job is processing a particular resource

I have sidekiq jobs doing processing on a many types of resources. However, for a particular type of resource, eg: Resource X, I need to ensure that only 1 sidekiq job can process that particular resource at any given time.
For example, if i have 3 sidekiq jobs that gets queued simultaneously and want to interact with resource X, then only 1 sidekiq job can process resource X while the 2 remaining sidekiq jobs will have to wait (or be re-queued) until the sidekiq job that is currently processing the resource finishes.
Currently, i am trying to add a record in a database table for when a sidekiq job is processing the resource and use that to stop other sidekiq jobs from processing the resource until that record is deleted from the database by the sidekiq job that added it (when it finishes processing resource X) or after a certain elapsed time has passed (eg: If the record was created more than 5 minutes ago, then it is considered to no longer hold exclusive access to resource X and the next sidekiq job that wants to process resource X may alter that record and claim exclusive access to resource X).
A pseudocode of my current implementation:
def perform(res_id, res_type)
# Only applies to "RESOURCE_X"
if res_type == RESOURCE_X
if ResourceProcessor.where(res_id).empty? || ((Time.now-ResourceProcessor.where(res_id).first.created_at) > 5.minutes)
ResourceProcessor.create(res_id: res_id).save
process_resource_x(res_id)
else
SidekiqWorker.delayed(res_id, res_type, 5.minutes) #Try again later
return
end
#Letting other sidekiq jobs know they can now fight over who gets to process resource X
ResourceProcessor.where(res_id).destroy
else
process_other_resource(res_id)
end
end
Unfortunately, my solution does not work. It works just fine if there is a delay between sidekiq jobs that wants to process resource X. However, if the jobs that want to process resource X arrives simultaneously, then my solution falls apart.
Is there any way i can enforce some sort of synchronization only when processing resource X?
Btw, my sidekiq jobs may be distributed across several machines (but they access the same redis server on a dedicated machine).
I did more research based on the comment provided by Thomas.
The link he provided was extremely useful. They implemented their own custom Lock class to achieve the results they want. However, i did not use their custom lock code because I needed a different behaviour.
The specific behaviour i was looking to implement is "Re-queue if locked" and not "Wait if lock".
There are alternative tools that I could have used, such as redis-semaphore and with_advisory_gem.
I tested redis-semaphore and found it buggy. It wasnt returning the lock state and resource count correctly. Also, after checking the issues on Github, in some situations, redis-semaphore may get itself into its own deadlock, so i decided to abandon using it. As a result, i also decided not to use the with_advisory_gem due to its lower star count than redis-semaphore.
In the end I found a way to implement the locking pattern i described in my question, which is to block sidekiq jobs based on a value in my database. I dealt with the concurrency issue of multiple sidekiq jobs reading stale values by locking the entire database row with rail's very own Locking-pessimistic class. This ensured that only 1 sidekiq worker can access the database row which holds the locking value at any given time. Locking period is kept to a minimal because only a read and when applicable, a write operation is performed while locking the database row. Subsequent operations such as doing a requeue and cleaning up is done after.

How do I create a worker daemon which waits for jobs and executes them?

I'm new to Rails and multithreading and am curious about how to achieve the following in the most elegant way.
I couldn't find any nice tutorials which explained in detail what's the best design decision for the following task:
I have a couple of HTTP requests which will be run for a user in the background, for example, parsing a couple websites and get some information like HTTP response code, response time, then return the results. For performance reasons, I decided to split the total number of URLs to parse into batches of 25 each, then execute each batch in a thread, join these and write the result to a database.
I decided to use the following gem (http://rubygems.org/gems/thread) to ensure that there's a maximum number of threads that are run simultaneously. So far so good.
The problem is, if two users start their analysis in parallel, the maximum number of threads is two times the maximum of my threadpool.
My solution (imho) is to create a worker daemon which runs on its own and waits for jobs from the clients.
My question is, what's the best way to achieve this in Rails?
Maybe create a Rake task, and use it as a daemon (see: "Daemoninsing a rake task") and (how?) add jobs to it?
Thank you very much in advance!
I'd build a queue in a table in the database, and a bit of code that is periodically started by cron, which walks that table, passing requests to Typhoeus and Hydra.
Here's how the author summarizes the gem:
Like a modern code version of the mythical beast with 100 serpent heads, Typhoeus runs HTTP requests in parallel while cleanly encapsulating handling logic.
As users add requests, append them to the table. You'll want fields like:
A "processed" field so you can tell which were handled in case the system goes down.
A "success" field so you can tell which requests were processed successfully, so you can retry if they failed.
A "retry_count" field so you can retry up to "n" times, then flag that URL as unreachable.
A "next_scan_time" field that says when the URL should be scanned again so you don't DOS a site by hitting it continuously.
Typhoeus and Hydra are easy to use, and do make it easy to handle multiple requests.
There are a bunch of libraries for Rails that can manage queues of long-running background jobs for you. Here are a few:
Sidekiq uses Redis for job storage and supports multiple worker threads.
Resque also uses Redis and a single worker thread.
delayed_job manages a job queue through ActiveRecord (or Mongoid).
Once you've chosen one, I'd recommend using Foreman to simplify launching multiple daemons at once.

How to lock Resque jobs to one server

I have a "cluster" of Resque servers in my infrastructure. They all have the same exact job priorities etc. I automagically scale the number of Resque servers up and down based on how many pending jobs there are and available resources on the servers to handle said jobs. I always have a minimum of two Resque servers up.
My issue is that when I do a quick, one off job, sometimes both the servers process that job. This is bad.
I've tried adding a lock to my job with something like the following:
require 'resque-lock-timeout'
class ExampleJob
extend Resque::Plugins::LockTimeout
def self.perform
# some code
end
end
This plugin works for longer running jobs. However for these super tiny one off jobs, processing happens right away. The Resque servers both do not see the lock set by its sister server, both set a lock, process the job, unlock, and are done.
I'm not entirely sure what to do at this point or what solutions there are except for having one dedicated server handle this type of job. That would be a serious pain to configure and scale. I really want both the servers to be able to handle it, but once one of them grabs it from the queue, ensure the other does not run it.
Can anyone suggest some viable solution(s)?
Write your lock interpreter to wait T milliseconds before it looks for a lock with a unique_id less than the value of the lock it made.
This will determine who won the race, and the loser will self-terminate.
T is the parallelism latency between all N servers in the pool of a given queue. You can determine this heuristically by scaling back from 1000 milliseconds until you again find the job happening in-duplicate. Give padding for latency variation.
This is called the Busy-Wait solution to mutex thread safety. It is considered one of the trade-offs acceptable given the various scenarios in which one must solve Mutex (e.g. Locking, etc)
I'll post some links when off mobile. Wikipedia entry on mutex should explain all this.
Of this won't work for you, then:
1. Use a scheduler to control duplication.
2. Classify short-running jobs to a queue designed to run them in serial.
TL;DR there is no perfect solution, only good trade-off for your conditions.
It should not be possible for two workers to get the same 'payload' because items are dequeued using BLPOP. Redis will only send the queued item to the first client that calls BLPOP. It sounds like you are enqueueing the job more than once and therefore two workers are able to acquire different payloads with the same arguments. The purpose of 'resque-lock-timeout' is to assure that payloads that have the same method and arguments do not run concurrently; it does not however stop the second payload from being worked if the first job releases the lock before the second job tries to acquire it.
It would make sense that this only happens to short running jobs. Here is what might be happening:
payload 1 is enqueued
payload 2 is enqueued
payload 1 is locked
payload 1 is worked
payload 1 is unlocked
payload 2 is locked
payload 2 is worked
payload 2 is unlocked
Where as in long running jobs the following senario might happen:
payload 1 is enqueued
payload 2 is enqueued
payload 1 is locked
payload 1 is worked
payload 2 is fails to get lock
payload 1 is unlocked
Try turning off Resque and enqueueing your job. Take a look in redis at the list for your Resque queue (or monitor Redis using redis-cli monitor). See if Resque has queued more than one payload. If you still only see one payload then monitor the list to see if another one of your resque workers is calling recreate on failed jobs.
If you want to have 'resque-lock-timeout' hold the lock for longer than the duration it takes to process the job you can override the release_lock! method to set an expiry on the lock instead of just deleting it.
module Resque
module Plugins
module LockTimeout
def release_lock!(*args)
lock_redis.expire(redis_lock_key(*args), 60) # expire lock after 60 seconds
end
end
end
end
https://github.com/lantins/resque-lock-timeout/blob/master/lib/resque/plugins/lock_timeout.rb#l153-155

Running multiple background parallel jobs with Rails

On my Ruby on Rails application I need to execute 50 background jobs in parallel. Each job creates a TCP connection to a different server, fecths some data and updates an active record object.
I know different solutions to perform this task but any of them in parallel. For example, delayed_job (DJ) could be a great solution if only it could execute all jobs in parallel.
Any ideas? Thanks.
It is actually possible to run multiple delayed_job workers.
From http://github.com/collectiveidea/delayed_job:
# Runs two workers in separate processes.
$ RAILS_ENV=production script/delayed_job -n 2 start
$ RAILS_ENV=production script/delayed_job stop
So, in theory, you could just execute:
$ RAILS_ENV=production script/delayed_job -n 50 start
This will spawn 50 processes, however I'm not sure whether that would be recommended depending on the resources of the system you're running this on.
An alternative option would be to use threads. Simply spawn a new thread for each of your jobs.
One thing to bear is mind with this method is that ActiveRecord is not thread-safe. You can make it thread-safe using the following setting:
ActiveRecord::Base.allow_concurrency = true
Some thoughts...
Just because you need to read 50 sites and naturally want some parallel work does not mean that you need 50 processes or threads. You need to balance the slowdown and overhead. How about having 10 or 20 processes each read a few sites?
Depending on which Ruby you are using, be careful about the green threads, you may not get the parallel result you want
You might want to structure it like a reverse, client-side inetd, and use connect_nonblock and IO.select to get the parallel connections you want by making all the servers respond in parallel. You don't really need parallel processing of the results, you just need to get in line at all the servers in parallel, because that is where the latency really is.
So, something like this from the socket library...extend it for multiple outstanding connections...
require 'socket'
include Socket::Constants
socket = Socket.new(AF_INET, SOCK_STREAM, 0)
sockaddr = Socket.sockaddr_in(80, 'www.google.com')
begin
socket.connect_nonblock(sockaddr)
rescue Errno::EINPROGRESS
IO.select(nil, [socket])
begin
socket.connect_nonblock(sockaddr)
rescue Errno::EISCONN
end
end
socket.write("GET / HTTP/1.0\r\n\r\n")
# here perhaps insert IO.select. You may not need multiple threads OR multiple
# processes with this technique, but if you do insert them here
results = socket.read
Since you're working with rails, I would advise you to use delayed_job to do this rather than splitting off into threads or forks. Reason being - dealing with timeouts and stuff when the browser is waiting can be a real pain. There are two approaches you can take with DJ
The first is - spawn 50+ workers. Depending on your environment this may be a pretty memory heavy solution, but it works great. Then when you need to run your job, just make sure you create 50 unique jobs. If there is too much memory bloat and you want to do things this way, make a separate environment that is stripped down, specifically for your workers.
The second way is to create a single job that uses Curl::Multi to run your 50 concurrent TCP requests. You can find out more about this here: http://curl-multi.rubyforge.org/ In that way, you could have one background processor running all of your TCP requests in parallel.

Resources