Run scripts in parallel in ruby - ruby-on-rails

I need to convert videos in 4 threads
For example I have Active Record models Video with titles: Video1, Video2, Video3, Video4, Video5
So, I need to execute something like this
bundle exec script/video_converter start
Where script will process unconverted videos for 4 threads, for example
Video.where(state: 'unconverted').first.process
But if one of 4 videos are converted, next video must be automatically added to thread
What is the best solution for this ? Sidekiq gem? Daemons gem + Ruby Threads manually?
For now I am using this script:
THREAD_COUNT = 4
SLEEP_TIME = 5
logger = CONVERTATION_LOG
spawns = []
loop do
videos = Video.where(state:'unconverted').limit(THREAD_COUNT).reorder("ID DESC")
videos.each do |video|
spawns << Spawnling.new do
result = video.process
if result.nil?
video.create_thumbnail!
else
video.failured!
end
end
end
Spawnling.wait(spawns)
sleep(SLEEP_TIME)
end
But this script waits 4 videos, and after it takes another 4 videos. I want, that after one of 4-th video converted, it will be automatically added to new thread, which is empty.

If your goal is to keep processing videos by using just 4 threads (or whatever Spawnling is configured to use - as it supports fork and thread), then, you could use a Queue to queue all your video records to be processed, spawn 4 threads and let them keep processing records one by one until queue is empty.
require "rails"
require "spawnling"
# In your case, videos are read from DB, below array is for illustration
videos = ["v1", "v2", "v3", "v4", "v5", "v6", "..."]
THREAD_COUNT = 4
spawns = []
q = Queue.new
videos.each {|i| q.push(i) }
THREAD_COUNT.times do
spawns << Spawnling.new do
until q.empty? do
v = q.pop
# simulate processing
puts "Processing video #{v}"
# simulate processing time
sleep(rand(10))
end
end
end
Spawnling.wait(spawns)
This answer is inspired from this answer
PS: I have added few requires and defined videos array to make above code self-contained running example.

Related

Using limit and offset in rails together with updated_at and find_each - will that cause a problem?

I have a Ruby on Rails project in which there are millions of products with different urls. I have a function "test_response" that checks the url and returns either a true or false for the Product attribute marked_as_broken, either way the Product is saved and has its "updated_at"-attribute updated to the current Timestamp.
Since this is a very tedious process I have created a task which in turn starts off 15 tasks, each with a N/15 number of products to check. The first one should check from, for example, the first to the 10.000th, the second one from the 10.000nd to the 20.000nd and so on, using limit and offset.
This script works fine, it starts off 15 process but rather quickly completes one script after another far too early. It does not terminate, it finishes with a "Process exited with status 0".
My guess here is that using find_each together with a search for updated_at as well as in fact updating the "updated_at" while running the script changes everything and does not make the script go through the 10.000 items as supposed but I can't verify this.
Is there something inherently wrong by doing what I do here. For example, does "find_each" run a new sql query once in a while providing completely different results each time, than anticipated? I do expect it to provide the same 10.000 -> 20.000 but just split it up in pieces.
task :big_response_launcher => :environment do
nbr_of_fps = Product.where(:marked_as_broken => false).where("updated_at < '" + 1.year.ago.to_date.to_s + "'").size.to_i
nbr_of_processes = 15
batch_size = ((nbr_of_fps / nbr_of_processes))-2
heroku = PlatformAPI.connect_oauth(auth_code_provided_elsewhere)
(0..nbr_of_processes-1).each do |i|
puts "Launching #{i.to_s}"
current_offset = batch_size * i
puts "rake big_response_tester[#{current_offset},#{batch_size}]"
heroku.dyno.create('kopa', {
:command => "rake big_response_tester[#{current_offset},#{batch_size}]",
:attach => false
})
end
end
task :big_response_tester, [:current_offset, :batch_size] => :environment do |task,args|
current_limit = args[:batch_size].to_i
current_offset = args[:current_offset].to_i
puts "Launching with offset #{current_offset.to_s} and limit #{current_limit.to_s}"
Product.where(:marked_as_broken => false).where("updated_at < '" + 1.year.ago.to_date.to_s + "'").limit(current_limit).offset(current_offset).find_each do |fp|
fp.test_response
end
end
As many have noted in the comments, it seems like using find_each will ignore the order and limit. I found this answer (ActiveRecord find_each combined with limit and order) that seems to be working for me. It's not working 100% but it is a definite improvement. The rest seems to be a memory issue, i.e. I cannot have too many processes running at the same time on Heroku.

Jobs update with Dashing and Ruby

I use Dashing for monitor trends and website statistics.
I create a jobs to check GooglesNews trends and Twitter trends .
The data is displayed well, however, they appear at first load and does put more update then. There is the code for twitter_trends.rb :
require 'nokogiri'
require 'open-uri'
url = 'http://trends24.in/france/~cloud'
data = Nokogiri::HTML(open(url))
list = data.xpath('//ol/li')
tags = list.collect do |tag|
tag.xpath('a').text
end
tags = tags.take(10)
tag_counts = Hash.new({value: 0})
SCHEDULER.every '10s' do
tag = tags.sample
tag_counts[tag] = {label: tag}
send_event('twitter_trends', {items: tag_counts.values})
end
I think I used bad "rufus-scheduler" to schedule my job jobs https://gist.github.com/pushmatrix/3978821#file-sample_job-rb
How to make the data will update correctly on a regular basis ?
Your scheduler looks fine, but it looks like you're making one call to the website:
data = Nokogiri::HTML(open(url))
But never calling it again. Is your intent to only check that site once along with the initial processing of it?
I assume you'd really want to wrap more of your logic into the scheduler loop - only things in there will be rerun when the schedule job hits.
When you covered everything in a scheduler, you are only taking one sample every 10 seconds (http://ruby-doc.org/core-2.2.0/Array.html#method-i-sample) then adding it to tag_counts. This is clearing the tag each time. Thing to remember about schedulers is it's basically a clean slate every time it runs. I'd recommend looping through tags and adding them to tag_counts that way instead of sampling. sampling is kind of unnecessary seeing as you are reducing it to 10 each time you run the scheduler.
If I move the SCHEDULER like this (after url on top), it works but that only one item appears randomly every 10 seconds.
require 'nokogiri'
require 'open-uri'
url = 'http://trends24.in/france/~cloud'
SCHEDULER.every '10s' do
data = Nokogiri::HTML(open(url))
list = data.xpath('//ol/li')
tags = list.collect do |tag|
tag.xpath('a').text
end
tags = tags.take(10)
tag_counts = Hash.new({value: 0})
tag = tags.sample
tag_counts[tag] = {label: tag}
send_event('twitter_trends', {items: tag_counts.values})
end
How to display a list of 10 items, which is updated regularly ?

Run external processes in non-blocking mode

I want to perform some actions in parallel periodically and once they're all done, show the results to the user on a page. It'll happen approximately 1 time per 5 mins, it depends on the users' activity.
These actions are performed by the external, third-party applications (processes). There're about 4 of them now. So I have to run 4 external processes for each user request .
While they are performing, I show an user a page with an ajax spinner and send an ajax requests to the server to check if everything is done. Once done, I show the results.
Here is a rough version of what I have
class MyController
def my_action request_id
res = external_apps_cmds_with_args.each do |x|
# new process
res = Open3.popen3 x do |stdin, stdout, stderr, wait_thr|
exit_value = wait_thr.value.exitstatus
if exit_value == 0 ....
end
end
write_res_to_db res, request_id #each external app writes to the db its own result for each request_id
end
end
The calculations CAN be done in parallel because there's NO overall result here, there are only the results from each tool. There is no race condition.
So I want them to run in non-blocking mode, obviously.
Is Open3.popen3 a non-blocking command? Or should I run the external processes in the different threads:
threads = []
external_apps_cmds_with_args.each do |x|
# new threads
threads << Thread.new do
# new process
res = Open3.popen3 x do |stdin, stdout, stderr, wait_thr|
exit_value = wait_thr.value.exitstatus
if exit_value == 0 ....
end
end
write_res_to_db res, request_id #each external app writes to the db its own result for each request_id
end
threads.each &:join
Or should I create only one thread?
# only one new thread
thread = Thread.new do
res = external_apps_cmds_with_args.each do |x|
# new process
res = Open3.popen3 x do |stdin, stdout, stderr, wait_thr|
exit_value = wait_thr.value.exitstatus
if exit_value == 0 ....
end
end
write_res_to_db res, request_id #each external app writes to the db its own result for each request_id
end
thread.join
Or should I continue using the approach I'm using now: NO threads at all?
What I would suggest is that you have one action to load the page and then a separate ajax action for each process. As the processes finish they will return data to the user (presumably in different parts of the page) and you will take advantage of the multi-process/threading capabilities of your webserver.
This approach has some issues because like your original ideas, you are tying up some of your web processes while the external processes are running and you may run into timeouts. If you want to avoid that, you could run them as background jobs (delayed_job, resque, etc..) and then display the data when the jobs have finished.

Duplicated results on Ruby threading

I need to improve a rake task that build cloth looks by fetching the images from external server.
When I try to create multiple threads, the results are duplicated.
But if I put sleep 0.1 before each Thread.new, the code works! Why?
new_looks = []
threads = []
for look in looks
# sleep 0.1 - when I put it, works!
threads << Thread.new do
# a external http request is being done here
new_looks << Look.new(ref: look["look_ref"])
end
end
puts 'waiting threads to finish...'
threads.each(&:join)
puts 'saving...'
new_looks.sort_by(&:ref).each(&:save)
Array is not generally thread safe. Switch to a thread-safe data structure such as Queue:
new_look_queue = Queue.new
threads = looks.map do |look|
Thread.new do
new_look_queue.enq Look.new(ref: look["look_ref"])
end
end
puts 'waiting threads to finish...'
threads.each(&:join)
puts 'saving...'
new_looks = []
while !new_look_queue.empty?
new_look_queue << queue.deq
end
new_looks.sort_by(&:ref).each(&:save)
Queue#enq puts a new entry in the queue; Queue#deq gets one out, blocking if there isn't one.
If you don't need the new_looks saved in order, the code gets simpler:
puts 'saving...'
while !new_look_queue.empty?
new_look_queue.deq.save
end
Or, even simpler yet, just do the save inside the thread.
If you have a great many looks, the above code will create more threads than is good. Too many threads cause the requests to take too long to process, and consume excess memory. In that case, consider create some number of producer threads:
NUM_THREADS = 8
As before, there's a queue of finished work:
new_look_queue = Queue.new
But there's now also a queue of work to be done:
look_queue = Queue.new
looks.each do |look|
look_queue.enq look
end
Each thread will live until it's out of work, so let's add some "out of work" symbols to the queue, one for each thread:
NUM_THREADS.times do {look_queue.enq :done}
And now the threads:
threads = NUM_THREADS.times.map do
Thread.new do
while (look = look_queue.deq) != :done
new_look_queue.enq Look.new(ref: look["look_ref"])
end
end
end
Processing the new_look_queue is the same as above.
Try to update your code to this one:
for look in looks
threads << Thread.new(look) do |lk|
new_looks << Look.new(ref: lk["look_ref"])
end
end
This should help you.
UPD: Forgot about Thread.new(args)

Server Side Timers with Juggernaut 2

I am writing a rails app with Juggernaut 2 for real-time push notifications and am not sure how to approach this problem. I have a number of users in a chat room and I would like to run a timer so that a push can go out to each browser in the chat room every 30 seconds. Juggernaut 2 is built on node.js, so I'm assuming I need to write this code there. I just have no idea where to start in terms of integrating this with Juggernaut 2.
I just browsed through Juggernaut briefly so take my answer with a grain of salt...
You might be interested in the Channel object (https://github.com/maccman/juggernaut/blob/master/lib/juggernaut/channel.js) You'll notice that Channel.channel is an object (think ruby's hash) of all the channels that exist. You can set a 30 second recurring timer (setInterval - http://nodejs.org/docs/v0.4.2/api/timers.html#setInterval) to do something with all your channels.
What to do in each loop iteration? Well, the link to the aforementioned Channel code has a publish method:
publish: function(message){
var channels = message.getChannels();
delete message.channels;
for(var i=0, len = channels.length; i < len; i++) {
message.channel = channels[i];
var clients = this.find(channels[i]).clients;
for(var x=0, len2 = clients.length; x < len2; x++) {
clients[x].write(message);
}
}
}
So you basically have to create a Message object with message.channels set to Channel.channels and if you pass that message to the publish method, it will send out to all your clients.
As to the contents of your message, I dunno what you are using client side (socket.io? a chat client someone already built for you off Juggernaut and socket.io?) so that's up to you.
As for where to put the code creating the interval and firing off the callback to publish your message to all channels, you might want to check here in the code that creates the actual server listening on the given port: (https://github.com/maccman/juggernaut/blob/master/lib/juggernaut/server.js) If you attach the interval within init(), then as soon as you start the server it will be checking every 30 seconds to publish your given message to every channel
Here is a sample client which pushes every 30 seconds in Ruby.
Install your Juggernaut with Redis and Node: install ruby and rubygems, then run gem install juggernaut and
#!/usr/bin/env ruby
require "rubygems"
require "juggernaut"
while 1==1
Juggernaut.publish("channel1","some Message")
sleep 30
end
We implemented a quiz system which pushed out questions on a variable time interval. We did it as follows:
def start_quiz
Rails.logger.info("*** Quiz starting at #{Time.now}")
$redis.flushall # Clear all scores from database
quiz = Quiz.find(params[:quizz] || 1 )
#quiz_master = quiz.user
quiz_questions = quiz.quiz_questions.order("question_no ASC")
spawn_block do
quiz_questions.each { |q|
Rails.logger.info("*** Publishing question #{q.question_no}.")
time_alloc = q.question_time
Juggernaut.publish( select_channel("/quiz_stream"), {:q_num => q.num, :q_txt => q.text :time=> time_alloc} )
sleep(time_alloc)
scoreboard = publish_scoreboard
Juggernaut.publish( select_channel("/scoreboard"), {:scoreboard => scoreboard} )
}
end
respond_to do |format|
format.all { render :nothing => true, :status => 200 }
end
end
The key in our case was using 'spawn' to run a background process for the quiz timing so that we could still process the incoming scores.
I have no idea how scalable this is.

Resources