I have one simple rake task that execute some action per user. Something like these:
task users_actions: :environment do ||
User.all.each { |u|
# Some actions here
}
end
The problem it's that it doesn't start with the next user until it finished one. What I want is to execute these in parallel. How can I do that? It's even posible?
Thanks,
If there was a good library available, it would be better to use it rather than implementing everything from scratch. concurrent-ruby has all kinds of utility classes for writing concurrent code, but I'm not sure if they have something suitable for this use case; anyways, I'll show you how to do it from scratch.
First pull in the thread library:
require 'thread'
Make a thread-safe queue, and stick all the users on it:
queue = Queue.new
User.all.each { |user| queue << user }
Start some number of worker threads, and make them process items from the queue until all are done.
threads = 5.times.collect do
Thread.new do
while true
user = queue.pop(true) rescue break
# do something with the user
# this had better be thread-safe, or you will live to regret it!
end
end
end
Then wait until all the threads finish:
threads.each(&:join)
Again, please make sure that the code which processes each user is thread-safe! The power of multi-threading is in your hands, don't abuse it!
NOTE: If your user-processing code is very CPU-intensive, you might consider running this on Rubinius or JRuby, so that more than one thread can run Ruby code at the same time.
Related
Sidekiq recommends that all jobs be idempotent (able to run multiple times without being an issue) as it cannot guarantee a job will only be run one time.
I am having trouble understanding the best way to achieve that in certain cases. For example, say you have the following table:
User
id
email
balance
The background job that is run simply adds some amount to their balance
def perform(user_id, balance_adjustment)
user = User.find(user_id)
user.balance += balance_adjustment
user.save
end
If this job is run more than once their balance will be incorrect. What is best practice for something like this?
If I think about it a potential solution I can come up with is to create a record before scheduling the job that is something like
PendingBalanceAdjustment
user_id
balance_adjustment
When the job runs it will need to acquire a lock for this user so that there's no chance of a race condition between two workers and then will need to both update the balance and delete the record from pending balance adjustment before releasing the lock.
The job then looks something like this?
def perform(user_id, balance_adjustment_id)
user = User.find(user_id)
pba = PendingBalanceAdjustment.where(:balance_adjustment_id => balance_adjustment_id).take
if pba.present?
$redis.lock("#{user_id}/balance_adjustment") do
user.balance += pba.balance_adjustment
user.save
pba.delete
end
end
end
This seems to solve both
a) Race condition between two workers taking the job at the same time (though you'd think Sidekiq could guarantee this already?)
b) A job being run multiple times after running successfully
Is this pattern a good solution?
You're on the right track; you want to use a database transaction, not a redis lock.
I think you're on the right track too but you're solution might be overkill since I don't have full knowledge of your application.
BUT, a simpler solution would simply be to have a flag on you User model like balance_updated:datetime. So, you could check that before updating.
As Mike mentions using a Transaction block should ensure it's thread safe.
In any case, to answer your question more generally... having an updated_ column is usually good enough to start with, and then if it gets complicated you can move this stuff to another model.
I've written the following pseudo-ruby to illustrate what I'm trying to do. I've got some computers, and I want to see if anything's connected to them. If nothing is connected to them, try again for another two attempts, and if that's the still case, shut it down.
This is for a big deployment so this recursive timer could be running for hundreds of nodes. I just want to check, is this approach sound? Will it generate tonnes of threads and eat up lots of RAM while blocking the worker processes? (I expect it will be running as a delayed_job)
check_status(0)
def check_status(i)
if instance.connected.true? then return
if instance.connected.false? and i < 3
wait.5.minutes
instance.check_status(i+1)
else
instance.shutdown
return
end
end
There is not going to be a large problem when the maximum recursion depth here is 3. It should be fine. Recursing a method does not create threads, but each call does store more information about the call stack, and eventually the resources used for that storage could run out. Not after 3 calls though, that is quite safe.
However, there is no need for recursion to solve your problem. The following loop should do just as well:
def check_status
return if instance.connected.true?
2.times do
wait.5.minutes
return if instance.connected.true?
end
instance.shutdown
end
You got answers from other users already. However, since you are waiting 5 minutes at least two times, you might consider using another language or change the design.
Ruby (MRI) has a global interpreter lock, which restricts parallel execution of Ruby code. MRI is not parallel. You risk to be inefficient with this.
Consider using threads (a reasonable number of thread pools might make sense), probably fed by a queue with tasks
Make sure you don't wait 5 minutes. Instead put them to sleep for that time. This way other threads can execute, while some are sleeping/waiting
You could also consider using jRuby, since jRuby has true parallelism (MRI is restricted by the GIL, thus it is not truly parallel)
Consider using another programming language that might be more performant
If it's running via delayed_job why not use the gem's functionality to implement what you want? I, for one, would go for something like the following. No need to sleep the delayed jobs or anything.
class CheckStatusJob
def before(job)
#job = job
end
def perform
if instance.connected.true? then return
if instance.connected.false? and #job.attempts < 3
raise 'The job failed!'
else
instance.shutdown
end
end
def max_attempts
3
end
def reschedule_at(current_time, attempts)
current_time + 5.minutes
end
end
I have an application that makes thousands of requests to a web service API. Each request takes about 2 seconds, then the response creates new record in the database. I want to just fire off as many of those requests as I can simultaneously, and save the response to the database as as soon as I get the response.
Is this something I should be using a gem like sidekiq for, or the ruby Thread class? I don't want to just hand off the requests to be handled synchronously.
Sounds like you need a thread pool for performing the operation, and a database thread to commit the results.
You can build one of these really simply:
require 'thread'
db_queue = Queue.new
Thread.new do
while (item = db_queue.pop)
# ... Deal with item in queue
end
end
# Example of supplying a job
db_queue.push(api_response)
# When finished
db_queue.push(nil)
Due to the Global Interpreter Lock in the standard Ruby runtime threads are only really useful for managing many lightly loaded threads. If you need something more heavy-duty, JRuby might be what you're looking for.
I have these two methods in my model. One method looks up a single CatalogItem facebook like count, and another that loops through all active CatalogItems and finds their like counts using the aforementioned.
It takes a while to run through all active facebook likes...it might loop anywhere from 300-1000 objects; so i'd like to move this to some sort of cron, or whatever you guys suggest.
I was thinking I should add a column to CatalogItem called cached_fb_count, and adapt self.facebook_likes to write to that colimn whenever that task runs.
Is this the right approach? What would that task look like if it was running every 2 hours?
def self.facebook_likes
self.active.each_with_index do |i, index|
_likes = i.facebook_like_count
i.update_attribute(:cached_likes, _likes)
# puts "#{index+1} Likes: #{_likes} ########### ID: #{i.id} "
end
end
def facebook_like_count
item_like_count = JSON.parse(open("https://api.facebook.com/method/fql.query?query=select%20like_count%20from%20link_stat%20where%20url='https://www.foobar.com/catalog_items/#{self.id}'&format=json").read).first.flatten[1]
item_like_count = item_like_count + 1 if item_like_count > 0
end
Delayed_job is a perfect tool for doing asynchronous tasks. It runs in a separate process, relation database-based (Active Record) so it saves context of execution as a simple script invokation. And has a rich functionality inculding task's priority and scheldue. but If you tasks assumes huge queues, consider Resque gem. it uses Reddis as a storage for tasks and deals much faster with long queues.
Use whenever its very easy to set up. Here is the link : https://github.com/javan/whenever
I want to give my users the option to send them a daily summary of their account statistics at a specific (user given) time ....
Lets say following model:
class DailySummery << ActiveRecord::Base
# attributes:
# send_at
# => 10:00 (hour)
# last_sent_at
# => Time of the last sent summary
end
Is there now a best practice how to send this account summaries via email to the specific time?
At the moment I have a infinite rake task running which checks permanently if emails are available for sending and i would like to put the dailysummary-generation and sending into this rake task.
I had a thought that I could solve this with following pseudo-code:
while true
User.all.each do |u|
u.generate_and_deliver_dailysummery if u.last_sent_at < Time.now - 24.hours
end
sleep 60
end
But I'm not sure if this has some hidden caveats...
Notice: I don't want to use queues like resq or redis or something like that!
EDIT: Added sleep (have it already in my script)
EDIT: It's a time critical service (notification of trade rates) so it should be as fast as possible. Thats the background why I don't want to use a queue or job based system. And I use Monit to manage this rake task, which works really fine.
There's only really two main ways you can do delayed execution. You run the script when an user on your site hits a page, which is inefficient and not entirely accurate. Or use some sort of background process, whether it's a cron job or resque/delayed job/etc.
While your method of having an rake process run forever will work fine, it's inefficient because you're iterating over users 24/7 as soon as it finishes, something like:
while true
User.where("last_sent_at <= ? OR last_sent_at = ?", 24.hours.ago, nil).each do |u|
u.generate_and_deliver_dailysummery
end
sleep 3600
end
Which would run once an hour and only pull users that needed an email sent is a bit more efficient. The best practice would be to use a cronjob though that runs your rake task though.
Running a task periodically is what cron is for. The whenever gem (https://github.com/javan/whenever) makes it simple to configure cron definitions for your app.
As your app scales, you may find that the rake task takes too long to run and that the queue is useful on top of cron scheduling. You can use cron to control when deliveries are scheduled but have them actually executed by a worker pool.
I see two possibilities to do a task at a specific time.
Background process / Worker / ...
It's what you already have done. I refactored your example, because there was two bad things.
Check conditions directly from your database, it's more efficient than loading potential useless data
Load users by batch. Imagine your database contains millions of users... I'm pretty sure you would be happy, but not Rails... not at all. :)
Beside your code I see another problem. How are you going to manage this background job on your production server? If you don't want to use Resque or something else, you should consider manage it another way. There is Monit and God which are both a process monitor.
while true
# Check the condition from your database
users = User.where(['last_sent_at < ? OR created_at IS NULL', 24.hours.ago])
# Load by batch of 1000
users.find_each(:batch_size => 1000) do |u|
u.generate_and_deliver_dailysummery
end
sleep 60
end
Cron jobs / Scheduled task / ...
The second possibility is to schedule your task recursively, for instance each hour or half-hour. Correct me if I'm wrong, but do your users really need to schedule the delivery at 10:39am? I think that let them choose the hour is enough.
Applying this, I think a job fired each hour is better than an infinite task querying your database every single minute. Moreover it's really easy to do, because you don't need to set up anything.
There is a good gem to manage cron task with the ruby syntax. More infos here : Whenever
You can do that, you'll need to also check for the time you want to send at. So starting with your pseudo code and adding to it:
while true
User.all.each do |u|
if u.last_sent_at < Time.now - 24.hours && Time.now.hour >= u.send_at
u.generate_and_deliver_dailysummery
# the next 2 lines are only needed if "generate_and_deliver_dailysummery" doesn't sent last_sent_at already
u.last_sent_at = Time.now
u.save
end
end
sleep 900
end
I've also added the sleep so you don't needlessly hammer your database. You might also want to look into limiting that loop to just the set of users you need to send to. A query similar what Zachary suggests would be much more efficient than what you have.
If you don't want to use a queue - consider delayed job (sort of a poor mans queue) - it does run as a rake task similar to what you are doing
https://github.com/collectiveidea/delayed_job
http://railscasts.com/episodes/171-delayed-job
it stores all tasks in a jobs table, usually when you add a task it queues it to run as soon as possible, however you can override this to delay it until a specific time
you could convert your DailySummary class to DailySummaryJob and once complete it could re-queue a new instance of itself for the next days run
How did you update the last_sent_at attribute?
if you use
last_sent_at += 24.hours
and initialized with last_sent_at = Time.now.at_beginning_of_day + send_at
it will be all ok .
don't use last_sent_at = Time.now . it is because there may be some delay when the job is actually done , this will make the last_sent_at attribute more and more "delayed".