Sidekiq push_bulk to batch - ruby-on-rails

I'm trying to use Sidekiq batches to group a set of related jobs together. However, the batch prematurely fires on complete callback's since the jobs method can't push all the jobs to Redis fast enough. The sidekiq documentation at https://github.com/mperham/sidekiq/wiki/Batches says this can be resolved by using the Sidekiq::Client.push_bulk method but the documentation is unclear on how to push to a batch in bulk. Can someone share an example of how to use push_bulk in the context of a batch?

Assume you want to process 100 users in a batch, one user ID per job.
b = Sidekiq::Batch.new
b.jobs do
users = User.select(:id).limit(100).map(&:id) # users is [1, 2, 3, etc...]
args = users.map {|uid| [uid] } # args is [[1], [2], [3], etc...]
Sidekiq::Client.push_bulk('class' => YourWorker, 'args' => args)
end

Related

Using limit and offset in rails together with updated_at and find_each - will that cause a problem?

I have a Ruby on Rails project in which there are millions of products with different urls. I have a function "test_response" that checks the url and returns either a true or false for the Product attribute marked_as_broken, either way the Product is saved and has its "updated_at"-attribute updated to the current Timestamp.
Since this is a very tedious process I have created a task which in turn starts off 15 tasks, each with a N/15 number of products to check. The first one should check from, for example, the first to the 10.000th, the second one from the 10.000nd to the 20.000nd and so on, using limit and offset.
This script works fine, it starts off 15 process but rather quickly completes one script after another far too early. It does not terminate, it finishes with a "Process exited with status 0".
My guess here is that using find_each together with a search for updated_at as well as in fact updating the "updated_at" while running the script changes everything and does not make the script go through the 10.000 items as supposed but I can't verify this.
Is there something inherently wrong by doing what I do here. For example, does "find_each" run a new sql query once in a while providing completely different results each time, than anticipated? I do expect it to provide the same 10.000 -> 20.000 but just split it up in pieces.
task :big_response_launcher => :environment do
nbr_of_fps = Product.where(:marked_as_broken => false).where("updated_at < '" + 1.year.ago.to_date.to_s + "'").size.to_i
nbr_of_processes = 15
batch_size = ((nbr_of_fps / nbr_of_processes))-2
heroku = PlatformAPI.connect_oauth(auth_code_provided_elsewhere)
(0..nbr_of_processes-1).each do |i|
puts "Launching #{i.to_s}"
current_offset = batch_size * i
puts "rake big_response_tester[#{current_offset},#{batch_size}]"
heroku.dyno.create('kopa', {
:command => "rake big_response_tester[#{current_offset},#{batch_size}]",
:attach => false
})
end
end
task :big_response_tester, [:current_offset, :batch_size] => :environment do |task,args|
current_limit = args[:batch_size].to_i
current_offset = args[:current_offset].to_i
puts "Launching with offset #{current_offset.to_s} and limit #{current_limit.to_s}"
Product.where(:marked_as_broken => false).where("updated_at < '" + 1.year.ago.to_date.to_s + "'").limit(current_limit).offset(current_offset).find_each do |fp|
fp.test_response
end
end
As many have noted in the comments, it seems like using find_each will ignore the order and limit. I found this answer (ActiveRecord find_each combined with limit and order) that seems to be working for me. It's not working 100% but it is a definite improvement. The rest seems to be a memory issue, i.e. I cannot have too many processes running at the same time on Heroku.

How to limit request rates in .map loop?

I'm requesting amazon product advertising api with code like this:
products = asins.map do |asin|
item = Amazon::Ecs.item_lookup(asin, response_group: :Large)
json = {asin: item.get_element('Item').get('ASIN'),
manufacturer: item.get_element('ItemAttributes').get('Manufacturer'),
model: item.get_element('ItemAttributes').get('Model')}
end
And get 503 error: You are submitting requests too quickly. Please retry your requests at a slower rate.
I found out that they want 1 request per second.
What's the best way of doing it in my case?
Perhaps just decelerate by waiting a second between to iterations:
products = asins.map do |asin|
sleep 1 # wait one second before doing the next API call
item = Amazon::Ecs.item_lookup(asin, response_group: :Large)
{
asin: item.get_element('Item').get('ASIN'),
manufacturer: item.get_element('ItemAttributes').get('Manufacturer'),
model: item.get_element('ItemAttributes').get('Model')
}
end
Using sleep is for sure the first solution that comes to mind. In my opinion, it's not an elegant one, becacause it's totally not managable. I would think of some queueing system to do the work - maybe sidekiq using a self triggering worker?
Some simplified code:
# some kind of queueing logic, to fetch asins
asin = AsinQueue.fetch
# trigger first worker
LookupWorker.perform_async(asin)
# and the worker itself:
class LookupWorker
include Sidekiq::Worker
def perform(asin)
item = Amazon::Ecs.item_lookup(asin, response_group: :Large)
# all the domain logic
# queue next lookup
next_asin = AsinQueue.fetch
LookupWorker.perform_in(1.second, next_asin)
end
end
ItemLookup supports batch requests. You can lookup up to 10 items at once.

Run scripts in parallel in ruby

I need to convert videos in 4 threads
For example I have Active Record models Video with titles: Video1, Video2, Video3, Video4, Video5
So, I need to execute something like this
bundle exec script/video_converter start
Where script will process unconverted videos for 4 threads, for example
Video.where(state: 'unconverted').first.process
But if one of 4 videos are converted, next video must be automatically added to thread
What is the best solution for this ? Sidekiq gem? Daemons gem + Ruby Threads manually?
For now I am using this script:
THREAD_COUNT = 4
SLEEP_TIME = 5
logger = CONVERTATION_LOG
spawns = []
loop do
videos = Video.where(state:'unconverted').limit(THREAD_COUNT).reorder("ID DESC")
videos.each do |video|
spawns << Spawnling.new do
result = video.process
if result.nil?
video.create_thumbnail!
else
video.failured!
end
end
end
Spawnling.wait(spawns)
sleep(SLEEP_TIME)
end
But this script waits 4 videos, and after it takes another 4 videos. I want, that after one of 4-th video converted, it will be automatically added to new thread, which is empty.
If your goal is to keep processing videos by using just 4 threads (or whatever Spawnling is configured to use - as it supports fork and thread), then, you could use a Queue to queue all your video records to be processed, spawn 4 threads and let them keep processing records one by one until queue is empty.
require "rails"
require "spawnling"
# In your case, videos are read from DB, below array is for illustration
videos = ["v1", "v2", "v3", "v4", "v5", "v6", "..."]
THREAD_COUNT = 4
spawns = []
q = Queue.new
videos.each {|i| q.push(i) }
THREAD_COUNT.times do
spawns << Spawnling.new do
until q.empty? do
v = q.pop
# simulate processing
puts "Processing video #{v}"
# simulate processing time
sleep(rand(10))
end
end
end
Spawnling.wait(spawns)
This answer is inspired from this answer
PS: I have added few requires and defined videos array to make above code self-contained running example.

Quickly adding multiple items (1000/sec) to a sidekiq queue?

I realize there is a push_bulk option for sidekiq but I'm currently being limited by latency to redis, so passing multiple items via push_bulk still isn't going quickly enough (only about 50/s).
I've tried to increase the number of redis connections like so:
redis_conn = proc {
Redis.new({ :url => Rails.configuration.redis.url })
}
Sidekiq.configure_client do |config|
Sidekiq.configure_client do |config|
config.redis = ConnectionPool.new(size: 50, &redis_conn)
end
config.client_middleware do |chain|
chain.add Sidekiq::Status::ClientMiddleware
end
end
And then fire off separate threads (Thread.new) to actually perform_async on the various objects. What is interesting is any thread that isn't the first thread NEVER gets thrown into the sidekiq queue, it's like they're ignored entirely.
Does anyone know of a better way to do this?
Edit: Here is the push_bulk method I was trying which is actually slower:
user_ids = User.need_scraping.pluck(:id)
bar = ProgressBar.new(user_ids.count)
user_ids.in_groups_of(10000, false).each do |user_id_group|
Sidekiq::Client.push_bulk(
'args' => user_id_group.map{ |user_id| [user_id] },
'class' => ScrapeUser,
'queue' => 'scrape_user',
'retry' => true
)
end
Thanks!
You DO want to use push_bulk. You're limited by the latency/round-trip time to write elements to the redis queue backing sidekiq.
You're using multiple threads/connections to overcome a slow network, when you should really be removing extra network roundtrips.
Assuming you're trying to enqueuue 20k UserWorker jobs that take a user_id:
You would enqueue a single job via:
UserWorker.perform_async(user_id)
... which maps to:
Sidekiq::Client.push('class' => UserWorker, 'args' => [user_id] )
So the push_bulk version for 20k user_ids is:
# This example takes 20k user_ids in an array, chunks them into groups of 1000 ids,
# and batch sends them to redis as a group.
User.need_scraping.select('id').find_in_batches do |user_group|
sidekiq_items = user_group.map {|user| { 'class' => UserWorker, 'args' => [user.id] } }
Sidekiq::Client.push_bulk(sidekiq_items)
end
This turns 20k redis calls into 20 redis calls, with an average round trip time of 5ms (optimistic), that's 1sec vs. 100 seconds. Your mileage may vary.
EDIT:
Commenters seem confused about the behavior of the Sidekiq/Redis client for bulk enqueuing data.
The Sidekiq::Client.push_bulk() method takes an array of jobs to be enqueud. It translates these into Sidekiq job payload hashes, and then calls SideKiq::Client.raw_push() to deliver these payloads to redis. See source: https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/client.rb#L158
SideKiq::Client.raw_push() takes a list of Sidekiq hash payloads, converts them to JSON, and then executes a redis MULTI command combining two redis commands. First, it adds to targeted queue to the list of active queues (redis SADD), then it pushes all of the job payloads to the targeted queue redis list object (redis LPUSH). This is a single redis command, executed together in a single redis atomic group.
If this is still slow, you likely have other problems (slow network, overloaded redis server, etc).
#Winfield's answer is correct, and he's absolutely right about latency. However, the correct syntax is actually as follows:
User.need_scraping.select('id').find_in_batches do |user_group|
Sidekiq::Client.push_bulk({ 'class' => UserWorker, 'args' => user_group.map {|user| [user.id] } })
end
Maybe it changed in the most recent Sidekiq (I was too lazy to check), but this is the correct syntax now.

Programmatically get the number of jobs in a Resque queue

I am interested in setting up a monitoring service that will page me whenever there are too many jobs in the Resque queue (I have about 6 queues, I'll have different numbers for each queue). I also want to setup a very similar monitoring service that will alert me when I exceed a certain amount of failed jobs in my queue.
My question is, there is a lot of keys and confusion that I see affiliated with Resque on my redis server. I don't necessarily see a straight forward way to get a count of jobs per queue or the number of failed jobs. Is there currently a trivial way to grab this data from redis?
yes it's quite easy, given you're using the Resque gem:
require 'resque'
Resque.info
will return a hash
e.g/ =>
{
:pending => 54338,
:processed => 12772,
:queues => 2,
:workers => 0,
:working => 0,
:failed => 8761,
:servers => [
[0] "redis://192.168.1.10:6379/0"
],
:environment => "development"
}
So to get the failed job count, simply use:
Resque.info[:failed]
which would give
=> 8761 #in my example
To get the queues use:
Resque.queues
this returns a array
e.g./ =>
[
[0] "superQ",
[1] "anotherQ"
]
You may then find the number of jobs per queue:
Resque.size(queue_name)
e.g/ Resque.size("superQ") or Resque.size(Resque.queues[0]) .....
Here is a bash script which will monitor the total number of jobs queued and the number of failed jobs.
while :
do
let sum=0
let errors=$(redis-cli llen resque:failed)
for s in $(redis-cli keys resque:queue:*)
do
let sum=$sum+$(redis-cli llen $s)
done
echo $sum jobs queued, with $errors errors
sleep 1 # sleep 1 second, probably want to increase this
done
This is for Resque 1.X, 2.0 might have different key names.
There is also a method Resque.queue_sizes That returns a hash of the queue name and size
Resque.queue_sizes
=> {"default"=>0, "slow"=>0}

Resources