I have 200000 users in my database
I need to iterate through each record to process something.
So I have a rake task to iterate each user, main logic will be in a worker. Now I want to get the limit of the workers those can be run simultaneously. If the limit is 50000, then I can divide my users into 4 sets, for each set I will call the worker seperately.
task:
namespace :users do
task data: :environment do
confirmed_users = User.where('confirmed_at IS NOT NULL').where('id <= 50000')
confirmed_users.each do |user|
MyWorker.perform_async(user.id)
end#confirmed_users.each do |user|
confirmed_users = User.where('confirmed_at IS NOT NULL').where('id > 50000 and id <= 100000')
confirmed_users.each do |user|
MyWorker.perform_async(user.id)
end#confirmed_users.each do |user|
confirmed_users = User.where('confirmed_at IS NOT NULL').where('id > 100000 and id <= 150000')
confirmed_users.each do |user|
MyWorker.perform_async(user.id)
end#confirmed_users.each do |user|
confirmed_users = User.where('confirmed_at IS NOT NULL').where('id > 200000')
confirmed_users.each do |user|
MyWorker.perform_async(user.id)
end#confirmed_users.each do |user|
end
end
If I can know the limit of sidekiq, I can make the user sets dynamically. And I wanted to know is this correct way to complete the process in less time. Or is there any way that I can process all my records in less time ?
Sidekiq only processes as many jobs concurrently as you have workers/threads. The rest will be placed in the queue and the queue is practically unlimited. No issues with 200k jobs.
Your issue probably comes from the slowness of querying for 200k jobs using 1 sql query and having to keep the result in memory while creating jobs from them.
Use find_each to tell Rails to find the records in batches and yield them one-by-one.
namespace :users do
task data: :environment do
User.where('confirmed_at IS NOT NULL').find_each do |user|
MyWorker.perform_async(user.id)
end
end
end
However, since you only need the id, not the entire user object, we can also remove the object initiation to speed it up more.
User.where('confirmed_at IS NOT NULL').in_batches.each do |batch|
batch.pluck(:id).each do |id|
MyWorker.perform_async(user.id)
end
end
And if that still is not fast enough, there is Sidekiq::Client.push_bulk. It will only make one request to redis for each batch. Might need to adjust batch size here.
User.where('confirmed_at IS NOT NULL').in_batches.each do |batch|
args = batch.pluck(:id).map { |id| [id] } # args is [[1], [2], [3], etc...]
Sidekiq::Client.push_bulk('class' => MyWorker, 'args' => args)
end
Related
I have a users table with 800 000 records. I created a new field called token in users table. for all the new users token is getting populated. for existing users to populate the token i wrote a rake task with following code. i feel this is not work for these many records in production environment. How to rewrite these queries with batches or some other way of writing the queries
users = User.all
users.each do |user|
user.token = SecureRandom.urlsafe_base64(nil, false)
user.save
end
How you want to proceed depends on different factors: is validation important for you when executing this? Is time an issue?
If you don't care about validations, you may generate raw SQL queries for each user and then execute them at once, otherwise you have options like ActiveRecord transactions:
User.transaction do
users = User.all
users.each do |user|
user.update(token: SecureRandom.urlsafe_base64(nil, false))
end
end
This would be quicker than your rake task, but still would take some time, depending on the number of users you want to update at once.
lower_limit = User.first.id
upper_limit = 30000
while true
users = User.where('id >= ? and id< ?',lower_limit,upper_limit)
break if users.empty?
users.each do |user|
user.update(token: SecureRandom.urlsafe_base64(nil, false))
end
lower_limit+=30000
upper_limit+=30000
end
I think that the best option for you is to use find_each or transactions.
Doc for find_each:
Looping through a collection of records from the database (using the ActiveRecord::Scoping::Named::ClassMethods#all method, for example) is very inefficient since it will try to instantiate all the objects at once.
In that case, batch processing methods allow you to work with the records in batches, thereby greatly reducing memory consumption.
The find_each method uses find_in_batches with a batch size of 1000 (or as specified by the :batch_size option).
Doc for transaction:
Transactions are protective blocks where SQL statements are only permanent if they can all succeed as one atomic action
In case that you care about memory, because you are bringnig all the 800k of users in memory, the User.all.each will instantiate the 800k objects consuming a lot of memory so my approach will be:
User.find_each(batch_size: 500) do |user|
user.token = SecureRandom.urlsafe_base64(nil, false)
user.save
end
In this case, it only instantiate 500 users instead of 1000 that is the default batch_size.
If you still want to do it in only one transaction to the database, you can use the answer of #Francesco
The common mistake is instantiating model instance without need. While AR instantiating is not cheap.
You can try this naive code:
BATCH_SIZE = 1000
while true
uids = User.where( token: nil ).limit( BATCH_SIZE ).pluck( :id )
break if uids.empty?
ApplicationRecord.transaction do
uids.each do |uid|
# def urlsafe_base64(n=nil, padding=false)
User
.where( id: uid )
.update_all( token: SecureRandom.urlsafe_base64 )
end
end
end
Next option is to use native DB's analog for SecureRandom.urlsafe_base64 and run one query like:
UPDATE users SET token=db_specific_urlsafe_base64 WHERE token IS NULL
If you won't find the analog, you can prepopulate temp table (like PostgreSQL's' COPY command) from precalculated CSV file(id, token=SecureRandom.urlsafe_base64)
and run one query like:
UPDATE users SET token=temp_table.token
FROM temp_table
WHERE (users.token IS NULL) AND (users.id=temp_table.id)
But in fact you need no to fill token on existing users because of:
i am using "token" for token based authentication in rails – John
You have to check if user's token is NULL(or expired) and redirect him to login form. It's common way and it will save your time.
I have a module written in ruby which connects to a postgres table and then applies some logic and code.
Below is a sample code:
module SampleModuleHelper
def self.traverse_database
ProductTable.where(:column => value).find_each do |product|
#some logic here that takes a long time
end
end
end
ProductTable has more than 3 million records. I have used the where clause to shorten number of records retrieved.
However I need to make the code connection proof. There are times when the connection breaks and I have to start traversing the table from the very beginning. I don't want this, rather it should start where it left off since the time taken is too much for each record.
What is the best way to make the code start where it left off?
One way is to make a table in the database that records the primary key(id) where it stopped and start from there again. But I don't want to make tables in the database as there are many such processes.
You could keep a counter of processed records and use the offset method to continue processing.
Something along the lines of:
MAX_RETRIES = 3
def self.traverse(query)
counter = 0
retries = 0
begin
query.offset(counter).find_each do |record|
yield record
counter += 1
end
rescue ActiveRecord::ConnectionNotEstablished => e # or whatever error you're expecting
retries += 1
retry unless retries > MAX_RETRIES
raise
end
end
def self.traverse_products
traverse(ProductTable.where(column: value)) do |product|
# do something with `product`
end
end
So I ran this command rake environment elasticsearch:import:model CLASS='AutoPartsMapper' FORCE=true to index documents in elasticsearch.In my database I have 10 000 000 records=)...it takes (I think) one day to index this...When indexing was running my computer turned off...(I indexed 2 000 000 documents)Is it possible to continue indexing documents?
If you use rails 4.2+ you can use ActiveJob to schedule and leave it running. So, first generate it with this
bin/rails generate job elastic_search_index
This will give you class and method perform:
class ElasticSearchIndexJob < ApplicationJob
def perform
# impleement here indexing
AutoPartMapper.__elasticsearch__.create_index! force:true
AutoPartMapper.__elasticsearch__.import
end
end
Set the sidekiq as your active job provider and from console initiate this with:
ElasticSearchIndexJob.perform_later
This will set the active job and execute it on next free job but it will free your console. You can leave it running and check the process in bash later:
ps aux | grep side
this will give you something like: sidekiq 4.1.2 app[1 of 12 busy]
Have a look at this post that explains them
http://ruby-journal.com/how-to-integrate-sidekiq-with-activejob/
Hope it helps
There is no such functionality in elasicsearch-rails afaik but you could write a simple task to do that.
namespace :es do
task :populate, [:start_id] => :environment do |_, args|
start_id = args[:start_id].to_i
AutoPartsMapper.where('id > ?', start_id).order(:id).find_each do |record|
puts "Processing record ##{record.id}"
record.__elasticsearch__.index_document
end
end
end
Start it with bundle exec rake es:populate[<start_id>] passing the id of the record from which to start the next batch.
Note that this is a simplistic solution which will be much slower than batch indexing.
UPDATE
Here is a batch indexing task. It is much faster and automatically detects the record from which to continue. It does make an assumption that previously imported records were processed in increasing id order and without gaps. I haven't tested it but most of the code is from a production system.
namespace :es do
task :populate_auto => :environment do |_, args|
start_id = get_max_indexed_id
AutoPartsMapper.find_in_batches(batch_size: 1000).where('id > ?', start_id).order(:id) do |records|
elasticsearch_bulk_index(records)
end
end
def get_max_indexed_id
AutoPartsMapper.search(aggs: {max_id: {max: {field: :id }}}, size: 0).response[:aggregations][:max_id][:value].to_i
end
def elasticsearch_bulk_index(records)
return if records.empty?
klass = records.first.class
klass.__elasticsearch__.client.bulk({
index: klass.__elasticsearch__.index_name,
type: klass.__elasticsearch__.document_type,
body: elasticsearch_records_to_index(records)
})
end
def self.elasticsearch_records_to_index(records)
records.map do |record|
payload = { _id: record.id, data: record.as_indexed_json }
{ index: payload }
end
end
end
I have a rake task where everyday I should load 10_000 users and process, like this
uptime = Sys::Uptime.days
batch = 10_000
Person.limit(batch).offset(batch*uptime).find_each do |p|
Namespace::UserWorker.perform_async(p.to_global_id)
end
turns out that debuggin, I realized that find_eachs seems to ignore my limits and offset, and loads more than 10000 users. What should I do? use each instead find_each?
What you need is to group in batches and ActiveRecord has a helper for it:
http://apidock.com/rails/ActiveRecord/Batches/find_in_batches
batch_size = 1000
start = 0
Person.find_in_batches(batch_size:batch_size, start:start) do |p|
Namespace::UserWorker.perform_async(p.to_global_id)
end
find_each override limit and offset conditions for load only part/batch of data. So, in your case just don't specify own limit and offset conditions:
Person.find_each(batch_size: 10_000) do |p|
Namespace::UserWorker.perform_async(p.to_global_id)
end
I' trying to fill by database with information, which is being downloaded from the internet, on the fly. I already have a list of ids in a table. What I initially tried is to get all the ids and traverse each id in a loop and download the relevant information. It worked, but, since I had more than 1000 ids it took approximately 24 hours. To speed up I tried to create threads, with each thread allotted some number of ids to download. THE problem here is that interpreter suddenly stops and exits. I also want to ask if the procedure what I wrote will actually gain me some speedup in overall time ? The code I wrote is something like this(I'm using ruby):
def self.called_by_thread(start, limit=50, retry_attempts = 5)
last_id = start
begin
#Users = User.where('id > ' + last_id.to_s).limit(limit)
#Users.each do |user|
#called a function to download information of user and store it,
#This function belongs to the user object
last_id = user.id
end
rescue => msg
puts "Something went wrong (" + msg + ")"
if retry_attempts > 0
retry_attempts -= 1
limit -= last_id-start
retry
end
end
end
In the above code start is the id from where to start.
I call the above function like this:
last_id = 1090
i = 1
limit = 50
workers = []
while i < num_workers
t = Thread.new { called_by_thread(last_id, limit, 5) }
workers << t
i += 1
last_id += limit
end
workers.each do |t|
t.join
end
all ids are incremental, so their is no harm in adding a positive number to it. It is guaranteed that the user exists for a given id. Provided its below 10000.