Custom bulk indexer for searchkick : mapping options are ignored - ruby-on-rails

I'm using Searchkick 3.1.0
I have to bulk index a certain collection of records. By what I read in the docs and have tried, I cannot pass an predefined array of ids to Searchkick's reindex method. I'm using the async mode.
If you do for example, Klass.reindex(async: true), it will enqueue jobs with the specified batch_size in your options. The problem with that it loops through the entire model's ids will then determin if they have to be indexed. For example, if I have 10 000 records in my database and a batch size of 200, it will enqueue 50 jobs. It will then loop on each id and if the search_import's conditions are met, it will index it.
This step is useless, I would like to enqueue a pre-filtered array of ids to prevent looping through the entire records.
I tried writing the following job to overwrite the normal behavior :
def perform(class_name, batch_size = 100, offset = 0)
model = class_name.constantize
ids = model
.joins(:user)
.where(user: { active: true, id: $rollout.get(:searchkick).users })
.where("#{class_name.downcase.pluralize}.id > ?", offset)
.pluck(:id)
until ids.empty?
ids_to_enqueue = ids.shift(batch_size)
Searchkick::BulkReindexJob.perform_later(
class_name: model.name,
record_ids: ids_to_enqueue
)
end
The problem : The searchkick mapping options are completely ignored when inserting records into ElasticSearch and I can't figure out why. It doesn't take the specified match (text_middle) and create a mapping with default match 'keyword'.
Is there any clean way to bulk reindex an array of records without having to enqueue jobs containing unwanted records?

You should be able to reindex records based on a condition:
From the searchkick docs:
Reindex multiple records
Product.where(store_id: 1).reindex
You can put that in your own delayed job.
What I have done is have for some of our batch operations that happens already in a delayed job, I wrap the code in the job in the bulk block, also in the searchkick doc.
Searchkick.callbacks(:bulk) do
... // wrap some batch operations on model instrumented with searchkick.
// the bulk block should be outside of any transaction block
end

Related

Rails: Group By join table

I am trying to group by a delegate join table. I have a tasks table and each task is has a project_id. The following code works well in my controller for me to group by project:
#task = Task.joins(:project).joins(:urgency).where(urgencies: {urgency_value: 7}).group_by(&:project_id)
This returns a hash where the key is what I have joined by and then the index contains each tasks within that group. I can then loop through each task to retrieve its attributes.
However, each project belongs to a workspace (via a workspace_id). What I want is to have the same query but to group by the workspace. The final aim is for me to create a table which shows the workspace name in one column and the number of tasks for that workspace in the second column.
I have tried many combinations and searched many forums but after several hours still haven't been able to crack it.
If your only goal is to get the task counts per workspace, I think you want a different query.
#workspaces_with_task_counts =
Workspace
.joins(:projects)
.joins(:tasks)
.select('workspaces.name, count(tasks.id) as task_count')
.group(:workspace_id)
Then you can access the count like this:
#workspaces_with_task_counts.each do |workspace|
puts "#{workspace.name}: #{workspace.task_count}"
end
EDIT 1
I think this is what you want:
Workspace
.joins(projects: { tasks: :urgencies })
.where(urgencies: {urgency_value: 7})
.group(:name)
.count
which results in a hash containing all of the workspaces with at least one task where the urgency_value is 7, by name, with the number of tasks in that workspace:
{"workspace1"=>4, "workspace2"=>1}
EDIT 2
SQL is not capable of returning both detail and summary information in a single query. But, we can get all the data, then summarize it in memory with Ruby's group_by method:
Task
.joins(project: :workspace)
.includes(project: :workspace)
.group_by { |task| task.project.workspace.name }
This produces the following data structure:
{
"workspace1": [task, task, task],
"workspace2": [task, task],
"workspace3": [task, task, task, task]
}
But, it does so at a cost. Grouping in memory is an expensive process. Running that query 10,000 times took ~15 seconds.
It turns out that executing two SQL queries is actually two orders of magnitude faster at ~0.2 seconds. Here are the queries:
tasks = Task.joins(project: :workspace).includes(project: :workspace)
counts = tasks.group('workspaces.name').count
The first query gets you all the tasks and preloads their associated project and workspace data. The second query uses ActiveRecord's group clause to construct the SQL statement to summarize the data. It returns this data structure:
{ "workspace1": 3, "workspace2": 2, "workspace3": 4 }
Databases are super efficient at set manipulation. It's almost always significantly faster to do that work in the database than in Ruby.

Processing pgSQL query results in batches

I've written the rake task to perform a postgreSQL query. The task returns an object of class Result.
Here's my task:
task export_products: :environment do
results = execute "SELECT smth IN somewhere"
if results.present?
results
else
nil
end
end
def execute sql
ActiveRecord::Base.connection.execute sql
end
My further plan is to split the output in batches and save these batches one by one into a .csv file.
Here I get stuck. I cannot imagine how to call find_in_batches method of ActiveRecord::Batches module for PG::Result.
How should I proceed?
Edit: I have a legacy sql query to a legacy database
If you look at how find_in_batches is implemented, you'll see that the algorithm is essentially:
Force the query to be ordered by the primary key.
Add a LIMIT clause to the query to match the batch size.
Execute the modified query from (2) to get a batch.
Do whatever needs to be done with the batch.
If the batch is smaller than the batch size, then the unlimited query has been exhausted so we're done.
Get the maximum primary query value (last_max) from the batch you get in (3).
Add primary_key_column > last_max to the query from (2)'s WHERE clause, run the query again, and go to step (4).
Pretty straight forward and could be implemented with something like this:
def in_batches_of(batch_size)
last_max = 0 # This should be safe for any normal integer primary key.
query = %Q{
select whatever
from table
where what_you_have_now
and primary_key_column > %{last_max}
order by primary_key_column
limit #{batch_size}
}
results = execute(query % { last_max: last_max }).to_a
while(results.any?)
yield results
break if(results.length < batch_size)
last_max = results.last['primary_key_column']
results = execute(query % { last_max: last_max }).to_a
end
end
in_batches_of(1000) do |batch|
# Do whatever needs to be done with the `batch` array here
end
Where, of course, primary_key_column and friends have been replaced with real values.
If you don't have a primary key in your query then you can use some other column that sorts nicely and is unique enough for your needs. You could also use an OFFSET clause instead of the primary key but that can get expensive with large result sets.

Rails 4 - sum values grouped by external key

I know this must be simple but I'm really lost.
Three models: Job, Task and Operation, as follows:
Job
has_many :tasks
Task
belongs_to :job
belongs_to :operation
Operation
has_many :jobs
Job has an attribute, total_pieces, which tells me how many pieces you need.
For each Job, you can add a number of Tasks, which can belong to different Operations (cutting, drilling, etc.) and for every task you can set up a number of pieces.
I don't know in advance how many Operations will be needed for a single Job, but I need to alert user of the number of pieces left for that Operation when a new Task is inserted.
Let us make an example:
Job 1: total_pieces=100
- Task 1: operation 1(cutting), pieces=20
- Task 2: operation 1(cutting), pieces=30
- Task 3: operation 2(drilling), pieces=20
I need to alert the user that they still need to cut 50 pieces and to drill 80.
Hypothetically, if i add:
- Task 4: operation 3(bending), pieces=20
I need to alert the user that they also still need to bend 80 pieces.
So far i've managed to list all kinds of Operations for each Job using map, but now i need to sum up all pieces of the Task with the same Operation type in a Job, and only for those Operations present in the Tasks belonging to that Job.
Is there any way to do this using map? Or do I need to write a query manually?
EDIT: this is what I managed to patch up at the moment.
A method operations_applied in Job gives me a list of ids for all the Operations usend in Tasks queued for the Job.
Then another method, pieces_remaining for(operation) gives me the remaining pieces for the single operation.
Finally, in the Job views I need, I iterate through all operations_applied printing all pieces_remaining_for.
I know this is not particularly elegant but so far it works, any ideas to improve this?
Thank you.
If I'm not misunderstanding, it is not possible to do what you want to do with map, since map always applies to arr.size == arr.map {...}.size and you want to reduce your array.
What you could do is something like this:
job = Jobs.first
operation_pieces = {}
job.tasks.each do |task|
operation_pieces[task.operation.id] ||= { operation: task.operation }
operation_pieces[task.operation.id][:pieces] ||= 0
operation_pieces[task.operation.id][:pieces] += task.pieces
end
Now operation_pieces contains the sum of pieces for the operation with the id of the respective index. But I'm sure there is a more elegant version to do this ;)
EDIT: changed the code example to a hash
EDIT: and here is the more elegant version:
job = Jobs.first
job.tasks
.group_by(&:operation)
.map { |op, tasks|
{ op => tasks.sum(&:pieces) }
}
The group_by groups your array of tasks by the operation of the task (maybe you need to use group_by { |t| t.operation } instead, I'm not sure) and inside the map afterwards the pieces of each task with the same operation is summed up. Finally, you end up with a hash of the type OPERATION => PIECES_SUM (INTEGER).
I assume the following variables required for your query,
attributes
Task:
number_of_pieces,
Job:
name,
Operation:
name
Job.joins("LEFT JOIN tasks ON jobs.id = tasks.job_id").joins("LEFT JOIN operations ON operations.id = tasks.operation_id").select(" SUM(tasks.number_of_pieces) as number_of_pieces, operations.name, jobs.name").group("operations.id, jobs.id")
This will list all the jobs, and sum of pieces required for each operation under it.
if you have the job_id for you want the list of operations and the pieces, then use the below code,
Job.find(params[:job_id]).joins("LEFT JOIN tasks ON jobs.id = tasks.job_id").joins("LEFT JOIN operations ON operations.id = tasks.operation_id").select(" SUM(tasks.number_of_pieces) as number_of_pieces, operations.name, jobs.name").group("operations.id, jobs.id")
please put in comments if your need any explanation.

Rails find_in_batches with locking

I need to process large number of records in batches. And each batch should be processed in it's own transaction. Is there way to wrap each batch in transaction and lock all records in batch at the same time?
Model.scheduled_now.lock.find_in_batches do |batch|
model_ids = batch.map(&:id)
Model.update_all({status: 'delivering'}, {"id IN (?)" , model_ids})
# creates and updates other DB records
# and triggers background job
perform_delivery_actions(batch)
end
Does SELECT FOR UPDATE in this example commits transaction after each batch?
Or I need to put internal transaction block and lock records manually inside each batch (which means one more query)?
The reason I don't want to put outer transaction block is that I want to commit each batch separately, not a whole thing at once.
I ended up implementing my own find_in_batches_with_lock:
def find_in_batches_with_lock(scope, user, batch_size: 1000)
last_processed_id = 0
records = []
begin
user.transaction do
records = scope.where("id > ?", last_processed_id)
.order(:id).limit(batch_size).lock.all
next if records.empty?
yield records
last_processed_id = records.last.id
end
end while !records.empty?
end

Finding mongoDB records in batches (using mongoid ruby adapter)

Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).
The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.
What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...
The code I currently have does this:
Model.all.each do |r|
Sunspot.index(r)
end
For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.
I would LIKE to be able to do something like this:
Model.all.in_batches_of(1000) do |batch|
Sunpot.index(batch)
end
That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.
With Mongoid, you don't need to manually batch the query.
In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.
For more information on this topic, read this comment from the Mongoid author and maintainer.
In summary, you can just do this:
Model.all.each do |r|
Sunspot.index(r)
end
If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.
require 'mongoid'
module Mongoid
class Criteria
def in_batches_of(count = 100)
Enumerator.new do |y|
total = 0
loop do
batch = 0
self.limit(count).skip(total).each do |item|
total += 1
batch += 1
y << item
end
break if batch == 0
end
end
end
end
end
Here is a helper method you can use to add the batching functionality. It can be used like so:
Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
# call external slow API
end
Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.
It is faster to send batches to sunspot as well.
This is how I do it:
records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
records << r
if records.size > 1000
Sunspot.index! records
records.clear
end
end
Sunspot.index! records
no_timeout: prevents the cursor to disconnect (after 10 min, by default)
only: selects only the id and the fields, which are actually indexed
batch_size: fetch 1000 entries instead of 100
I am not sure about the batch processing, but you can do this way
current_page = 0
item_count = Model.count
while item_count > 0
Model.all.skip(current_page * 1000).limit(1000).each do |item|
Sunpot.index(item)
end
item_count-=1000
current_page+=1
end
But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,
i have created a resque job which updates the solr index
class SolrUpdator
#queue = :solr_updator
def self.perform(item_id)
item = Model.find(item_id)
#i have used RSolr, u can change the below code to handle sunspot
solr = RSolr.connect :url => Rails.application.config.solr_path
js = JSON.parse(item.to_json)
solr.add js
end
end
After adding the item, i just put an entry to the resque queue
Resque.enqueue(SolrUpdator, item.id.to_s)
Thats all, start the resque and it will take care of everything
As #RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.
Model.all.to_a.in_groups_of(1000, false) do |records|
Sunspot.index! records
end
The following will work for you , just try it
Model.all.in_groups_of(1000, false) do |r|
Sunspot.index! r
end

Resources