Processing pgSQL query results in batches

Processing pgSQL query results in batches - ruby-on-rails

I've written the rake task to perform a postgreSQL query. The task returns an object of class Result.
Here's my task:
task export_products: :environment do
results = execute "SELECT smth IN somewhere"
if results.present?
results
else
nil
end
end
def execute sql
ActiveRecord::Base.connection.execute sql
end
My further plan is to split the output in batches and save these batches one by one into a .csv file.
Here I get stuck. I cannot imagine how to call find_in_batches method of ActiveRecord::Batches module for PG::Result.
How should I proceed?
Edit: I have a legacy sql query to a legacy database

If you look at how find_in_batches is implemented, you'll see that the algorithm is essentially:
Force the query to be ordered by the primary key.
Add a LIMIT clause to the query to match the batch size.
Execute the modified query from (2) to get a batch.
Do whatever needs to be done with the batch.
If the batch is smaller than the batch size, then the unlimited query has been exhausted so we're done.
Get the maximum primary query value (last_max) from the batch you get in (3).
Add primary_key_column > last_max to the query from (2)'s WHERE clause, run the query again, and go to step (4).
Pretty straight forward and could be implemented with something like this:
def in_batches_of(batch_size)
last_max = 0 # This should be safe for any normal integer primary key.
query = %Q{
select whatever
from table
where what_you_have_now
and primary_key_column > %{last_max}
order by primary_key_column
limit #{batch_size}
}
results = execute(query % { last_max: last_max }).to_a
while(results.any?)
yield results
break if(results.length < batch_size)
last_max = results.last['primary_key_column']
results = execute(query % { last_max: last_max }).to_a
end
end
in_batches_of(1000) do |batch|
# Do whatever needs to be done with the `batch` array here
end
Where, of course, primary_key_column and friends have been replaced with real values.
If you don't have a primary key in your query then you can use some other column that sorts nicely and is unique enough for your needs. You could also use an OFFSET clause instead of the primary key but that can get expensive with large result sets.

Related

Can I force the execution of an active record query chain?

I have an edge case where I want to use .first only after my SQL query has been executed.
My case is the next one:
User.select("sum((type = 'foo')::int) as foo_count",
"sum((type = 'bar')::int) as bar_count")
.first
.yield_self { |r| r.bar_count / r.foo_count.to_f }
However, this would throw an SQL error saying that I should include my user_id in the GROUP BY clause. I've already found a hacky solution using to_a, but I really wonder if there is a proper way to force execution before my call to .first.

The error is because first uses an order by statement to order by id.
"Find the first record (or first N records if a parameter is supplied). If no order is defined it will order by primary key."
Instead try take
"Gives a record (or N records if a parameter is supplied) without any implied order. The order will depend on the database implementation. If an order is supplied it will be respected."
So
User.select("sum((type = 'foo')::int) as foo_count",
"sum((type = 'bar')::int) as bar_count")
.take
.yield_self { |r| r.bar_count / r.foo_count.to_f }
should work appropriately however as stated the order is indeterminate.

You may want to use pluck which retrieves only the data instead of select which just alters which fields get loaded into models:
User.pluck(
"sum((type = 'foo')::int) as foo_count",
"sum((type = 'bar')::int) as bar_count"
).map do |foo_count, bar_count|
bar_count / foo_count.to_f
end
You can probably do the division in the query as well if necessary.

Rails: Group By join table

I am trying to group by a delegate join table. I have a tasks table and each task is has a project_id. The following code works well in my controller for me to group by project:
#task = Task.joins(:project).joins(:urgency).where(urgencies: {urgency_value: 7}).group_by(&:project_id)
This returns a hash where the key is what I have joined by and then the index contains each tasks within that group. I can then loop through each task to retrieve its attributes.
However, each project belongs to a workspace (via a workspace_id). What I want is to have the same query but to group by the workspace. The final aim is for me to create a table which shows the workspace name in one column and the number of tasks for that workspace in the second column.
I have tried many combinations and searched many forums but after several hours still haven't been able to crack it.

If your only goal is to get the task counts per workspace, I think you want a different query.
#workspaces_with_task_counts =
Workspace
.joins(:projects)
.joins(:tasks)
.select('workspaces.name, count(tasks.id) as task_count')
.group(:workspace_id)
Then you can access the count like this:
#workspaces_with_task_counts.each do |workspace|
puts "#{workspace.name}: #{workspace.task_count}"
end
EDIT 1
I think this is what you want:
Workspace
.joins(projects: { tasks: :urgencies })
.where(urgencies: {urgency_value: 7})
.group(:name)
.count
which results in a hash containing all of the workspaces with at least one task where the urgency_value is 7, by name, with the number of tasks in that workspace:
{"workspace1"=>4, "workspace2"=>1}
EDIT 2
SQL is not capable of returning both detail and summary information in a single query. But, we can get all the data, then summarize it in memory with Ruby's group_by method:
Task
.joins(project: :workspace)
.includes(project: :workspace)
.group_by { |task| task.project.workspace.name }
This produces the following data structure:
{
"workspace1": [task, task, task],
"workspace2": [task, task],
"workspace3": [task, task, task, task]
}
But, it does so at a cost. Grouping in memory is an expensive process. Running that query 10,000 times took ~15 seconds.
It turns out that executing two SQL queries is actually two orders of magnitude faster at ~0.2 seconds. Here are the queries:
tasks = Task.joins(project: :workspace).includes(project: :workspace)
counts = tasks.group('workspaces.name').count
The first query gets you all the tasks and preloads their associated project and workspace data. The second query uses ActiveRecord's group clause to construct the SQL statement to summarize the data. It returns this data structure:
{ "workspace1": 3, "workspace2": 2, "workspace3": 4 }
Databases are super efficient at set manipulation. It's almost always significantly faster to do that work in the database than in Ruby.

Active Record - Chain Queries with OR

Rails: 4.1.2
Database: PostgreSQL
For one of my queries, I am using methods from both the textacular gem and Active Record. How can I chain some of the following queries with an "OR" instead of an "AND":
people = People.where(status: status_approved).fuzzy_search(first_name: "Test").where("last_name LIKE ?", "Test")
I want to chain the last two scopes (fuzzy_search and the where after it) together with an "OR" instead of an "AND." So I want to retrieve all People who are approved AND (whose first name is similar to "Test" OR whose last name contains "Test"). I've been struggling with this for quite a while, so any help would be greatly appreciated!

I digged into fuzzy_search and saw that it will be translated to something like:
SELECT "people".*, COALESCE(similarity("people"."first_name", 'test'), 0) AS "rankxxx"
FROM "people"
WHERE (("people"."first_name" % 'abc'))
ORDER BY "rankxxx" DESC
That says if you don't care about preserving order, it will just filter the result by WHERE (("people"."first_name" % 'abc'))
Knowing that and now you can simply write the query with similar functionality:
People.where(status: status_approved)
.where('(first_name % :key) OR (last_name LIKE :key)', key: 'Test')
In case you want order, please specify what would you like the order will be after joining 2 conditions.

After a few days, I came up with the solution! Here's what I did:
This is the query I wanted to chain together with an OR:
people = People.where(status: status_approved).fuzzy_search(first_name: "Test").where("last_name LIKE ?", "Test")
As Hoang Phan suggested, when you look in the console, this produces the following SQL:
SELECT "people".*, COALESCE(similarity("people"."first_name", 'test'), 0) AS "rank69146689305952314"
FROM "people"
WHERE "people"."status" = 1 AND (("people"."first_name" % 'Test')) AND (last_name LIKE 'Test') ORDER BY "rank69146689305952314" DESC
I then dug into the textacular gem and found out how the rank is generated. I found it in the textacular.rb file and then crafted the SQL query using it. I also replaced the "AND" that connected the last two conditions with an "OR":
# Generate a random number for the ordering
rank = rand(100000000000000000).to_s
# Create the SQL query
sql_query = "SELECT people.*, COALESCE(similarity(people.first_name, :query), 0)" +
" AS rank#{rank} FROM people" +
" WHERE (people.status = :status AND" +
" ((people.first_name % :query) OR (last_name LIKE :query_like)))" +
" ORDER BY rank#{rank} DESC"
I took out all of quotation marks in the SQL query when referring to tables and fields because it was giving me error messages when I kept them there and even if I used single quotes.
Then, I used the find_by_sql method to retrieve the People object IDs in an array. The symbols (:status, :query, :query_like) are used to protect against SQL injections, so I set their values accordingly:
# Retrieve all the IDs of People who are approved and whose first name and last name match the search query.
# The IDs are sorted in order of most relevant to the search query.
people_ids = People.find_by_sql([sql_query, query: "Test", query_like: "%Test%", status: 1]).map(&:id)
I get the IDs and not the People objects in an array because find_by_sql returns an Array object and not a CollectionProxy object, as would normally be returned, so I cannot use ActiveRecord query methods such as where on this array. Using the IDs, we can execute another query to get a CollectionProxy object. However, there's one problem: If we were to simply run People.where(id: people_ids), the order of the IDs would not be preserved, so all the relevance ranking we did was for nothing.
Fortunately, there's a nice gem called order_as_specified that will allow us to retrieve all People objects in the specific order of the IDs. Although the gem would work, I didn't use it and instead wrote a short line of code to craft conditions that would preserve the order.
order_by = people_ids.map { |id| "people.id='#{id}' DESC" }.join(", ")
If our people_ids array is [1, 12, 3], it would create the following ORDER statement:
"people.id='1' DESC, people.id='12' DESC, people.id='3' DESC"
I learned from this comment that writing an ORDER statement in this way would preserve the order.
Now, all that's left is to retrieve the People objects from ActiveRecord, making sure to specify the order.
people = People.where(id: people_ids).order(order_by)
And that did it! I didn't worry about removing any duplicate IDs because ActiveRecord does that automatically when you run the where command.
I understand that this code is not very portable and would require some changes if any of the people table's columns are modified, but it works perfectly and seems to execute only one query according to the console.

Batching Queries with ActiveRecord

Is there a way to batch independent (i.e. not dependent on a previous value) queries into a single request to the database in order to prevent round trips?
I intend use this to read data from several unrelated data models, so joins and views don't suffice.
Here's a very rough idea of what I'm attempting:
queries = {
# What Relation or SQL
top_duck: Duck.limit(1), # We can't use first directly
cow_count: Cow.select('count(*)'), # We can't use count directly
petals_lost: Flower.group_by('petals_lost').order_by('petals_lost')
.select('average(petals_lost) as average_petals_lost'),
avg_score: "select avg(score), stddev(score) from games"
}
bq = BatchQuery.new(queries)
# And in one trip...
bq.execute # => Hash<DB::Result> suffices for now; next step: typecasting

Finding mongoDB records in batches (using mongoid ruby adapter)

Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).
The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.
What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...
The code I currently have does this:
Model.all.each do |r|
Sunspot.index(r)
end
For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.
I would LIKE to be able to do something like this:
Model.all.in_batches_of(1000) do |batch|
Sunpot.index(batch)
end
That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.

With Mongoid, you don't need to manually batch the query.
In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.
For more information on this topic, read this comment from the Mongoid author and maintainer.
In summary, you can just do this:
Model.all.each do |r|
Sunspot.index(r)
end

If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.
require 'mongoid'
module Mongoid
class Criteria
def in_batches_of(count = 100)
Enumerator.new do |y|
total = 0
loop do
batch = 0
self.limit(count).skip(total).each do |item|
total += 1
batch += 1
y << item
end
break if batch == 0
end
end
end
end
end
Here is a helper method you can use to add the batching functionality. It can be used like so:
Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
# call external slow API
end
Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.

It is faster to send batches to sunspot as well.
This is how I do it:
records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
records << r
if records.size > 1000
Sunspot.index! records
records.clear
end
end
Sunspot.index! records
no_timeout: prevents the cursor to disconnect (after 10 min, by default)
only: selects only the id and the fields, which are actually indexed
batch_size: fetch 1000 entries instead of 100

I am not sure about the batch processing, but you can do this way
current_page = 0
item_count = Model.count
while item_count > 0
Model.all.skip(current_page * 1000).limit(1000).each do |item|
Sunpot.index(item)
end
item_count-=1000
current_page+=1
end
But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,
i have created a resque job which updates the solr index
class SolrUpdator
#queue = :solr_updator
def self.perform(item_id)
item = Model.find(item_id)
#i have used RSolr, u can change the below code to handle sunspot
solr = RSolr.connect :url => Rails.application.config.solr_path
js = JSON.parse(item.to_json)
solr.add js
end
end
After adding the item, i just put an entry to the resque queue
Resque.enqueue(SolrUpdator, item.id.to_s)
Thats all, start the resque and it will take care of everything

As #RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.
Model.all.to_a.in_groups_of(1000, false) do |records|
Sunspot.index! records
end

The following will work for you , just try it
Model.all.in_groups_of(1000, false) do |r|
Sunspot.index! r
end

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Processing pgSQL query results in batches - ruby-on-rails

Related

Can I force the execution of an active record query chain?

Rails: Group By join table

Active Record - Chain Queries with OR

Batching Queries with ActiveRecord

Finding mongoDB records in batches (using mongoid ruby adapter)

Categories

Resources