Batching Queries with ActiveRecord

Batching Queries with ActiveRecord - ruby-on-rails

Is there a way to batch independent (i.e. not dependent on a previous value) queries into a single request to the database in order to prevent round trips?
I intend use this to read data from several unrelated data models, so joins and views don't suffice.
Here's a very rough idea of what I'm attempting:
queries = {
# What Relation or SQL
top_duck: Duck.limit(1), # We can't use first directly
cow_count: Cow.select('count(*)'), # We can't use count directly
petals_lost: Flower.group_by('petals_lost').order_by('petals_lost')
.select('average(petals_lost) as average_petals_lost'),
avg_score: "select avg(score), stddev(score) from games"
}
bq = BatchQuery.new(queries)
# And in one trip...
bq.execute # => Hash<DB::Result> suffices for now; next step: typecasting

Related

Rails: Group By join table

I am trying to group by a delegate join table. I have a tasks table and each task is has a project_id. The following code works well in my controller for me to group by project:
#task = Task.joins(:project).joins(:urgency).where(urgencies: {urgency_value: 7}).group_by(&:project_id)
This returns a hash where the key is what I have joined by and then the index contains each tasks within that group. I can then loop through each task to retrieve its attributes.
However, each project belongs to a workspace (via a workspace_id). What I want is to have the same query but to group by the workspace. The final aim is for me to create a table which shows the workspace name in one column and the number of tasks for that workspace in the second column.
I have tried many combinations and searched many forums but after several hours still haven't been able to crack it.

If your only goal is to get the task counts per workspace, I think you want a different query.
#workspaces_with_task_counts =
Workspace
.joins(:projects)
.joins(:tasks)
.select('workspaces.name, count(tasks.id) as task_count')
.group(:workspace_id)
Then you can access the count like this:
#workspaces_with_task_counts.each do |workspace|
puts "#{workspace.name}: #{workspace.task_count}"
end
EDIT 1
I think this is what you want:
Workspace
.joins(projects: { tasks: :urgencies })
.where(urgencies: {urgency_value: 7})
.group(:name)
.count
which results in a hash containing all of the workspaces with at least one task where the urgency_value is 7, by name, with the number of tasks in that workspace:
{"workspace1"=>4, "workspace2"=>1}
EDIT 2
SQL is not capable of returning both detail and summary information in a single query. But, we can get all the data, then summarize it in memory with Ruby's group_by method:
Task
.joins(project: :workspace)
.includes(project: :workspace)
.group_by { |task| task.project.workspace.name }
This produces the following data structure:
{
"workspace1": [task, task, task],
"workspace2": [task, task],
"workspace3": [task, task, task, task]
}
But, it does so at a cost. Grouping in memory is an expensive process. Running that query 10,000 times took ~15 seconds.
It turns out that executing two SQL queries is actually two orders of magnitude faster at ~0.2 seconds. Here are the queries:
tasks = Task.joins(project: :workspace).includes(project: :workspace)
counts = tasks.group('workspaces.name').count
The first query gets you all the tasks and preloads their associated project and workspace data. The second query uses ActiveRecord's group clause to construct the SQL statement to summarize the data. It returns this data structure:
{ "workspace1": 3, "workspace2": 2, "workspace3": 4 }
Databases are super efficient at set manipulation. It's almost always significantly faster to do that work in the database than in Ruby.

I need advice in speeding up this rails method that involves many queries

I'm trying to display a table that counts webhooks and arranges the various counts into cells by date_sent, sending_ip, and esp (email service provider). Within each cell, the controller needs to count the webhooks that are labelled with the "opened" event, and the "sent" event. Our database currently includes several million webhooks, and adds at least 100k per day. Already this process takes so long that running this index method is practically useless.
I was hoping that Rails could break down the enormous model into smaller lists using a line like this:
#today_hooks = #m_webhooks.where(:date_sent => this_date)
I thought that the queries after this line would only look at the partial list, instead of the full model. Unfortunately, running this index method generates hundreds of SQL statements, and they all look like this:
SELECT COUNT(*) FROM "m_webhooks" WHERE "m_webhooks"."date_sent" = $1 AND "m_webhooks"."sending_ip" = $2 AND (m_webhooks.esp LIKE 'hotmail') AND (m_webhooks.event LIKE 'sent')
This appears that the "date_sent" attribute is included in all of the queries, which implies that the SQL is searching through all 1M records with every single query.
I've read over a dozen articles about increasing performance in Rails queries, but none of the tips that I've found there have reduced the time it takes to complete this method. Thank you in advance for any insight.
m_webhooks.controller.rb
def index
def set_sub_count_hash(thip) {
gmail_hooks: {opened: a = thip.gmail.send(#event).size, total_sent: b = thip.gmail.sent.size, perc_opened: find_perc(a, b)},
hotmail_hooks: {opened: a = thip.hotmail.send(#event).size, total_sent: b = thip.hotmail.sent.size, perc_opened: find_perc(a, b)},
yahoo_hooks: {opened: a = thip.yahoo.send(#event).size, total_sent: b = thip.yahoo.sent.size, perc_opened: find_perc(a, b)},
other_hooks: {opened: a = thip.other.send(#event).size, total_sent: b = thip.other.sent.size, perc_opened: find_perc(a, b)},
}
end
#m_webhooks = MWebhook.select("date_sent", "sending_ip", "esp", "event", "email").all
#event = params[:event] || "unique_opened"
#m_list_of_ips = [#List of three ip addresses]
end_date = Date.today
start_date = Date.today - 10.days
date_range = (end_date - start_date).to_i
#count_array = []
date_range.times do |n|
this_date = end_date - n.days
#today_hooks = #m_webhooks.where(:date_sent => this_date)
#count_array[n] = {:this_date => this_date}
#m_list_of_ips.each_with_index do |ip, index|
thip = #today_hooks.where(:sending_ip => ip) #Stands for "Today Hooks ip"
#count_array[n][index] = set_sub_count_hash(thip)
end
end

Well, your problem is very simple, actually. You gotta remember that when you use where(condition), the query is not straight executed in the DB.
Rails is smart enough to detect when you need a concrete result (a list, an object, or a count or #size like in your case) and chain your queries while you don't need one. In your code, you keep chaining conditions to the main query inside a loop (date_range). And it gets worse, you start another loop inside this one adding conditions to each query created in the first loop.
Then you pass the query (not concrete yet, it was not yet executed and does not have results!) to the method set_sub_count_hash which goes on to call the same query many times.
Therefore you have something like:
10(date_range) * 3(ip list) * 8 # (times the query is materialized in the #set_sub_count method)
and then you have a problem.
What you want to do is to do the whole query at once and group it by date, ip and email. You should have a hash structure after that, which you would pass to the #set_sub_count method and do some ruby gymnastics to get the counts you're looking for.
I imagine the query something like:
main_query = #m_webhooks.where('date_sent > ?', 10.days.ago.to_date)
.where(sending_ip:#m_list_of_ips)
Ok, now you have one query, which is nice, but I think you should separate the query in 4 (gmail, hotmail, yahoo and other), which gives you 4 queries (the first one, the main_query, will not be executed until you call for materialized results, don forget it). Still, like 100 times faster.
I think this is the result that should be grouped, mapped and passed to #set_sub_count instead of passing the raw query and calling methods on it every time and many times. It will be a little work to do the grouping, mapping and counting for sure, but hey, it's faster. =)

In case this helps anybody else, I learned how to fill a hash with counts in a much simpler way. More importantly, this approach runs a single query (as opposed to the 240 queries that I was running before).
#count_array[esp_index][j] = MWebhook.where('date_sent > ?', start_date.to_date)
.group('date_sent', 'sending_ip', 'event', 'esp').count

Processing pgSQL query results in batches

I've written the rake task to perform a postgreSQL query. The task returns an object of class Result.
Here's my task:
task export_products: :environment do
results = execute "SELECT smth IN somewhere"
if results.present?
results
else
nil
end
end
def execute sql
ActiveRecord::Base.connection.execute sql
end
My further plan is to split the output in batches and save these batches one by one into a .csv file.
Here I get stuck. I cannot imagine how to call find_in_batches method of ActiveRecord::Batches module for PG::Result.
How should I proceed?
Edit: I have a legacy sql query to a legacy database

If you look at how find_in_batches is implemented, you'll see that the algorithm is essentially:
Force the query to be ordered by the primary key.
Add a LIMIT clause to the query to match the batch size.
Execute the modified query from (2) to get a batch.
Do whatever needs to be done with the batch.
If the batch is smaller than the batch size, then the unlimited query has been exhausted so we're done.
Get the maximum primary query value (last_max) from the batch you get in (3).
Add primary_key_column > last_max to the query from (2)'s WHERE clause, run the query again, and go to step (4).
Pretty straight forward and could be implemented with something like this:
def in_batches_of(batch_size)
last_max = 0 # This should be safe for any normal integer primary key.
query = %Q{
select whatever
from table
where what_you_have_now
and primary_key_column > %{last_max}
order by primary_key_column
limit #{batch_size}
}
results = execute(query % { last_max: last_max }).to_a
while(results.any?)
yield results
break if(results.length < batch_size)
last_max = results.last['primary_key_column']
results = execute(query % { last_max: last_max }).to_a
end
end
in_batches_of(1000) do |batch|
# Do whatever needs to be done with the `batch` array here
end
Where, of course, primary_key_column and friends have been replaced with real values.
If you don't have a primary key in your query then you can use some other column that sorts nicely and is unique enough for your needs. You could also use an OFFSET clause instead of the primary key but that can get expensive with large result sets.

Can i write this Query in ActiveRecord

for a data analysis i need both results into one set.
a.follower_trackings.pluck(:date, :new_followers, :deleted_followers)
a.data_trackings.pluck(:date, :followed_by_count)
instead of ugly-merging an array (they can have different starting dates and i obv. need only those values where the date exists in both arrays) i thought about mysql
SELECT
followers.new_followers,
followers.deleted_followers,
trackings.date,
trackings.followed_by_count
FROM
instagram_user_follower_trackings AS followers,
instagram_data_trackings AS trackings
WHERE
followers.date = trackings.date
AND
followers.user_id=5
AND
trackings.user_id=5
ORDER
BY trackings.date DESC
This is Working fine, but i wonder if i can write the same with ActiveRecord?

You can do the following which should render the same query as your raw SQL, but it's also quite ugly...:
a.follower_trackings.
merge(a.data_trackings).
from("instagram_user_follower_trackings, instagram_data_trackings").
where("instagram_user_follower_trackings.date = instagram_data_trackings.date").
order(:date => :desc).
pluck("instagram_data_trackings.date",
:new_followers, :deleted_followers, :followed_by_count)
There are a few tricks turned out useful while playing with the scopes: the merge trick adds the data_trackings.user_id = a.id condition but it does not join in the data_trackings, that's why the from clause has to be added, which essentially performs the INNER JOIN. The rest is pretty straightforward and leverages the fact that order and pluck clauses do not need the table name to be specified if the columns are either unique among the tables, or are specified in the SELECT (pluck).
Well, when looking again, I would probably rather define a scope for retrieving the data for a given user (a record) that would essentially use the raw SQL you have in your question. I might also define a helper instance method that would call the scope with self, something like:
def Model
scope :tracking_info, ->(user) { ... }
def tracking_info
Model.tracking_info(self)
end
end
Then one can use simply:
a = Model.find(1)
a.tracking_info
# => [[...], [...]]

includes/joins case in rails 4

I have a habtm relationship between my Product and Category model.
I'm trying to write a query that searches for products with minimum of 2 categories.
I got it working with the following code:
p = Product.joins(:categories).group("product_id").having("count(product_id) > 1")
p.length # 178
When iterating on it though, for each time I call product.categories, it will do a new call to the database - not good. I want to prevent these calls and have the same result. Doing more research I've seen that I could include (includes) my categories table and it would load all the table in memory so it's not necessary to call the database again when iterating. So I got it working with the following code:
p2 = Product.includes(:categories).joins(:categories).group("product_id").having("count(product_id) > 1")
p2.length # 178 - I compared and the objects are the same as last query
Here come's what I am confused about:
p.first.eql? p2.first # true
p.first.categories.eql? p2.first.categories # false
p.first.categories.length # 2
p2.first.categories.length # 1
Why with the includes query I get the right objects but I don't get the categories relationship right?

It has something to do with the group method. Your p2 only contains the first category for each product.
You could break this up into two queries:
product_ids = Product.joins(:categories).group("product_id").having("count(product_id) > 1").pluck(:product_id)
result = Product.includes(:categories).find(product_ids)
Yeah, you hit the database twice, but at least you don't go to the database when you're iterating.

You must know that includes doesn't play well with joins (joins will just suppress the former).
Also When you include an association ActiveRecord figures out if it'll use eager_load (with a left join) or preload (with a separate query). Includes is just a wrapper for one of those 2.
The thing is preload plays well with joins ! So you can do this :
products = Product.preload(:categories). # this will trigger a separate query
joins(:categories). # this will build the relevant query
group("products.id").
having("count(product_id) > 1").
select("products.*")
Note that this will also hit the database twice, but you will not have any O(n) query.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart