ActiveRecord: Alternative to find_in_batches? - ruby-on-rails

I have a query that loads thousands of objects and I want to tame it by using find_in_batches:
Car.includes(:member).where(:engine => "123").find_in_batches(batch_size: 500) ...
According to the docs, I can't have a custom sorting order: http://www.rubydoc.info/docs/rails/4.0.0/ActiveRecord/Batches:find_in_batches
However, I need a custom sort order of created_at DESC. Is there another method to run this query in chunks like it does in find_in_batches so that not so many objects live on the heap at once?

Hm I've been thinking about a solution for this (I'm the person who asked the question). It makes sense that find_in_batches doesn't allow you to have a custom order because lets say you sort by created_at DESC and specify a batch_size of 500. The first loop goes from 1-500, the second loop goes from 501-1000, etc. What if before the 2nd loop occurs, someone inserts a new record into the table? That would be put onto the top of the query results and your results would be shifted 1 to the left and your 2nd loop would have a repeat.
You could argue though that created_at ASC would be safe then, but it's not guaranteed if your app specifies a created_at value.
UPDATE:
I wrote a gem for this problem: https://github.com/EdmundMai/batched_query
Since using it, the average memory of my application has HALVED. I highly suggest anyone having similar issues to check it out! And contribute if you want!

The slower manual way to do this, is to do something like this:
count = Cars.includes(:member).where(:engine => "123").count
count = count/500
count += 1 if count%500 > 0
last_id = 0
while count > 0
ids = Car.includes(:member).where("engine = "123" and id > ?", last_id).order(created_at: :desc).limit(500).ids #which plucks just the ids`
cars = Cars.find(ids)
#cars.each or #cars.update_all
#do your updating
last_id = ids.last
count -= 1
end

Can you imagine how find_in_batches with sorting will works on 1M rows or more? It will sort all rows every batch.
So, I think will be better to decrease number of sort calls. For example for batch size equal to 500 you can load IDs only (include sorting) for N * 500 rows and after it just load batch of objects by these IDs. So, such way should decrease have queries with sorting to DB in N times.

Related

Speed up Active Record group by count query

How can I speed up the following query? I'm look to find record with 6 or less unique values of fb_id. The select doesn't seem to be adding much in terms of time but instead it's the group and count. Is there an alternate way to query? I added an index on fb_id and it only sped up the query by 50%
FbGroupApplication.group(:fb_id).where.not(
fb_id: _get_exclude_fb_group_ids
).group(
"count_fb_id desc"
).count(
"fb_id"
).select{|k, v| v <= 6 }
The query is looking for FbGroupApplications that have 6 or less applications to the same fb_id
Passing a block to the select method made Rails trigger the SQL, convert the found rows into ActiveRecord::Base's ruby object (record), and then perform a select on the array based of the block you gave. This whole process is costly (ruby is not good at this).
You can "delegate" the responsibility of comparing the count vs 6 to the database with a having clause:
FbGroupApplication
.group(:fb_id)
.where.not(fb_id: _get_exclude_fb_group_ids)
.having('count(fb_id) <= 6')

I need advice in speeding up this rails method that involves many queries

I'm trying to display a table that counts webhooks and arranges the various counts into cells by date_sent, sending_ip, and esp (email service provider). Within each cell, the controller needs to count the webhooks that are labelled with the "opened" event, and the "sent" event. Our database currently includes several million webhooks, and adds at least 100k per day. Already this process takes so long that running this index method is practically useless.
I was hoping that Rails could break down the enormous model into smaller lists using a line like this:
#today_hooks = #m_webhooks.where(:date_sent => this_date)
I thought that the queries after this line would only look at the partial list, instead of the full model. Unfortunately, running this index method generates hundreds of SQL statements, and they all look like this:
SELECT COUNT(*) FROM "m_webhooks" WHERE "m_webhooks"."date_sent" = $1 AND "m_webhooks"."sending_ip" = $2 AND (m_webhooks.esp LIKE 'hotmail') AND (m_webhooks.event LIKE 'sent')
This appears that the "date_sent" attribute is included in all of the queries, which implies that the SQL is searching through all 1M records with every single query.
I've read over a dozen articles about increasing performance in Rails queries, but none of the tips that I've found there have reduced the time it takes to complete this method. Thank you in advance for any insight.
m_webhooks.controller.rb
def index
def set_sub_count_hash(thip) {
gmail_hooks: {opened: a = thip.gmail.send(#event).size, total_sent: b = thip.gmail.sent.size, perc_opened: find_perc(a, b)},
hotmail_hooks: {opened: a = thip.hotmail.send(#event).size, total_sent: b = thip.hotmail.sent.size, perc_opened: find_perc(a, b)},
yahoo_hooks: {opened: a = thip.yahoo.send(#event).size, total_sent: b = thip.yahoo.sent.size, perc_opened: find_perc(a, b)},
other_hooks: {opened: a = thip.other.send(#event).size, total_sent: b = thip.other.sent.size, perc_opened: find_perc(a, b)},
}
end
#m_webhooks = MWebhook.select("date_sent", "sending_ip", "esp", "event", "email").all
#event = params[:event] || "unique_opened"
#m_list_of_ips = [#List of three ip addresses]
end_date = Date.today
start_date = Date.today - 10.days
date_range = (end_date - start_date).to_i
#count_array = []
date_range.times do |n|
this_date = end_date - n.days
#today_hooks = #m_webhooks.where(:date_sent => this_date)
#count_array[n] = {:this_date => this_date}
#m_list_of_ips.each_with_index do |ip, index|
thip = #today_hooks.where(:sending_ip => ip) #Stands for "Today Hooks ip"
#count_array[n][index] = set_sub_count_hash(thip)
end
end
Well, your problem is very simple, actually. You gotta remember that when you use where(condition), the query is not straight executed in the DB.
Rails is smart enough to detect when you need a concrete result (a list, an object, or a count or #size like in your case) and chain your queries while you don't need one. In your code, you keep chaining conditions to the main query inside a loop (date_range). And it gets worse, you start another loop inside this one adding conditions to each query created in the first loop.
Then you pass the query (not concrete yet, it was not yet executed and does not have results!) to the method set_sub_count_hash which goes on to call the same query many times.
Therefore you have something like:
10(date_range) * 3(ip list) * 8 # (times the query is materialized in the #set_sub_count method)
and then you have a problem.
What you want to do is to do the whole query at once and group it by date, ip and email. You should have a hash structure after that, which you would pass to the #set_sub_count method and do some ruby gymnastics to get the counts you're looking for.
I imagine the query something like:
main_query = #m_webhooks.where('date_sent > ?', 10.days.ago.to_date)
.where(sending_ip:#m_list_of_ips)
Ok, now you have one query, which is nice, but I think you should separate the query in 4 (gmail, hotmail, yahoo and other), which gives you 4 queries (the first one, the main_query, will not be executed until you call for materialized results, don forget it). Still, like 100 times faster.
I think this is the result that should be grouped, mapped and passed to #set_sub_count instead of passing the raw query and calling methods on it every time and many times. It will be a little work to do the grouping, mapping and counting for sure, but hey, it's faster. =)
In case this helps anybody else, I learned how to fill a hash with counts in a much simpler way. More importantly, this approach runs a single query (as opposed to the 240 queries that I was running before).
#count_array[esp_index][j] = MWebhook.where('date_sent > ?', start_date.to_date)
.group('date_sent', 'sending_ip', 'event', 'esp').count

Is there a way with Rails/Postgres to get records in a random order BUT place certain values first?

I have an app that returns a bunch of records in a list in a random order with some pagination. For this, I save a seed value (so that refreshing the page will return the list in the same order again) and then use .order('random()').
However, say that out of 100 records, I have 10 records that have a preferred_level = 1 while all the other 90 records have preferred_level = 0.
Is there some way that I can place the preferred_level = 1 records first but still keep everything randomized?
For example, I have [A,1],[X,0],[Y,0],[Z,0],[B,1],[C,1],[W,0] and I hope I would get back something like [B,1],[A,1],[C,1],[Z,0],[X,0],[Y,0],[W,0].
Note that even the ones with preferred_level = 1 are randomized within themselves, just that they come before all the 0 records. In theory, I would hope whatever solution would place preferred_level = 2 before the 1 records if I were ever to add them.
------------
I had hoped it would be as intuitively simple as Model.all.order('random()').order('preferred_level DESC') but that doesn't seem to be the case. The second order doesn't seem to affect anything.
Any help would be appreciated!
This got the job done for me on a very similar problem.
select * from table order by preferred_level = 1 desc, random()
or I guess the Rails way
Model.order('preferred_level = 1 desc, random()')

Optimize the retrieval of a small number of records from database?

I'm running the following query:
User.where("number > ?", 5).order(&:age).first(20)
I noticed that the speed of the query was about the same whether I replaced "first(20)" with "first(200)" or even just "first". This seems to imply that all records are retrieved by the server, no matter how many records I actually want in the array. Are there any ways to possibly expedite this process?
The performance may well be similar, because in general the database is going to have to identify all of the rows that match the conditions, then order them all, then read the first n rows from the sorted set. If n is 200 then obviously it will have to return more rows to the application, but the primary driver on database performance is probably not the quantity of rows returned but the quantity of rows to be ordered.
As others state:
User.where("number > ?", 5).order(:age).limit(20)
... or to get those with the highest age ...
User.where("number > ?", 5).order(:age => :desc).limit(20)
(Rails 4 syntax)
There are occasions when the database can use an index to provide the sort order, in which case you'd likely see a much larger performance difference between 20 or 200 rows.
You can perform the query with limit:
User.where("number > ?", 5).order(:age).limit(20)
Check this Rails Guides article for more examples.
Good luck!
You can use limit since you're ordering the results:
User.where("number > ?", 5).order('age desc').limit(20)

Can I add where clauses after putting limit on a scoped query?

I have a model called Game in which I build up a scoped query.
Something like:
games = Game.scoped
games = games.team(team_name) if team_name
games = game.opponent(opponent_name) if opponent_name
total_games = games
I then calculate several subsets like:
wins = games.where("team_score > opponent_score").count
losses = games.where("opponent_score > team_score").count
Everything is great. Then I decided that I want to limit the original scope to show the last X number of games.
total_games = games.limit(10)
If there are 100 games that match what I want for total_games, and then I add .limit(10) - it gets the last 10. Great. But now calling
total_games.where("team_score > opponent_score").count
will reach back beyond the last 10, and into results that aren't part of total_games. Since adding .limit(10), I'll always get 10 total games, but also 10 wins, and 10 losses.
After typing this all out, I've realized that the cases where I want to use limit are for showing a smaller set of results - so I'll probably end up just looping through the results to calculate things like wins and losses (instead of doing separate queries as in my subsets above).
I tried this out when total_games had hundreds or thousands of results, and it's significantly slower to loop through than it is to just do separate queries for the subsets.
So, now I must know - what is the best way to limit a scoped query, and then do future queries of those results that restrict themselves results returned by the original .limit(x)?
I don't think you can do what you want to do without separating your query into two steps, first getting 10 games from total_games and making the DB query with all:
last_10_games = total_games.limit(10).all
then selecting from the resulting array and getting the size of the result:
wins = last_10_games.select { |g| g.team_score > g.opponent_score }.count
losses = last_10_games.select { |g| g.opponent_score > g.team_score }.count
I know this is not exactly what you asked for, but I think it's probably the most straightforward solution to the problem.

Resources