I have been struggling for a while with problems along the same lines - performing efficient queries in rails. I am currently trying to perform a query on a model with 500,000 records and then pull out some descriptive statistics regarding the results returned.
As an overview:
I want to pull out a number of products which match a set of criteria. I would then like to...
Count the number of records (if there aren't any I want to supress certain actions)
Identify the max and min prices of the matching records and calculate the number of items falling between certain ranges
As it stands this set of commands takes a lot longer than I was hoping for (26000ms running locally on my desktop computer) and involves either 8 or 9 active record actions each of which take around 3000ms
Is there something I am doing wrongly to make this so slow to process? Any suggestions would be fantastic
The code in my controller is:
filteredmatchingproducts = Allproduct.select("id, product_name, price")
.where('product_name LIKE ?
OR (product_name LIKE ? AND product_name NOT LIKE ? AND product_name NOT LIKE ? AND product_name NOT LIKE ? AND product_name NOT LIKE ? AND product_name NOT LIKE ?)
OR product_name LIKE ? OR product_name LIKE ? OR product_name LIKE ? OR product_name LIKE ? OR (product_name LIKE ? AND product_name NOT LIKE ?) OR product_name LIKE ?',
'%Bike Box', '%Bike Bag%', '%Pannier%', '%Shopper%', '%Shoulder%', '%Shopping%', '%Backpack%' , '%Wheel Bag%', '%Bike sack%', '%Wheel cover%', '%Wheel case%', '%Bike case%', '%Wahoo%', '%Bicycle Travel Case%')
.order('price ASC')
#selected_products = filteredmatchingproducts.paginate(:page => params[:page])
#productsfound = filteredmatchingproducts.count
#min_price = filteredmatchingproducts.first
#max_price = filteredmatchingproducts.last
#price_range = #max_price.price - #min_price.price
#max_pricerange1 = #min_price.price + #price_range/4
#max_pricerange2 = #min_price.price + #price_range/2
#max_pricerange3 = #min_price.price + 3*#price_range/4
#max_pricerange4 = #max_price.price
if #min_price == nil
#don't do anything - just avoid error
else
#restricted_products_pricerange1 = filteredmatchingproducts.select("price").where('price BETWEEN ? and ?', 0 , #max_pricerange1).count
#restricted_products_pricerange2 = filteredmatchingproducts.select("price").where('price BETWEEN ? and ?', #max_pricerange1 + 0.01 , #max_pricerange2).count
#restricted_products_pricerange3 = filteredmatchingproducts.select("price").where('price BETWEEN ? and ?', #max_pricerange2 + 0.01 , #max_pricerange3).count
#restricted_products_pricerange4 = filteredmatchingproducts.select("price").where('price BETWEEN ? and ?', #max_pricerange3 + 0.01 , #max_pricerange4).count
end
EDIT
For clarity, the fundamental question I have is - why does each of these queries need to be performed on the large Allproduct database, is there not a way to perform the latter queries on the result of the former ones (I.e. use filteredmatchingproducts itself not recalculate it for each query)? In other programming languages I am used to being able to remember variables and perform operations of those remembered values, rather than having to work them out again before performing the operations - is this not the mindset in Rails?
There are one too many things that are wrong with the code snippet that you have shared. Most importantly perhaps, this is not a rails specific optimisation problem, but instead a database structure, and optimisation issue.
You are using 'like' queries, with ampersand (%) on both sides that result in linear search time in SQLLite, as no index can be applied. Ideally, you should not be applying searches using 'Like', but instead should have defined a product_categories table, which would have been reference in the AllProducts table as product_category_id and would have a index defined on it.
For initializing #products_found, #min_price, and #max_price variables, you can do the following:
filteredmatchingproductlist = filteredmatchingproducts.to_a
#productsfound = filteredmatchingproductlist.count
#min_price = filteredmatchingproductlist.first
#max_price = filteredmatchingproductlist.last
This will avoid having the separate queries triggered for them as you're performing these operations on an Array instead of ActiveRecord::Relation.
Since the results are sorted, you can apply good old binary search on filteredmatchingproductlist array, and calculate the counts to achieve the same result as the last four lines of your code:
#restricted_products_pricerange1 = filteredmatchingproducts.select("price").where('price BETWEEN ? and ?', 0 , #max_pricerange1).count
#restricted_products_pricerange2 = filteredmatchingproducts.select("price").where('price BETWEEN ? and ?', #max_pricerange1 + 0.01 , #max_pricerange2).count
#restricted_products_pricerange3 = filteredmatchingproducts.select("price").where('price BETWEEN ? and ?', #max_pricerange2 + 0.01 , #max_pricerange3).count
#restricted_products_pricerange4 = filteredmatchingproducts.select("price").where('price BETWEEN ? and ?', #max_pricerange3 + 0.01 , #max_pricerange4).count
Finally, it would be best to integrate a search engine such as Sphinx or Solr if you really need counts and full text searching. Check out http://pat.github.io/thinking-sphinx/searching.html as a reference for how to implement that.
What is the product_name field? It seems like you could use act_as_taggable gem (https://github.com/mbleigh/acts-as-taggable-on). LIKE statement causes database to check every single record for matches and it is quite heavy. When you have 500k records, it has to take a while.
If all you're dealing with are prices, you should go ahead and do so on an array of prices, rather than an ActiveRecord::Relation. So try something like:
filteredmatchingproducts = (...).map(&:price)
And then do all operations on that array. Also, try to load large requests in batches wherever possible, and then maintain your own counts, etc. if you can. This will avoid the application chewing up all the memory at once and slowing things down:
http://guides.rubyonrails.org/active_record_querying.html#retrieving-multiple-objects-in-batches
The reason it's executing so many queries is because you're asking it to execute a lot of queries. (Also all of the LIKEs tend to make things slow.) Here's your code with a comment added before each query that will be made (8 total).
filteredmatchingproducts = Allproduct.select("id, product_name, price")
.where('product_name LIKE ?
OR (product_name LIKE ? AND product_name NOT LIKE ? AND product_name NOT LIKE ? AND product_name NOT LIKE ? AND product_name NOT LIKE ? AND product_name NOT LIKE ?)
OR product_name LIKE ? OR product_name LIKE ? OR product_name LIKE ? OR product_name LIKE ? OR (product_name LIKE ? AND product_name NOT LIKE ?) OR product_name LIKE ?',
'%Bike Box', '%Bike Bag%', '%Pannier%', '%Shopper%', '%Shoulder%', '%Shopping%', '%Backpack%' , '%Wheel Bag%', '%Bike sack%', '%Wheel cover%', '%Wheel case%', '%Bike case%', '%Wahoo%', '%Bicycle Travel Case%')
.order('price ASC')
#!!!! this is a query "select ... offset x, limit y"
#selected_products = filteredmatchingproducts.paginate(:page => params[:page])
#!!!! this is a query "select count ..."
#productsfound = filteredmatchingproducts.count
#!!!! this is a query "select ... order id asc, limit 1"
#min_price = filteredmatchingproducts.first
#!!!! this is a query "select ... order id desc, limit 1"
#max_price = filteredmatchingproducts.last
#price_range = #max_price.price - #min_price.price
#max_pricerange1 = #min_price.price + #price_range/4
#max_pricerange2 = #min_price.price + #price_range/2
#max_pricerange3 = #min_price.price + 3*#price_range/4
#max_pricerange4 = #max_price.price
if #min_price == nil
#don't do anything - just avoid error
else
#!!!! this is a query "select ... where price BETWEEN X and Y"
#restricted_products_pricerange1 = filteredmatchingproducts.select("price").where('price BETWEEN ? and ?', 0 , #max_pricerange1).count
#!!!! this is a query "select ... where price BETWEEN X and Y"
#restricted_products_pricerange2 = filteredmatchingproducts.select("price").where('price BETWEEN ? and ?', #max_pricerange1 + 0.01 , #max_pricerange2).count
#!!!! this is a query "select ... where price BETWEEN X and Y"
#restricted_products_pricerange3 = filteredmatchingproducts.select("price").where('price BETWEEN ? and ?', #max_pricerange2 + 0.01 , #max_pricerange3).count
#!!!! this is a query "select ... where price BETWEEN X and Y"
#restricted_products_pricerange4 = filteredmatchingproducts.select("price").where('price BETWEEN ? and ?', #max_pricerange3 + 0.01 , #max_pricerange4).count
end
Related
I have a Schema like this:
User(:id)
Event(:id, :start_date, :end_date,:duration, :recurrence_pattern)
EventAssignment(:user_id, :event_id, :date)
recurrence_pattern is a rrule string (see https://icalendar.org/rrule-tool.html, https://github.com/square/ruby-rrule)
date is a date formatted like 'YYYY-MM-DD'
start_date and end_date are timestamps like 'YYYY-MM-DDTHH:MM:SS'
I want to find all users that have at least one allocation overlapping with 2 timestamps, say from and to
So I wrote a bit of postgres and arel
assignment_subquery = EventAssignment.joins(:event).where(
'"user_id" = users.id AND
(?) <= "event_assignments".date + (to_char("events".start_date, \'HH24:MI\'))::time + make_interval(mins => "events".duration) AND
(event_assignments.date) + (to_char(events.start_date, \'HH24:MI\'))::time <= (?)',
from, to).arel.exists
User.where(assignment_subquery)
edit: some more postgres(#max)
assignment_subquery = EventAssignment.joins(:event).where(
'"user_id" = users.id AND
("event_assignments".date + (to_char("events".start_date, \'HH24:MI\'))::time + make_interval(mins => "events".duration),
(event_assignments.date) + (to_char(events.start_date, \'HH24:MI\'))::time)
OVERLAPS ((?), (?))'
, from, to).arel.exists
Which works fine
my question is:
Is there a better more rails way to do this?
I have a method that ranks user's response rates in our system called ranked_users
def ranked_users
User.joins(:responds).group(:id).select(
"users.*, SUM(CASE WHEN answers.response != 3 THEN 1 ELSE 0 END ) avg, RANK () OVER (
ORDER BY SUM(CASE WHEN answers.response != 3 THEN 1 ELSE 0 END ) DESC, CASE WHEN users.id = '#{
current_user.id
}' THEN 1 ELSE 0 END DESC
) rank"
)
.where('users.active = true')
.where('answers.created_at BETWEEN ? AND ?', Time.now - 12.months, Time.now)
end
result = ranked_users
I then take the top three with top_3 = ranked_users.limit(3)
If the user is not in the top 3, I want to append them with their rank to the list:
user_rank = result.find_by(id: current_user.id)
Whenever I call user_rank.rank it returns 1. I know this is because it's applying the find_by clause first and then ranking them. Is there a way to enforce the find_by clause happens only on the result of the first query? I tried doing result.load.find_by(...) but had the same issue. I could convert the entire result into an array but I want the solution to be highly scalable.
If you expect lots of users with lots of answers and high load on your rating system - you can create a materialized view for the ranking query with (user_id, avg, rank, etc.) and refresh it periodically instead of calculating rank every time (say, a few times per day or even less often). There's gem scenic for this.
You can even have indexes on rank and user id on the view and your query will be two simple fast reads from it.
Noob here, I'm trying to query my SQLite database for entries that have been made in the last 7 days and then return them.
This is the current attempt
user.rb
def featuredfeed
#x = []
#s = []
Recipe.all.each do |y|
#x << "SELECT id FROM recipes WHERE id = #{y.id} AND created_at > datetime('now','-7 days')"
end
Recipe.all.each do |d|
#t = "SELECT id FROM recipes where id = #{d.id}"
#x.each do |p|
if #t = p
#s << d
end
end
end
#s
end
This code returns each recipe 6(total number of objects in the DB) times regardless of how old it is.
#x should only be 3 id's
#x = [13,15,16]
if i run
SELECT id FROM recipes WHERE id = 13 AND created_at > datetime('now','-7 days')
1 Rows returned with id 13 is returned
but if look for an id that is more than 7 days old such as 12
SELECT id FROM recipes WHERE id = 12 AND created_at > datetime('now','-7 days')
0 Rows returned
I'm probably over complicating this but I've spent way too long on it at this point.
the return type has to be Recipe.
To return objects created within last 7 days just use where clause:
Recipe.where('created_at >= ?', 1.week.ago)
Check out docs for more info on querying db.
Edit according to comments:
Since you are using acts_as_votable gem, add the votes caching, so that filtering by votes score is straightforward:
Recipe.where('cached_votes_total >= ?', 10)
Ruby is expressive. I would take the opportunity to use a scope. With Active Record Scopes, this query can be represented in a meaningful way within your code, using syntactic sugar.
scope :from_recent_week, -> { where('created_at >= ?', Time.zone.now - 1.week) }
This allows you to chain your scoped query and enhance readability:
Recipe.from_recent_week.each do
something_more_meaningful_than_a_SQL_query
end
It looks to me that your problem is database abstraction, something Rails does for you. If you are looking for a function that returns the three ids you indicate, I think you would want to do this:
#x = Recipe.from_recent_week.map(&:id)
No need for any of the other fluff, no declarations necessary. I also would encourage you to use a different variable name instead of #x. Please use something more like:
#ids_from_recent_week = Recipe.from_recent_week.map(&:id)
I have the following query within my model:
Post.where("created_utc > ? AND lower(category) = ?", 0, 'videos').group(:domain).order('count_all desc').page(1).per(25).count
I'm using the kaminari gem for pagination but the problem is this: this query returns what appears to be a sorted hash. However I have no way of knowing what the total result count is.
If you prefer not to take kaminari into consideration, you can reference the following query:
Post.where("created_utc > ? AND lower(category) = ?", min_time, subreddit.downcase).group(:domain).order('count_all desc').limit(limit).offset(start).count
Regardless, I have no way of figuring out what the total number of results is. How can I go about resolving this? Is there some way to figure out what the total result set size would be without the limit?
What you want is .length instead of .count, like this:
Post.where("created_utc > ? AND lower(category) = ?", 0, 'videos').group(:domain).order('count_all desc').page(1).per(25).length
.count performs a SQL COUNT
.length calculates the length of the resultant array
courtesy of #BenHawker at https://stackoverflow.com/a/34541020/380607
Aside from ruby based methods, I would also benchmark running a separate database-based count.
posts = Post.where("created_utc > ? AND lower(category) = ?", min_time, subreddit.downcase)
domain_count = posts.count("distinct domain")
result = posts.group(:domain).order('count_all desc').limit(limit).offset(start).count
Kaminari provides a total_count method. https://github.com/kaminari/kaminari/blob/master/README.md#the-per-scope
I would split it to multiple lines:
# 1. get the results
posts = Post.where("created_utc > ? AND lower(category) = ?", 0, 'videos').group(:domain).order('count_all desc')
# 2. count the total size
total_size = posts.count
# 3. use kaminari for pagination
posts_paged = posts.page(1).per(25)
So lets say I have the following in a Post model, each record has the field "num" with a random value of a number and a user_id.
So I make this:
#posts = Post.where(:user_id => 1)
Now lets say I want to limit my #posts array's records to have a sum of 50 or more in the num value (with only the final record going over the limit). So it would be adding post.num + post2.num + post3.num etc, until it the total reaches at least 50.
Is there a way to do this?
I would say to just grab all of the records like you already are:
#posts = Post.where(:user_id => 1)
and then use Ruby to do the rest:
sum, i = 0, 0
until sum >= 50
post = #posts[i].delete
sum, i = sum+post.num, i+1
end
There's probably a more elegant way but this will work. It deletes posts in order until the sum has exceed or is equal to 50. Then #posts is left with the rest of the records. Hopefully I understood your question.
You need to use the PostgreSQL Window functions
This gives you the rows with the net sum lower than 50
SELECT a.id, sum(a.num) num_sum OVER (ORDER BY a.user_id)
FROM posts a
WHERE a.user_id = 1 AND a.num_sum < 50
But your case is trickier as you want to go over the limit by one row:
SELECT a.id, sum(a.num) num_sum OVER (ORDER BY a.user_id)
FROM posts a
WHERE a.user_id = 1 AND a.num_sum <= (
SELECT MIN(c.num_sum)
FROM (
SELECT sum(b.num) num_sum OVER (ORDER BY b.user_id)
FROM posts b
WHERE b.user_id = 1 AND b.num_sum >= 50
) c )
You have to convert this SQL to Arel.