Queries with include in Rails - ruby-on-rails

I have the following problem. I need to do a massive query of table named professionals but I need to optimize the query because for each professional I call the associated tables.
But I have a problem with two associated tables: comments and tariffs.
Comments:
I need to call 3 comments for each professional. I try with:
#professionals.includes(:comments).where(:comments => { type: 0 } ).last(3)
The problem the query only brings 3 professionals, not what I need, all the professionals with only three comments where type be equal to zero.
And when I try:
#professionals.includes(:comments).where(:comments => { type: 0 } )
The result is only professionals with (all the) comments when I need all the professional with or without comments. But if the professional have comments I only need the last three comments where the type be equals zero
Tariffs:
With tariffs I have a similar problem, in this case I need the last 4 tariffs for each professional. I try with:
#professionals.includes(:tariffs).last(4)
But only brings the last 4 professionals.
Models:
class Comment < ActiveRecord::Base
belongs_to :client
belongs_to :professional
end
class Professionals < ActiveRecord::Base
has_many :comment
end

You can't use limit on the joining table in ActiveRecord. The limit is applied to the first relation, which in this case happens to be #professionals.
You have a few choices choices:
Preload all comments for each professional and limit them on output (reduces the number of queries needed but increases memory consumption since you are potentially preloading a lot of AR objects).
Lazy load the required number of comments (increases the number of queries by n+1, but reduces the potential memory consumption).
Write a custom query with raw SQL.
If you preload everything, then you don't have to change much. Just limit the number of comments white iterating through each #professional.
#professionals.each do |professional|
#professional.comments.limit(3)
end
If you lazy load only what you need, then you would apply the limit scope to the comments relation.
#professionals.all
#professionals.each do |professional|
#professional.comments.where(type: 0).limit(3)
end
Writing a custom query is a bit more complex. But you might find that it might be less performant depending on the number of joins you have to make, and how well indexed your table is.
I suggest you take approach two, and use query and fragment caching to improve performance. For example:
- cache #professionals do
- #professionals.each do |professional|
- cache professional do
= professional.name
This approach will hit the database the first time, but after subsequent loads comments will be read from the cache, avoiding the DB hit. You can read more about caching in the Rails Guides.

Related

How to deal with memory leak in Ruby/Rails

I'm developping a Rails application that deals with huge amounts of
data and it halts since it uses all memory of my computer due to
memory leak (allocated objects that are not released).
In my application, data is organized in a hierarchical way, as a tree,
where each node of level "X" contains the sum of data of level
"X+1". For example if the data of level "X+1" contains the amount of
people in cities, level "X" contains the amount of people in
states. In this way, level "X"'s data is obtained by summing up the
amount of data in level "X+1" (in this case, people).
For the sake of this question, consider a tree with four levels:
country, State, City and Neighbourhoods and that each level is mapped
into Activerecords tables (countries, states, cities, neighbourhoods).
Data is read from a csv file that fills the leaves of the tree, that is,
the neighbourhoods table.
Afetr that, data flows from bottom (neighbourhoods) to top (countries) in the following sequence:
1) Neighbourhoods data is summed to Cities;
2) after step 1 is completed, Cities data is summed to States;
3) after step 2 is completed, States data is summed to Country;
The schematic code I'm using is as follows:
1 cities = City.all
2 cities.each do |city|
3 city.data = 0
4 city.neighbourhoods.each do |neighbourhood|
5 city.data = city.data + neighbourhood.data
6 end
7 city.save
8 end
The lowest level of the tree contains 3.8M of records. Each time lines
2-8 are executed, a city is summed up and after line 8 is executed,
that subtree is no longer needed, but it is never released (memory
leak). After summing 50% of the cities, all my 8Gbytes of RAM
vanishes.
My question is what can I do. Buy better hardware will not
do since I'm working with a "small" prototype.
I know a way to make it work: restart the application for each City,
but I hope someone has a better idea. The "simplest" would be to force
the garbage collector to free specific objects, but seems is not a way
to do it
(https://www.ruby-forum.com/t/how-do-i-force-ruby-to-release-memory/195515).
From the following articles I understood that the developer should
organize the data in a way to "suggest" the garbage collector what
should be freed. Maybe another approach will do the trick, but the only
alternative I see is Depth-first search approach instead of the
reversed Breadth-first search I'm using, but I don't see why it should work.
What I read so far:
https://stackify.com/how-does-ruby-garbage-collection-work-a-simple-tutorial/
https://www.toptal.com/ruby/hunting-ruby-memory-issues
https://scoutapm.com/blog/ruby-garbage-collection
https://scoutapm.com/blog/manage-ruby-memory-usage
Thanks
This isn't really a case of a memory leak. You're just indescrimely loading data off the table which will exhaust the available memory.
The solution is to load the data off the database in batches:
City.find_each do |city|
city.update(data: city.neighbourhoods.sum(&:data))
end
If neighbourhoods.data is a simple integer you don't need to fetch the records in the first place:
City.update_all(
'data = (SELECT SUM(neighbourhoods.data) FROM neighbourhoods WHERE neighbourhoods.city_id = cities.id)'
)
This will be an order of magnitude faster and have a trivial memory consumption as all the work is done in the database.
If you REALLY want to load a bunch of records into rails then make sure to select aggregates instead of instantiating all those nested records:
City.left_joins(:neighbourhoods)
.group(:id)
.select(:id, 'SUM(neighbourhoods.data) AS n_data')
.find_each { |c| city.update(data: n_data) }
Depending on your how your model associations are setup, should be able to take advantage of preloading.
For Example:
class City < ApplicationRecord
has_many :neighborhoods
class Neighborhood < ApplicationRecord
belongs_to :city
belongs_to :state
class State < ApplicationRecord
belongs_to :country
has_many :neighborhoods
class Country < ApplicationRecord
has_many :states
cities = City.all.includes(neighborhoods: { state: :country })
cities.each do |city|
...
end
You don't need rails at all, with pure SQL should be good enough to do what you're trying:
City.connection.execute(<<-SQL.squish)
UPDATE cities SET cities.data = (
SELECT SUM("neighbourhoods.data")
FROM neighbourhoods
WHERE neighbourhoods.city_id = cities.id
)
SQL

Rails subquery reduce amount of raw SQL

I have two ActiveRecord models: Post and Vote. I want a make a simple query:
SELECT *,
(SELECT COUNT(*)
FROM votes
WHERE votes.id = posts.id) AS vote_count
FROM posts
I am wondering what's the best way to do it in activerecord DSL. My goal is to minimize the amount of SQL I have to write.
I can do Post.select("COUNT(*) from votes where votes.id = posts.id as vote_count")
Two problems with this:
Raw SQL. Anyway to write this in DSL?
This returns only attribute vote_count and not "*" + vote_count. I can append .select("*") but I will be repeating this every time. Is there an much better/DRY way to do this?
Thanks
Well, if you want to reduce amount of SQL, you can split that query into smaller two end execute them separately. For instance, the votes counting part could be extracted to query:
SELECT votes.id, COUNT(*) FROM votes GROUP BY votes.id;
which you may write with ActiveRecord methods as:
Vote.group(:id).count
You can store the result for later use and access it directly from Post model, for example you may define #votes_count as a method:
class Post
def votes_count
##votes_count_cache ||= Vote.group(:id).count
##votes_count_cache[id] || 0
end
end
(Of course every use of cache raises a question about invalidating or updating it, but this is out of the scope of this topic.)
But I strongly encourage you to consider yet another approach.
I believe writing complicated queries like yours with ActiveRecord methods — even if would be possible — or splitting queries into two as I proposed earlier are both bad ideas. They result in extremely cluttered code, far less readable than raw SQL. Instead, I suggest introducing query objects. IMO there is nothing wrong in using raw, complicated SQL when it's hidden behind nice interface. See: M. Fowler's P of EAA and Brynary's post on Code Climate Blog.
How about doing this with no additional SQL at all - consider using the Rails counter_cache feature.
If you add an integer votes_count column to the posts table, you can get Rails to automatically increment and decrement that counter by changing the belongs_to declaration in Vote to:
belongs_to :post, counter_cache: true
Rails will then keep each Post updated with the number of votes it has. That way the count is already calculated and no sub-query is needed.
Maybe you can create mysql view and just map it to new AR model. It works similar way to table, you just need to specify with set_table_name "your_view_name"....maybe on DB level it will work faster and will be automatically re-calculating.
Just stumbled upon postgres_ext gem which adds support for Common Table Expressions in Arel and ActiveRecord which is exactly what you asked. Gem is not for SQLite, but perhaps some portions could be extracted or serve as examples.

Rails, data structure and performance

Let's say I have a rails app with 3 tables, one for questions, one for options (possible answers to this question), and one for votes.
Currently, when requesting the statistics on a given question, I have to make a SQL query for each option which will look in the "votes" table (around 1.5 million entries) and count the number of times this option has been selected. It's slow and takes 4/5 seconds.
I was thinking of adding a column directly in the question table which would store the statistics and update them each time someone makes a vote. Is that good practice ? Because it seems redundant to the information that is already in the votes table, only it would be faster to load.
Or maybe I should create another table which would save these statistics for each question ?
Thanks for your advice !
Rails offers a feature called counter_cache which will serve your purpose
Add the counter_cache option to votes model
class Vote < AR::Base
belongs_to :question, :counter_cache => true
end
and the following migration
add_column :questions, :votes_count, :integer, :default => 0
This should increment the votes_count field in questions table for every new record in votes table
For more info: RailsCast
It would be a wise decision, ActiveRecord:CounterCache is made just for that purpose.
Also, there's a Railscast for that
You probably can do a "clever" SQL query using GROUP BY that will give you the expected result in one query. If you are query is that slow you'll probably need to add some indexes on your table.

ActiveRecord: size vs count

In Rails, you can find the number of records using both Model.size and Model.count. If you're dealing with more complex queries is there any advantage to using one method over the other? How are they different?
For instance, I have users with photos. If I want to show a table of users and how many photos they have, will running many instances of user.photos.size be faster or slower than user.photos.count?
Thanks!
You should read that, it's still valid.
You'll adapt the function you use depending on your needs.
Basically:
if you already load all entries, say User.all, then you should use length to avoid another db query
if you haven't anything loaded, use count to make a count query on your db
if you don't want to bother with these considerations, use size which will adapt
As the other answers state:
count will perform an SQL COUNT query
length will calculate the length of the resulting array
size will try to pick the most appropriate of the two to avoid excessive queries
But there is one more thing. We noticed a case where size acts differently to count/lengthaltogether, and I thought I'd share it since it is rare enough to be overlooked.
If you use a :counter_cache on a has_many association, size will use the cached count directly, and not make an extra query at all.
class Image < ActiveRecord::Base
belongs_to :product, counter_cache: true
end
class Product < ActiveRecord::Base
has_many :images
end
> product = Product.first # query, load product into memory
> product.images.size # no query, reads the :images_count column
> product.images.count # query, SQL COUNT
> product.images.length # query, loads images into memory
This behaviour is documented in the Rails Guides, but I either missed it the first time or forgot about it.
tl;dr
If you know you won't be needing the data use count.
If you know you will use or have used the data use length.
If you don't know where it is used or the speed difference is neglectable, use size...
count
Resolves to sending a Select count(*)... query to the DB. The way to go if you don't need the data, but just the count.
Example: count of new messages, total elements when only a page is going to be displayed, etc.
length
Loads the required data, i.e. the query as required, and then just counts it. The way to go if you are using the data.
Example: Summary of a fully loaded table, titles of displayed data, etc.
size
It checks if the data was loaded (i.e. already in rails) if so, then just count it, otherwise it calls count. (plus the pitfalls, already mentioned in other entries).
def size
loaded? ? #records.length : count(:all)
end
What's the problem?
That you might be hitting the DB twice if you don't do it in the right order (e.g. if you render the number of elements in a table on top of the rendered table, there will be effectively 2 calls sent to the DB).
Sometimes size "picks the wrong one" and returns a hash (which is what count would do)
In that case, use length to get an integer instead of hash.
The following strategies all make a call to the database to perform a COUNT(*) query.
Model.count
Model.all.size
records = Model.all
records.count
The following is not as efficient as it will load all records from the database into Ruby, which then counts the size of the collection.
records = Model.all
records.size
If your models have associations and you want to find the number of belonging objects (e.g. #customer.orders.size), you can avoid database queries (disk reads). Use a counter cache and Rails will keep the cache value up to date, and return that value in response to the size method.
I recommended using the size function.
class Customer < ActiveRecord::Base
has_many :customer_activities
end
class CustomerActivity < ActiveRecord::Base
belongs_to :customer, counter_cache: true
end
Consider these two models. The customer has many customer activities.
If you use a :counter_cache on a has_many association, size will use the cached count directly, and not make an extra query at all.
Consider one example:
in my database, one customer has 20,000 customer activities and I try to count the number of records of customer activities of that customer with each of count, length and size method. here below the benchmark report of all these methods.
user system total real
Count: 0.000000 0.000000 0.000000 ( 0.006105)
Size: 0.010000 0.000000 0.010000 ( 0.003797)
Length: 0.030000 0.000000 0.030000 ( 0.026481)
so I found that using :counter_cache Size is the best option to calculate the number of records.
Here's a flowchart to simplify your decision-making process. Hope it helps.
Source: Difference Between the Length, Size, and Count Methods in Rails

A database design for variable column names

I have a situation that involves Companies, Projects, and Employees who write Reports on Projects.
A Company owns many projects, many reports, and many employees.
One report is written by one employee for one of the company's projects.
Companies each want different things in a report. Let's say one company wants to know about project performance and speed, while another wants to know about cost-effectiveness. There are 5-15 criteria, set differently by each company, which ALL apply to all of that company's project reports.
I was thinking about different ways to do this, but my current stalemate is this:
To company table, add text field criteria, which contains an array of the criteria desired in order.
In the report table, have a company_id and columns criterion1, criterion2, etc.
I am completely aware that this is typically considered horrible database design - inelegant and inflexible. So, I need your help! How can I build this better?
Conclusion
I decided to go with the serialized option in my case, for these reasons:
My requirements for the criteria are simple - no searching or sorting will be required of the reports once they are submitted by each employee.
I wanted to minimize database load - where these are going to be implemented, there is already a large page with overhead.
I want to avoid complicating my database structure for what I believe is a relatively simple need.
CouchDB and Mongo are not currently in my repertoire so I'll save them for a more needy day.
This would be a great opportunity to use NoSQL! Seems like the textbook use-case to me. So head over to CouchDB or Mongo and start hacking.
With conventional DBs you are slightly caught in the problem of how much to normalize your data:
A sort of "good" way (meaning very normalized) would look something like this:
class Company < AR::Base
has_many :reports
has_many :criteria
end
class Report < AR::Base
belongs_to :company
has_many :criteria_values
has_many :criteria, :through => :criteria_values
end
class Criteria < AR::Base # should be Criterion but whatever
belongs_to :company
has_many :criteria_values
# one attribute 'name' (or 'type' and you can mess with STI)
end
class CriteriaValues < AR::Base
belongs_to :report
belongs_to :criteria
# one attribute 'value'
end
This makes something very simple and fast in NoSQL a triple or quadruple join in SQL and you have many models that pretty much do nothing.
Another way is to denormalize:
class Company < AR::Base
has_many :reports
serialize :criteria
end
class Report < AR::Base
belongs_to :company
serialize :criteria_values
def criteria
self.company.criteria
end
# custom code here to validate that criteria_values correspond to criteria etc.
end
Related to that is the rather clever way of serializing at least the criteria (and maybe values if they were all boolean) is using bit fields. This basically gives you more or less easy migrations (hard to delete and modify, but easy to add) and search-ability without any overhead.
A good plugin that implements this is Flag Shih Tzu which I've used on a few projects and could recommend.
Variable columns (eg. crit1, crit2, etc.).
I'd strongly advise against it. You don't get much benefit (it's still not very searchable since you don't know in which column your info is) and it leads to maintainability nightmares. Imagine your db gets to a few million records and suddenly someone needs 16 criteria. What could have been a complete no-issue is suddenly a migration that adds a completely useless field to millions of records.
Another problem is that a lot of the ActiveRecord magic doesn't work with this - you'll have to figure out what crit1 means by yourself - now if you wan't to add validations on these fields then that adds a lot of pointless work.
So to summarize: Have a look at Mongo or CouchDB and if that seems impractical, go ahead and save your stuff serialized. If you need to do complex validation and don't care too much about DB load then normalize away and take option 1.
Well, when you say "To company table, add text field criteria, which contains an array of the criteria desired in order" that smells like the company table wants to be normalized: you might break out each criterion in one of 15 columns called "criterion1", ..., "criterion15" where any or all columns can default to null.
To me, you are on the right track with your report table. Each row in that table might represent one report; and might have corresponding columns "criterion1",...,"criterion15", as you say, where each cell says how well the company did on that column's criterion. There will be multiple reports per company, so you'll need a date (or report-number or similar) column in the report table. Then the date plus the company id can be a composite key; and the company id can be a non-unique index. As can the report date/number/some-identifier. And don't forget a column for the reporting-employee id.
Any and every criterion column in the report table can be null, meaning (maybe) that the employee did not report on this criterion; or that this criterion (column) did not apply in this report (row).
It seems like that would work fine. I don't see that you ever need to do a join. It looks perfectly straightforward, at least to these naive and ignorant eyes.
Create a criteria table that lists the criteria for each company (company 1 .. * criteria).
Then, create a report_criteria table (report 1 .. * report_criteria) that lists the criteria for that specific report based on the criteria table (criteria 1 .. * report_criteria).

Resources