How to deal with memory leak in Ruby/Rails - ruby-on-rails

I'm developping a Rails application that deals with huge amounts of
data and it halts since it uses all memory of my computer due to
memory leak (allocated objects that are not released).
In my application, data is organized in a hierarchical way, as a tree,
where each node of level "X" contains the sum of data of level
"X+1". For example if the data of level "X+1" contains the amount of
people in cities, level "X" contains the amount of people in
states. In this way, level "X"'s data is obtained by summing up the
amount of data in level "X+1" (in this case, people).
For the sake of this question, consider a tree with four levels:
country, State, City and Neighbourhoods and that each level is mapped
into Activerecords tables (countries, states, cities, neighbourhoods).
Data is read from a csv file that fills the leaves of the tree, that is,
the neighbourhoods table.
Afetr that, data flows from bottom (neighbourhoods) to top (countries) in the following sequence:
1) Neighbourhoods data is summed to Cities;
2) after step 1 is completed, Cities data is summed to States;
3) after step 2 is completed, States data is summed to Country;
The schematic code I'm using is as follows:
1 cities = City.all
2 cities.each do |city|
3 city.data = 0
4 city.neighbourhoods.each do |neighbourhood|
5 city.data = city.data + neighbourhood.data
6 end
7 city.save
8 end
The lowest level of the tree contains 3.8M of records. Each time lines
2-8 are executed, a city is summed up and after line 8 is executed,
that subtree is no longer needed, but it is never released (memory
leak). After summing 50% of the cities, all my 8Gbytes of RAM
vanishes.
My question is what can I do. Buy better hardware will not
do since I'm working with a "small" prototype.
I know a way to make it work: restart the application for each City,
but I hope someone has a better idea. The "simplest" would be to force
the garbage collector to free specific objects, but seems is not a way
to do it
(https://www.ruby-forum.com/t/how-do-i-force-ruby-to-release-memory/195515).
From the following articles I understood that the developer should
organize the data in a way to "suggest" the garbage collector what
should be freed. Maybe another approach will do the trick, but the only
alternative I see is Depth-first search approach instead of the
reversed Breadth-first search I'm using, but I don't see why it should work.
What I read so far:
https://stackify.com/how-does-ruby-garbage-collection-work-a-simple-tutorial/
https://www.toptal.com/ruby/hunting-ruby-memory-issues
https://scoutapm.com/blog/ruby-garbage-collection
https://scoutapm.com/blog/manage-ruby-memory-usage
Thanks

This isn't really a case of a memory leak. You're just indescrimely loading data off the table which will exhaust the available memory.
The solution is to load the data off the database in batches:
City.find_each do |city|
city.update(data: city.neighbourhoods.sum(&:data))
end
If neighbourhoods.data is a simple integer you don't need to fetch the records in the first place:
City.update_all(
'data = (SELECT SUM(neighbourhoods.data) FROM neighbourhoods WHERE neighbourhoods.city_id = cities.id)'
)
This will be an order of magnitude faster and have a trivial memory consumption as all the work is done in the database.
If you REALLY want to load a bunch of records into rails then make sure to select aggregates instead of instantiating all those nested records:
City.left_joins(:neighbourhoods)
.group(:id)
.select(:id, 'SUM(neighbourhoods.data) AS n_data')
.find_each { |c| city.update(data: n_data) }

Depending on your how your model associations are setup, should be able to take advantage of preloading.
For Example:
class City < ApplicationRecord
has_many :neighborhoods
class Neighborhood < ApplicationRecord
belongs_to :city
belongs_to :state
class State < ApplicationRecord
belongs_to :country
has_many :neighborhoods
class Country < ApplicationRecord
has_many :states
cities = City.all.includes(neighborhoods: { state: :country })
cities.each do |city|
...
end

You don't need rails at all, with pure SQL should be good enough to do what you're trying:
City.connection.execute(<<-SQL.squish)
UPDATE cities SET cities.data = (
SELECT SUM("neighbourhoods.data")
FROM neighbourhoods
WHERE neighbourhoods.city_id = cities.id
)
SQL

Related

Is it a good idea to serialize immutable data from an association?

Let's say we have a collection of products, each with their own specifics e.g. price.
We want to issue invoices that contain said products. Using a direct association from Invoice to Product via :has_many is a no-go, since products may change and invoices must be immutable, thus resulting in an alteration of the invoice price, concept, etc.
I first thought of having an intermediate model like InvoiceProduct that would be associated to the Invoice and created from a Product. Each InvoiceProduct would be unique to its parent invoice and immutable. This option would increase the db size significantly as more invoices get issued though, so I think it is not a good option.
I'm now considering adding a serialized field to the invoice model with all the products information that are associated to it, a hash of the collection of items the invoice contains. This way we can have them in an immutable manner even if the product gets modified in the future.
I'm not sure of possible mid or long term downsides to this approach, though. Would like to hear your thoughts about it.
Also, if there's some more obvious approach that I might have overlooked I'd love to hear about it too.
Cheers
In my experience, the main downside of a serialized field approach vs the InvoiceProducts approach described above is decreased flexibility in terms of how you can use your invoice data going forward.
In our case, we have Orders and OrderItems tables in our database and use this data to generate sales analytics reports as well as customer Invoices.
Querying the OrderItem data to generate the sales reports we need is much faster and easier with this approach than it would be if the same data was stored as serialized data in the db.
No.
Serialized columns have no place in a modern application. They are a overused dirty hack from the days before native JSON/JSONB columns were widespread and have only downsides. The only exception to this rule is when you're using application side encryption.
JSON/JSONB columns can be used for a limited number of tasks where the data defies being defined by a fixed schema or if you're just storing raw json responses - but it should not be how you're defining your schema out of convenience because you're just shooting yourself in the foot. Its a special tool for special jobs.
The better alternative is to actually use good relational database design and store the price at the time of sale and everything else in a separate table:
class Order < ApplicationRecord
has_many :line_items
end
# rails g model line_item order:belongs_to product:belongs_to units:decimal unit_price:decimal subtotal:decimal
# The line item model is responsible for each item of an order
# and records the price at the time of order and any discounts applied to that line
class LineItem < ApplicationRecord
belongs_to :order
belongs_to :product
end
class Product < ApplicationRecord
has_many :line_items
end
A serialized column is not immutable in any way - its actually more prone to denormalization and corruption as there are no database side constraints to ensure its correctness.
Tables can actually be made immutable in many databases by using triggers.
Advantages:
No violation of 1NF.
A normalized fixed data schema to work with - constraints ensure the validity of the data on the database level.
Joins are an extremely powerful tool and not as expensive as you might think.
You can actually access and make sense of the data outside of the application if needed.
DECIMAL data types. JSON/JSONB only has a single number type that uses IEEE 754 floating point.
You have an actual model and assocations instead of having to deal with raw hashes.
You can query the data in sane queries.
You can generate aggregates on the database level and use tools like materialized views.

Queries with include in Rails

I have the following problem. I need to do a massive query of table named professionals but I need to optimize the query because for each professional I call the associated tables.
But I have a problem with two associated tables: comments and tariffs.
Comments:
I need to call 3 comments for each professional. I try with:
#professionals.includes(:comments).where(:comments => { type: 0 } ).last(3)
The problem the query only brings 3 professionals, not what I need, all the professionals with only three comments where type be equal to zero.
And when I try:
#professionals.includes(:comments).where(:comments => { type: 0 } )
The result is only professionals with (all the) comments when I need all the professional with or without comments. But if the professional have comments I only need the last three comments where the type be equals zero
Tariffs:
With tariffs I have a similar problem, in this case I need the last 4 tariffs for each professional. I try with:
#professionals.includes(:tariffs).last(4)
But only brings the last 4 professionals.
Models:
class Comment < ActiveRecord::Base
belongs_to :client
belongs_to :professional
end
class Professionals < ActiveRecord::Base
has_many :comment
end
You can't use limit on the joining table in ActiveRecord. The limit is applied to the first relation, which in this case happens to be #professionals.
You have a few choices choices:
Preload all comments for each professional and limit them on output (reduces the number of queries needed but increases memory consumption since you are potentially preloading a lot of AR objects).
Lazy load the required number of comments (increases the number of queries by n+1, but reduces the potential memory consumption).
Write a custom query with raw SQL.
If you preload everything, then you don't have to change much. Just limit the number of comments white iterating through each #professional.
#professionals.each do |professional|
#professional.comments.limit(3)
end
If you lazy load only what you need, then you would apply the limit scope to the comments relation.
#professionals.all
#professionals.each do |professional|
#professional.comments.where(type: 0).limit(3)
end
Writing a custom query is a bit more complex. But you might find that it might be less performant depending on the number of joins you have to make, and how well indexed your table is.
I suggest you take approach two, and use query and fragment caching to improve performance. For example:
- cache #professionals do
- #professionals.each do |professional|
- cache professional do
= professional.name
This approach will hit the database the first time, but after subsequent loads comments will be read from the cache, avoiding the DB hit. You can read more about caching in the Rails Guides.

Horizontal database scaling in Ruby on Rails

I have a Ruby on Rails app with a PostgreSQL database which has this structure:
class A < ActiveRecord::Base
has_many :B
end
class B < ActiveRecord::Base
has_many :C
end
class C < ActiveRecord::Base
attr_accessible :x, :y :z
end
The are only a few A's, and they grow slowly (say 5 a month). Each A has thousands of B's, and each B has tens of thousands of C's (so each A has millions of C's).
A's are independent and B's and C's from different A's will never be needed together (i.e. in the same query).
My problem is that now that I have only a couple of A's, ActiveRecord queries take pretty long. When the table for C has tens of millions of rows, queries will take forever.
I am thinking about scaling the database horizontally (i.e a table for A's, one table of B's and one table of C's for each A). But I don't know how to do it. It is a kind of sharding i guess, but I can't figure out how to create DB tables dynamically and use ActiveRecord to access the data if the table depends on which A im working with.
Thank you very much.
If you have performance concerns with only a few rows, or even with several million rows, you need to take a step back before trying to atmosphere engineer a solution. The problem you are describing is very easily solved by indexing; there is no advantage to creating additional physical tables and you'd be introducing incredible complexity.
As #mu-is-too-short already stated: pay attention to your query plans. Use your tools to analyze performance.
That being said you can use table partitioning to physically and transparently house the storage of data into different sharded tables which is especially useful for data that grows very fast but is only useful in a given time box (like a month). You can also do this with an archive bit flag column to shuttle old or deleted records onto some slower storage (say, standard RAID comprised of spinning rust) while keeping active records on faster storage (like a RAID of SSDs).
So it seems you have a tree-like structure. If there is really no need to pull them out of the database in a some kind of cross-referenced manner, then your A's have exactly the properties of a "document", have a look at MongoDB. A's would be saved with all of their B's and there C's in a single record.
http://www.mongodb.org/
If you are looking for an ORM, check
http://mongoid.org/en/mongoid/index.html

ActiveRecord: size vs count

In Rails, you can find the number of records using both Model.size and Model.count. If you're dealing with more complex queries is there any advantage to using one method over the other? How are they different?
For instance, I have users with photos. If I want to show a table of users and how many photos they have, will running many instances of user.photos.size be faster or slower than user.photos.count?
Thanks!
You should read that, it's still valid.
You'll adapt the function you use depending on your needs.
Basically:
if you already load all entries, say User.all, then you should use length to avoid another db query
if you haven't anything loaded, use count to make a count query on your db
if you don't want to bother with these considerations, use size which will adapt
As the other answers state:
count will perform an SQL COUNT query
length will calculate the length of the resulting array
size will try to pick the most appropriate of the two to avoid excessive queries
But there is one more thing. We noticed a case where size acts differently to count/lengthaltogether, and I thought I'd share it since it is rare enough to be overlooked.
If you use a :counter_cache on a has_many association, size will use the cached count directly, and not make an extra query at all.
class Image < ActiveRecord::Base
belongs_to :product, counter_cache: true
end
class Product < ActiveRecord::Base
has_many :images
end
> product = Product.first # query, load product into memory
> product.images.size # no query, reads the :images_count column
> product.images.count # query, SQL COUNT
> product.images.length # query, loads images into memory
This behaviour is documented in the Rails Guides, but I either missed it the first time or forgot about it.
tl;dr
If you know you won't be needing the data use count.
If you know you will use or have used the data use length.
If you don't know where it is used or the speed difference is neglectable, use size...
count
Resolves to sending a Select count(*)... query to the DB. The way to go if you don't need the data, but just the count.
Example: count of new messages, total elements when only a page is going to be displayed, etc.
length
Loads the required data, i.e. the query as required, and then just counts it. The way to go if you are using the data.
Example: Summary of a fully loaded table, titles of displayed data, etc.
size
It checks if the data was loaded (i.e. already in rails) if so, then just count it, otherwise it calls count. (plus the pitfalls, already mentioned in other entries).
def size
loaded? ? #records.length : count(:all)
end
What's the problem?
That you might be hitting the DB twice if you don't do it in the right order (e.g. if you render the number of elements in a table on top of the rendered table, there will be effectively 2 calls sent to the DB).
Sometimes size "picks the wrong one" and returns a hash (which is what count would do)
In that case, use length to get an integer instead of hash.
The following strategies all make a call to the database to perform a COUNT(*) query.
Model.count
Model.all.size
records = Model.all
records.count
The following is not as efficient as it will load all records from the database into Ruby, which then counts the size of the collection.
records = Model.all
records.size
If your models have associations and you want to find the number of belonging objects (e.g. #customer.orders.size), you can avoid database queries (disk reads). Use a counter cache and Rails will keep the cache value up to date, and return that value in response to the size method.
I recommended using the size function.
class Customer < ActiveRecord::Base
has_many :customer_activities
end
class CustomerActivity < ActiveRecord::Base
belongs_to :customer, counter_cache: true
end
Consider these two models. The customer has many customer activities.
If you use a :counter_cache on a has_many association, size will use the cached count directly, and not make an extra query at all.
Consider one example:
in my database, one customer has 20,000 customer activities and I try to count the number of records of customer activities of that customer with each of count, length and size method. here below the benchmark report of all these methods.
user system total real
Count: 0.000000 0.000000 0.000000 ( 0.006105)
Size: 0.010000 0.000000 0.010000 ( 0.003797)
Length: 0.030000 0.000000 0.030000 ( 0.026481)
so I found that using :counter_cache Size is the best option to calculate the number of records.
Here's a flowchart to simplify your decision-making process. Hope it helps.
Source: Difference Between the Length, Size, and Count Methods in Rails

A database design for variable column names

I have a situation that involves Companies, Projects, and Employees who write Reports on Projects.
A Company owns many projects, many reports, and many employees.
One report is written by one employee for one of the company's projects.
Companies each want different things in a report. Let's say one company wants to know about project performance and speed, while another wants to know about cost-effectiveness. There are 5-15 criteria, set differently by each company, which ALL apply to all of that company's project reports.
I was thinking about different ways to do this, but my current stalemate is this:
To company table, add text field criteria, which contains an array of the criteria desired in order.
In the report table, have a company_id and columns criterion1, criterion2, etc.
I am completely aware that this is typically considered horrible database design - inelegant and inflexible. So, I need your help! How can I build this better?
Conclusion
I decided to go with the serialized option in my case, for these reasons:
My requirements for the criteria are simple - no searching or sorting will be required of the reports once they are submitted by each employee.
I wanted to minimize database load - where these are going to be implemented, there is already a large page with overhead.
I want to avoid complicating my database structure for what I believe is a relatively simple need.
CouchDB and Mongo are not currently in my repertoire so I'll save them for a more needy day.
This would be a great opportunity to use NoSQL! Seems like the textbook use-case to me. So head over to CouchDB or Mongo and start hacking.
With conventional DBs you are slightly caught in the problem of how much to normalize your data:
A sort of "good" way (meaning very normalized) would look something like this:
class Company < AR::Base
has_many :reports
has_many :criteria
end
class Report < AR::Base
belongs_to :company
has_many :criteria_values
has_many :criteria, :through => :criteria_values
end
class Criteria < AR::Base # should be Criterion but whatever
belongs_to :company
has_many :criteria_values
# one attribute 'name' (or 'type' and you can mess with STI)
end
class CriteriaValues < AR::Base
belongs_to :report
belongs_to :criteria
# one attribute 'value'
end
This makes something very simple and fast in NoSQL a triple or quadruple join in SQL and you have many models that pretty much do nothing.
Another way is to denormalize:
class Company < AR::Base
has_many :reports
serialize :criteria
end
class Report < AR::Base
belongs_to :company
serialize :criteria_values
def criteria
self.company.criteria
end
# custom code here to validate that criteria_values correspond to criteria etc.
end
Related to that is the rather clever way of serializing at least the criteria (and maybe values if they were all boolean) is using bit fields. This basically gives you more or less easy migrations (hard to delete and modify, but easy to add) and search-ability without any overhead.
A good plugin that implements this is Flag Shih Tzu which I've used on a few projects and could recommend.
Variable columns (eg. crit1, crit2, etc.).
I'd strongly advise against it. You don't get much benefit (it's still not very searchable since you don't know in which column your info is) and it leads to maintainability nightmares. Imagine your db gets to a few million records and suddenly someone needs 16 criteria. What could have been a complete no-issue is suddenly a migration that adds a completely useless field to millions of records.
Another problem is that a lot of the ActiveRecord magic doesn't work with this - you'll have to figure out what crit1 means by yourself - now if you wan't to add validations on these fields then that adds a lot of pointless work.
So to summarize: Have a look at Mongo or CouchDB and if that seems impractical, go ahead and save your stuff serialized. If you need to do complex validation and don't care too much about DB load then normalize away and take option 1.
Well, when you say "To company table, add text field criteria, which contains an array of the criteria desired in order" that smells like the company table wants to be normalized: you might break out each criterion in one of 15 columns called "criterion1", ..., "criterion15" where any or all columns can default to null.
To me, you are on the right track with your report table. Each row in that table might represent one report; and might have corresponding columns "criterion1",...,"criterion15", as you say, where each cell says how well the company did on that column's criterion. There will be multiple reports per company, so you'll need a date (or report-number or similar) column in the report table. Then the date plus the company id can be a composite key; and the company id can be a non-unique index. As can the report date/number/some-identifier. And don't forget a column for the reporting-employee id.
Any and every criterion column in the report table can be null, meaning (maybe) that the employee did not report on this criterion; or that this criterion (column) did not apply in this report (row).
It seems like that would work fine. I don't see that you ever need to do a join. It looks perfectly straightforward, at least to these naive and ignorant eyes.
Create a criteria table that lists the criteria for each company (company 1 .. * criteria).
Then, create a report_criteria table (report 1 .. * report_criteria) that lists the criteria for that specific report based on the criteria table (criteria 1 .. * report_criteria).

Resources