Mongoid identity_map and memory usage, memory leaks - ruby-on-rails

When I executing query
Mymodel.all.each do |model|
# ..do something
end
It uses allot of memory and amount of used memory increases at all the time and at the and it crashes. I found out that to fix it I need to disable identity_map but when I adding to my mongoid.yml file identity_map_enabled: false I am getting error
Invalid configuration option: identity_map_enabled.
Summary:
A invalid configuration option was provided in your mongoid.yml, or a typo is potentially present. The valid configuration options are: :include_root_in_json, :include_type_for_serialization, :preload_models, :raise_not_found_error, :scope_overwrite_exception, :duplicate_fields_exception, :use_activesupport_time_zone, :use_utc.
Resolution:
Remove the invalid option or fix the typo. If you were expecting the option to be there, please consult the following page with repect to Mongoid's configuration:
I am using Rails 4 and Mongoid 4, Mymodel.all.count => 3202400
How can I fix it or maybe some one know other way to reduce amount of memory used during executing query .all.each ..?
Thank you very much for the help!!!!

I started with something just like you by doing loop through millions of record and the memory just keep increasing.
Original code:
#portal.listings.each do |listing|
listing.do_something
end
I've gone through many forum answers and I tried them out.
1st attempt: I try to use the combination of WeakRef and GC.start but no luck, I fail.
2nd attempt: Adding listing = nil to the first attempt, and still fail.
Success Attempt:
#start_date = 10.years.ago
#end_date = 1.day.ago
while #start_date < #end_date
#portal.listings.where(created_at: #start_date..#start_date.next_month).each do |listing|
listing.do_something
end
#start_date = #start_date.next_month
end
Conclusion
All the memory allocated for the record will never be released during
the query request. Therefore, trying with small number of record every
request does the job, and memory is in good condition since it will be
released after each request.

Your problem isn't the identity map, I don't think Mongoid4 even has an identity map built in, hence the configuration error when you try to turn it off. Your problem is that you're using all. When you do this:
Mymodel.all.each
Mongoid will attempt to instantiate every single document in the db.mymodels collection as a Mymodel instance before it starts iterating. You say that you have about 3.2 million documents in the collection, that means that Mongoid will try to create 3.2 million model instances before it tries to iterate. Presumably you don't have enough memory to handle that many objects.
Your Mymodel.all.count works fine because that just sends a simple count call into the database and returns a number, it won't instantiate any models at all.
The solution is to not use all (and preferably forget that it exists). Depending on what "do something" does, you could:
Page through all the models so that you're only working with a reasonable number of them at a time.
Push the logic into the database using mapReduce or the aggregation framework.
Whenever you're working with real data (i.e. something other than a trivially small database), you should push as much work as possible into the database because databases are built to manage and manipulate big piles of data.

Related

Request timeout in Rails

We are working on a data visualization problem right now. Our customer wants us to show the last 6 months data for a honeybee hive on a graph.
Clearly it's gonna be a huge dataset. Adding indexes we overcame the database slowness problem in loading data though we still have problem in visualizing data on a graph.
Here is the related code:
def self.prepare_single_hive_messages_for_datatable_dygraph(messages, us_metric_enabled)
data = []
messages.each do |message|
record = []
record << message.occurance_time.to_s(:dygraph_format)
record << weight_according_to_metric(message.weight, us_metric_enabled)
record << temperature_according_to_metric(message.temperature, us_metric_enabled)
record << (message.humidity.nil? ? nil : message.humidity.to_f)
data << record
end
return data
end
The problem is that messages.each is very slow and takes more than 30 seconds. Is there any solution to overcome this?
Project Specification:
Rails Version: 4.1.9
Graph Library: Dygraph
Database: Postgres
There are two ways to attack a performance problem like this.
Find and correct the performance bottle neck
Break it into smaller pieces
Finding Performance issues
First, get a dataset large enough to reproduce the problem setup on your dev system. Then look at the logs so you can see how long the transaction is taking. You should be looking for a line like this:
Completed 200 OK in 432.1ms (Views: 367.7ms | ActiveRecord: 61.4ms)
Rerun the task a couple times since caching can cause variations. Write down your different times. Then remove everything in the loop and run it with just the loop. Do the numbers go back to looking reasonable? If that is the case then you know the problem is the work you are doing inside the loop. Next, add each line in the loop back on its own (or one at a time if they depend on each other). Figure out which line causes those numbers to jump the most.
This is the point where you should try to performance tune your code. Check for queries that could be smarter. Make sure you aren't querying the same data over and over. If you have a function in a model that computes something and you call it multiple times to get the same answer then use this to only compute once:
def something
return #savedvalue if #savedvalue
#savedvalue = really complex calculation
end
The goal is to find the worse offender so you can make changes that have the biggest impact. However, if you are working with a LOT of data this may only get you so far. It may be impossible to performance tune enough for all the data. In that case there is option 2.
Break it into smaller pieces
Write a second rails action who's only job is to render a single record on a graph. It will do the inner part of your loop but only on the message who's id was passed to it.
Call your original function to setup the view and pass the list of messages to the view. In the view loop through the list of messages to setup jquery ajax code to call the above action once for each message. Have this run in on document ready.
Then, the page will load with an empty graph... but as soon as it is up the individual processed records will be fed to it and appear one at a time on the page. It will still take just ask long (or even a little longer because of overhead) to complete the graph... but it will no longer time out. Each ajax call will be its own quick hit to the server instead of one big long hit.
I just used this very technique to load a rather long report on a site I work on. Ideally we'd like to fix any underlying performance issues... but what we really wanted was to have a report working right away and then fix the performance issues as we had time.
Ok you said every person sees the same set of data, which is great, means we can cache without worrying about who's logged in, first here's your method, with tiny improvements
def self.prepare_single_hive_messages_for_datatable_dygraph(messages, us_metric_enabled)
messages.inject([]) do |records, message|
records << [].tap do |record|
record << message.occurance_time.to_s(:dygraph_format)
record << weight_according_to_metric(message.weight, us_metric_enabled)
record << temperature_according_to_metric(message.temperature, us_metric_enabled)
record << (message.humidity.nil? ? nil : message.humidity.to_f)
end
end
end
Then create a caching function, that runs this method and caches it
# some class constants
CACHE_KEY = 'some_cache_key'
EXPIRY_TIME = 15.minutes
# the methods
def self.write_single_hive_messages_to_cache(messages, us_metric_enabled)
Rails.cache.write CACHE_KEY,
self.class.prepare_single_hive_messages_for_datatable_dygraph(messages, us_metric_enabled),
expires_in: EXPIRY_TIME
end
And a simple cache reading method
self.read_single_hive_messages_from_cache
Rails.cache.read CACHE_KEY
end
Then create a rake task that just fetches these messages and call the caching method, and rails will write the cache.
Create a cron job that calls this rake task, set the cron job to 5 minutes or so, the expiry time is longer just in case for some reason the cron job didn't run, the data will still be available for the next run.
This way your processing is run in the background, every 5 ( or whatever time you choose ) minutes, the page load should happen normally with no delay at all, since the array data will be loaded from the pre-calculated cache.
In case the cron stops working, the data will expire in the 15 minutes I've set, and then the read cache method will return nil, you could avoid this and set the data to never expire, but then the data will become stale and the old data will keep getting returned.
Another way to handle this is to tell the cache reading method how to generate the cache it self, so if it finds the cache empty it generates one and caches it itself before returning the data, the method would look like this
def self.read_single_hive_messages_from_cache(messages, us_metric_enabled)
Rails.cache.fetch CACHE_KEY, expires_in: EXPIRY_TIME do
self.class.write_single_hive_messages_to_cache(messages, us_metric_enabled)
end
end
But then make sure that messages is an ActiveRecord::Relation and not a processed array, because you don't want to query for 1+ million records and then find the cache already ready, if it's an ActiveRecord::Relation it will not touch the database until the array is started ( inside the caching block ), if the cache exists it will be returned before you enter the block and thus the data won't get fetched, saving you that huge query.
I know the answer got long, if you need more help tell me.

Rails 4 Multithreaded App - ActiveRecord::ConnectionTimeoutError

I have a simple rails app that scrapes JSON from a remote URL for each instance of a model (let's call it A). The app then creates a new data-point under an associated model of the 1st. Let's call this middle model B and the data point model C. There's also a front end that let's users browse this data graphically/visually.
Thus the hierarchy is A has many -> B which has many -> C. I scrape a URL for each A which returns a few instances of B with new Cs that have data for the respective B.
While attempting to test/scale this app I have encountered a problem where rails will stop processing, hang for a while, and finally throw a "ActiveRecord::ConnectionTimeoutError could not obtain a database connection within 5.000 seconds" Obviously the 5 is just the default.
I can't understand why this is happening when 1) there are no DB calls being made explicitly, 2) the log doesn't show any under the hood DB calls happening when it does work 3) it works sometimes and not others.
What's going on with rails 4 AR and the connection pool?!
A couple of notes:
The general algorithm is to spawn a thread for each model A, scrape the data, create in memory new instances of model C, save all the C's in one transaction at the end.
Sometimes this works, other times it doesn't, i can't figure out what causes it to fail. However, once it fails it seems to fail more and more.
I eager load all the model A's and B's to begin with.
I use a transaction at the end to insert all the newly created C instances.
I currently use resque and resque scheduler to do this work but I highly doubt they are the source of the problem as it persists even if I just do "rails runner Class.do_work"
Any suggestions and or thoughts greatly appreciated!
I believe I have found the cause of this problem. When you loop through an association via
model.association.each do |a|
#work here
end
Rails does some behind the scenes work that "uses" a DB connection. I put uses in quotes because in my case I think the result is actually returned from memory. I eager loaded the association and thus the DB is never actually hit.
Preliminary testing of wrapping my block in a
ActiveRecord::Base.connection_pool.with_connection do
#something me doing?
end
seems to have resolved the issue.
I uncovered this by adding a backtrace to my thread's error message that was printing out.
-----For those using resque----
I also had to add a bit in my resque.rake file to get this fully working as intended.
task 'resque:setup' => :environment do
Resque.after_fork do |job|
ActiveRecord::Base.establish_connection
end
end
If you are you using
ActiveRecord::Base.transaction do
... code
end
to accomplish faster transactions in a thread, note that this locks the database. I had an app that did this for a hugely expensive process, in a thread, and it would lock the DB for over 5 seconds. It is faster, though it will lock your database

How does Rails 4 Russian doll caching prevent stampedes?

I am looking to find information on how the caching mechanism in Rails 4 prevents against multiple users trying to regenerate cache keys at once, aka a cache stampede: http://en.wikipedia.org/wiki/Cache_stampede
I've not been able to find out much information via Googling. If I look at other systems (such as Drupal) cache stampede prevention is implemented via a semaphores table in the database.
Rails does not have a built-in mechanism to prevent cache stampedes.
According to the README for atomic_mem_cache_store (a replacement for ActiveSupport::Cache::MemCacheStore that mitigates cache stampedes):
Rails (and any framework relying on active support cache store) does
not offer any built-in solution to this problem
Unfortunately, I'm guessing that this gem won't solve your problem either. It supports fragment caching, but it only works with time-based expiration.
Read more about it here:
https://github.com/nel/atomic_mem_cache_store
Update and possible solution:
I thought about this a bit more and came up with what seems to me to be a plausible solution. I haven't verified that this works, and there are probably better ways to do it, but I was trying to think of the smallest change that would mitigate the majority of the problem.
I assume you're doing something like cache model do in your templates as described by DHH (http://37signals.com/svn/posts/3113-how-key-based-cache-expiration-works). The problem is that when the model's updated_at column changes, the cache_key likewise changes, and all your servers try to re-create the template at the same time. In order to prevent the servers from stampeding, you would need to retain the old cache_key for a brief time.
You might be able to do this by (dum da dum) caching the cache_key of the object with a short expiration (say, 1 second) and a race_condition_ttl.
You could create a module like this and include it in your models:
module StampedeAvoider
def cache_key
orig_cache_key = super
Rails.cache.fetch("/cache-keys/#{self.class.table_name}/#{self.id}", expires_in: 1, race_condition_ttl: 2) { orig_cache_key }
end
end
Let's review what would happen. There are a bunch of servers calling cache model. If your model includes StampedeAvoider, then its cache_key will now be fetching /cache-keys/models/1, and returning something like /models/1-111 (where 111 is the timestamp), which cache will use to fetch the compiled template fragment.
When you update the model, model.cache_key will begin returning /models/1-222 (assuming 222 is the new timestamp), but for the first second after that, cache will keep seeing /models/1-111, since that is what is returned by cache_key. Once 1 second passes, all of the servers will get a cache-miss on /cache-keys/models/1 and will try to regenerate it. If they all recreated it immediately, it would defeat the point of overriding cache_key. But because we set race_condition_ttl to 2, all of the servers except for the first will be delayed for 2 seconds, during which time they will continue to fetch the old cached template based on the old cache key. Once the 2 seconds have passed, fetch will begin returning the new cache key (which will have been updated by the first thread which tried to read/update /cache-keys/models/1) and they will get a cache hit, returning the template compiled by that first thread.
Ta-da! Stampede averted.
Note that if you did this, you would be doing twice as many cache reads, but depending on how common stampedes are, it could be worth it.
I haven't tested this. If you try it, please let me know how it goes :)
The :race_condition_ttl setting in ActiveSupport::Cache::Store#fetch should help avoid this problem. As the documentation says:
Setting :race_condition_ttl is very useful in situations where a cache entry is used very frequently and is under heavy load. If a cache expires and due to heavy load seven different processes will try to read data natively and then they all will try to write to cache. To avoid that case the first process to find an expired cache entry will bump the cache expiration time by the value set in :race_condition_ttl. Yes, this process is extending the time for a stale value by another few seconds. Because of extended life of the previous cache, other processes will continue to use slightly stale data for a just a bit longer. In the meantime that first process will go ahead and will write into cache the new value. After that all the processes will start getting new value. The key is to keep :race_condition_ttl small.
Great question. A partial answer that applies to single multi-threaded Rails servers but not multiprocess(or) environments (thanks to Nick Urban for drawing this distinction) is that the ActionView template compilation code blocks on a mutex that is per template. See line 230 in template.rb here. Notice there is a check for completed compilation both before grabbing the lock and after.
The effect is to serialize attempts to compile the same template, where only the first will actually do the compilation and the rest will get the already completed result.
Very interesting question. I searched on google (you get more results if you search for "dog pile" instead of "stampede") but like you, did I not get any answers, except this one blog post: protecting from dogpile using memcache.
Basically does it store you fragment in two keys: key:timestamp (where timestamp would be updated_at for active record objects) and key:last.
def custom_write_dogpile(key, timestamp, fragment, options)
Rails.cache.write(key + ':' + timestamp.to_s, fragment)
Rails.cache.write(key + ':last', fragment)
Rails.cache.delete(key + ':refresh-thread')
fragment
end
Now when reading from the cache, and trying to fetch a non existing cache, will it instead try to fecth the key:last fragment instead:
def custom_read_dogpile(key, timestamp, options)
result = Rails.cache.read(timestamp_key(name, timestamp))
if result.blank?
Rails.cache.write(name + ':refresh-thread', 0, raw: true, unless_exist: true, expires_in: 5.seconds)
if Rails.cache.increment(name + ':refresh-thread') == 1
# The cache didn't exists
result = nil
else
# Fetch the last cache, as the new one has not been created yet
result = Rails.cache.read(name + ':last')
end
end
result
end
This is a simplified summary of the by Moshe Bergman that i linked to before, or you can find here.
There is no protection against memcache stampedes. This is a real problem when multiple machines are involved and multiple processes on those multiple machines. -Ouch-.
The problem is compounded when one of the key processes has "died" leaving any "locking" ... locked.
In order to prevent stampedes you have to re-compute the data before it expires. So, if your data is valid for 10 minutes, you need to regenerate again at the 5th minute and re-set the data with a new expiration for 10 more minutes. Thus you don't wait until the data expires to set it again.
Should also not allow your data to expire at the 10 minute mark, but re-compute it every 5 minutes, and it should never expire. :)
You can use wget & cron to periodically call the code.
I recommend using redis, which will allow you to save the data and reload it in the advent of a crash.
-daniel
A reasonable strategy would be to:
use a :race_condition_ttl with at least the expected time it takes to refresh the resource. Setting it to less time than expected to perform a refresh is not advisable as the angry mob will end up trying to refresh it, resulting in a stampede.
use an :expires_in time calculated as the maximum acceptable expiry time minus the :race_condition_ttl to allow for refreshing the resource by a single worker and avoiding a stampede.
Using the above strategy will ensure that you don't exceed your expiry/staleness deadline and also avoid a stampede. It works because only one worker gets through to refresh, whilst the angry mob are held off using the cache value with the race_condition_ttl extension time right up to the originally intended expiry time.

Updating a lot of records frequently

I have a Rails 3 app that has several hundred records in a mySQL-DB that need to be updated multiple times each hour. The actual updating is done through delayed_job which is triggered in controller-logic (checking if enough time has passed since the last update, only then sth. happens).
Each update is slow, it can take up to a second in some cases (although it averages at 3 - 5 updates/sec.).
Code looks like this:
class Thing < ActiveRecord::Base
...
def self.scheduled_update
Thing.all.each do |t|
...
t.some_property = new_value
t.save
end
end
end
I've observed that the execution stalls after 300 - 400 records and then the delayed job just seems to hang and times out eventually (entries in delayed_job.log). After a while the next one starts, also fails, and so forth, so not all records get updated.
What is the proper way to do this?
How does Rails handle database-connections when used like that? Could it be some timeout issue that is not detected/handled properly?
There must be a default way to do this, but couldn't find anything so far..
Any help is appreciated.
Another options is update_all.
Rails is a bad choice for mass data records. See if you can create a sql stored procedure or some other way that would avoid active record.
Use object.save_with_validation(false) if you are ok with skipping validations altogether.
When finding records, use :select => 'a,b,c,other_fields' to limit the fields you want ('a', 'b', 'c' and 'other' in this example).
Use :include for eager loading when you are initially selecting and joining across multiple tables.
So I solved my problem.
There was some issue with the rails-version I was using (3.0.3), the Timeout was caused by some bug I suspect. Updating to a later version of the 3.0.x branch solved it and everything runs perfectly now.

Rails - given an array of Users - how to get a output of just emails?

I have the following:
#users = User.all
User has several fields including email.
What I would like to be able to do is get a list of all the #users emails.
I tried:
#users.email.all but that errors w undefined
Ideas? Thanks
(by popular demand, posting as a real answer)
What I don't like about fl00r's solution is that it instantiates a new User object per record in the DB; which just doesn't scale. It's great for a table with just 10 emails in it, but once you start getting into the thousands you're going to run into problems, mostly with the memory consumption of Ruby.
One can get around this little problem by using connection.select_values on a model, and a little bit of ARel goodness:
User.connection.select_values(User.select("email").to_sql)
This will give you the straight strings of the email addresses from the database. No faffing about with user objects and will scale better than a straight User.select("email") query, but I wouldn't say it's the "best scale". There's probably better ways to do this that I am not aware of yet.
The point is: a String object will use way less memory than a User object and so you can have more of them. It's also a quicker query and doesn't go the long way about it (running the query, then mapping the values). Oh, and map would also take longer too.
If you're using Rails 2.3...
Then you'll have to construct the SQL manually, I'm sorry to say.
User.connection.select_values("SELECT email FROM users")
Just provides another example of the helpers that Rails 3 provides.
I still find the connection.select_values to be a valid way to go about this, but I recently found a default AR method that's built into Rails that will do this for you: pluck.
In your example, all that you would need to do is run:
User.pluck(:email)
The select_values approach can be faster on extremely large datasets, but that's because it doesn't typecast the returned values. E.g., boolean values will be returned how they are stored in the database (as 1's and 0's) and not as true | false.
The pluck method works with ARel, so you can daisy chain things:
User.order('created_at desc').limit(5).pluck(:email)
User.select(:email).map(&:email)
Just use:
User.select("email")
While I visit SO frequently, I only registered today. Unfortunately that means that I don't have enough of a reputation to leave comments on other people's answers.
Piggybacking on Ryan's answer above, you can extend ActiveRecord::Base to create a method that will allow you to use this throughout your code in a cleaner way.
Create a file in config/initializers (e.g., config/initializers/active_record.rb):
class ActiveRecord::Base
def self.selected_to_array
connection.select_values(self.scoped)
end
end
You can then chain this method at the end of your ARel declarations:
User.select('email').selected_to_array
User.select('email').where('id > ?', 5).limit(4).selected_to_array
Use this to get an array of all the e-mails:
#users.collect { |user| user.email }
# => ["test#example.com", "test2#example.com", ...]
Or a shorthand version:
#users.collect(&:email)
You should avoid using User.all.map(&:email) as it will create a lot of ActiveRecord objects which consume large amounts of memory, a good chunk of which will not be collected by Ruby's garbage collector. It's also CPU intensive.
If you simply want to collect only a few attributes from your database without sacrificing performance, high memory usage and cpu cycles, consider using Valium.
https://github.com/ernie/valium
Here's an example for getting all the emails from all the users in your database.
User.all[:email]
Or only for users that subscribed or whatever.
User.where(:subscribed => true)[:email].each do |email|
puts "Do something with #{email}"
end
Using User.all.map(&:email) is considered bad practice for the reasons mentioned above.

Resources