Updating a lot of records frequently - ruby-on-rails

I have a Rails 3 app that has several hundred records in a mySQL-DB that need to be updated multiple times each hour. The actual updating is done through delayed_job which is triggered in controller-logic (checking if enough time has passed since the last update, only then sth. happens).
Each update is slow, it can take up to a second in some cases (although it averages at 3 - 5 updates/sec.).
Code looks like this:
class Thing < ActiveRecord::Base
...
def self.scheduled_update
Thing.all.each do |t|
...
t.some_property = new_value
t.save
end
end
end
I've observed that the execution stalls after 300 - 400 records and then the delayed job just seems to hang and times out eventually (entries in delayed_job.log). After a while the next one starts, also fails, and so forth, so not all records get updated.
What is the proper way to do this?
How does Rails handle database-connections when used like that? Could it be some timeout issue that is not detected/handled properly?
There must be a default way to do this, but couldn't find anything so far..
Any help is appreciated.

Another options is update_all.

Rails is a bad choice for mass data records. See if you can create a sql stored procedure or some other way that would avoid active record.
Use object.save_with_validation(false) if you are ok with skipping validations altogether.
When finding records, use :select => 'a,b,c,other_fields' to limit the fields you want ('a', 'b', 'c' and 'other' in this example).
Use :include for eager loading when you are initially selecting and joining across multiple tables.

So I solved my problem.
There was some issue with the rails-version I was using (3.0.3), the Timeout was caused by some bug I suspect. Updating to a later version of the 3.0.x branch solved it and everything runs perfectly now.

Related

Rails cache_key returning different values before and after model.reload

I'm on Rails 4.2.8.
Let's say I have a model Contact, being contact = Contact.new.
Calling contact.cache_key returns a timestamped key, like so:
"contacts/2615608-20180109154442000000000"
Where 2615608 is the ID and 20180109154442000000000 the timestamp.
I'm seeing a weird behavior in my controller.
After a contact.update(contact_params), if I call contact.cache_key I'm having a different timestamp then if I call contact.reload.cache_key.
The weird thing is I'm 100% sure no other update is happening to the model's updated_at (saying so by checking console's SQL update statements, there is only one call that is changing the updated_at)
It's so weird that if I do something like this in the controller:
contact.udpate(contact_params)
Rails.logger.info "Updated_at before reload is #{#contact.updated_at} and it's cache_key is #{#contact.cache_key}"
Rails.logger.info "Updated_at after reload is #{#contact.reload.updated_at} and it's cache_key is #{#contact.reload.cache_key}"
The output reveals IDENTICAL updated_at values, but different cache_key values:
Updated_at before reload is 2018-01-09 14:01:58 -0200 and it's cache_key is contacts/2615608-20180109160158143423000
Updated_at after reload is 2018-01-09 14:01:58 -0200 and it's cache_key is contacts/2615608-20180109160158000000000
As you can see, same updated_at, but different timestamps.
This is driving me nuts. I hate having to manually .reload models without necessity. Why is this happening? Tried looking at cache_key's source code but couldn't find an answer there, since it apparently only depends on the updated_at value (which is the same).
Found out the issue. It was a Rails bug, fixed in Rails 5.0.
The source code for .cache_key for Rails 4.2 can be found here. As you can see, it used cache_timestamp_format = :nsec, which was too precise.
The reason I could understand is that, BEFORE the reload, the model's updated_at is still in memory, so it has enough resolution (in memory) to return a very precise nanosecond key. But after model.reload, the updated_at comes from the database, which doesn't have that high resolution, so that's why the cache_key timestamp results in a different number, with a lot of zeroes at the end (in my example, 20180109160158000000000).
This issue details the problem with the too precise (nanosecond) timestamp, and this pull request was merged to fixed the issue, changing the precision from :nsec to :usec.
On the issue discussion (same link above) #tarzan makes a suggestion to fix it by creating an ActiveSupport::Concern with the following code:
included do
self.cache_timestamp_format = :usec
end
and defined the :usec as a formatter in an initializer:
Time::DATE_FORMATS[:usec] = "%Y%m%d%H%M%S%6N"
And manually inserting your concern into the models you want to use .cache_key manually (by the way, I'm using this method as recommended by the oficial Rails guides for low level caching, by using Rails.cache.fetch(self.cache_key), check http://guides.rubyonrails.org/caching_with_rails.html in topic 1.6 Low-Level Caching).
The problem is, in my tests at least (MySQL on Mac Os High Sierra), being contact a scaffolded model, the precision of :datetime column (like updated_at) is only in the seconds level; that's why his fix doesn't solve it for us since changing from nanoseconds to microseconds precision will still result in two different cache_keys, with less zeroes added to it. As an example:
[4] pry(#<ContactsController>)> ::Time::DATE_FORMATS[:usec] = "%Y%m%d%H%M%S%6N"
=> "%Y%m%d%H%M%S%6N"
[5] pry(#<ContactsController>)> #contact.updated_at.utc.to_s(:usec)
=> "20180109234014062142"
[6] pry(#<ContactsController>)> #contact.reload.updated_at.utc.to_s(:usec)
Contact Load (2.2ms) SELECT `contacts`.* FROM `contacts` WHERE `contacts`.`id` = 2615608 LIMIT 1 /*application:Temporadalivre,controller:contacts,action:toggle_status*/
=> "20180109234014000000"
So, for now, at least, I'm sticking to the .reload unfortunately.
Lastly but now least, I only realized this bug because we had a page of 20 of these contacts that could be manipulated via ajax. Each ajax call would call .update on the model, and return the model .as_cached_json to the view. Since .as_cached_json was called, we hoped that by reloading the page everything would be already cached, but it wasn't, and only by checking the cache keys we discovered this bug, which gave us a really nice performance boost that we were loosing.

Rails / Postgres Lookup Performance

I have a status dashboard that shows the status of remote hardware devices that 'ping' the application every minute and log their status.
class Sensor < ActiveRecord::Base
has_many :logs
def most_recent_log
logs.order("id DESC").first
end
end
class Log < ActiveRecord::Base
belongs_to :sensor
end
Given I'm only interested in showing the current status, the dashboard only shows the most recent log for all sensors. This application has been running for a long time now and there are tens of millions of Log records.
The problem I have is that the dashboard takes around 8 seconds to load. From what I can tell, this is largely because there is an N+1 Query fetching these logs.
Completed 200 OK in 4729.5ms (Views: 4246.3ms | ActiveRecord: 480.5ms)
I do have the following index in place:
add_index "logs", ["sensor_id", "id"], :name => "index_logs_on_sensor_id_and_id", :order => {"id"=>:desc}
My controller / lookup code is the following:
class SensorsController < ApplicationController
def index
#sensors = Sensor.all
end
end
How do I make the load time reasonable?
Is there a way to avoid the N+1 and reload this?
I had thought of putting a latest_log_id reference on to Sensor and then updating this every time a new log for that sensor is posted - but something in my head is telling me that other developers would say this is a bad thing. Is this the case?
How are problems like this usually solved?
There are 2 relatively easy ways to do this:
Use ActiveRecord eager loading to pull in just the most recent logs
Roll your own mini eager loading system (as a Hash) for just this purpose
Basic ActiveRecord approach:
subquery = Log.group(:sensor_id).select("MAX('id')")
#sensors = Sensor.eager_load(:logs).where(logs: {id: subquery}).all
Note that you should NOT use your most_recent_log method for each sensor (that will trigger an N+1), but rather logs.first. Only the latest log for each sensor will actually be prefetched in the logs collection.
Rolling your own may be more efficient from a SQL perspective, but more complex to read and use:
#sensors = Sensor.all
logs = Log.where(id: Log.group(:sensor_id).select("MAX('id')"))
#sensor_logs = logs.each_with_object({}){|log, hash|
hash[log.sensor_id] = log
}
#sensor_logs is a Hash, permitting a fast lookup for the latest log by sensor.id.
Regarding your comment about storing the latest log id - you are essentially asking if you should build a cache. The answer would be 'it depends'. There are many advantages and many disadvantages to caching, so it comes down to 'is the benefit worth the cost'. From what you are describing, it doesn't appear that you are familiar with the difficulties they introduce (Google 'cache invalidation') or if they are applicable in your case. I'd recommend against it until you can demonstrate that a) it is adding real value over a non-cache solution, and b) it can be safely applied for your scenario.
There's 3 options:
eager loading
joining
caching the current status
--
is explained by PinnyM
You can do a join from the Sensor just to the latest Log record for each row, so everything gets fetched in the one query. Not sure off hand how that'll perform with the number of rows you have, likely it'll still be slower than you want.
The thing you mentioned - caching the latest_log_id (or even caching just the latest_status if that's all you need for the dashboard) is actually OK. It's called denormalization and it's a useful thing if used carefully. You've likely come across "counter cache" plugins for rails which are in the same vein - duplicating data, in the interests of being able to optimise read performance.

Request timeout in Rails

We are working on a data visualization problem right now. Our customer wants us to show the last 6 months data for a honeybee hive on a graph.
Clearly it's gonna be a huge dataset. Adding indexes we overcame the database slowness problem in loading data though we still have problem in visualizing data on a graph.
Here is the related code:
def self.prepare_single_hive_messages_for_datatable_dygraph(messages, us_metric_enabled)
data = []
messages.each do |message|
record = []
record << message.occurance_time.to_s(:dygraph_format)
record << weight_according_to_metric(message.weight, us_metric_enabled)
record << temperature_according_to_metric(message.temperature, us_metric_enabled)
record << (message.humidity.nil? ? nil : message.humidity.to_f)
data << record
end
return data
end
The problem is that messages.each is very slow and takes more than 30 seconds. Is there any solution to overcome this?
Project Specification:
Rails Version: 4.1.9
Graph Library: Dygraph
Database: Postgres
There are two ways to attack a performance problem like this.
Find and correct the performance bottle neck
Break it into smaller pieces
Finding Performance issues
First, get a dataset large enough to reproduce the problem setup on your dev system. Then look at the logs so you can see how long the transaction is taking. You should be looking for a line like this:
Completed 200 OK in 432.1ms (Views: 367.7ms | ActiveRecord: 61.4ms)
Rerun the task a couple times since caching can cause variations. Write down your different times. Then remove everything in the loop and run it with just the loop. Do the numbers go back to looking reasonable? If that is the case then you know the problem is the work you are doing inside the loop. Next, add each line in the loop back on its own (or one at a time if they depend on each other). Figure out which line causes those numbers to jump the most.
This is the point where you should try to performance tune your code. Check for queries that could be smarter. Make sure you aren't querying the same data over and over. If you have a function in a model that computes something and you call it multiple times to get the same answer then use this to only compute once:
def something
return #savedvalue if #savedvalue
#savedvalue = really complex calculation
end
The goal is to find the worse offender so you can make changes that have the biggest impact. However, if you are working with a LOT of data this may only get you so far. It may be impossible to performance tune enough for all the data. In that case there is option 2.
Break it into smaller pieces
Write a second rails action who's only job is to render a single record on a graph. It will do the inner part of your loop but only on the message who's id was passed to it.
Call your original function to setup the view and pass the list of messages to the view. In the view loop through the list of messages to setup jquery ajax code to call the above action once for each message. Have this run in on document ready.
Then, the page will load with an empty graph... but as soon as it is up the individual processed records will be fed to it and appear one at a time on the page. It will still take just ask long (or even a little longer because of overhead) to complete the graph... but it will no longer time out. Each ajax call will be its own quick hit to the server instead of one big long hit.
I just used this very technique to load a rather long report on a site I work on. Ideally we'd like to fix any underlying performance issues... but what we really wanted was to have a report working right away and then fix the performance issues as we had time.
Ok you said every person sees the same set of data, which is great, means we can cache without worrying about who's logged in, first here's your method, with tiny improvements
def self.prepare_single_hive_messages_for_datatable_dygraph(messages, us_metric_enabled)
messages.inject([]) do |records, message|
records << [].tap do |record|
record << message.occurance_time.to_s(:dygraph_format)
record << weight_according_to_metric(message.weight, us_metric_enabled)
record << temperature_according_to_metric(message.temperature, us_metric_enabled)
record << (message.humidity.nil? ? nil : message.humidity.to_f)
end
end
end
Then create a caching function, that runs this method and caches it
# some class constants
CACHE_KEY = 'some_cache_key'
EXPIRY_TIME = 15.minutes
# the methods
def self.write_single_hive_messages_to_cache(messages, us_metric_enabled)
Rails.cache.write CACHE_KEY,
self.class.prepare_single_hive_messages_for_datatable_dygraph(messages, us_metric_enabled),
expires_in: EXPIRY_TIME
end
And a simple cache reading method
self.read_single_hive_messages_from_cache
Rails.cache.read CACHE_KEY
end
Then create a rake task that just fetches these messages and call the caching method, and rails will write the cache.
Create a cron job that calls this rake task, set the cron job to 5 minutes or so, the expiry time is longer just in case for some reason the cron job didn't run, the data will still be available for the next run.
This way your processing is run in the background, every 5 ( or whatever time you choose ) minutes, the page load should happen normally with no delay at all, since the array data will be loaded from the pre-calculated cache.
In case the cron stops working, the data will expire in the 15 minutes I've set, and then the read cache method will return nil, you could avoid this and set the data to never expire, but then the data will become stale and the old data will keep getting returned.
Another way to handle this is to tell the cache reading method how to generate the cache it self, so if it finds the cache empty it generates one and caches it itself before returning the data, the method would look like this
def self.read_single_hive_messages_from_cache(messages, us_metric_enabled)
Rails.cache.fetch CACHE_KEY, expires_in: EXPIRY_TIME do
self.class.write_single_hive_messages_to_cache(messages, us_metric_enabled)
end
end
But then make sure that messages is an ActiveRecord::Relation and not a processed array, because you don't want to query for 1+ million records and then find the cache already ready, if it's an ActiveRecord::Relation it will not touch the database until the array is started ( inside the caching block ), if the cache exists it will be returned before you enter the block and thus the data won't get fetched, saving you that huge query.
I know the answer got long, if you need more help tell me.

Mongoid identity_map and memory usage, memory leaks

When I executing query
Mymodel.all.each do |model|
# ..do something
end
It uses allot of memory and amount of used memory increases at all the time and at the and it crashes. I found out that to fix it I need to disable identity_map but when I adding to my mongoid.yml file identity_map_enabled: false I am getting error
Invalid configuration option: identity_map_enabled.
Summary:
A invalid configuration option was provided in your mongoid.yml, or a typo is potentially present. The valid configuration options are: :include_root_in_json, :include_type_for_serialization, :preload_models, :raise_not_found_error, :scope_overwrite_exception, :duplicate_fields_exception, :use_activesupport_time_zone, :use_utc.
Resolution:
Remove the invalid option or fix the typo. If you were expecting the option to be there, please consult the following page with repect to Mongoid's configuration:
I am using Rails 4 and Mongoid 4, Mymodel.all.count => 3202400
How can I fix it or maybe some one know other way to reduce amount of memory used during executing query .all.each ..?
Thank you very much for the help!!!!
I started with something just like you by doing loop through millions of record and the memory just keep increasing.
Original code:
#portal.listings.each do |listing|
listing.do_something
end
I've gone through many forum answers and I tried them out.
1st attempt: I try to use the combination of WeakRef and GC.start but no luck, I fail.
2nd attempt: Adding listing = nil to the first attempt, and still fail.
Success Attempt:
#start_date = 10.years.ago
#end_date = 1.day.ago
while #start_date < #end_date
#portal.listings.where(created_at: #start_date..#start_date.next_month).each do |listing|
listing.do_something
end
#start_date = #start_date.next_month
end
Conclusion
All the memory allocated for the record will never be released during
the query request. Therefore, trying with small number of record every
request does the job, and memory is in good condition since it will be
released after each request.
Your problem isn't the identity map, I don't think Mongoid4 even has an identity map built in, hence the configuration error when you try to turn it off. Your problem is that you're using all. When you do this:
Mymodel.all.each
Mongoid will attempt to instantiate every single document in the db.mymodels collection as a Mymodel instance before it starts iterating. You say that you have about 3.2 million documents in the collection, that means that Mongoid will try to create 3.2 million model instances before it tries to iterate. Presumably you don't have enough memory to handle that many objects.
Your Mymodel.all.count works fine because that just sends a simple count call into the database and returns a number, it won't instantiate any models at all.
The solution is to not use all (and preferably forget that it exists). Depending on what "do something" does, you could:
Page through all the models so that you're only working with a reasonable number of them at a time.
Push the logic into the database using mapReduce or the aggregation framework.
Whenever you're working with real data (i.e. something other than a trivially small database), you should push as much work as possible into the database because databases are built to manage and manipulate big piles of data.

Why does a sleep between ActiveRecord#first calls change its behaviour?

I've inherited some code written in ruby 1.8.7 with Rails ~> 2.3.15 and it contains a test which looks something like this:
class Wibble < ActiveRecord::Base
# Wibbles have integer primary keys and string names
end
def test
create_two_wibbles
w1 = Wibble.first
sleep 2 # This sleep is necessary to
w2 = Wibble.first
w1.name = "Name for w1"
w2.name = "Name for w2"
w1.save
w3 = Wibble.first
assert(!w3.update_attributes(w2.attributes))
end
That comment next to the sleep line hasn't been cut off, it literally says This sleep is necessary to. Without that sleep, this test fails - removing it changes the behaviour somehow beyond making it run 2 seconds faster. I've been digging through the file's history in our version control system, and the messages were uninformative. I also cannot contact the original author of this test to figure out what they were trying to do.
From my understanding, we're pulling the same record out of the database twice, editing it in two different ways, saving the first, and asserting that we can't then save the second. I suspect this is a test to make sure our database locks the table correctly, but surely if this were to fail our Wibbles would be fine, and ActiveRecord would be at fault. Nobody in the office can figure out why this test may have been necessary, nor what difference the sleep statement might make. Any ideas?
This is likely caused by Active Record caching and memoization stopping the second find from actually going to the DB and get fresh data.
In fact, try printing the object_id of each wibble; they might be the exact same memoized object. If that is the case, then the test kinda makes sense.
You could also do the same test in a controller action and look at the verbose SQL logs from Rails; I'd expect the second find call to tell you it just used cached data and did not actually run any SQL query.

Resources