How much memory is consumed if I create a ruby object? - ruby-on-rails

I want to know how much memory is consumed if I create a ruby object. Does Ruby have any method to tell?
Is there any difference for memory consumption in the following?
users = User.where("created_at > ?", 2.months.ago) # select all fields
users = User.select(:user_name).where("created_at > ?", 2.months.ago) # just select one field

You could use ruby-prof, a wonderful ruby profiler that will tell you everything your code is doing, including memory allocation. The usage is really simple:
require 'ruby-prof'
# Profile the code
result = RubyProf.profile do
...
[code to profile]
...
end
# Print a flat profile to text
printer = RubyProf::FlatPrinter.new(result)
printer.print(STDOUT)
It can output results as text, text graph, html graph, call stack and more. In the readme there is also a section about profiling rails applications. The installation is immediate, so give it a try:
gem install ruby-prof

Firstly there's no easy way to measure how much memory an object consumes. If it's a Rails App you can use this Unix Script to check. Also there's a great blog post that may help you about this issue.
In your second question. The 2nd query is probably gonna consume less memory since ActiveRecord isnt processing all the fields to build an AR object. Ultimately it's better to use .pluck for you second query.
users = User.where("created_at > ?", 2.months.ago).pluck(:user_name)

Related

Rails / Postgres Lookup Performance

I have a status dashboard that shows the status of remote hardware devices that 'ping' the application every minute and log their status.
class Sensor < ActiveRecord::Base
has_many :logs
def most_recent_log
logs.order("id DESC").first
end
end
class Log < ActiveRecord::Base
belongs_to :sensor
end
Given I'm only interested in showing the current status, the dashboard only shows the most recent log for all sensors. This application has been running for a long time now and there are tens of millions of Log records.
The problem I have is that the dashboard takes around 8 seconds to load. From what I can tell, this is largely because there is an N+1 Query fetching these logs.
Completed 200 OK in 4729.5ms (Views: 4246.3ms | ActiveRecord: 480.5ms)
I do have the following index in place:
add_index "logs", ["sensor_id", "id"], :name => "index_logs_on_sensor_id_and_id", :order => {"id"=>:desc}
My controller / lookup code is the following:
class SensorsController < ApplicationController
def index
#sensors = Sensor.all
end
end
How do I make the load time reasonable?
Is there a way to avoid the N+1 and reload this?
I had thought of putting a latest_log_id reference on to Sensor and then updating this every time a new log for that sensor is posted - but something in my head is telling me that other developers would say this is a bad thing. Is this the case?
How are problems like this usually solved?
There are 2 relatively easy ways to do this:
Use ActiveRecord eager loading to pull in just the most recent logs
Roll your own mini eager loading system (as a Hash) for just this purpose
Basic ActiveRecord approach:
subquery = Log.group(:sensor_id).select("MAX('id')")
#sensors = Sensor.eager_load(:logs).where(logs: {id: subquery}).all
Note that you should NOT use your most_recent_log method for each sensor (that will trigger an N+1), but rather logs.first. Only the latest log for each sensor will actually be prefetched in the logs collection.
Rolling your own may be more efficient from a SQL perspective, but more complex to read and use:
#sensors = Sensor.all
logs = Log.where(id: Log.group(:sensor_id).select("MAX('id')"))
#sensor_logs = logs.each_with_object({}){|log, hash|
hash[log.sensor_id] = log
}
#sensor_logs is a Hash, permitting a fast lookup for the latest log by sensor.id.
Regarding your comment about storing the latest log id - you are essentially asking if you should build a cache. The answer would be 'it depends'. There are many advantages and many disadvantages to caching, so it comes down to 'is the benefit worth the cost'. From what you are describing, it doesn't appear that you are familiar with the difficulties they introduce (Google 'cache invalidation') or if they are applicable in your case. I'd recommend against it until you can demonstrate that a) it is adding real value over a non-cache solution, and b) it can be safely applied for your scenario.
There's 3 options:
eager loading
joining
caching the current status
--
is explained by PinnyM
You can do a join from the Sensor just to the latest Log record for each row, so everything gets fetched in the one query. Not sure off hand how that'll perform with the number of rows you have, likely it'll still be slower than you want.
The thing you mentioned - caching the latest_log_id (or even caching just the latest_status if that's all you need for the dashboard) is actually OK. It's called denormalization and it's a useful thing if used carefully. You've likely come across "counter cache" plugins for rails which are in the same vein - duplicating data, in the interests of being able to optimise read performance.

Mongoid identity_map and memory usage, memory leaks

When I executing query
Mymodel.all.each do |model|
# ..do something
end
It uses allot of memory and amount of used memory increases at all the time and at the and it crashes. I found out that to fix it I need to disable identity_map but when I adding to my mongoid.yml file identity_map_enabled: false I am getting error
Invalid configuration option: identity_map_enabled.
Summary:
A invalid configuration option was provided in your mongoid.yml, or a typo is potentially present. The valid configuration options are: :include_root_in_json, :include_type_for_serialization, :preload_models, :raise_not_found_error, :scope_overwrite_exception, :duplicate_fields_exception, :use_activesupport_time_zone, :use_utc.
Resolution:
Remove the invalid option or fix the typo. If you were expecting the option to be there, please consult the following page with repect to Mongoid's configuration:
I am using Rails 4 and Mongoid 4, Mymodel.all.count => 3202400
How can I fix it or maybe some one know other way to reduce amount of memory used during executing query .all.each ..?
Thank you very much for the help!!!!
I started with something just like you by doing loop through millions of record and the memory just keep increasing.
Original code:
#portal.listings.each do |listing|
listing.do_something
end
I've gone through many forum answers and I tried them out.
1st attempt: I try to use the combination of WeakRef and GC.start but no luck, I fail.
2nd attempt: Adding listing = nil to the first attempt, and still fail.
Success Attempt:
#start_date = 10.years.ago
#end_date = 1.day.ago
while #start_date < #end_date
#portal.listings.where(created_at: #start_date..#start_date.next_month).each do |listing|
listing.do_something
end
#start_date = #start_date.next_month
end
Conclusion
All the memory allocated for the record will never be released during
the query request. Therefore, trying with small number of record every
request does the job, and memory is in good condition since it will be
released after each request.
Your problem isn't the identity map, I don't think Mongoid4 even has an identity map built in, hence the configuration error when you try to turn it off. Your problem is that you're using all. When you do this:
Mymodel.all.each
Mongoid will attempt to instantiate every single document in the db.mymodels collection as a Mymodel instance before it starts iterating. You say that you have about 3.2 million documents in the collection, that means that Mongoid will try to create 3.2 million model instances before it tries to iterate. Presumably you don't have enough memory to handle that many objects.
Your Mymodel.all.count works fine because that just sends a simple count call into the database and returns a number, it won't instantiate any models at all.
The solution is to not use all (and preferably forget that it exists). Depending on what "do something" does, you could:
Page through all the models so that you're only working with a reasonable number of them at a time.
Push the logic into the database using mapReduce or the aggregation framework.
Whenever you're working with real data (i.e. something other than a trivially small database), you should push as much work as possible into the database because databases are built to manage and manipulate big piles of data.

Rails - given an array of Users - how to get a output of just emails?

I have the following:
#users = User.all
User has several fields including email.
What I would like to be able to do is get a list of all the #users emails.
I tried:
#users.email.all but that errors w undefined
Ideas? Thanks
(by popular demand, posting as a real answer)
What I don't like about fl00r's solution is that it instantiates a new User object per record in the DB; which just doesn't scale. It's great for a table with just 10 emails in it, but once you start getting into the thousands you're going to run into problems, mostly with the memory consumption of Ruby.
One can get around this little problem by using connection.select_values on a model, and a little bit of ARel goodness:
User.connection.select_values(User.select("email").to_sql)
This will give you the straight strings of the email addresses from the database. No faffing about with user objects and will scale better than a straight User.select("email") query, but I wouldn't say it's the "best scale". There's probably better ways to do this that I am not aware of yet.
The point is: a String object will use way less memory than a User object and so you can have more of them. It's also a quicker query and doesn't go the long way about it (running the query, then mapping the values). Oh, and map would also take longer too.
If you're using Rails 2.3...
Then you'll have to construct the SQL manually, I'm sorry to say.
User.connection.select_values("SELECT email FROM users")
Just provides another example of the helpers that Rails 3 provides.
I still find the connection.select_values to be a valid way to go about this, but I recently found a default AR method that's built into Rails that will do this for you: pluck.
In your example, all that you would need to do is run:
User.pluck(:email)
The select_values approach can be faster on extremely large datasets, but that's because it doesn't typecast the returned values. E.g., boolean values will be returned how they are stored in the database (as 1's and 0's) and not as true | false.
The pluck method works with ARel, so you can daisy chain things:
User.order('created_at desc').limit(5).pluck(:email)
User.select(:email).map(&:email)
Just use:
User.select("email")
While I visit SO frequently, I only registered today. Unfortunately that means that I don't have enough of a reputation to leave comments on other people's answers.
Piggybacking on Ryan's answer above, you can extend ActiveRecord::Base to create a method that will allow you to use this throughout your code in a cleaner way.
Create a file in config/initializers (e.g., config/initializers/active_record.rb):
class ActiveRecord::Base
def self.selected_to_array
connection.select_values(self.scoped)
end
end
You can then chain this method at the end of your ARel declarations:
User.select('email').selected_to_array
User.select('email').where('id > ?', 5).limit(4).selected_to_array
Use this to get an array of all the e-mails:
#users.collect { |user| user.email }
# => ["test#example.com", "test2#example.com", ...]
Or a shorthand version:
#users.collect(&:email)
You should avoid using User.all.map(&:email) as it will create a lot of ActiveRecord objects which consume large amounts of memory, a good chunk of which will not be collected by Ruby's garbage collector. It's also CPU intensive.
If you simply want to collect only a few attributes from your database without sacrificing performance, high memory usage and cpu cycles, consider using Valium.
https://github.com/ernie/valium
Here's an example for getting all the emails from all the users in your database.
User.all[:email]
Or only for users that subscribed or whatever.
User.where(:subscribed => true)[:email].each do |email|
puts "Do something with #{email}"
end
Using User.all.map(&:email) is considered bad practice for the reasons mentioned above.

Working with a large data object between ruby processes

I have a Ruby hash that reaches approximately 10 megabytes if written to a file using Marshal.dump. After gzip compression it is approximately 500 kilobytes.
Iterating through and altering this hash is very fast in ruby (fractions of a millisecond). Even copying it is extremely fast.
The problem is that I need to share the data in this hash between Ruby on Rails processes. In order to do this using the Rails cache (file_store or memcached) I need to Marshal.dump the file first, however this incurs a 1000 millisecond delay when serializing the file and a 400 millisecond delay when serializing it.
Ideally I would want to be able to save and load this hash from each process in under 100 milliseconds.
One idea is to spawn a new Ruby process to hold this hash that provides an API to the other processes to modify or process the data within it, but I want to avoid doing this unless I'm certain that there are no other ways to share this object quickly.
Is there a way I can more directly share this hash between processes without needing to serialize or deserialize it?
Here is the code I'm using to generate a hash similar to the one I'm working with:
#a = []
0.upto(500) do |r|
#a[r] = []
0.upto(10_000) do |c|
if rand(10) == 0
#a[r][c] = 1 # 10% chance of being 1
else
#a[r][c] = 0
end
end
end
#c = Marshal.dump(#a) # 1000 milliseconds
Marshal.load(#c) # 400 milliseconds
Update:
Since my original question did not receive many responses, I'm assuming there's no solution as easy as I would have hoped.
Presently I'm considering two options:
Create a Sinatra application to store this hash with an API to modify/access it.
Create a C application to do the same as #1, but a lot faster.
The scope of my problem has increased such that the hash may be larger than my original example. So #2 may be necessary. But I have no idea where to start in terms of writing a C application that exposes an appropriate API.
A good walkthrough through how best to implement #1 or #2 may receive best answer credit.
Update 2
I ended up implementing this as a separate application written in Ruby 1.9 that has a DRb interface to communicate with application instances. I use the Daemons gem to spawn DRb instances when the web server starts up. On start up the DRb application loads in the necessary data from the database, and then it communicates with the client to return results and to stay up to date. It's running quite well in production now. Thanks for the help!
A sinatra app will work, but the {un}serializing, and the HTML parsing could impact performance compared to a DRb service.
Here's an example, based on your example in the related question. I'm using a hash instead of an array so you can use user ids as indexes. This way there is no need to keep both a table on interests and a table of user ids on the server. Note that the interest table is "transposed" compared to your example, which is the way you want it anyways, so it can be updated in one call.
# server.rb
require 'drb'
class InterestServer < Hash
include DRbUndumped # don't send the data over!
def closest(cur_user_id)
cur_interests = fetch(cur_user_id)
selected_interests = cur_interests.each_index.select{|i| cur_interests[i]}
scores = map do |user_id, interests|
nb_match = selected_interests.count{|i| interests[i] }
[nb_match, user_id]
end
scores.sort!
end
end
DRb.start_service nil, InterestServer.new
puts DRb.uri
DRb.thread.join
# client.rb
uri = ARGV.shift
require 'drb'
DRb.start_service
interest_server = DRbObject.new nil, uri
USERS_COUNT = 10_000
INTERESTS_COUNT = 500
# Mock users
users = Array.new(USERS_COUNT) { {:id => rand(100000)+100000} }
# Initial send over user interests
users.each do |user|
interest_server[user[:id]] = Array.new(INTERESTS_COUNT) { rand(10) == 0 }
end
# query at will
puts interest_server.closest(users.first[:id]).inspect
# update, say there's a new user:
new_user = {:id => 42}
users << new_user
# This guy is interested in everything!
interest_server[new_user[:id]] = Array.new(INTERESTS_COUNT) { true }
puts interest_server.closest(users.first[:id])[-2,2].inspect
# Will output our first user and this new user which both match perfectly
To run in terminal, start the server and give the output as the argument to the client:
$ ruby server.rb
druby://mal.lan:51630
$ ruby client.rb druby://mal.lan:51630
[[0, 100035], ...]
[[45, 42], [45, 178902]]
Maybe it's too obvious, but if you sacrifice a little access speed to the members of your hash, a traditional database will give you much more constant time access to values. You could start there and then add caching to see if you could get enough speed from it. This will be a little simpler than using Sinatra or some other tool.
be careful with memcache, it has some object size limitations (2mb or so)
One thing to try is to use MongoDB as your storage. It is pretty fast and you can map pretty much any data structure into it.
If it's sensible to wrap your monster hash in a method call, you might simply present it using DRb - start a small daemon that starts a DRb server with the hash as the front object - other processes can make queries of it using what amounts to RPC.
More to the point, is there another approach to your problem? Without knowing what you're trying to do, it's hard to say for sure - but maybe a trie, or a Bloom filter would work? Or even a nicely interfaced bitfield would probably save you a fair amount of space.
Have you considered upping the memcache max object size?
Versions greater than 1.4.2
memcached -I 11m #giving yourself an extra MB in space
or on previous versions changing the value of POWER_BLOCK in the slabs.c and recompiling.
What about storing the data in Memcache instead of storing the Hash in Memcache? Using your code above:
#a = []
0.upto(500) do |r|
#a[r] = []
0.upto(10_000) do |c|
key = "#{r}:#{c}"
if rand(10) == 0
Cache.set(key, 1) # 10% chance of being 1
else
Cache.set(key, 0)
end
end
end
This will be speedy and you won't have to worry about serialization and all of your systems will have access to it. I asked in a comment on the main post about accessing the data, you will have to get creative, but it should be easy to do.

Profile a rails controller action

What is the best way to profile a controller action in Ruby on Rails. Currently I am using the brute-force method of throwing in puts Time.now calls between what I think will be a bottleneck. But that feels really, really dirty. There has got to be a better way.
I picked up this technique a while back and have found it quite handy.
When it's in place, you can add ?profile=true to any URL that hits a controller. Your action will run as usual, but instead of delivering the rendered page to the browser, it'll send a detailed, nicely formatted ruby-prof page that shows where your action spent its time.
First, add ruby-prof to your Gemfile, probably in the development group:
group :development do
gem "ruby-prof"
end
Then add an around filter to your ApplicationController:
around_action :performance_profile if Rails.env == 'development'
def performance_profile
if params[:profile] && result = RubyProf.profile { yield }
out = StringIO.new
RubyProf::GraphHtmlPrinter.new(result).print out, :min_percent => 0
self.response_body = out.string
else
yield
end
end
Reading the ruby-prof output is a bit of an art, but I'll leave that as an exercise.
Additional note by ScottJShea:
If you want to change the measurement type place this:
RubyProf.measure_mode = RubyProf::GC_TIME #example
Before the if in the profile method of the application controller. You can find a list of the available measurements at the ruby-prof page. As of this writing the memory and allocations data streams seem to be corrupted (see defect).
Use the Benchmark standard library and the various tests available in Rails (unit, functional, integration). Here's an example:
def test_do_something
elapsed_time = Benchmark.realtime do
100.downto(1) do |index|
# do something here
end
end
assert elapsed_time < SOME_LIMIT
end
So here we just do something 100 times, time it via the Benchmark library, and ensure that it took less than SOME_LIMIT amount of time.
You also may find these links useful: The Benchmark.realtime reference and the Test::Unit reference. Also, if you're into the 'book reading' thing, I picked up the idea for the example from Agile Web Development with Rails, which talks all about the different testing types and a little on performance testing.
There's a Railscast on profiling that's well worth watching
http://railscasts.com/episodes/98-request-profiling
You might want to give the FiveRuns TuneUp service a try, as it's really rather impressive. Disclaimer: I'm not associated with FiveRuns in any way, I've just tried this service out.
TuneUp is a free service whereby you download a plugin and when you run your application it injects a panel at the top of the screen that can be expanded to display detailed performance metrics.
It gives you some nice graphs, including one that shows what proportion of time is spent in the Model, View and Controller. You can even drill right down to see the individual SQL queries that ActiveRecord is executing if you need to and it can show you the underlying database schema with another click.
Finally, you can optionally upload your profiling data to the FiveRuns site for community performance analysis and advice.
This works in Rails 4.2.6:
o=OpenStruct.new(logger: Rails.logger)
o.extend ActiveSupport::Benchmarkable
o.benchmark 'name' do
# ... your code ...
end

Resources