Memory is not freed in worker after job ends - ruby-on-rails

Scenario:
I have a job running a process (sidekiq) in production (heroku). The process imports data (CSV) from S3 into a DB model using activerecord-import gem. This gem helps to bulk insertion of data. Thus dbRows variable sets a considerable amount of memory from all ActiveRecord objects stored when iterating CSV lines (all good). Once data is imported (in: db_model.import dbRows) dbRows is cleared (should be!) and next object is processed.
Such as: (script simplified for better understanding)
def import
....
s3_objects.contents.each do |obj|
#cli.get_object({..., key: obj.key}, target: file)
dbRows = []
csv = CSV.new(file, headers: false)
while line = csv.shift
# >> here dbRows grows and grows and never is freed!
dbRows << db_model.new(
field1: field1,
field2: field2,
fieldN: fieldN
)
end
db_model.import dbRows
dbRows = nil # try 1 to freed array
GC.start # try 2 to freed memory
end
....
end
Issue:
Job memory grows while process runs BUT once the job is done memory does not goes down. It stays forever and ever!
Debugging I found that dbRows does not look to be never garbage collected
and I learned about RETAINED objects in and how memory works in rails. Although I did not find yet a way to apply it to solve my problem.
I would like that once the job finished all references set on dbRows are GC and worker memory is freed.
any help appreciated.
UPDATE: I read about weakref but I don't know if is would be useful. any insights there?

Try importing lines from the CSV in batches, e.g. import lines into the DB 1000 lines at a time so you're not holding onto previous rows, and the GC can collect them. This is good for the database, in any case (and for the download from s3, if you hand CSV the IO object from S3.
s3_io_object = s3_client.get_object(*s3_obj_params).body
csv = CSV.new(s3_io_object, headers: true, header_converters: :symbol)
csv.each_slice(1_000) do |row_batch|
db_model.import ALLOWED_FIELDS, row_batch.map(&:to_h), validate: false
end
Note that I'm not instantiating AR models either to save on memory, and only passing in hashes and telling activerecord-import to validate: false.
Also, where does the file reference come from? It seems to be long-lived.
It's not evident from your example, but is it possible for references to objects are still being held globally by a library or extension in your environment?
Sometimes these things are very difficult to track down, as any code from anywhere that's called (including external library code) could do something like:
Dynamically defining constants, since they never get GC'd
Any::Module::Or:Class.const_set('NewConstantName', :foo)
or adding data to anything referenced/owned by a constant
SomeConstant::Referenceable::Globally.array << foo # array will only get bigger and contents will never be GC'd
Otherwise, the best you can do is use some memory profiling tools, either inside of Ruby (memory profiling gems) or outside of Ruby (job and system logs) to try and find the source.

Related

Request timeout in Rails

We are working on a data visualization problem right now. Our customer wants us to show the last 6 months data for a honeybee hive on a graph.
Clearly it's gonna be a huge dataset. Adding indexes we overcame the database slowness problem in loading data though we still have problem in visualizing data on a graph.
Here is the related code:
def self.prepare_single_hive_messages_for_datatable_dygraph(messages, us_metric_enabled)
data = []
messages.each do |message|
record = []
record << message.occurance_time.to_s(:dygraph_format)
record << weight_according_to_metric(message.weight, us_metric_enabled)
record << temperature_according_to_metric(message.temperature, us_metric_enabled)
record << (message.humidity.nil? ? nil : message.humidity.to_f)
data << record
end
return data
end
The problem is that messages.each is very slow and takes more than 30 seconds. Is there any solution to overcome this?
Project Specification:
Rails Version: 4.1.9
Graph Library: Dygraph
Database: Postgres
There are two ways to attack a performance problem like this.
Find and correct the performance bottle neck
Break it into smaller pieces
Finding Performance issues
First, get a dataset large enough to reproduce the problem setup on your dev system. Then look at the logs so you can see how long the transaction is taking. You should be looking for a line like this:
Completed 200 OK in 432.1ms (Views: 367.7ms | ActiveRecord: 61.4ms)
Rerun the task a couple times since caching can cause variations. Write down your different times. Then remove everything in the loop and run it with just the loop. Do the numbers go back to looking reasonable? If that is the case then you know the problem is the work you are doing inside the loop. Next, add each line in the loop back on its own (or one at a time if they depend on each other). Figure out which line causes those numbers to jump the most.
This is the point where you should try to performance tune your code. Check for queries that could be smarter. Make sure you aren't querying the same data over and over. If you have a function in a model that computes something and you call it multiple times to get the same answer then use this to only compute once:
def something
return #savedvalue if #savedvalue
#savedvalue = really complex calculation
end
The goal is to find the worse offender so you can make changes that have the biggest impact. However, if you are working with a LOT of data this may only get you so far. It may be impossible to performance tune enough for all the data. In that case there is option 2.
Break it into smaller pieces
Write a second rails action who's only job is to render a single record on a graph. It will do the inner part of your loop but only on the message who's id was passed to it.
Call your original function to setup the view and pass the list of messages to the view. In the view loop through the list of messages to setup jquery ajax code to call the above action once for each message. Have this run in on document ready.
Then, the page will load with an empty graph... but as soon as it is up the individual processed records will be fed to it and appear one at a time on the page. It will still take just ask long (or even a little longer because of overhead) to complete the graph... but it will no longer time out. Each ajax call will be its own quick hit to the server instead of one big long hit.
I just used this very technique to load a rather long report on a site I work on. Ideally we'd like to fix any underlying performance issues... but what we really wanted was to have a report working right away and then fix the performance issues as we had time.
Ok you said every person sees the same set of data, which is great, means we can cache without worrying about who's logged in, first here's your method, with tiny improvements
def self.prepare_single_hive_messages_for_datatable_dygraph(messages, us_metric_enabled)
messages.inject([]) do |records, message|
records << [].tap do |record|
record << message.occurance_time.to_s(:dygraph_format)
record << weight_according_to_metric(message.weight, us_metric_enabled)
record << temperature_according_to_metric(message.temperature, us_metric_enabled)
record << (message.humidity.nil? ? nil : message.humidity.to_f)
end
end
end
Then create a caching function, that runs this method and caches it
# some class constants
CACHE_KEY = 'some_cache_key'
EXPIRY_TIME = 15.minutes
# the methods
def self.write_single_hive_messages_to_cache(messages, us_metric_enabled)
Rails.cache.write CACHE_KEY,
self.class.prepare_single_hive_messages_for_datatable_dygraph(messages, us_metric_enabled),
expires_in: EXPIRY_TIME
end
And a simple cache reading method
self.read_single_hive_messages_from_cache
Rails.cache.read CACHE_KEY
end
Then create a rake task that just fetches these messages and call the caching method, and rails will write the cache.
Create a cron job that calls this rake task, set the cron job to 5 minutes or so, the expiry time is longer just in case for some reason the cron job didn't run, the data will still be available for the next run.
This way your processing is run in the background, every 5 ( or whatever time you choose ) minutes, the page load should happen normally with no delay at all, since the array data will be loaded from the pre-calculated cache.
In case the cron stops working, the data will expire in the 15 minutes I've set, and then the read cache method will return nil, you could avoid this and set the data to never expire, but then the data will become stale and the old data will keep getting returned.
Another way to handle this is to tell the cache reading method how to generate the cache it self, so if it finds the cache empty it generates one and caches it itself before returning the data, the method would look like this
def self.read_single_hive_messages_from_cache(messages, us_metric_enabled)
Rails.cache.fetch CACHE_KEY, expires_in: EXPIRY_TIME do
self.class.write_single_hive_messages_to_cache(messages, us_metric_enabled)
end
end
But then make sure that messages is an ActiveRecord::Relation and not a processed array, because you don't want to query for 1+ million records and then find the cache already ready, if it's an ActiveRecord::Relation it will not touch the database until the array is started ( inside the caching block ), if the cache exists it will be returned before you enter the block and thus the data won't get fetched, saving you that huge query.
I know the answer got long, if you need more help tell me.

Mongoid identity_map and memory usage, memory leaks

When I executing query
Mymodel.all.each do |model|
# ..do something
end
It uses allot of memory and amount of used memory increases at all the time and at the and it crashes. I found out that to fix it I need to disable identity_map but when I adding to my mongoid.yml file identity_map_enabled: false I am getting error
Invalid configuration option: identity_map_enabled.
Summary:
A invalid configuration option was provided in your mongoid.yml, or a typo is potentially present. The valid configuration options are: :include_root_in_json, :include_type_for_serialization, :preload_models, :raise_not_found_error, :scope_overwrite_exception, :duplicate_fields_exception, :use_activesupport_time_zone, :use_utc.
Resolution:
Remove the invalid option or fix the typo. If you were expecting the option to be there, please consult the following page with repect to Mongoid's configuration:
I am using Rails 4 and Mongoid 4, Mymodel.all.count => 3202400
How can I fix it or maybe some one know other way to reduce amount of memory used during executing query .all.each ..?
Thank you very much for the help!!!!
I started with something just like you by doing loop through millions of record and the memory just keep increasing.
Original code:
#portal.listings.each do |listing|
listing.do_something
end
I've gone through many forum answers and I tried them out.
1st attempt: I try to use the combination of WeakRef and GC.start but no luck, I fail.
2nd attempt: Adding listing = nil to the first attempt, and still fail.
Success Attempt:
#start_date = 10.years.ago
#end_date = 1.day.ago
while #start_date < #end_date
#portal.listings.where(created_at: #start_date..#start_date.next_month).each do |listing|
listing.do_something
end
#start_date = #start_date.next_month
end
Conclusion
All the memory allocated for the record will never be released during
the query request. Therefore, trying with small number of record every
request does the job, and memory is in good condition since it will be
released after each request.
Your problem isn't the identity map, I don't think Mongoid4 even has an identity map built in, hence the configuration error when you try to turn it off. Your problem is that you're using all. When you do this:
Mymodel.all.each
Mongoid will attempt to instantiate every single document in the db.mymodels collection as a Mymodel instance before it starts iterating. You say that you have about 3.2 million documents in the collection, that means that Mongoid will try to create 3.2 million model instances before it tries to iterate. Presumably you don't have enough memory to handle that many objects.
Your Mymodel.all.count works fine because that just sends a simple count call into the database and returns a number, it won't instantiate any models at all.
The solution is to not use all (and preferably forget that it exists). Depending on what "do something" does, you could:
Page through all the models so that you're only working with a reasonable number of them at a time.
Push the logic into the database using mapReduce or the aggregation framework.
Whenever you're working with real data (i.e. something other than a trivially small database), you should push as much work as possible into the database because databases are built to manage and manipulate big piles of data.

Getting error "Cannot allocate memory" for Rails

In my project there is one script that returns the list of products which I have to display in a table.
To store the input of the script I used IO.popen:
#device_list = []
IO.popen("device list").each do |device|
#device_list << device
end
device list is the command that will give me the product list.
I return the #device_list array to my view for displaying by iterating it.
When I run it I got an error:
Errno::ENOMEM (Cannot allocate memory):
for IO.popen
I have on another script device status that returns only true and false but I got the same error:
def check_status(device_id)
#stat = system("status device_id")
if #stat == true
"sold"
else
"not sold"
end
end
What should I do?
Both IO.popen and Kernel#system can be expensive operations in terms of memory because they both rely on fork(2). Fork(2) is a Unix system call which creates a child process that clones the parent's memory and resources. That means, if your parent process uses 500mb of memory, then your child would also use 500mb of memory. Each time you do Kernel#system or IO.popen you increase your application's memory usage by the amount of memory it takes to run your Rails app.
If your development machine has more RAM than your production server or if your production server produces a lot more output, there are two things you could do:
Increase memory for your production server.
Do some memory management using something like Resque.
You can use Resque to queue those operations as jobs. Resque will then spawn "workers"/child processes to get a job from the queue, work on it and then exit. Resque still forks, but the important thing is that the worker exits after working on the task so that frees up memory. There'll be a spike in memory every time a worker does a job, but it will go back to the baseline memory of your app every after it.
You might have to do both options above and look for other ways to minimize the memory-usage of your app.
It seems your output from device list is too large.
"Cannot allocate memory (Errno::ENOMEM)" is a useful link which describes the question.
Limit the output of device list and check. Then you can know if it is a memory issue or not.

Working with a large data object between ruby processes

I have a Ruby hash that reaches approximately 10 megabytes if written to a file using Marshal.dump. After gzip compression it is approximately 500 kilobytes.
Iterating through and altering this hash is very fast in ruby (fractions of a millisecond). Even copying it is extremely fast.
The problem is that I need to share the data in this hash between Ruby on Rails processes. In order to do this using the Rails cache (file_store or memcached) I need to Marshal.dump the file first, however this incurs a 1000 millisecond delay when serializing the file and a 400 millisecond delay when serializing it.
Ideally I would want to be able to save and load this hash from each process in under 100 milliseconds.
One idea is to spawn a new Ruby process to hold this hash that provides an API to the other processes to modify or process the data within it, but I want to avoid doing this unless I'm certain that there are no other ways to share this object quickly.
Is there a way I can more directly share this hash between processes without needing to serialize or deserialize it?
Here is the code I'm using to generate a hash similar to the one I'm working with:
#a = []
0.upto(500) do |r|
#a[r] = []
0.upto(10_000) do |c|
if rand(10) == 0
#a[r][c] = 1 # 10% chance of being 1
else
#a[r][c] = 0
end
end
end
#c = Marshal.dump(#a) # 1000 milliseconds
Marshal.load(#c) # 400 milliseconds
Update:
Since my original question did not receive many responses, I'm assuming there's no solution as easy as I would have hoped.
Presently I'm considering two options:
Create a Sinatra application to store this hash with an API to modify/access it.
Create a C application to do the same as #1, but a lot faster.
The scope of my problem has increased such that the hash may be larger than my original example. So #2 may be necessary. But I have no idea where to start in terms of writing a C application that exposes an appropriate API.
A good walkthrough through how best to implement #1 or #2 may receive best answer credit.
Update 2
I ended up implementing this as a separate application written in Ruby 1.9 that has a DRb interface to communicate with application instances. I use the Daemons gem to spawn DRb instances when the web server starts up. On start up the DRb application loads in the necessary data from the database, and then it communicates with the client to return results and to stay up to date. It's running quite well in production now. Thanks for the help!
A sinatra app will work, but the {un}serializing, and the HTML parsing could impact performance compared to a DRb service.
Here's an example, based on your example in the related question. I'm using a hash instead of an array so you can use user ids as indexes. This way there is no need to keep both a table on interests and a table of user ids on the server. Note that the interest table is "transposed" compared to your example, which is the way you want it anyways, so it can be updated in one call.
# server.rb
require 'drb'
class InterestServer < Hash
include DRbUndumped # don't send the data over!
def closest(cur_user_id)
cur_interests = fetch(cur_user_id)
selected_interests = cur_interests.each_index.select{|i| cur_interests[i]}
scores = map do |user_id, interests|
nb_match = selected_interests.count{|i| interests[i] }
[nb_match, user_id]
end
scores.sort!
end
end
DRb.start_service nil, InterestServer.new
puts DRb.uri
DRb.thread.join
# client.rb
uri = ARGV.shift
require 'drb'
DRb.start_service
interest_server = DRbObject.new nil, uri
USERS_COUNT = 10_000
INTERESTS_COUNT = 500
# Mock users
users = Array.new(USERS_COUNT) { {:id => rand(100000)+100000} }
# Initial send over user interests
users.each do |user|
interest_server[user[:id]] = Array.new(INTERESTS_COUNT) { rand(10) == 0 }
end
# query at will
puts interest_server.closest(users.first[:id]).inspect
# update, say there's a new user:
new_user = {:id => 42}
users << new_user
# This guy is interested in everything!
interest_server[new_user[:id]] = Array.new(INTERESTS_COUNT) { true }
puts interest_server.closest(users.first[:id])[-2,2].inspect
# Will output our first user and this new user which both match perfectly
To run in terminal, start the server and give the output as the argument to the client:
$ ruby server.rb
druby://mal.lan:51630
$ ruby client.rb druby://mal.lan:51630
[[0, 100035], ...]
[[45, 42], [45, 178902]]
Maybe it's too obvious, but if you sacrifice a little access speed to the members of your hash, a traditional database will give you much more constant time access to values. You could start there and then add caching to see if you could get enough speed from it. This will be a little simpler than using Sinatra or some other tool.
be careful with memcache, it has some object size limitations (2mb or so)
One thing to try is to use MongoDB as your storage. It is pretty fast and you can map pretty much any data structure into it.
If it's sensible to wrap your monster hash in a method call, you might simply present it using DRb - start a small daemon that starts a DRb server with the hash as the front object - other processes can make queries of it using what amounts to RPC.
More to the point, is there another approach to your problem? Without knowing what you're trying to do, it's hard to say for sure - but maybe a trie, or a Bloom filter would work? Or even a nicely interfaced bitfield would probably save you a fair amount of space.
Have you considered upping the memcache max object size?
Versions greater than 1.4.2
memcached -I 11m #giving yourself an extra MB in space
or on previous versions changing the value of POWER_BLOCK in the slabs.c and recompiling.
What about storing the data in Memcache instead of storing the Hash in Memcache? Using your code above:
#a = []
0.upto(500) do |r|
#a[r] = []
0.upto(10_000) do |c|
key = "#{r}:#{c}"
if rand(10) == 0
Cache.set(key, 1) # 10% chance of being 1
else
Cache.set(key, 0)
end
end
end
This will be speedy and you won't have to worry about serialization and all of your systems will have access to it. I asked in a comment on the main post about accessing the data, you will have to get creative, but it should be easy to do.

Tempfile and Garbage Collection

I have this command in a Rails controller
open(source) { |s| content = s.read }
rss = RSS::Parser.parse(content, false)
and it is resulting in temporary files that are filling up the (scarce) disk space.
I have examined the problem to some extent and it turns out somewhere in the stack this happens:
io = Tempfile.new('open-uri')
but it looks like this Tempfile instance never gets explicitly closed. It's got a
def _close # :nodoc:
method which might fire automatically upon garbage collection?
Any help in knowing what's happening or how to clean up the tempfiles would be helpful indeed.
If you really want to force open-uri not to use a tempfile, you can mess with the OpenURI::Buffer::StringMax constant:
> require 'open-uri'
=> true
> OpenURI::Buffer::StringMax
=> 10240
> open("http://www.yahoo.com")
=> #<File:/tmp/open-uri20110111-16395-8vco29-0>
> OpenURI::Buffer::StringMax = 1_000_000_000_000
(irb):10: warning: already initialized constant StringMax
=> 1000000000000
> open("http://www.yahoo.com")
=> #<StringIO:0x6f5b1c>
That's because of this snippet from open-uri.rb:
class Buffer
[...]
StringMax = 10240
def <<(str)
[...]
if [...] StringMax < #size
require 'tempfile'
it looks like _close closes the file and then waits for garbage collection to unlink (remove) the file. Theoretically you could force unlinking immediately by calling the Tempfile's close! method instead of close, or to call close(true) (which calls close! internally).
edit: But the problem is in open-uri, which is out of your hands - and that makes no promises for cleaning up after itself: it just assumes that the garbage collector will finalize all Tempfiles in due time.
In such a case, you are left with no choice but to call the garbage collector yourself using ObjectSpace.garbage_collect (see here). This should cause the removal of all temp files.
Definitely not a bug, but faulty error handling the IO. Buffer.io is either StringIO if the #size is less than 10240 bytes or a Tempfile if over that amount. The ensure clause in OpenURI.open_uri() is calling close(), but because it could be a StringIO object, which doesn't have a close!() method, it just can't just call close!().
The fix, I think, would be either one of these:
The ensure clause checks for class and calls either StringIO.close or Tempfile.close! as needed.
--or--
The Buffer class needs a finalizer that handles the class check and calls the correct method.
Granted, neither of those fix it if you don't use a block to handle the IO, but I suppose in that case, you can do your own checking, since open() returns the IO object, not the Buffer object.
The lib is a big chunk of messy code, imho, so it could use a work-over to clean it up. I think I might do that, just for fun. ^.^

Resources