Tempfile and Garbage Collection - ruby-on-rails

I have this command in a Rails controller
open(source) { |s| content = s.read }
rss = RSS::Parser.parse(content, false)
and it is resulting in temporary files that are filling up the (scarce) disk space.
I have examined the problem to some extent and it turns out somewhere in the stack this happens:
io = Tempfile.new('open-uri')
but it looks like this Tempfile instance never gets explicitly closed. It's got a
def _close # :nodoc:
method which might fire automatically upon garbage collection?
Any help in knowing what's happening or how to clean up the tempfiles would be helpful indeed.

If you really want to force open-uri not to use a tempfile, you can mess with the OpenURI::Buffer::StringMax constant:
> require 'open-uri'
=> true
> OpenURI::Buffer::StringMax
=> 10240
> open("http://www.yahoo.com")
=> #<File:/tmp/open-uri20110111-16395-8vco29-0>
> OpenURI::Buffer::StringMax = 1_000_000_000_000
(irb):10: warning: already initialized constant StringMax
=> 1000000000000
> open("http://www.yahoo.com")
=> #<StringIO:0x6f5b1c>
That's because of this snippet from open-uri.rb:
class Buffer
[...]
StringMax = 10240
def <<(str)
[...]
if [...] StringMax < #size
require 'tempfile'

it looks like _close closes the file and then waits for garbage collection to unlink (remove) the file. Theoretically you could force unlinking immediately by calling the Tempfile's close! method instead of close, or to call close(true) (which calls close! internally).
edit: But the problem is in open-uri, which is out of your hands - and that makes no promises for cleaning up after itself: it just assumes that the garbage collector will finalize all Tempfiles in due time.
In such a case, you are left with no choice but to call the garbage collector yourself using ObjectSpace.garbage_collect (see here). This should cause the removal of all temp files.

Definitely not a bug, but faulty error handling the IO. Buffer.io is either StringIO if the #size is less than 10240 bytes or a Tempfile if over that amount. The ensure clause in OpenURI.open_uri() is calling close(), but because it could be a StringIO object, which doesn't have a close!() method, it just can't just call close!().
The fix, I think, would be either one of these:
The ensure clause checks for class and calls either StringIO.close or Tempfile.close! as needed.
--or--
The Buffer class needs a finalizer that handles the class check and calls the correct method.
Granted, neither of those fix it if you don't use a block to handle the IO, but I suppose in that case, you can do your own checking, since open() returns the IO object, not the Buffer object.
The lib is a big chunk of messy code, imho, so it could use a work-over to clean it up. I think I might do that, just for fun. ^.^

Related

Memory is not freed in worker after job ends

Scenario:
I have a job running a process (sidekiq) in production (heroku). The process imports data (CSV) from S3 into a DB model using activerecord-import gem. This gem helps to bulk insertion of data. Thus dbRows variable sets a considerable amount of memory from all ActiveRecord objects stored when iterating CSV lines (all good). Once data is imported (in: db_model.import dbRows) dbRows is cleared (should be!) and next object is processed.
Such as: (script simplified for better understanding)
def import
....
s3_objects.contents.each do |obj|
#cli.get_object({..., key: obj.key}, target: file)
dbRows = []
csv = CSV.new(file, headers: false)
while line = csv.shift
# >> here dbRows grows and grows and never is freed!
dbRows << db_model.new(
field1: field1,
field2: field2,
fieldN: fieldN
)
end
db_model.import dbRows
dbRows = nil # try 1 to freed array
GC.start # try 2 to freed memory
end
....
end
Issue:
Job memory grows while process runs BUT once the job is done memory does not goes down. It stays forever and ever!
Debugging I found that dbRows does not look to be never garbage collected
and I learned about RETAINED objects in and how memory works in rails. Although I did not find yet a way to apply it to solve my problem.
I would like that once the job finished all references set on dbRows are GC and worker memory is freed.
any help appreciated.
UPDATE: I read about weakref but I don't know if is would be useful. any insights there?
Try importing lines from the CSV in batches, e.g. import lines into the DB 1000 lines at a time so you're not holding onto previous rows, and the GC can collect them. This is good for the database, in any case (and for the download from s3, if you hand CSV the IO object from S3.
s3_io_object = s3_client.get_object(*s3_obj_params).body
csv = CSV.new(s3_io_object, headers: true, header_converters: :symbol)
csv.each_slice(1_000) do |row_batch|
db_model.import ALLOWED_FIELDS, row_batch.map(&:to_h), validate: false
end
Note that I'm not instantiating AR models either to save on memory, and only passing in hashes and telling activerecord-import to validate: false.
Also, where does the file reference come from? It seems to be long-lived.
It's not evident from your example, but is it possible for references to objects are still being held globally by a library or extension in your environment?
Sometimes these things are very difficult to track down, as any code from anywhere that's called (including external library code) could do something like:
Dynamically defining constants, since they never get GC'd
Any::Module::Or:Class.const_set('NewConstantName', :foo)
or adding data to anything referenced/owned by a constant
SomeConstant::Referenceable::Globally.array << foo # array will only get bigger and contents will never be GC'd
Otherwise, the best you can do is use some memory profiling tools, either inside of Ruby (memory profiling gems) or outside of Ruby (job and system logs) to try and find the source.

Rails Memoization of Helper method

I have a helper method that does expensive calculations and returns a Hash, and this Hash is constant during my entire application lifespan (meaning it can only change after a re-deploy) and it also doesn't take any arguments.
For performance, I wish I could 'cache' the resulting Hash.
I don't want to use Rails cache for this, since I want to avoid the extra trip to memcached and I don't want the overhead of de-serializing the string into a hash.
My first idea was to assign the resulting hash to a Constant and calling .freeze on it. But the helper is an instance method, the constant lives on the class, and I had to do this ultra hacky solution:
module FooHelper
def expensive_calculation_method
resulting_hash
end
EXPENSIVE_CALCULATION_CONSTANT = Class.new.extend(self).expensive_calculation_method.freeze
This is due to the helper method being an instance method, the helper being a Module (which leads to the fake Class extend so I can call the instance method) and I also must declare the constant AFTER the instance method (if I declare it right after module FooHelper, I get an undefined method 'expensive_calculation_method'.
The second idea was to use memoization, but at least for Rails Controllers memoization is the persistance of a variable over the lifecycle of a single request, so it's only valuable if you reuse a variable many times from within a single request, which is not my case, but at the same time Helpers are modules, not Classes to be instanciated, and by this point I don't know what to do.
How would I cache that Hash, or memoize it in a way that persists over requests?
Per your comments, this will only change at application boot, so placing it in an initializer would do the trick.
# config/initializers/expensive_thing.rb
$EXENSIVE_THING_GLOBAL = expensive_calculation
# or
EXPENSIVE_THING_CONSTANT = expensive_calculation
# or
Rails.application.config.expensive_thing = expensive_calcualatioin
If you want to cache the result of some painful operation at launch time:
module MyExpensiveOperation
COMPUTED_RESULT = OtherModule.expensive_operation
def self.cached
COMPUTED_RESULT
end
end
Just make sure that module's loaded somehow or it won't initiate. You can always force-require that module if necessary in environment.rb or as a config/initializer type file.
If you want to lazy load the basic principle is the same:
module MyExpensiveOperation
def self.cached
return #cached if (defined?(#cached))
#cached = OtherModule.expensive_operation
end
end
That will handle operations that, for whatever reason, return nil or false. It will run once, and once only, unless you have multiple threads triggering it at the same time. If that's the case there's ways of making your module concurrency aware with automatic locks.

Best practice for a big array manipulation with values that never change and will be used in more than one view

What would be the best and more efficient way in Rails if I want to use a hash of about 300-500 integers (but it will never be modified) and use it in more than one view in the application?
Should I save the data in the database?, create the hash in each action that is used? (this is what I do now, but the code looks ugly and inefficient), or is there another option?
Why don't you put it in a constant? You said it will never change, so it fits either configuration or constant.
Using the cache has the downside that it can be dropped out of cache, triggering a reload, which seems quite useless in this case.
The overhead of having it always in memory is none, 500 integers are 4KB or something like that at most, you are safe.
You can write the hash manually or load a YAML file (or whatever) if you prefer, your choice.
My suggestion is create a file app/models/whatever.rb and:
module Whatever
MY_HASH = {
1 => 241
}.freeze
end
This will be preloaded by rails on startup (in production) and kept in memory all the time.
You can access those valus in view with Whatever::MY_HASH[1], or you can write a wrapper method like
module Whatever
MY_HASH = {
1 => 241
}.freeze
def self.get(id)
MY_HASH.fetch(id)
end
end
And use that Whatever.get(1)
If the data will never be changed, why not just calculate the values before hand and write them directly into the view?
Another option would be to put the values into a singleton and cache them there.
require 'singleton'
class MyHashValues
include Singleton
def initialize
#
#results = calculation
end
def result_key_1
#results[:result_key_1]
end
def calculation
Hash.new
end
end
MyHashValues.instance.result_key_1
Cache it, it'll do exactly what you want and it's a standard Rails component. If you're not caching yet, check out the Rails docs on caching. If you use the memory store, your data will essentially be in RAM.
You will then be able to do this sort of thing
# The block contains the value to cache, if there's a miss
# Setting the value is done initially and after the cache
# expires or is cleared.
# put this in application controller and make it a helper method
def integer_hash
cache.fetch('integer_hash') { ... }
end
helper_method :integer_hash

Creating thread-safe non-deleting unique filenames in ruby/rails

I'm building a bulk-file-uploader. Multiple files are uploaded in individual requests, and my UI provides progress and success/fail. Then, once all files are complete, a final request processes/finalizes them. For this to work, I need to create many temporary files that live longer than a single request. Of course I also need to guarantee filenames are unique across app instances.
Normally I would use Tempfile for easy unique filenames, but in this case it won't work because the files need to stick around until another request comes in to further process them. Tempfile auto-unlinks files when they're closed and garbage collected.
An earlier question here suggests using Dir::Tmpname.make_tmpname but this seems to be undocumented and I don't see how it is thread/multiprocess safe. Is it guaranteed to be so?
In c I would open the file O_EXCL which will fail if the file exists. I could then keep trying until I successfully get a handle on a file with a truly unique name. But ruby's File.open doesn't seem to have an "exclusive" option of any kind. If the file I'm opening already exists, I have to either append to it, open for writing at the end, or empty it.
Is there a "right" way to do this in ruby?
I have worked out a method that I think is safe, but is seems overly complex:
# make a unique filename
time = Time.now
filename = "#{time.to_i}-#{sprintf('%06d', time.usec)}"
# make tempfiles (this is gauranteed to find a unique creatable name)
data_file = Tempfile.new(["upload", ".data"], UPLOAD_BASE)
# but the file will be deleted automatically, which we don't want, so now link it in a stable location
count = 1
loop do
begin
# File.link will raise an exception if the destination path exists
File.link(data_file.path, File.join(UPLOAD_BASE, "#{filename}-#{count}.data"))
# so here we know we created a file successfully and nobody else will take it
break
rescue Errno::EEXIST
count += 1
end
end
# now unlink the original tempfiles (they're still writeable until they're closed)
data_file.unlink
# ... write to data_file and close it ...
NOTE: This won't work on Windows. Not a problem for me, but reader beware.
In my testing this works reliably. But again, is there a more straightforward way?
I would use SecureRandom.
Maybe something like:
p SecureRandom.uuid #=> "2d931510-d99f-494a-8c67-87feb05e1594"
or
p SecureRandom.hex #=> "eb693ec8252cd630102fd0d0fb7c3485"
You can specify the length, and count on an almost impossibly small chance of collision.
I actually found the answer after some digging. Of course the obvious approach is to see what Tempfile itself does. I just assumed it was native code, but it is not. The source for 1.8.7 can be found here for instance.
As you can see, Tempfile uses an apparently undocumented file mode of File::EXCL. So my code can be simplified substantially:
# make a unique filename
time = Time.now
filename = "#{time.to_i}-#{sprintf('%06d', time.usec)}"
data_file = nil
count = 1
loop do
begin
data_file = File.open(File.join(UPLOAD_BASE, "#{filename}-#{count}.data"), File::RDWR|File::CREAT|File::EXCL)
break
rescue Errno::EEXIST
count += 1
end
end
# ... write to data_file and close it ...
UPDATE And now I see that this is covered in a prior thread:
How do open a file for writing only if it doesn't already exist in ruby
So maybe this whole question should be marked a duplicate.

Does ruby IO.read() lock?

I have a webservice method that reads a photo and returns its byte data. I'm currently doing the following:
#photo_bytes = IO.read("/path/to/file")
send_data(#photo_bytes, :filename => "filename", :type => "filetype", :disposition => "inline")
I'm getting some strange behavior when calling this a lot... occasionally send_data is returning null. I'm thinking that maybe I'm getting read contention if a file hasn't been closed yet. Do I need to explicitly close the file after opening it with IO.read? How could I use read_nonblock to do this and would it be worth it?
UPDATE:
So I did some more logging and occasionally IO.read is returning a value like 1800 bytes when it usually returns ~5800 bytes for a picture. When it return 1800 bytes the picture does not show up on the client. This happens fairly randomly when two users are calling the web service.
Thanks
Tom
The IO.read method doesn't do any advisory file locking, and so shouldn't be affected by other concurrent readers. However, if you have code elsewhere in your application which writes to the same path, you need to make sure you update the file atomically. Opening a file in write (not append) mode immediately truncates the file to zero bytes, so until the new version had been written, you could well see empty responses generated from the above snippet.
Assuming you're on a *NIX platform like Linux or OS X, though, you can update a file atomically using code like this:
require 'tempfile'
require 'fileutils'
def safe_write(path, data)
tmp = Tempfile.new
tmp.write(data)
tmp.close
FileUtils.mv(tmp.path, path)
end
This will write data to a temporary file, then move it to the "/path/to/file" location atomically, without readers ever seeing the zero-length truncated version.

Resources