Does ruby IO.read() lock? - ruby-on-rails

I have a webservice method that reads a photo and returns its byte data. I'm currently doing the following:
#photo_bytes = IO.read("/path/to/file")
send_data(#photo_bytes, :filename => "filename", :type => "filetype", :disposition => "inline")
I'm getting some strange behavior when calling this a lot... occasionally send_data is returning null. I'm thinking that maybe I'm getting read contention if a file hasn't been closed yet. Do I need to explicitly close the file after opening it with IO.read? How could I use read_nonblock to do this and would it be worth it?
UPDATE:
So I did some more logging and occasionally IO.read is returning a value like 1800 bytes when it usually returns ~5800 bytes for a picture. When it return 1800 bytes the picture does not show up on the client. This happens fairly randomly when two users are calling the web service.
Thanks
Tom

The IO.read method doesn't do any advisory file locking, and so shouldn't be affected by other concurrent readers. However, if you have code elsewhere in your application which writes to the same path, you need to make sure you update the file atomically. Opening a file in write (not append) mode immediately truncates the file to zero bytes, so until the new version had been written, you could well see empty responses generated from the above snippet.
Assuming you're on a *NIX platform like Linux or OS X, though, you can update a file atomically using code like this:
require 'tempfile'
require 'fileutils'
def safe_write(path, data)
tmp = Tempfile.new
tmp.write(data)
tmp.close
FileUtils.mv(tmp.path, path)
end
This will write data to a temporary file, then move it to the "/path/to/file" location atomically, without readers ever seeing the zero-length truncated version.

Related

Memory is not freed in worker after job ends

Scenario:
I have a job running a process (sidekiq) in production (heroku). The process imports data (CSV) from S3 into a DB model using activerecord-import gem. This gem helps to bulk insertion of data. Thus dbRows variable sets a considerable amount of memory from all ActiveRecord objects stored when iterating CSV lines (all good). Once data is imported (in: db_model.import dbRows) dbRows is cleared (should be!) and next object is processed.
Such as: (script simplified for better understanding)
def import
....
s3_objects.contents.each do |obj|
#cli.get_object({..., key: obj.key}, target: file)
dbRows = []
csv = CSV.new(file, headers: false)
while line = csv.shift
# >> here dbRows grows and grows and never is freed!
dbRows << db_model.new(
field1: field1,
field2: field2,
fieldN: fieldN
)
end
db_model.import dbRows
dbRows = nil # try 1 to freed array
GC.start # try 2 to freed memory
end
....
end
Issue:
Job memory grows while process runs BUT once the job is done memory does not goes down. It stays forever and ever!
Debugging I found that dbRows does not look to be never garbage collected
and I learned about RETAINED objects in and how memory works in rails. Although I did not find yet a way to apply it to solve my problem.
I would like that once the job finished all references set on dbRows are GC and worker memory is freed.
any help appreciated.
UPDATE: I read about weakref but I don't know if is would be useful. any insights there?
Try importing lines from the CSV in batches, e.g. import lines into the DB 1000 lines at a time so you're not holding onto previous rows, and the GC can collect them. This is good for the database, in any case (and for the download from s3, if you hand CSV the IO object from S3.
s3_io_object = s3_client.get_object(*s3_obj_params).body
csv = CSV.new(s3_io_object, headers: true, header_converters: :symbol)
csv.each_slice(1_000) do |row_batch|
db_model.import ALLOWED_FIELDS, row_batch.map(&:to_h), validate: false
end
Note that I'm not instantiating AR models either to save on memory, and only passing in hashes and telling activerecord-import to validate: false.
Also, where does the file reference come from? It seems to be long-lived.
It's not evident from your example, but is it possible for references to objects are still being held globally by a library or extension in your environment?
Sometimes these things are very difficult to track down, as any code from anywhere that's called (including external library code) could do something like:
Dynamically defining constants, since they never get GC'd
Any::Module::Or:Class.const_set('NewConstantName', :foo)
or adding data to anything referenced/owned by a constant
SomeConstant::Referenceable::Globally.array << foo # array will only get bigger and contents will never be GC'd
Otherwise, the best you can do is use some memory profiling tools, either inside of Ruby (memory profiling gems) or outside of Ruby (job and system logs) to try and find the source.

Creating thread-safe non-deleting unique filenames in ruby/rails

I'm building a bulk-file-uploader. Multiple files are uploaded in individual requests, and my UI provides progress and success/fail. Then, once all files are complete, a final request processes/finalizes them. For this to work, I need to create many temporary files that live longer than a single request. Of course I also need to guarantee filenames are unique across app instances.
Normally I would use Tempfile for easy unique filenames, but in this case it won't work because the files need to stick around until another request comes in to further process them. Tempfile auto-unlinks files when they're closed and garbage collected.
An earlier question here suggests using Dir::Tmpname.make_tmpname but this seems to be undocumented and I don't see how it is thread/multiprocess safe. Is it guaranteed to be so?
In c I would open the file O_EXCL which will fail if the file exists. I could then keep trying until I successfully get a handle on a file with a truly unique name. But ruby's File.open doesn't seem to have an "exclusive" option of any kind. If the file I'm opening already exists, I have to either append to it, open for writing at the end, or empty it.
Is there a "right" way to do this in ruby?
I have worked out a method that I think is safe, but is seems overly complex:
# make a unique filename
time = Time.now
filename = "#{time.to_i}-#{sprintf('%06d', time.usec)}"
# make tempfiles (this is gauranteed to find a unique creatable name)
data_file = Tempfile.new(["upload", ".data"], UPLOAD_BASE)
# but the file will be deleted automatically, which we don't want, so now link it in a stable location
count = 1
loop do
begin
# File.link will raise an exception if the destination path exists
File.link(data_file.path, File.join(UPLOAD_BASE, "#{filename}-#{count}.data"))
# so here we know we created a file successfully and nobody else will take it
break
rescue Errno::EEXIST
count += 1
end
end
# now unlink the original tempfiles (they're still writeable until they're closed)
data_file.unlink
# ... write to data_file and close it ...
NOTE: This won't work on Windows. Not a problem for me, but reader beware.
In my testing this works reliably. But again, is there a more straightforward way?
I would use SecureRandom.
Maybe something like:
p SecureRandom.uuid #=> "2d931510-d99f-494a-8c67-87feb05e1594"
or
p SecureRandom.hex #=> "eb693ec8252cd630102fd0d0fb7c3485"
You can specify the length, and count on an almost impossibly small chance of collision.
I actually found the answer after some digging. Of course the obvious approach is to see what Tempfile itself does. I just assumed it was native code, but it is not. The source for 1.8.7 can be found here for instance.
As you can see, Tempfile uses an apparently undocumented file mode of File::EXCL. So my code can be simplified substantially:
# make a unique filename
time = Time.now
filename = "#{time.to_i}-#{sprintf('%06d', time.usec)}"
data_file = nil
count = 1
loop do
begin
data_file = File.open(File.join(UPLOAD_BASE, "#{filename}-#{count}.data"), File::RDWR|File::CREAT|File::EXCL)
break
rescue Errno::EEXIST
count += 1
end
end
# ... write to data_file and close it ...
UPDATE And now I see that this is covered in a prior thread:
How do open a file for writing only if it doesn't already exist in ruby
So maybe this whole question should be marked a duplicate.

Including .xml file to rails and using it

So I have this currency .xml file:
http://www.ecb.int/stats/eurofxref/eurofxref-daily.xml
Now, I am wondering, how can I make my rails application read it? Where do I even have to put it and how do I include it?
I am basically making a currency exchange rate calculator.
And I am going to make the dropdown menu have the currency names from the .xml table appear in it and be usable.
First of all you're going to have to be able to read the file--I assume you want the very latest from that site, so you'll be making an HTTP request (otherwise, just store the file anywhere in your app and read it with File.read with a relative path). Here I use Net::HTTP, but you could use HTTParty or whatever you prefer.
It looks like it changes on a daily basis, so maybe you'll only want to make one HTTP request every day and cache the file somewhere along with a timestamp.
Let's say you have a directory in your application called rates where we store the cached xml files, the heart of the functionality could look like this (kind of clunky but I want the behaviour to be obvious):
def get_rates
today_path = Rails.root.join 'rates', "#{Date.today.to_s}.xml"
xml_content = if File.exists? today_path
# Read it from local storage
File.read today_path
else
# Go get it and store it!
xml = Net::HTTP.get URI 'http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml'
File.write today_path, xml
xml
end
# Now convert that XML to a hash. Lots of ways to do this, but this is very simple xml.
currency_list = Hash.from_xml(xml_content)["Envelope"]["Cube"]["Cube"]["Cube"]
# Now currency_list is an Array of hashes e.g. [{"currency"=>"USD", "rate"=>"1.3784"}, ...]
# Let's say you want a single hash like "USD" => "1.3784", you could do a conversion like this
Hash[currency_list.map &:values]
end
The important part there is Hash.from_xml. Where you have XML that is essentially key/value pairs, this is your friend. For anything more complicated you will want to look for an XML library like Nokogiri. The ["Envelope"]["Cube"]["Cube"]["Cube"] is digging through the hash to get to the important part.
Now, you can see how sensitive this will be to any changes in the XML structure, and you should make the endpoint configurable, and that hash is probably small enough to cache up in memory, but this is the basic idea.
To get your list of currencies out of the hash just say get_rates.keys.
As long as you understand what's going on, you can make that smaller:
def get_rates
today_path = Rails.root.join 'rates', "#{Date.today.to_s}.xml"
Hash[Hash.from_xml(if File.exists? today_path
File.read today_path
else
xml = Net::HTTP.get URI 'http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml'
File.write today_path, xml
xml
end)["Envelope"]["Cube"]["Cube"]["Cube"].map &:values]
end
If you do choose to cache the xml you will probably want to automatically clear out old versions of the cached XML file, too. If you want to cache other conversion lists consider a naming scheme derived automatically from the URI, e.g. eurofxref-daily-2013-10-28.xml.
Edit: let's say you want to cache the converted xml in memory--why not!
module CurrencyRetrieval
def get_rates
if defined?(##rates_retrieved) && (##rates_retrieved == Date.today)
##rates
else
##rates_retrieved = Date.today
##rates = Hash[Hash.from_xml(Net::HTTP.get URI 'http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml')["Envelope"]["Cube"]["Cube"]["Cube"].map &:values]
end
end
end
Now just include CurrencyRetrieval wherever you need it and you're golden. ##rates and ##rates_retrieved will be stored as class variables in whatever class you include this module within. You must test that this persists between calls in your production setup (otherwise fall back to the file-based approach or store those values elsewhere).
Note, if the XML structure changes, or the XML is unavailable today, you'll want to invalidate ##rates and handle exceptions in some nice way...better safe than sorry.

Stream uploading large files using aws-sdk

Is there a way to stream upload large files to S3 using aws-sdk?
I can't seem to figure it out but I'm assuming there's a way.
Thanks
Update
My memory failed me and I didn't read the quote mentioned in my initial answer correctly (see below), as revealed by the API documentation for (S3Object, ObjectVersion) write(data, options = {}) :
Writes data to the object in S3. This method will attempt to
intelligently choose between uploading in one request and using
#multipart_upload.
[...] You can pass :data or :file as the first argument or as options. [emphasis mine]
The data parameter is the one to be used for streaming, apparently:
:data (Object) — The data to upload. Valid values include:
[...] Any object responding to read and eof?; the object must support the following access methods:
read # all at once
read(length) until eof? # in chunks
If you specify data this way, you must also include the
:content_length option.
[...]
:content_length (Integer) — If provided, this option must match the
total number of bytes written to S3 during the operation. This option
is required if :data is an IO-like object without a size method.
[emphasis mine]
The resulting sample fragment might look like so accordingly:
# Upload a file.
key = File.basename(file_name)
s3.buckets[bucket_name].objects[key].write(:data => File.open(file_name),
:content_length => File.size(file_name))
puts "Uploading file #{file_name} to bucket #{bucket_name}."
Please note that I still haven't actually tested this, so beware ;)
Initial Answer
This is explained in Upload an Object Using the AWS SDK for Ruby:
Uploading Objects
Create an instance of the AWS::S3 class by providing your AWS credentials.
Use the AWS::S3::S3Object#write method which takes a data parameter and options hash which allow you to upload data from a file, or a stream. [emphasis mine]
The page contains a complete example as well, which uses a file rather than a stream though, the relevant fragment:
# Upload a file.
key = File.basename(file_name)
s3.buckets[bucket_name].objects[key].write(:file => file_name)
puts "Uploading file #{file_name} to bucket #{bucket_name}."
That should be easy to adjust to use a stream instead (if I recall correctly you might just need to replace the file_name parameter with open(file_name) - make sure to verify this though), e.g.:
# Upload a file.
key = File.basename(file_name)
s3.buckets[bucket_name].objects[key].write(:file => open(file_name))
puts "Uploading file #{file_name} to bucket #{bucket_name}."
I don't know how big the files you want to upload are, but for large files a 'pre-signed post' allows the user operating the browser to bypass your server and upload directly to S3. That may be what you need - to free up your server during an upload.

Tempfile and Garbage Collection

I have this command in a Rails controller
open(source) { |s| content = s.read }
rss = RSS::Parser.parse(content, false)
and it is resulting in temporary files that are filling up the (scarce) disk space.
I have examined the problem to some extent and it turns out somewhere in the stack this happens:
io = Tempfile.new('open-uri')
but it looks like this Tempfile instance never gets explicitly closed. It's got a
def _close # :nodoc:
method which might fire automatically upon garbage collection?
Any help in knowing what's happening or how to clean up the tempfiles would be helpful indeed.
If you really want to force open-uri not to use a tempfile, you can mess with the OpenURI::Buffer::StringMax constant:
> require 'open-uri'
=> true
> OpenURI::Buffer::StringMax
=> 10240
> open("http://www.yahoo.com")
=> #<File:/tmp/open-uri20110111-16395-8vco29-0>
> OpenURI::Buffer::StringMax = 1_000_000_000_000
(irb):10: warning: already initialized constant StringMax
=> 1000000000000
> open("http://www.yahoo.com")
=> #<StringIO:0x6f5b1c>
That's because of this snippet from open-uri.rb:
class Buffer
[...]
StringMax = 10240
def <<(str)
[...]
if [...] StringMax < #size
require 'tempfile'
it looks like _close closes the file and then waits for garbage collection to unlink (remove) the file. Theoretically you could force unlinking immediately by calling the Tempfile's close! method instead of close, or to call close(true) (which calls close! internally).
edit: But the problem is in open-uri, which is out of your hands - and that makes no promises for cleaning up after itself: it just assumes that the garbage collector will finalize all Tempfiles in due time.
In such a case, you are left with no choice but to call the garbage collector yourself using ObjectSpace.garbage_collect (see here). This should cause the removal of all temp files.
Definitely not a bug, but faulty error handling the IO. Buffer.io is either StringIO if the #size is less than 10240 bytes or a Tempfile if over that amount. The ensure clause in OpenURI.open_uri() is calling close(), but because it could be a StringIO object, which doesn't have a close!() method, it just can't just call close!().
The fix, I think, would be either one of these:
The ensure clause checks for class and calls either StringIO.close or Tempfile.close! as needed.
--or--
The Buffer class needs a finalizer that handles the class check and calls the correct method.
Granted, neither of those fix it if you don't use a block to handle the IO, but I suppose in that case, you can do your own checking, since open() returns the IO object, not the Buffer object.
The lib is a big chunk of messy code, imho, so it could use a work-over to clean it up. I think I might do that, just for fun. ^.^

Resources