Creating thread-safe non-deleting unique filenames in ruby/rails - ruby-on-rails

I'm building a bulk-file-uploader. Multiple files are uploaded in individual requests, and my UI provides progress and success/fail. Then, once all files are complete, a final request processes/finalizes them. For this to work, I need to create many temporary files that live longer than a single request. Of course I also need to guarantee filenames are unique across app instances.
Normally I would use Tempfile for easy unique filenames, but in this case it won't work because the files need to stick around until another request comes in to further process them. Tempfile auto-unlinks files when they're closed and garbage collected.
An earlier question here suggests using Dir::Tmpname.make_tmpname but this seems to be undocumented and I don't see how it is thread/multiprocess safe. Is it guaranteed to be so?
In c I would open the file O_EXCL which will fail if the file exists. I could then keep trying until I successfully get a handle on a file with a truly unique name. But ruby's File.open doesn't seem to have an "exclusive" option of any kind. If the file I'm opening already exists, I have to either append to it, open for writing at the end, or empty it.
Is there a "right" way to do this in ruby?
I have worked out a method that I think is safe, but is seems overly complex:
# make a unique filename
time = Time.now
filename = "#{time.to_i}-#{sprintf('%06d', time.usec)}"
# make tempfiles (this is gauranteed to find a unique creatable name)
data_file = Tempfile.new(["upload", ".data"], UPLOAD_BASE)
# but the file will be deleted automatically, which we don't want, so now link it in a stable location
count = 1
loop do
begin
# File.link will raise an exception if the destination path exists
File.link(data_file.path, File.join(UPLOAD_BASE, "#{filename}-#{count}.data"))
# so here we know we created a file successfully and nobody else will take it
break
rescue Errno::EEXIST
count += 1
end
end
# now unlink the original tempfiles (they're still writeable until they're closed)
data_file.unlink
# ... write to data_file and close it ...
NOTE: This won't work on Windows. Not a problem for me, but reader beware.
In my testing this works reliably. But again, is there a more straightforward way?

I would use SecureRandom.
Maybe something like:
p SecureRandom.uuid #=> "2d931510-d99f-494a-8c67-87feb05e1594"
or
p SecureRandom.hex #=> "eb693ec8252cd630102fd0d0fb7c3485"
You can specify the length, and count on an almost impossibly small chance of collision.

I actually found the answer after some digging. Of course the obvious approach is to see what Tempfile itself does. I just assumed it was native code, but it is not. The source for 1.8.7 can be found here for instance.
As you can see, Tempfile uses an apparently undocumented file mode of File::EXCL. So my code can be simplified substantially:
# make a unique filename
time = Time.now
filename = "#{time.to_i}-#{sprintf('%06d', time.usec)}"
data_file = nil
count = 1
loop do
begin
data_file = File.open(File.join(UPLOAD_BASE, "#{filename}-#{count}.data"), File::RDWR|File::CREAT|File::EXCL)
break
rescue Errno::EEXIST
count += 1
end
end
# ... write to data_file and close it ...
UPDATE And now I see that this is covered in a prior thread:
How do open a file for writing only if it doesn't already exist in ruby
So maybe this whole question should be marked a duplicate.

Related

Recursive directory search in Rails

So I have a function here that should take the path to an archive.zip as an argument, and recursively dive in to every sub directory until it finds a file with the extension .html
def path_to_project
"#{recursive_find(File.dirname(self.folder.path))}"
end
path_to_project is used to apply this recursive_find process on the fly as a string, since it's used repeatedly in the process as a whole.
def recursive_find(proj_path)
Dir.glob("#{proj_path}/*") do |object_path|
if object_path.split(".").last == "html"
#found_it = File.dirname(object_path)
else
recursive_find(object_path)
end
end
#found_it
end
Anyways, I have two questions for the smart folks of stackoverflow-
1- Is my use of the #found_it instance variable correct? , perhaps I should use attr_accessor :found_it instead? Obviously named something else that isn't stupid.. maybe :html_file.
Perhaps -
unless #found_it
# do the whole recursive thing
end
return #found_it
# I don't actually have to return the variable right?
2 - Could my recursive method be better? I realize this is pretty open ended so by all means, flame away you angry dwellers. I gladly accept your harsh criticisms and whole heartedly appreciate your good advices :)
If you don't need to use recursion you could just do
Dir["#{proj_path}"/**/*.html"] That should give you a list of all the files that have html extension.
As far as your questions: Your use of #found_it depends on the bigger scope of things. Where is this function defined a class or a module? The name of the variable itself could more meaningful like #html_file or maybe what the context of the file is like #result_page.
Ruby's Find class is a very easy to use, and scalable solution. It descends into a directory hierarchy, returning each element as it is encountered. You can tell it to recurse into particular directories, to ignore files and directories based on attributes, and it is very fast.
This is the example from the documentation:
require 'find'
total_size = 0
Find.find(ENV["HOME"]) do |path|
if FileTest.directory?(path)
if File.basename(path)[0] == ?.
Find.prune # Don't look any further into this directory.
else
next
end
else
total_size += FileTest.size(path)
end
end
I use this to do several scans of directories containing thousands of files. It's easily as fast as using a glob.

Including .xml file to rails and using it

So I have this currency .xml file:
http://www.ecb.int/stats/eurofxref/eurofxref-daily.xml
Now, I am wondering, how can I make my rails application read it? Where do I even have to put it and how do I include it?
I am basically making a currency exchange rate calculator.
And I am going to make the dropdown menu have the currency names from the .xml table appear in it and be usable.
First of all you're going to have to be able to read the file--I assume you want the very latest from that site, so you'll be making an HTTP request (otherwise, just store the file anywhere in your app and read it with File.read with a relative path). Here I use Net::HTTP, but you could use HTTParty or whatever you prefer.
It looks like it changes on a daily basis, so maybe you'll only want to make one HTTP request every day and cache the file somewhere along with a timestamp.
Let's say you have a directory in your application called rates where we store the cached xml files, the heart of the functionality could look like this (kind of clunky but I want the behaviour to be obvious):
def get_rates
today_path = Rails.root.join 'rates', "#{Date.today.to_s}.xml"
xml_content = if File.exists? today_path
# Read it from local storage
File.read today_path
else
# Go get it and store it!
xml = Net::HTTP.get URI 'http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml'
File.write today_path, xml
xml
end
# Now convert that XML to a hash. Lots of ways to do this, but this is very simple xml.
currency_list = Hash.from_xml(xml_content)["Envelope"]["Cube"]["Cube"]["Cube"]
# Now currency_list is an Array of hashes e.g. [{"currency"=>"USD", "rate"=>"1.3784"}, ...]
# Let's say you want a single hash like "USD" => "1.3784", you could do a conversion like this
Hash[currency_list.map &:values]
end
The important part there is Hash.from_xml. Where you have XML that is essentially key/value pairs, this is your friend. For anything more complicated you will want to look for an XML library like Nokogiri. The ["Envelope"]["Cube"]["Cube"]["Cube"] is digging through the hash to get to the important part.
Now, you can see how sensitive this will be to any changes in the XML structure, and you should make the endpoint configurable, and that hash is probably small enough to cache up in memory, but this is the basic idea.
To get your list of currencies out of the hash just say get_rates.keys.
As long as you understand what's going on, you can make that smaller:
def get_rates
today_path = Rails.root.join 'rates', "#{Date.today.to_s}.xml"
Hash[Hash.from_xml(if File.exists? today_path
File.read today_path
else
xml = Net::HTTP.get URI 'http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml'
File.write today_path, xml
xml
end)["Envelope"]["Cube"]["Cube"]["Cube"].map &:values]
end
If you do choose to cache the xml you will probably want to automatically clear out old versions of the cached XML file, too. If you want to cache other conversion lists consider a naming scheme derived automatically from the URI, e.g. eurofxref-daily-2013-10-28.xml.
Edit: let's say you want to cache the converted xml in memory--why not!
module CurrencyRetrieval
def get_rates
if defined?(##rates_retrieved) && (##rates_retrieved == Date.today)
##rates
else
##rates_retrieved = Date.today
##rates = Hash[Hash.from_xml(Net::HTTP.get URI 'http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml')["Envelope"]["Cube"]["Cube"]["Cube"].map &:values]
end
end
end
Now just include CurrencyRetrieval wherever you need it and you're golden. ##rates and ##rates_retrieved will be stored as class variables in whatever class you include this module within. You must test that this persists between calls in your production setup (otherwise fall back to the file-based approach or store those values elsewhere).
Note, if the XML structure changes, or the XML is unavailable today, you'll want to invalidate ##rates and handle exceptions in some nice way...better safe than sorry.

Using and Editing Class Variables in Ruby?

So I've done a couple of days worth of research on the matter, and the general consensus is that there isn't one. So I was hoping for an answer more specific to my situation...
I'm using Rails to import a file into a database. Everything is working regarding the import, but I'm wanting to give the database itself an attribute, not just every entry. I'm creating a hash of the file, and I figured it'd be easiest to just assign it to the database (or the class).
I've created a class called Issue (and thus an 'issues' database) with each entry having a couple of attributes. I was wanting to figure out a way to add a class variable (at least, that's what I think is the best option) to Issue to simply store the hash. I've written a rake to import the file, iff the new file is different than the previous file imported (read, if the hash's are different).
desc "Parses a CSV file into the 'issues' database"
task :issues, [:file] => :environment do |t, args|
md5 = Digest::MD5.hexdigest(args[:file])
puts "1: Issue.md5 = #{Issue.md5}"
if md5 != Issue.md5
Issue.destroy_all()
#import new csv file
CSV.foreach(args[:file]) do |row|
issue = {
#various attributes to be columns...
}
Issue.create(issue)
end #end foreach loop
Issue.md5 = md5
puts "2: Issue.md5 = #{Issue.md5}"
end #end if statement
end #end task
And my model is as follows:
class Issue < ActiveRecord::Base
attr_accessible :md5
##md5 = 5
def self.md5
##md5
end
def self.md5=(newmd5)
##md5 = newmd5
end
attr_accessible #various database-entry attributes
end
I've tried various different ways to write my model, but it all comes down to this. Whatever I set the ##md5 in my model, becomes a permanent change, almost like a constant. If I change this value here, and refresh my database, the change is noted immediately. If I go into rails console and do:
Issue.md5 # => 5
Issue.md5 = 123 # => 123
Issue.md5 # => 123
But this change isn't committed to anything. As soon as I exit the console, things return to "5" again. It's almost like I need a .save method for my class.
Also, in the rake file, you see I have two print statements, printing out Issue.md5 before and after the parse. The first prints out "5" and the second prints out the new, correct hash. So Ruby is recognizing the fact that I'm changing this variable, it's just never saved anywhere.
Ruby 1.9.3, Rails 3.2.6, SQLite3 3.6.20.
tl;dr I need a way to create a class variable, and be able to access it, modify it, and re-store it.
Fixes please? Thanks!
There are a couple solutions here. Essentially, you need to persist that one variable: Postgres provides a key/value store in the database, which would be most ideal, but you're using SQLite so that isn't an option for you. Instead, you'll probably need to use either redis or memcached to persist this information into your database.
Either one allows you to persist values into a schema-less datastore and query them again later. Redis has the advantage of being saved to disk, so if the server craps out on you you can get the value of md5 again when it restarts. Data saved into memcached is never persisted, so if the memcached instance goes away, when it comes back md5 will be 5 once again.
Both redis and memcached enjoy a lot of support in the Ruby community. It will complicate your stack slightly installing one, but I think it's the best solution available to you. That said, if you just can't use either one, you could also write the value of md5 to a temporary file on your server and access it again later. The issue there is that the value then won't be shared among all your server processes.

Does ruby IO.read() lock?

I have a webservice method that reads a photo and returns its byte data. I'm currently doing the following:
#photo_bytes = IO.read("/path/to/file")
send_data(#photo_bytes, :filename => "filename", :type => "filetype", :disposition => "inline")
I'm getting some strange behavior when calling this a lot... occasionally send_data is returning null. I'm thinking that maybe I'm getting read contention if a file hasn't been closed yet. Do I need to explicitly close the file after opening it with IO.read? How could I use read_nonblock to do this and would it be worth it?
UPDATE:
So I did some more logging and occasionally IO.read is returning a value like 1800 bytes when it usually returns ~5800 bytes for a picture. When it return 1800 bytes the picture does not show up on the client. This happens fairly randomly when two users are calling the web service.
Thanks
Tom
The IO.read method doesn't do any advisory file locking, and so shouldn't be affected by other concurrent readers. However, if you have code elsewhere in your application which writes to the same path, you need to make sure you update the file atomically. Opening a file in write (not append) mode immediately truncates the file to zero bytes, so until the new version had been written, you could well see empty responses generated from the above snippet.
Assuming you're on a *NIX platform like Linux or OS X, though, you can update a file atomically using code like this:
require 'tempfile'
require 'fileutils'
def safe_write(path, data)
tmp = Tempfile.new
tmp.write(data)
tmp.close
FileUtils.mv(tmp.path, path)
end
This will write data to a temporary file, then move it to the "/path/to/file" location atomically, without readers ever seeing the zero-length truncated version.

Tempfile and Garbage Collection

I have this command in a Rails controller
open(source) { |s| content = s.read }
rss = RSS::Parser.parse(content, false)
and it is resulting in temporary files that are filling up the (scarce) disk space.
I have examined the problem to some extent and it turns out somewhere in the stack this happens:
io = Tempfile.new('open-uri')
but it looks like this Tempfile instance never gets explicitly closed. It's got a
def _close # :nodoc:
method which might fire automatically upon garbage collection?
Any help in knowing what's happening or how to clean up the tempfiles would be helpful indeed.
If you really want to force open-uri not to use a tempfile, you can mess with the OpenURI::Buffer::StringMax constant:
> require 'open-uri'
=> true
> OpenURI::Buffer::StringMax
=> 10240
> open("http://www.yahoo.com")
=> #<File:/tmp/open-uri20110111-16395-8vco29-0>
> OpenURI::Buffer::StringMax = 1_000_000_000_000
(irb):10: warning: already initialized constant StringMax
=> 1000000000000
> open("http://www.yahoo.com")
=> #<StringIO:0x6f5b1c>
That's because of this snippet from open-uri.rb:
class Buffer
[...]
StringMax = 10240
def <<(str)
[...]
if [...] StringMax < #size
require 'tempfile'
it looks like _close closes the file and then waits for garbage collection to unlink (remove) the file. Theoretically you could force unlinking immediately by calling the Tempfile's close! method instead of close, or to call close(true) (which calls close! internally).
edit: But the problem is in open-uri, which is out of your hands - and that makes no promises for cleaning up after itself: it just assumes that the garbage collector will finalize all Tempfiles in due time.
In such a case, you are left with no choice but to call the garbage collector yourself using ObjectSpace.garbage_collect (see here). This should cause the removal of all temp files.
Definitely not a bug, but faulty error handling the IO. Buffer.io is either StringIO if the #size is less than 10240 bytes or a Tempfile if over that amount. The ensure clause in OpenURI.open_uri() is calling close(), but because it could be a StringIO object, which doesn't have a close!() method, it just can't just call close!().
The fix, I think, would be either one of these:
The ensure clause checks for class and calls either StringIO.close or Tempfile.close! as needed.
--or--
The Buffer class needs a finalizer that handles the class check and calls the correct method.
Granted, neither of those fix it if you don't use a block to handle the IO, but I suppose in that case, you can do your own checking, since open() returns the IO object, not the Buffer object.
The lib is a big chunk of messy code, imho, so it could use a work-over to clean it up. I think I might do that, just for fun. ^.^

Resources