Extracting uploaded archive to S3 with CarrierWave on Heroku - ruby-on-rails

I want to do something what I thought will be a simple task:
Have a form with these controls:
File upload for one file
Checkbox if this file should be extracted
Text input where I would specify which file should I link to (required only if the checkbox is checked) - index_file
After submitting form:
If the checkbox isn't checked, upload the file via CarrierWave to S3 to the specified store_dir
If the checkbox is checked, extract all files from the archive (I expect only ZIP archives; I need to keep the directory structure), upload extracted files to the specified store_dir and set the index_file in database (I don't need to save to database anything about other extracted files)
As I have found, it isn't an easy task because of Heroku limitations. These files will have a large size (hundreds of MiBs or a few GiBs), so I don't want to redownload this file from S3 if possible.
I think that using Delayed Job or Resque might work, but I'm not exactly sure how to do it and what is the best solution of my problem.
Does anyone have any idea how to solve it with using the lowest resources as possible? I can change CarrierWave to another uploader (Paperclip etc.) and my hosting provider too if it isn't possible on Heroku.
I was also thinking about using CloudFlare, would this still work without problems?
Thank you for answers.

Based on this heroku support email, it would seem that the /tmp directory is many gigs in size. You just need to clean up after yourself so Heroku as a platform is not the issue.
A couple of articles may help you solve the problem:
https://github.com/jnicklas/carrierwave/wiki/How-to%3A-Make-Carrierwave-work-on-Heroku - which explains how to configure your app to use the /tmp directory as the cache directory for CarrierWave. Pay attention to the following line:
use Rack::Static, :urls => ['/carrierwave'], :root => 'tmp' # adding this line
This instructs rack to serve /carrierwave/xzy from the /tmp directory (useful for storing images temporarily)
Then, using the uploader.cache! method, you can deliberately cache the inbound uploaded file. Once stored, you can do checks to determine whether to call the uploader.store! method which will promote the contents to S3 (assuming you configured S3 as the store for CarrierWave.

Related

Basic file uploading in Rails

I'm having a lot of trouble finding a solution to uploading a file to a folder in Rails.
I have a file that I need to upload to a specific folder in the app 'public/uploads' with a specific name. Each time i upload, i need to run a pre-existing background job, which will remove the file after it's done.
If it happens that a file already exists, it should just overwrite it.
I can't find a solution that covers this. All the examples are things about attaching a file to an instance of a model and storing it in my DB. I don't need that. That's overkill for my scenario.
Just upload file to a folder, simple as.
Suggestions?
You can modify the file path. The easiest strategy is to add a randomly generated or sequential subfolder/filename prefix for each upload.
So Rails.root.join('public', 'uploads', uploaded_file.original_filename) becomes Rails.root.join('public', 'uploads', "#{my_random_value}-#{uploaded_file.original_filename}").

Rails file upload: upload a folder

I work on Rails project and client asked if I can add 'upload a folder' feature to simple file upload system that we have now. Currently it attaches files to model and then displays them on a page for download. Pretty basic.
But I can't figure out how can I handle folder uploads, with every folder having it's own content. Is there any pre-made gems that can help accomplish that?
We use Paperclip at the moment, but I don't mind migrating to Carrerwave or some other gem that would
UPDATE I see that I was unclear about my needs. I need an upload system that could handle folders. Something like this.
In Dropbox I am able to upload both files and folders. How can I make my uploaders accept folders and then display them alongside regular attached files?
you can solve it by using the interpolation of paperclip where you can create or naming the folder dynamically for the same you need to do like below
specify the path into model which you wanted always
:path => ":folder/:id_:filename"
and specify the private method in same model or using globally specify in initializer
Paperclip::interpolates :folder do |attachment, style|
attachment.instance.name
end

How can I migrate CarrierWave files to a new storage mechanism?

I have a Ruby on Rails site with models using CarrierWave for file handling, currently using local storage. I want to start using cloud storage and I need to migrate existing local files to the cloud. I am wondering if anyone can point out a method for doing this?
Bonus points for using a model attribute that would allow me to do this row-by-row in the background without interrupting my site for extended downtime (in other words, some model rows would still have local storage while others used cloud storage).
My first instinct is to create a new uploader for each model that uses cloud storage, so I have two uploaders on each model, then transferring the files from one to the other, setting an attribute to indicate which file should be used until they are all transferred, then removing the old uploader. That seems a little excessive.
Minimal to Possibly Zero Donwtime Procedure
In my opinion, the easiest and fastest way to accomplish what you want with almost no downtime is this: (I will assume that you will use AWS cloud, but similar procedure is applicable to any cloud service)
Figure out and setup your assets bucket, bucket policies etc for making the assets publicly accessible.
Using s3cmd (command line tool for interacting with S3) or a GUI app, copy entire assets folder from file system to the appropriate folder in S3.
In your app, setup carrierwave and update your models/uploaders for :fog storage.
Do not restart your application yet. Instead bring up rails console and for your models, check that the new assets URL is correct and accessible as planned. For example, for a video model with picture asset, you can check this way:
Video.first.picture.url
This will give you a full cloud URL based on the updated settings. Copy the URL and paste in a browser to make sure that you can get to it fine.
If this works for at least one instance of each model that has assets, you are good to restart your application.
Upon restart, all your assets are being served from cloud, and you didn't need any migrations or multiple uploaders in your models.
(Based on comment by #Frederick Cheung): Using s3cmd (or something similar) rsync or sync the assets folder from the filesystem to S3 to account for assets that were uploaded between steps 2 and 5, if any.
PS: If you need help setting up carrierwave for cloud storage, let me know.
I'd try the following steps:
Change the storage in the uploaders to :fog or what ever you want to use
Write a migration like rails g migration MigrateFiles to let carrierwave get the current files, process them and upload them to the cloud.
If your model looks like this:
class Video
mount_uploader :attachment, VideoUploader
end
The migration would look like this:
#videos = Video.all
#videos.each do |video|
video.remote_attachment_url = video.attachment_url
video.save
end
If you execute this migration the following should happen:
Carrierwave downloads each image because you specified a remote url for the attachment(the current location, like http://test.com/images/1.jpg) and saves it to the cloud because you changed that in the uploader.
Edit:
Since San pointed out this will not work directly you should maybe create an extra column first, run a migration to copy the current attachment_urls from all the videos into that column, change the uploader after that and run the above migration using the copied urls in that new column. With another migration just delete the column again. Not that clean and easy but done in some minutes.
When we use Heroku, most of people suggest to use cloudinary. Free and simple setup.
My case is when we use cloudinary service and need move into aws S3 for some reasons.
This is what i did with the uploader:
class AvatarUploader < CarrierWave::Uploader::Base
def self.set_storage
if ENV['UPLOADER_SERVICE'] == 'aws'
:fog
else
nil
end
end
if ENV['UPLOADER_SERVICE'] == 'aws'
include CarrierWave::MiniMagick
else
include Cloudinary::CarrierWave
end
storage set_storage
end
also, setup the rake task:
task :migrate_cloudinary_to_aws do
profile_image_old_url = []
Profile.where("picture IS NOT NULL").each do |profile_image|
profile_image_old_url << profile_image
end
ENV['UPLOADER_SERVICE'] = 'aws'
load("#{Rails.root}/app/uploaders/avatar_uploader.rb")
Profile.where("picture IS NOT NULL OR cover IS NOT NULL").each do |profile_image|
old_profile_image = profile_image_old_url.detect { |image| image.id == profile_image.id }
profile_image.remote_picture_url = old_profile_image.picture.url
profile_image.save
end
end
The trick is how to change the uploader provider by env variable. Good luck!
I have migrated the Carrier wave files to Amazon s3 with s3cmd and it works.
Here are the steps to follow:
Change the storage kind of the uploader to fog.
Create a bucket on Amazon s3 if you already dont have one.
Install s3cmd on the remote server sudo apt-get install s3cmd
Configure s3cmd s3cmd --configure.
You would need to enter public and secret key here, provided by Amazon.
Sync the files by this command s3cmd sync /path_to_your_files ://bucket_name/
Set this flag --acl-public to upload the file as public and avoid permission issues.
Restart your server
Notes:
sync will not duplicate your records. It will first check if the file is present on remote server or not.

Strategy for avoiding file upload naming conflicts

I have a webapp in Rails which as an AJAX file upload feature. Files are uploaded to a remote server (AWS S3). My current strategy is to upload the files in a temp/ directory (with their original name) until the user submits the form, and then rename them to their definitive name.
But the problem is that if multiple users try to upload two files with the same names at the same time, then one is gonna override the other.
The strategy I was thinking of to solve this was to generate random SHA1 when the upload page is loaded, store them in a table locally to make sure they're unique, and remove them when the temp file is renamed.
Do you see problems with this approach?
What's a good strategy to solve this problem?
One problem is, if they navigate away from the page without uploading anything, their hash will stay in the database, and eventually make a mess. I would avoid storing anything this temporary in the database.
Rather than try to come up with your own way to name temporary files, why not use the ruby tempfile library, which will do it for you?
Originally, I thought you were uploading the files to the ruby server, and uploading them to s3 yourself. Tempfiles won't help if users are uploading files directly. If you just want unique names for your temp files, a UUID generator might work for you. There is a Ruby UUID generator gem which is designed to not produce duplicates, even in a distributed setting. If you name your files with these, you shouldn't need to store anything in the database.

carrierwave upload caching

How does carrierwave upload caching functionality work? From what I've read, it looks like it keeps the uploaded file in public/uploads/tmp to avoid reupload across form redisplays. I am guessing the cache would get assigned a unique id, but still be publicly accessible. How to make it more secure for sensitive uploads or disable this feature altogether?
One way to avoid this is to have the uploader as a separate model from the target model, such that validation errors won't require reuploading.
CarrierWave keeps uploaded images in a cache dir so you can easily re-submit forms in case of validation errors without forcing your users to re-upload images.
The cache dir in default is public/uploads/tmp but you can change it by setting the cache_dir configuration parameter.
Usually uploaded images are available for download without authentication. Therefore, placing uploaded and cached files in a public directory is fine. You can also change your uploader class to have a filename method that generates a unique random ID to make it less guessable.
By the way, this blog post describes how to integrate CarrierWave while storing and transforming images in the cloud and delivering through a CDN.

Resources