The Problem:
I have a rails app that requires a user to upload some type of spreadsheet (csv, xslx, xsl etc) for processing which can be a costly operation so we've decided to send it off to a background service as a solution to this problem. The issue we're concerned about is that because our production system is on Heroku we need to store the file on AS3 first then retrieve later for processing.
Because uploading the file to AS3 is in itself a costly operation, this should probably also be done as a background job. The problem is the concern that using Resque to do this could eat up a lot of RAM due to Resque needing to put the file data into Redis or later retrieval. As you know, Redis only stores its data in RAM and also prefers simple key value pairs so we would like to try and avoid this.
Heres some pseudocode as an example of what we'd like try and do:
workers/AS3Uploader.rb
require 'fog'
class AS3Uploader
#queue = :as3_uploader
def self.perform(some, file, data)
# create a connection
connection = Fog::Storage.new({
:provider => 'AWS',
:aws_access_key_id => APP_CONFIG['s3_key'],
:aws_secret_access_key => APP_CONFIG['s3_secret']
})
# First, a place to contain the glorious details
directory = connection.directories.create(
:key => "catalog-#{Time.now.to_i}", # globally unique name
:public => true
)
# list directories
p connection.directories
# upload that catalog
file = directory.files.create(
:key => 'catalog.xml',
:body => File.open(blah), # not sure how to get file data here with out putting it into RAM first using Resque/Redis
:public => true
end
# make a call to Enqueue the processing of the catalog
Resque.enqueue(CatalogProcessor, some, parameters, here)
end
controllers/catalog_upload_controller.rb
def create
# process params
# call Enqueue to start the file processing
# What do I do here? I could send all of the file data here right now
# but like I said previously that means storing potentially 100s of MB into RAM
Resque.enqueue(AS3Uploader, some, parameters, here)
end
The way I would suggest you to do would be
store your file in tmp dir you create and get the file-path
tell Resque to upload the file by using the file-path
make Resque to store the file-path in the redis not the whole file-content ( It would be very expensive )
Now worker will upload the file to AWS- S3
Note: If you have multiple instances like One instance for background processing, One for database, One as utility instance then your tmp dir may not be available to other instances.. so store the file in the temp dir inside the instance holding the resque
Related
I'm trying to work around a known issue in Active Storage where the MIME type of a stored file is incorrectly set, without the ability to override it.
https://github.com/rails/rails/issues/32632
This has been addressed in the master branch of Rails, however it doesn't appear to be released yet (project is currently using 5.2.0). Therefor I'm trying to work around the issue using one of the comments provided in the issue:
Within a new initializer (\config\initializers\active_record_fix.rb):
Rails.application.config.after_initialize do
# Defeat the ActiveStorage MIME type detection.
ActiveStorage::Blob.class_eval do
def extract_content_type(io)
return content_type if content_type
Marcel::MimeType.for io, name: filename.to_s, declared_type: content_type
end
end
end
I'm processing and storing a zip file within a background job using delayed_jobs. The initializer doesn't appear to be getting called. I have restarted the server. I'm running the project locally using heroku local to process background jobs.
Here is the code storing the file:
file.attach(io: File.open(temp_zip_path), filename: 'Download.zip', content_type: 'application/zip')
Any ideas why the code above is not working? Active Storage likes to somewhat randomly decide this ZIP file is a PDF and save the content type as application\pdf. Unrelated, attempting to manually override the content_type after attaching doesn't work:
file.content_type = 'application/zip'
file.save # No errors, but record doesn't update the content_type
Try with Rails.application.config.to_prepare in place of after_initialize initialization event.
more info :
https://guides.rubyonrails.org/configuring.html#initialization-events
https://guides.rubyonrails.org/v5.2.0/initialization.html
My questions
Cost related pit falls to avoid when deploying rails app?
Attacks are welcome as it would teach me what to expect and brace myself against.
I would rather avoid big bills at the end of month, however.
Easy cloud hosting services to use?
I picked AWS because it seems scalable and I thought I can avoid leaning another service later.
I have no regrets but AWS is overwhelming, if there was significantly simpler service, I should have used it.
My current concern
Dos attack or get request flooding on aws S3 could raise hosting cost significantly as I'm uploading some contents there.
Billing alarm is useful, but without automatic shutdown I feel a little uncomfortable taking a break and going into a jungle or an inhabited island where I get no INTERNET connection to be informed of or to shut down my service...
Obvious fix for my case
Stop using S3 and save user uploads to database where I can control scaling behavior. But then, most people seems to be using S3 with carrierwave, why?
What I'm doing
Making my first ever home page using:
elastic beanstalk
rails5
Carrierwave gem configured to save user uploads in S3
Edit
In the end, I could not find any real solution to the no cap for S3 issue.
The below is more or less my note.
I'm guessing S3 has some basic built in defense against attacks because I have not heard of sad stories about people using S3 to host static web sites and getting a bill over 10000 US, which can still happen though regardless of how good amazon's defense might be.
mitigation
A script that periodically checks for s3 log files and calls an action that disables s3 resource serving when the cumulative size of those files is too large.
S3 log sometimes takes more than an hour before they become available, so it's no solution but better than nothing.
class LogObserver
def initialize
aws_conf = Aws.config.update({
access_key_id: ENV['AWS_ACCESS_KEY_ID'],
secret_access_key: ENV['AWS_SECRET_ACCESS_KEY'],
region: 'ap-northeast-1'})
#bucket_name="bucket name that holds s3 log"
#last_checked_log_timestamp = Time.now.utc
log "started at: #{Time.now}"
end
def run
bucket = Aws::S3::Resource.new.bucket(#bucket_name)
while true
prv_log_ts = #last_checked_log_timestamp
log_size = fetch_log_size(bucket)
log "The total size of S3 log accumulated since last time this script was executed: #{log_size}"
time_range = #last_checked_log_timestamp - prv_log_ts # float
log_size_per_second = log_size/time_range
if log_size_per_second > (500.kilobyte/60)
log "Disabling S3 access as S3 log size is greater than expected."
`curl localhost/static_pages/disable_s3`
end
sleep 60*60
end
end
def log text
puts text
File.open('./s3_observer_log.txt','a') do |f|
f << text
end
end
def fetch_log_size(bucket)
log_size=0
bucket.objects(prefix: 'files').each do |o|
time_object = o.last_modified
if time_object < #last_checked_log_timestamp
next
end
#last_checked_log_timestamp = time_object
log_size += o.size.to_i
end
return log_size
end
end
Rake task:
namespace :s3_log do
desc "Access S3 access log files and check their cumulative size. If the size is above the expected value, disables s3 access."
task :start_attack_detection_loop do
require './s3_observer.rb'
id=Process.fork do
o=LogObserver.new
o.run
end
puts "Forked a new process that watches s3 log. Process id: #{id}"
end
end
controller action:
before_action :ensure_permited_ip, only: [:enable_s3, :disable_s3]
def enable_s3
# allow disabling s3 only from localhost
CarrierWave.configure do |config|
config.fog_authenticated_url_expiration = 3
end
end
def disable_s3
# allow disabling s3 only from localhost
CarrierWave.configure do |config|
config.fog_authenticated_url_expiration = 0
end
end
private
def ensure_permited_ip
if request.remote_ip!= "127.0.0.1" # allow access only from localhost
redirect_to root_path
end
end
Gems:
gem 'aws-sdk-rails'
gem 'aws-sdk-s3', '~> 1'
My experiences are limited but my suggestions would be:
Cost related pit falls to avoid when deploying rails app?
if you're gonna be using a background-job, use rufus-scheduler instead of sidekiq or delayed_job, because it runs on top of your rails server and would not require additional memory / additional dedicated processes. This allows you to procure the smallest/cheapest possible instance: t2.nano, which I did once before.
Easy cloud hosting services to use?
Heroku would be a good choice, because it is a lot easy to set it up. However if you're doing this for the experience, I would suggest to procure unmanaged hosting like AWS EC2 or Linode. I recently migrated my server from AWS to Vpsdime 3 months ago because it's cheap and has big memory; so far so good.
My current concern
For carrierwave, you may restrict S3 access. See reference. This then prevents hotlinking and would then require a user to view your Rails pages first in order to download or view or show the S3 files. Now that Rails now have control over the S3 files, you can just simply use something like Rack::Attack to prevent DDOS or excessive requests. If your Rails app is configured with Apache or Nginx, you can instead set up DDOS rules there instead of using Rack::Attack. Or, if you are gonna be using AWS load balancer to manage / route the requests, then you can use AWS Shield ... haven't really used this yet though.
I'm developing an image editing app in Ruby on Rails, and I want to update my image on AWS S3 cloud storage.
Currently I have a system where the user signs in, then using Carrierwave uploader uploads the image on S3 using fog in production,
then I have an AJAX call to the controller which triggers the editing using mini_magick gem, then finally reloading the image.
The problem is that I don't know how to reupload it on S3 (updating the image), locally it works fine, but the problem is in production on Heroku with S3.
One of the answers was this, but for me it doesn't work: (AWS::S3::S3Object.store 's3/path/to/untitled.txt', params[:textarea_contents_params_name], your_bucket_name).
This is my code:
def flip # controller
imagesource = params["imagesource"].to_s # URL
image = MiniMagick::Image.open("#{imagesource}")
image.combine_options do |i|
i.flip
end
# image.write "#{imagesource}" # development
# production
AWS::S3::Base.establish_connection!(
:access_key_id => 'xxxxxxxxxxxxxx',
:secret_access_key => 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
)
AWS::S3::S3Object.store '#{imagesource}', image, 'mybucket'
AWS::S3::S3Object.store('#{imagesource}', open(image), 'mybucket') #2nd attempt
respond_to do |format|
format.json {render :json => {:result => image}}
end
end
You need to write the file to disk (as you do in the development) before attempting to upload the file to disk. Your code should explicitly write the data to a temporary location (eg use a Tempfile) rather than let that location be controlled by input from the user
If as the comment suggests the user input is a URL, recent security updates may prevent you from directly passing a URL to minimagick. If so download the image first (to a temporary file) and then pass that to minimagick.
It looks like you are using aws-s3 for your s3 uploads (super old and maintained etc). If so, it's not completely clear what the values of the parameters are, but when you upload the image to s3 you probably only specify the path portion of the URL. The second argument to store can either be an IO object (such as an instance of File) or a string containing the actual data.
I just installed the gem asset_sync and I am trying to get set up with my AWS account. When I run bundle exec rake assets:precompile I get the following errror:
AssetSync::Config::Invalid: Fog provider can't be blank, Fog directory can't be blank
I understand the simply reason that I am getting this error, namely that I havent pushed the Fog provider or directory to heroku. What I am stumped about is where to put the Following code (Taken from the Fog README). In config/initializers/fog.rb? Is this all I need to do to start using fog, other than installing the gem?
require 'rubygems'
require 'fog'
# create a connection
connection = Fog::Storage.new({
:provider => 'AWS',
:aws_access_key_id => YOUR_AWS_ACCESS_KEY_ID,
:aws_secret_access_key => YOUR_AWS_SECRET_ACCESS_KEY
})
# First, a place to contain the glorious details
directory = connection.directories.create(
:key => "fog-demo-#{Time.now.to_i}", # globally unique name
:public => true
)
not a problem, getting started tends to be the hardest part.
The answer is, it depends. I'd actually venture to say it would be best to put this in your environment based initializers, ie config/init/development or config/init/production, etc. Relatedly, you probably will not want to generate a new directory every time you start your app (there is an account level limit of 100 total I believe). So you might want to either set a key for each environment for that create or simply create the directory somewhere outside the initializers (and within the initializer you can assume it exists).
If you want to use that directory directly, you'll still need to create a reference, but you can create a local reference without making any api calls with #new like this:
directory = connection.directories.new(:key => ...)
As for asset_sync, it needs those keys and a reference to the directory key, which you will probably want to provide via ENV vars (to avoid checking your credentials into version control). You can find details on which keys and how to set them here: https://github.com/rumblelabs/asset_sync#built-in-initializer-environment-variables (the readme also describes how to do it via initializers, but that probably isn't the best plan).
Hope that helps!
I have a Rails application hosted on Heroku. The app generates and stores PDF files on Amazon S3. Users can download these files for viewing in their browser or to save on their computer.
The problem I am having is that although downloading of these files is possible via the S3 URL (like "https://s3.amazonaws.com/my-bucket/F4D8CESSDF.pdf"), it is obviously NOT a good way to do it. It is not desirable to expose to the user so much information about the backend, not to mention the security issues that rise.
Is it possible to have my app somehow retrieve the file data from S3 in a controller, then create a download stream for the user, so that the Amazon URL is not exposed?
You can create your s3 objects as private and generate temporary public urls for them with url_for method (aws-s3 gem). This way you don't stream files through your app servers, which is more scalable. It also allows putting session based authorization (e.g. devise in your app), tracking of download events, etc.
In order to do this, change direct links to s3 hosted files into links to controller/action which creates temporary url and redirects to it. Like this:
class HostedFilesController < ApplicationController
def show
s3_name = params[:id] # sanitize name here, restrict access to only some paths, etc
AWS::S3::Base.establish_connection!( ... )
url = AWS::S3::S3Object.url_for(s3_name, YOUR_BUCKET, :expires_in => 2.minutes)
redirect_to url
end
end
Hiding of amazon domain in download urls is usually done with DNS aliasing. You need to create CNAME record aliasing your subdomain, e.g. downloads.mydomain, to s3.amazonaws.com. Then you can specify :server option in AWS::S3::Base.establish_connection!(:server => "downloads.mydomain", ...) and S3 gem will use it for generating links.
Yes, this is possible - just fetch the remote file with Rails and either store it temporarily on your server or send it directly from the buffer. The problem with this is of course the fact that you need to fetch the file first before you can serve it to the user. See this thread for a discussion, their solution is something like this:
#environment.rb
require 'open-uri'
#controller
def index
data = open(params[:file])
send_data data, :filename => params[:name], ...
end
This issue is also somewhat related.
First you need create a CNAME in your domain, like explain here.
Second you need create a bucket with the same name that you put in CNAME.
And to finish you need add this configurations in your config/initializers/carrierwave.rb:
CarrierWave.configure do |config|
...
config.asset_host = 'http://bucket_name.your_domain.com'
config.fog_directory = 'bucket_name.your_domain.com'
...
end