How to manage a large number of images in a Rails app?

How to manage a large number of images in a Rails app? - ruby-on-rails

I have to manage around 200 high quality images in my app. I am currently using Cloudinary to store these images.
But I've seen that many of apps uses a different domain name to store images and other assets (ex: "assets.example.com"). If I understand currently this is called asset_host in a Rails app. I've found documentation on what it is but not much on how to set it up or how the files are served.
How do they do something like that ? Do they pay an other domain name/server and just uses this server to store assets ?

200 images is not that much, even at 100Mb/image (original+downsized variants) it's just 20Gb of storage, under moderate load can easily be handled by one server without any clouds, additional domains etc. And since you're already storing them in a cloud storage - you do not have to worry.
asset_host is for asset pipeline (your css/js/images from app/assets which end up in public/assets), not app's managed data
In the old days assets were served from other hosts to get around connection count limits in browsers(so that assets can be downloaded in parallel and site loaded faster), this is not relevant for modern HTTP/2 (and even the opposite - there's an overhead in establishing additional http connection), unless you're under a really high load or have a specific need for that(for example - when deploying in a container it may be useful to store assets separately).
Second benefit is that browser will not send app's cookies to other host, which saves a little bandwidth.
Many sites set up that domain to be handled by the same physical web server
As for paying for domain - assets.example.com is a third-level domain for example.com, if you already own the latter - you own it too, just need to set up A(and optionally AAAA) DNS records and the server.

Related

Web application security concerns with user-uploaded files on Amazon S3?

Background
I have a web application where users may upload a wide variety of files (e.g. jpg, png, r, csv, epub, pdf, docx, nb, tex, etc). Currently, we whitelist exactly which files types are a user may upload. This limitation is sometimes annoying for users (i.e. because they must zip disallowed files, then upload the zip) and for us (i.e. because users write support asking for additional file types to be whitelisted).
Ideal Solution
Ideally, I'd love to whitelist files more aggressively. Specifically, I'd would like to (1) figure out which file types may be trusted and (2) whitelist them all. Having a larger whitelist would be more convienient for users and it reduce the number of support tickets (if only very slightly).
What I Know
I've done a few hours of research and have identified common problems (e.g. path traversal, placing assets in root directory, htaccess vulnerabilities, failure to validate mime type, etc). While this research has been interesting, my understanding is that many of these issues are moot (or considerably mitigated) if your assets are stored on Amazon S3 (or a similar cloud storage service) – which is how most modern web application manage user-uploaded files.
Hasn't this question already been asked a zillion times?!
Please don't mistake this as a general "What are the security risks of user-uploaded content?" question. There are already many questions like that and I don't want to rehash that discussion here.
More specifically, my question is, "What risks, if any, exist given a conventional / modern web application setup?" In other words, I don't care about some old PHP app or vulnerabilities related to IE6. What should I be worried about assuming files are stored in a cloud service like AmazonS3?
Context about infrastructure / architecture
So... To answer that, you'll probably need more context about my setup. That said, I suspect this is a relatively common setup and therefore hope the answers will be broadly useful to anyone writing a modern web application.
My stack
Ruby on Rails application, hosted on Heroku
Users may upload a variety of files (via Paperclip)
Server validates both mime type and extension (against a whitelist)
Files are stored on Amazon S3 (with varying ACL permissions)
When a user uploads a file...
I upload the file directly on AS3 in a tmp folder (hasn't touched my server yet)
My server then downloads the file from the tmp folder on AS3.
Paperclip runs validations and executes any processing (e.g. cutting thumbnails of images)
Finally, Paperclip places the file(s) back on AS3 in their new, permanent location.
When a user downloads a file...
User clicks to download a file which sends a request to my API (e.g. /api/atricle/123/download)
Internally, my API reads the file from AS3 and then serves it to the user (as content type attachment)
Thus the file does briefly pass through my server (i.e. not merely a redirect)
From the user's perspective, the file is served from my API (i.e. the user has no idea the file live on AS3)
Questions
Given this setup, is it safe to whitelist a wide range of file types?
Are there some types of files that are always best avoided (e.g. JS files)?
Are there any glaring flaws in my setup? I suspect not, but if so, please alert me!

Images located on a separate server -- is there overhead?

I'm planning to upload images to facebook to my account first, get their "src" and then show them in my Rails app where img src will point to the location of the images at facebook that I've uploaded them.
Is there overhead in this approach as opposed having images in my own website? Will that slow down the server? And will this approach do in general? Is it legal?

No there is no overhead, in fact this could actually speed up your app by reducing the load of request your server will receive. This is basically like using a distributed CDN for your javascript and css.
Typically your rails server will serve an html response with links to css, javascript, and images. The user's browser then starts rendering this html and will make requests when it encounters these links. If all these links point back to your server, then your rails server has to handle serving these static assets (and it can only handle so many requests per second).
In production its common to put your assets on a CDN such as Amazon Web Services to decrease the load on your rails server. As long as your facebook image is public, I believe this is actually a good idea.

PDF caching on heroku with cloudflare

I'm having a problem getting the caching I need to work using CloudFlare.
We use CloudFlare for caching all our assets on S3 which works 100% using a separate subdomain cdn
We also use CloudFlare for our main site (hosted on Heroku) as well, e.g. www
My problem is I can't get CloudFlare to cache PDFs that are generated from our Rails app. I'm using the WickedPDF gem to dynamically generate certain PDFs for invoices, etc. I don't want to upload these as files to say S3 but we would like to have CloudFlare cache these so they don't get generated each and every time, as the time spent generating these PDFs is a little intensive.
CloudFlare is turned on and is "accelerating" for the subdomain in question and we're using SSL, but PDFs never seem to cache properly.
Is there something else we need to do to ensure these get cached? Or maybe there's another solution that would work for Heroku? (eg we can't use Page caching since it relies on the filesystem) I also checked the WickedPDF documentation so see if we could do anything else, but found nothing about expire controls.
Thanks,

We should actually cache it as long as the resources are on-domain & not being delivered through a third-party resource in some way.
Keep in mind:
1. Our caching depends on the number of requests for the resources (at least three).
2. Caching is very much data center dependent (in other words, if your site receives a lot of traffic at a data center it is going to be cached; if your site doesn't get a lot of traffic in another data center it may not cache).
I would open a support ticket if you're still having issues.

Scalable way to share cached files across frontend servers

I have multiple backend servers continuously building and refresing the public parts of an api in order to cache it. The backend servers are builing depending on what has to be done in the job queue.
At a time,
backend server 1 will build :
/article/1.json
/article/5.json
backend server 2 will build :
/article/3.json
/article/9.json
/article/6.json
I need to serve these files from the front-end servers. The cache is stored as file in order to be directly served by nginx without going through the rails stack.
The issue is to manage to have the cache up to date on the front-end servers in a scalable way (adding new servers should be seamless).
I've considered :
NFS / S3 (but too slow)
Memcached (but can't serve directly from nginx - might be wrong ?)
CouchDB direcly serving JSON (I feel this is too big for the job)
Backend to write json in redis, job in fronted to re-write files at the good place (currently my favorite option)
Any experience or great idea on a better way to achieve this ?

You don't say how long it takes to build a single article, but assuming it's not horrifically slow, I think you'd be better off letting the app servers build the pages on the fly and having the front end servers doing the caching. In this scenerio you could put some combination of haproxy/varnish/squid/nginx in front of your app servers and let them do the balancing/caching for you.
You could do the same thing I suppose if you continued to build them continuously on the backend.
You're end goal is to have this:
internet -> load balancer -> caching server 1 --> numerous app servers
\-> caching server 2 -/
Add more caching servers and app servers as needed. The internet will never know. Depending on what software you pick the load balancer/caching server might be the same, or might not. Really depends on your load and particular needs.

If you don't want to hit the rails stack, you catch the request with something like rack-cache before it ever reaches the whole app:
http://rtomayko.github.io/rack-cache/
At least that way, you only have to bootstrap rack.
It also supports memcached as a storage mechanism: http://rtomayko.github.io/rack-cache/storage

You are right, S3 is pretty slow by itlsef, especially HTTPS session setup can take up to 5-10 seconds. But S3 is the ideal storage for primary data, we use it a lot but with combination of S3 Nginx proxy to speed data delivery up and inject caching facilities.
Nginx S3 proxy solution well tested on production and the caching mechanism works perfect, every application server goes to proxy that fetches original file from S3 to be cached.
To prevent dog-pile effect you can use:
proxy_cache_lock for new files, doc
proxy_cache_use_stale updating for updated files, doc
An S3 Nginx proxy configuration look at this https://gist.github.com/mikhailov/9639593

Using a CDN to store/serve user image uploads?

I'm still new to the whole CDN ideaology, so this might be a stupid question but I'm sure someone can shed some light on this. I've got a basic php script that takes user image uploads, resizes them, creates a directory ($user_id), and stores the finished product in the directory (like www.mysite.com/uploads/$user_id/image1.jpg). Works like a charm.
I just got all the hosting stuff squared away and I'm using the Rackspace (Slicehost?) "Cloud Server" architecture. I also signed up for the Rackspace (Mosso?) "Cloud Files". So far so good.
So my question is: Should I be storing the images that users upload locally (on my apache server) or as objects via Cloud Files? It seems like a great idea to separate the static content from my web server so I can just use it to generate the dynamic content. But would it be a lot of overhead to create a CDN-enabled Container each time a user uploads an image?
Hopefully I'm not missing the boat on this one totally. I can't seem to find a whole lot of info about this, but I'm sure there is a good reason why I should either pursue or avoid this idea. Any suggestions are greatly appreciated!

I am not familiar with Rackspace's offering, but the general logic behind using a CDN for static content is to achieve two goals:
offload the bandwidth and processing
to other servers, freeing up yours.
move the requests off to the client
Move the large static content closer
to the client
When you send the generated HTML to the browser, it will "see" the images as www.yourdomain.com/my_image.jpg for example, and perform additional requests for each piece of static content, potentially starving your server of threads to service requests. If you move all this content onto a CDN, the browser would see something like cdn.yourdomain.com, and the browser will request the images from the CDN, thus allowing your server to service other requests instead. Additionally, most CDN's distribute your content to multiple locations and have geographic routing for requests to serve the content from the closest possible location, improving the perceived load time for clients.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart