Where to save scraped images? - ruby-on-rails

I'm building a Ruby on Rails app that scrapes the images off a website. What is the best location to save this images to?
Edit:
To be clear, I know the file system is the best type of storage, but where on the file system? I suppose I have to stay in the RoR app directory, but which folder is best suitable for this? public?

On your file server (static Apache server), on your app server (save some where locally in the disk and serve via the app server) or on Amazon S3
But I would suggest not to store in Database. (Some people think it's alright. So, I would be limited to suggestion)
in ROR, under <app_name>/public/images see here -- but the data will be public. If you are worried about privacy, probably this is not right.
If you are concerned about privacy, see the options discussed here How to store private pictures and videos in Ruby on Rails But as a sughestion: serving files from app-server may be painful in high traffic conditions and my experience is it better off-loaded to a file server or a cloud like S3.

It's not hard to write and/or create a server that is only serving images from a file store outside your website's directory structure. A simple rewrite of the URL can provide your code with the info it needs to the actual file location, which it then outputs to the browser.
An alternate is to have the image's URL mapped to the image's directory path in a database, then do a lookup. Make the URL field an indexed lookup and it will be very fast.
I wrote an image server in Ruby a couple years ago along those lines and it was a pretty simple task.

Related

What is the best way for an iOS app access data from a public website without overloading it?

I would like to use some publicly available data from a government website as a source of data in an iOS app. But I am not sure what is the best / most polite / scalable way have a large number of users request data from this website with the least impact on their servers and best reliability for me.
It is 1-50kb of static XML with a fixed URL scheme
It updates with a new XML once a day
New users would need to download past data
It has a Last-Modified header but no caching headers
It does not use compression or a CDN
It's a government website, so if someone even replies to my email I doubt they are going to change how they host it for me...
I'm thinking I could run a script on a server to download this file once a day and re-host for my app however my heart desires. But I don't currently run a server which I could use for this and it seems like a lot just for this. My knowledge of web development is not great, so am I perhaps missing something obvious and I just don't know what search terms I should be using to find the answer.
Can I point a CDN at this static data somehow and use that?
Is there something in CloudKit I could use?
Should I run a script on AWS somehow to do the rehosting without needing a full server?
Should I just not worry about it and access the data directly??
You can use the AWS S3 service (Simple Storage Service).
The flow is somewhat like this:
If the file doesn't exist on S3 yet, or, if the creation date of the file on S3 is yesterday, the iOS app downloads the XML from the gov site and stores it in S3.
If the file exists on S3 and is up to date, download it from S3.
After that, the data can be presented by the app without overloading to the site.
I think the best way for you is to create an intermediary database where you can store your data in a secure manner.
Create a pipeline that does some data transformation and store in you newly created database.
Create an api with pagination and you desired filters
Also make sure you are not violating any data policies in the process.
I hope this helps.

Can Heroku be made to use a persistent filesystem?

I've built an app where users can upload their avatars. I used the paperclip gem and everything works fine on my local machine. On Heroku everything works fine until server restart. Then every uploaded images disappear. Is it possible to keep them on the server?
Notice: I probably should use services such as Amazon S3 or Google Cloud. However each of those services require credit card or banking account information, even if you want to use a free mode. This is a small app just for my portfolio and I would rather avoid sending that information.
No, this isn't possible. Heroku's filesystem is ephemeral and there is no way to make it persistent. You will lose your uploads every time your dyno restarts.
You must use an off-site file storage service like Amazon S3 if you want to store files long-term.
(Technically you could store your images directly in your database, e.g. as a bytea in Postgres, but I strongly advise against that. It's not very efficient and then you have to worry about how to provide the saved files to the browser. Go with S3 or something similar.)

Where to store web application uploaded files?

I'm developing a web application using Grails.
I'm wondering where should I store the uploaded files (pictures, pdf ...) ? in the application server? a remote ftp server? or where ?
It depends
what you want to do next time with this files
do you want to allow to download those
how many clients and how big traffic to those files has your application
how many files foresee to have
The easiest way is the best, so you can start simply store files in local filesystem.
Good practics is to create dedicated calss/service for suporting file storage, when in the feature you want store files elsewhere you only have to change implementation in one place.

Access MP3 files on server from iOS.

I'm building a streaming app similar to pandora. However right now I'm storing all my files on http and accessing them with urls. Is there an alernative to this because all the files are in the public html folder? For example how does apps like pandora or spotify pull files off their servers. I'm new to web severs and not sure where to ask this question. I have a centos server on vps hosting with apache, MySQL, http, ftp.
You just need to provide the content as a bit stream rather than a file download. The source of that data to send as a stream can be stored as binary data in a BLOB column in a database or as a regular file on a non-public part of the file system. It really does not mater which one you use.
Storing them in the database gives your app a bit easier access and makes the app more portable since it is not restricted the file system level permissions.
The fact you currently have the files in a public folder is not really that critical of an issue since you are making them available for download. You would just need to make sure you have an authentication requirement if you want to restrict who can access them.

How does Dropbox upload data to its servers?

just recently I was thinking and wondered, how does Dropbox upload my files to its S3 storage and how might that one be organized?
Let's just completely forget about the sync aspect for a second and scale the problem down to one S3 bucket.
Say, in that bucket's root directory you have lots of folders, each belonging to an arbitrary user.
Now if that user wants to upload a file to his folder... how does that happen internally? I mean, Dropbox can't just store the Amazon S3 access credentials/keys hard-coded into the application (be it on ios or windows) as it might get reverse-engineered and thus exposed.
Any thoughts on this?
Thanks!
Some guys from EADS did reengineering on Dropbox, the presentation slides are available for download: A CRITICAL ANALYSIS OF
DROPBOX SOFTWARE SECURITY
In the same way websites don't allow users to directly access their databases but rather provide interfaces that can control permissions and handle authentication, I'm sure Dropbox has some kind of application that the client on your computer interacts with. Their server daemon will have permissions to write to the disk, but your computer has to go through it (and it's security procedures) before anything your computer sends is written.

Resources