I'm currently using Shrine to upload files to DigitalOcean (via s3 sdk). It works nicely, it's perfect. However, in the guide there's a storage option to make a temporary copy of the uploads, specified by the 'cache' prefix.
How is this cache used? Or, put differently, what features does it provide?
Since I'm totally unaware of its use, currently all I'm seeing are duplicates of my uploads in my Spaces (bucket) resource. Are these files ever disposed of?
Finally, if the cached files are for data retrieval purposes, wouldn't it make sense to make a local cache, rather than it being sent to the S3 resource?
I apologize if all of this is general knowledge, it didn't seem clear to me as I was rushing in to get it implemented.
Shrine's temporary storage is used mainly to prevent orphan files (files not attached to any record) from entering your primary storage. An uploaded file might not end up attached in case of validation errors, or if the user decides not to save the form after the file has been asynchronously uploaded to the storage.
Because Shrine's uploaded files are not backed by database records by default (like with Active Storage), the temporary storage also provides a security measure where it prevents users from hijacking files of other users. If only the primary storage were used, an attacker could guess the uploaded file ID from the URL of another file, and assign it in their form when creating a record. Afterwards they could delete the record, and the file belonging to the other user would get deleted with it.
Shrine recommends using cloud storage for temporary storage to enable direct uploads to the cloud storage from the browser, and also because disk storage doesn't work if you're hosting your app on multiple servers, since only one server would have access to the saved file. Note that you can still use disk for temporary storage if you want to, just change the :cache declaration.
Shrine used cache to move slow processing action on background. You can specify some fast actions on caching and then make heavy processing in background. This is improving user side effect of uploading files. Also Shrine does not delete temporary files and you need to destroy it yourself in background
Related
The problem at hand
I have a rails app.
Users will be uploading files. Anywhere between 1 file to 3000 files. Sometimes they are zip files, and sometimes they are not. I do not want hold up the server with these files uploads, so I am looking for a solution around this problem.
The zipped files will have to be unzipped.
I then want to check whether: the user has previously uploaded the same files? i.e. if the user has already uploaded the same file(2) one week ago, then this is a problem: (i) either we don’t allow that particular file to be uploaded, or we ask the user: are you sure you want to upload the same file again?
Then I want to store the keys/links to the files within the appropriate models/records on the back end.
Was wondering what the best workflow for handling the above could be: i.e a very general overview: in other words, could AWS Lambda / Google cloud computing etc. etc be best employed to handle the above problem? How would we use the Shrine gem, to best handle this situation? Would it make sense to use AWS Lambda rather than using background jobs?
My preferences are to use the Shrine gem for uploading.
My Ideas:
In the client side, the user drags and drops the files the user
wants to upload.
All the files are then uploaded (whether zipped or otherwise) to a temporary bucket location via the Shrine gem.
IF the zip files are uploaded then perhaps an AWS lambda function must be triggered to unzip the files. If that’s the case,then at the end of the day, the keys for these files must somehow be returned to the client, to handle validation issues – but then how would the AWS lambda function be able to return this request to the original client side where the request was originated? Or rather,should the AWS lambda function be generated from the client side,passing in the IDs of the unzipped blobs?
Then we need to run some validations: we want to handle the situation where there are duplicate files. We will need to check with our rails backed as to whether those files have already been uploaded.
After those validation issues are handled, then user submits the form, and all the keys are stored within the appropriate records.
These ideas are by no means prescriptive
Am seeking some very general advise on what the best way is of doing this all. I am by no means constrained to AWS: I could use Google or Azure just as easily. Any guidance on the above would be much appreciated.
Specific questions:
How would the AWS lambda function get triggered?
How would be be able to return the keys of the uploaded files back to the client?
What do I mean by general overview?
Here are some examples of general overviews:
(1) Uploading & Unzipping files to S3 through Rails hosted on Heroku?
(2) https://www.quora.com/How-do-I-extract-large-zip-files-in-AWS-Lambda
Any pointers in the right direction would be much appreciated.
Cheers!
This isn't a really difficult problem to solve if you are willing to change the process flow a little bit.
In the client side, the user drags and drops the files the user wants to upload.
When the user requests the upload operation to begin you can make HTTP GET requests to an API Gateway endpoint, backed with a Lambda. The Lambda can query for previous files uploaded by the client and send back a result set showing what files already exist. You then filter those out and send only what is considered new from the client to the server. This will save the user time in waiting for the upload to happen and save you time on the S3/Lambda side of not having to store duplicates or process them. This isn't a substitute for server-side validation though, you'll still want to do that. For legit clients, this will save you and them a lot of bandwidth and storage.
All the files are then uploaded (whether zipped or otherwise) to a temporary bucket location via the Shrine gem.
This works. As they enter the temp bucket, use a Lambda with an S3 event to process the files, unzip files, push any metadata needed into DynamoDb and delete the files from the temp bucket. In the temp bucket, I would place the files into a folder that is unique per request and user. I would take the user/client Id and a UUID of some kind and make that your folder name. Such as Johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f, or encode this value into a Base64 string and make that your folder name. Store this in DynamoDb with each file uploaded into your permanent bucket with the Hash Key being the userid/clientid, a Sort Key being the full folder path + file name and an extra attribute of IsProcessed. The IsProcessed attribute will be updated by your Lambda that is processing the files and moving them to their permanent S3 bucket. If there are errors, you can put the error in this field. If it is successful then you put it in this field.
the keys for these files must somehow be returned to the client, to handle validation issues – but then how would the AWS lambda function be able to return this request to the original client side where the request was originated? Or rather,should the AWS lambda function be generated from the client side,passing in the IDs of the unzipped blobs?
The original API request to push the files to the temp S3 bucket would be able to return back to the client the folder name johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f to the client. So let's say you made a HTTP POST to /jobs. You would return back 201 Created with a HTTP Header of Location /jobs/johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f. Your client can then start polling /jobs/johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f for the status of the process.
Your response back to /jobs/johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f can return the DynamoDB records. This would include all DynamoDB records for the HashKey matching the folder name. Your client side can look at all of the objects in the result set and check the IsProcessed attribute to see if everything worked out ok, or if there were issues.
Then we need to run some validations: we want to handle the situation where there are duplicate files. We will need to check with our rails backed as to whether those files have already been uploaded.
Handle this with the Lambda that is executed by the temporary bucket. Grab the files from the temp bucket folder, handle your business logic and back-end queries then push them to their final permanent bucket.
After those validation issues are handled, then user submits the form, and all the keys are stored within the appropriate records.
All of this would happen asynchronously, starting when the user submits the form. The client side needs to be able to handle this by making HTTP GET requests to the endpoint mentioned above, checking for the status of the process. This gives you some more flexibility as you can also publish SNS messages on failures as well, such as sending an email to the clients if they upload 3,000 files and you need to spend 30 minutes processing them.
I am working on a project where the user joins a "stream". During stream setup, the person who is creating the stream (the stream creator) can choose to either:
Upload all photos added to the stream by members to our hosting solution (S3)
Upload all photos added to the stream by members to the stream creator's own Dropbox authenticated folder
In the future I would like to add more storage providers (such as Drive, Onesky etc)
There is a couple of different questions I have in regards to how to solve this.
What should the structure be in the database for photos? I currently only have photo_url, but that won't be easy to manage from a data perspective with pre-signed urls and when there are different ways a photo can be uploaded (s3, dropbox etc.)
How should the access tokens for each storage provider be stored? Remember that only the stream creator's access_token will be stored and everyone who is on the stream will share that token when uploading photos
I will add iOS and web clients in the future that will do a direct upload to the storage provider and bypass the server to avoid a heavy load on the server
As far as database storage, your application should dictate the structure based on the interface that you present both to the user and to the stream.
If you have users upload a photo and they don't get to choose the URI, and you don't have any hierarchy within a stream, then I'd recommend storing just an ID and a stream_id in your main photo table.
So at a minimum you might have something looking like
create table photos(id integer primary key, stream_id integer references streams(id) not null);
But you probably also want description and other information that is independent of storage.
The streams table would have all the generic information about a stream, but would have a polymorphic association to a class dependent on the type of stream. So you could use that association to get an instance of S3Stream or DropBoxStream based on what actual stream was used.
That instance (also an ActiveRecord resource) could store the access key, and for things like dropbox, the path to the folder etc. In addition, that instance could provide methods to construct a URI given your Photo object.
If a particular technology needs to cache signed URIs, then say the S3Stream object could reference a S3SignedUrl model where the URIs are signed.
If it turns out that the signed URL code is similar between DropBox and S3, then perhaps you have a single SignedUrl model.
When you design the ios and android clients, it is critical that they are not given access to the stream owner's access tokens. Instead, you'll need to do all the signing inside your server app. You wouldn't want a compromise of a device to lead to exposing the access token creating billing problems as well as privacy exposures.
Hope this helps.
we setup a lot of rails applications with different kind of file storages behind it.
Yes, just an url is not manageable in the future. To save a lot of time you could use gems like carrierwave or paperclip. They handle all the thumbnail generation and file validation. One approach is, that you could upload the file from the client directly to S3 or Dropbox to a tmp folder and just tell your Rails App "Hey, here is the url of a new upload file" and paperclip and carrierwave will take care of the thumbnail generation and storaging. (Example for paperclip)
Don't know exactly how your stream works, so I cannot give a good answer to this -.-
With the setup I mentioned in 1. you should upload form your different clients directly to S3 or Dropbox etc. and after uploading, the client tells the Rails Backend that it should import the file from that url. (And before paperclip or carrierwave finish their processing you could use the tmp url from the file to display something directly in your stream)
I have a Rails app that catalogues recorded music products with metadata & wav files.
Previously, my users had the option to send me files via ftp, which i'd monitor with a cron task for new .complete files and then pick it's associated .xml file and a perform metadata import and audio file transfer to S3.
I regularly hit capacity limits on the prior FTP so decided to move the user 'dropbox' to S3, with an FTP gateway to allow users to send me their files. Now it's on S3 and due to S3 not storing the object in folders i'm struggling to get my head around how to navigate the bucket, find the .complete files and then perform my imports as usual.
Can anyway recommend how to 'scan' a bucket for new .complete files.....read the filename and then pass back to my app so that I can then pick up it's xml, wav and jpg files?
The structure of the files in my bucket is like this. As you can see there are two products here. I would need to find both and import their associated xml data and wavs/jpg
42093156-5060156655634/
42093156-5060156655634/5060156655634.complete
42093156-5060156655634/5060156655634.jpg
42093156-5060156655634/5060156655634.xml
42093156-5060156655634/5060156655634_1_01_wav.wav
42093156-5060156655634/5060156655634_1_02_wav.wav
42093156-5060156655634/5060156655634_1_03_wav.wav
42093156-5060156655634/5060156655634_1_04_wav.wav
42093156-5060156655634/5060156655634_1_05_wav.wav
42093156-5060156655634/5060156655634_1_06_wav.wav
42093156-5060156655634/5060156655634_1_07_wav.wav
42093156-5060156655634/5060156655634_1_08_wav.wav
42093156-5060156655634/5060156655634_1_09_wav.wav
42093156-5060156655634/5060156655634_1_10_wav.wav
42093156-5060156655634/5060156655634_1_11_wav.wav
42093163-5060243322593/
42093163-5060243322593/5060243322593.complete
42093163-5060243322593/5060243322593.jpg
42093163-5060243322593/5060243322593.xml
42093163-5060243322593/5060243322593_1_01_wav.wav
Though Amazon S3 does not formally have the concept of folders, you can actually simulate folders through the GET Bucket API, using the delimiter and prefix parameters. You'd get a result similar to what you see in the AWS Management Console interface.
Using this, you could list the top-level directories, and scan through them. After finding the names of the top-level directories, you could change the parameters and issue a new GET Bucket request, to list the "files" inside the "directory", and check for the existence of the .complete file as well as your .xml and other relevant files.
However, there might be a different approach to your problem: did you consider using SQS? You could make the process that receives the uploads post a message to a queue in SQS, say, completed-uploads, with the name of the folder of the upload that just completed. Another process would then consume the queue and process the finished uploads. No need to scan through the directories in S3.
Just note that, if you try the SQS approach, you might need to be prepared for the possibility of being notified more than once of a finished upload: SQS guarantees that it will eventually deliver posted messages at least once; you might receive duplicated messages! (you can identify a duplicated message by saving the id of the received message on, say, a consistent database, and checking newly received messages against the same database).
Also, remember that, if you use the US Standard Region for S3, then you don't have read-after-write consistency, you have only eventual-consistency, which means that the process receiving messages from SQS might try to GET the object from S3 and get nothing back -- just try again until it sees the object.
I am storing many images in Amazon S3,
using a ruby lib (http://amazon.rubyforge.org/)
I don't care the photos older than 1 week, then to free the space in S3 I have to delete those photos.
I know there is a method to delete the object in a certain bucket:
S3Object.delete 'photo-1.jpg', 'photos'
Is there a way to automatically delete the image older than a week ?
If it does Not exist, I'll have to write a daemon to do that :-(
Thank you
UPDATE: now it is possible, check the Roberto's answer.
You can use the Amazon S3 Object Expiration policy
Amazon S3 - Object Expiration | AWS Blog
If you use S3 to store log files or other files that have a limited
lifetime, you probably had to build some sort of mechanism in-house to
track object ages and to initiate a bulk deletion process from time to
time. Although our new Multi-Object deletion function will help you to
make this process faster and easier, we want to go ever farther.
S3's new Object Expiration function allows you to define rules to
schedule the removal of your objects after a pre-defined time period.
The rules are specified in the Lifecycle Configuration policy that you
apply to a bucket. You can update this policy through the S3 API or
from the AWS Management Console.
Object Expiration | AWS S3 Documentation
Some objects that you store in an Amazon S3 bucket might have a
well-defined lifetime. For example, you might be uploading periodic
logs to your bucket, but you might need to retain those logs for a
specific amount of time. You can use using the Object Lifecycle
Management to specify a lifetime for objects in your bucket; when the
lifetime of an object expires, Amazon S3 queues the objects for
deletion.
Ps: Click on the links for more information.
If you have access to a local database, it's easy to simply log each image (you may be doing this already depending on your application), and then you can perform a simple query to retrieve the entire list and delete them each. This is much faster than querying S3 directly, but does require local storage of some kind.
Unfortunately, Amazon doesn't offer an API for automatic deletion based on a specific set of criteria.
You'll need to write a daemon that goes through all of the photos and and selects just those that meet your criteria, and then delete them one by one.
I have seen quite a few code samples/plugins that promote uploading assets directly to S3. For example, if you have a user object with an avatar, the file upload field would load directly to S3.
The only way I see this being possible is if the user object is already created in the database and your S3 bucket + path is something like
user_avatars.domain.com/some/id/partition/medium.jpg
But then if you had an image tag that tried to access that URL when an avatar was not uploaded, it would yield a bad result. How would you handle checking for existence?
Also, it seems like this would not work well for most has many associations. For example, if a user had many songs/mp3s, where would you store those and how would you access them.
Also, your validations will be shot.
I am having trouble thinking of situations where direct upload to S3 (or any cloud) is a good idea and was hoping people could clarify either proper use cases, or tell me why my logic is incorrect.
Why pay for storage/bandwidth/backups/etc. when you can have somebody in the cloud handle it for you?
S3 (and other Cloud-based storage options) handle all the headaches for you. You get all the storage you need, a good distribution network (almost definitely better than you'd have on your own unless you're paying for a premium CDN), and backups.
Allowing users to upload directly to S3 takes even more of the bandwidth load off of you. I can see the tracking concerns, but S3 makes it pretty easy to handle that situation. If you look at the direct upload methods, you'll see that you can force a redirect on a successful upload.
Amazon will then pass the following to the redirect handler: bucket, key, etag
That should give you what you need to track the uploaded asset after success. Direct uploads give you the best of both worlds. You get your tracking information and it unloads your bandwidth.
Check this link for details: Amazon S3: Browser-Based Uploads using POST
If you are hosting your Rails application on Heroku, the reason could very well be that Heroku doesn't allow file-uploads larger than 4MB:
http://docs.heroku.com/s3#direct-upload
So if you would like your users to be able to upload large files, this is the only way forward.
Remember how web servers work.
Unless you're using a sort of async web setup like you could achieve with Node.JS or Erlang (just 2 examples), then every upload request your web application serves ties up an entire process or thread while the file is being uploaded.
Imagine that you're uploading a file that's several megabytes large. Most internet users don't have tremendously fast uplinks, so your web server spends a lot of time doing nothing. While it's doing all of that nothing, it can't service any other requests. Which means your users start to get long delays and/or error responses from the server. Which means they start using some other website to get the same thing done. You can always have more processes and threads running, but each of those costs additional memory which eventually means additional $.
By uploading straight to S3, in addition to the bandwidth savings that Justin Niessner mentioned and the Heroku workaround that Thomas Watson mentioned, you let Amazon worry about that problem. You can have a single-process webserver effectively handle very large uploads, since it punts that actual functionality over to Amazon.
So yeah, it's more complicated to set up, and you have to handle the callbacks to track things, but if you deal with anything other than really small files (and even in those cases), why cost yourself more money?
Edit: fixing typos