I am using PhantomJS to dynamically generate 10 large images of websites at a time in each request. Therefore it is important that I cache these images and check if they are cached so I can serve them up. I've never cached images before, so I have no idea how to do this.
Some other information:
PhantomJS writes images to your local filesystem at the path you specify.
I want to cache these images but also need to balance that with updating the cache if the websites have updated.
I will be running these image generation processes in parallel.
I'm thinking of using Amazon's Elastic MapReduce to take advantage of Hadoop and to help with the load. I've never used it before, so any advice here would be appreciated.
I am pretty much a complete noob with this, so in depth explanations with examples would be really helpful.
What's your front-end web server? Since PhantomJS can write images to your local filesystem at any path you specify, you should specify the document root of your web server so you're serving them statically. This way Rails doesn't even have to be involved.
Related
I currently have a rails app running on some dedicated servers, where I need to dynamically (per request) generate some banner images by fetching product images from s3 and generate some custom formatting on top of each image (combining with a logo, product price, some texts etc.). After the image has been generated, the image can be cached in a CDN.
There are many, many product images and the data/texts/prices that needs to go into each image often change, hence I don't want to rely on pre-generating all the image combinations and store them in S3. In the current setup I have, where I use imagemagick under the hood to generate the images, a request takes around 750ms where 2/3 are time spend in ruby/imagemagick and 1/3 is network.
I'm considering moving the job of generating the actual banner images onto Amazon (EC2 probably). This way I can make the network to s3 shorter and I can better scale up and down as requests come in. I can then fetch the data/texts needed to generate each image via an API from my current app. However I'm unsure if there are libraries fit for this exact task of the actual image generation, what will be high performing? And is there a better way than spawning some EC2 servers (e.g. lambda)?
You could look at ruby-vips. It's typically 3x to 4x faster than imagemagick, and needs a lot less memory. If you open an issue on the ruby-vips github repo I could help set up a benchmark.
It's part of rails6, so it might to be simple to experiment with, depending on which rails you are using.
You could deploy your app to ElasticBeanStalk and then let Amazon take care of the infra needed to scale when you see high load.
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html
Or as you pointed out you could go serverless with Lambda
I want to use Heroku for hosting my Ruby on Rails project. It will involve lots of file uploads, mostly images. Can I host and serve that static files on Heroku or is it wiser to use services like Amazon S3. What is Your opinion on that approach ? What are my options for hosting that static files on Heroku ?
To answer your question, Heroku's "ephemeral filesystem" will not serve as a storage for static uploads. Heroku is an app server, period. You have to plug into data storage elsewhere.
From Heroku's spec:
Ephemeral filesystem
Each dyno gets its own ephemeral filesystem, with a fresh copy of the most recently deployed code. During the dyno’s lifetime its running processes can use the filesystem as a temporary scratchpad, but no files that are written are visible to processes in any other dyno and any files written will be discarded the moment the dyno is stopped or restarted. For example, this occurs any time a dyno is replaced due to application deployment and approximately once a day as part of normal dyno management.
Heroku is a great option for RoR in my opinion. I have used it personally and ran to the problem that has been mentioned here already (you can't store anything in Heroku's filesystem). I therefore used S3 following this tutorial: https://devcenter.heroku.com/articles/s3
Hope it helps!
PD: Make sure not to store the S3 credentials on any file, but rather create variables as described here: https://devcenter.heroku.com/articles/config-vars
I used to have them on a file and long story short someone gained access to my Amazon account and my account was billed several thousands of dollars (just from a couple of days). The Amazon staff was kind enough to waive those. Just something to have in mind.
As pointed out, you shouldn't do this with Heroku for the specific reason of ephemeral storage, but to answer your question more broadly storing user-uploaded content on a local filesystem on any host has a few inherent issues:
You can quickly run out of local storage space on the disk
You can lose all your user-uploaded content if the hardware crashes / the directory gets deleted / etc.
Heroku, EC2, Digital Ocean, etc. all provide servers that don't come with any guarantee of persistence (ephemeral storage especially). This means that your instance may shut down at any point, be swapped out, etc.
You can't scale your application horizontally. The files on one server won't be accessible from another (or dyno, or whatever your provider of choice calls them).
S3, however, is such a widely-used solution because:
It's incredibly cheap (we store 20 TB of data for something like $500 a month)
Your uploaded files aren't at risk of disappearing due to hardware failure
Your uploaded files are decoupled from the application, meaning any server / dyno / whatever could access them.
You can always publish your S3 buckets into cloud front if you need a CDN without any extra effort.
And certainly many more reasons. The most important thing to remember, is that by storing uploaded content locally on a server, you put yourself in a position where you can't scale horizontally, regardless of how you're hosting your app.
It it wiser to host files on S3, and actually it is even more wiser to use direct uploads to S3.
You can read the arguments, for example, here.
Main point: Heroku is really, really expensive thing.
So you need to save every bit of resources you have. And the only option to store static files on Heroku is having separate dyno running app server for you. And static files don't need app server. So it's just a waste of CPU time (and you should read that as "a waste of a lot of my money").
Also, uploading huge amount of huge files will quickly get you out of memory quota (read that as "will waste even more of my money because I will need to run more dynos"). So it's best to upload files directly to S3.
Heroku is great for hosting your app. Use the tool that best suites the task.
UPD. Forgot to tell you – not only you will need separate dyno for static assets, your static assets will die every time this dyno is restarted.
I had the same problem. I do solve it by adding all my images in my rails app. I then reference the images using their links that might be something like
myapp.herokuapp.com/assets/image1.jpg
I might add the link from the CMS. It might not be the best option, but it works.
I have implemented a rails application and have deployed it on azure webserver.
The issue I am getting is that some of the images present in the public folder take a long time to load hence the website performance is very low.Some of the images are as small as 20kb and still takes around 13 secs to load.
My question is that if i were to put the images present in the public directory in the CDN(Content delivery network) and then load via cache, will it give better performance or will it not affect the overall performance.
Is it also possible to put all the images in the CDN for the rails app in production.
Thanks.
Well, remember that a CDN is just another webserver. All you are doing when you use a CDN is hyperlinking to the resource on that other webserver.
<img src="http://www.quackit.com/pix/milford_sound/milford_sound_t.jpg" />
Now, will it speed up the load times of your app? Maybe. There are lots of factors effecting this, namely:
Why is your app loading slow? Is it your connection? Are you on dialup? A CDN can't help that.
Why is your azure server slow? Is there lots of traffic? If so, a CDN will help.
Most large production applications likely use a CDN for all of their static assets, such as images, css, and javascript. (They probably own the CDN, but still, its a CDN.) So, yes, every image in your site could be stored in a CDN. (Very easy if these are all static images.) However, CDN's that do this are not usually free.
Why are you choosing to use azure for a rails app? It's possible, but it would be much easier to use something like Heroku or Engineyard. You could even use a VPS service like Digital Ocean, and set up your own small army of CDN's using VPS providers around the country. (If your a cheapskate, like me.)
There usally aren't that many images in a production rails app that are located in /public. Usually those images are in assets/images/... the only thing I would keep in public would be a small front end site and perhaps some 404/error pages.
I am interested in understanding the different approaches to handling large file uploads in a Rails application, 2-5Gb files.
I understand that in order to transfer a file of this size it will need to be broken down into smaller parts, I have done some research and here is what I have so far.
Server-side config will be required to accept large POST requests and probably a 64bit machine to handle anything over 4Gb.
AWS supports multipart upload.
HTML5 FileSystemAPI has a persistent uploader that uploads the file in chunks.
A library for Bitorrent although this requires a transmission client which is not ideal
Can all of these methods be resumed like FTP, the reason I dont want to use FTP is that I want to keep in the web app if this is possible? I have used carrierwave and paperclip but I am looking for something that will be able to be resumed as uploading a 5Gb file could take some time!
Of these approaches I have listed I would like to undertand what has worked well and if there are other approaches that I may be missing? No plugins if possible, would rather not use Java Applets or Flash. Another concern is that these solutions hold the file in memory while uploading, that is also a constraint I would rather avoid if possible.
I've dealt with this issue on several sites, using a few of the techniques you've illustrated above and a few that you haven't. The good news is that it is actually pretty realistic to allow massive uploads.
A lot of this depends on what you actually plan to do with the file after you have uploaded it... The more work you have to do on the file, the closer you are going to want it to your server. If you need to do immediate processing on the upload, you probably want to do a pure rails solution. If you don't need to do any processing, or it is not time-critical, you can start to consider "hybrid" solutions...
Believe it or not, I've actually had pretty good luck just using mod_porter. Mod_porter makes apache do a bunch of the work that your app would normally do. It helps not tie up a thread and a bunch of memory during the upload. It results in a file local to your app, for easy processing. If you pay attention to the way you are processing the uploaded files (think streams), you can make the whole process use very little memory, even for what would traditionally be fairly expensive operations. This approach requires very little actual setup to your app to get working, and no real modification to your code, but it does require a particular environment (apache server), as well as the ability to configure it.
I've also had good luck using jQuery-File-Upload, which supports good stuff like chunked and resumable uploads. Without something like mod_porter, this can still tie up an entire thread of execution during upload, but it should be decent on memory, if done right. This also results in a file that is "close" and, as a result, easy to process. This approach will require adjustments to your view layer to implement, and will not work in all browsers.
You mentioned FTP and bittorrent as possible options. These are not as bad of options as you might think, as you can still get the files pretty close to the server. They are not even mutually exclusive, which is nice, because (as you pointed out) they do require an additional client that may or may not be present on the uploading machine. The way this works is, basically, you set up an area for them to dump to that is visible by your app. Then, if you need to do any processing, you run a cron job (or whatever) to monitor that location for uploads and trigger your servers processing method. This does not get you the immediate response the methods above can provide, but you can set the interval to be small enough to get pretty close. The only real advantage to this method is that the protocols used are better suited to transferring large files, the additional client requirement and fragmented process usually outweigh any benefits from that, in my experience.
If you don't need any processing at all, your best bet may be to simply go straight to S3 with them. This solution falls down the second you actually need to do anything with the files other than server them as static assets....
I do not have any experience using the HTML5 FileSystemAPI in a rails app, so I can't speak to that point, although it seems that it would significantly limit the clients you are able to support.
Unfortunately, there is not one real silver bullet - all of these options need to be weighed against your environment in the context of what you are trying to accomplish. You may not be able to configure your web server or permanently write to your local file system, for example. For what it's worth, I think jQuery-File-Upload is probably your best bet in most environments, as it only really requires modification to your application, so you could move an implementation to another environment most easily.
This project is a new protocol over HTTP to support resumable upload for large files. It bypass Rails by providing its own server.
http://tus.io/
http://www.jedi.be/blog/2009/04/10/rails-and-large-large-file-uploads-looking-at-the-alternatives/ has some good comparisons of the options, including some outside of Rails.
Please go through it.It was helpful in my case
Also another site to go to is:-
http://bclennox.com/extremely-large-file-uploads-with-nginx-passenger-rails-and-jquery
Please let me know if any of this does not work out
I would by-pass the rails server and post your large files(split into chunks) directly from the browser to Amazon Simple Storage. Take a look at this post on splitting files with JavaScript. I'm a little curious how performant this setup would be and I feel like tinkering with this setup this weekend.
I think that Brad Werth nailed the answer
just one approach could be upload directly to S3 (and even if you do need some reprocessing after you could theoretical use aws lambda to notify your app ... but to be honest I'm just guessing here, I'm about to solve the same problem myself, I'll expand on this later)
http://aws.amazon.com/articles/1434
if you use carrierwave
https://github.com/dwilkie/carrierwave_direct_example
Uploading large files on Heroku with Carrierwave
Let me also pin down few options that might help others looking for a real world solution.
I have a Rails 6 with Ruby 2.7 and the main purpose of this app is to create a Google drive like environment where users can upload images and videos and them process them again for high quality.
Obviously we did tried using local processing using Sidekiq background jobs but it was overwhelming during large uploads like 1GB and more.
We did tried tuts.io but personally I think is not quite easy to setup just like Jquery File uploads.
So we experimented with AWS..moving in steps listed below and it worked like a charm....uploading directly to S3 from the browser.
using React drop zone uploader...we uploads multiple files to S3.
we setup Aws Lambda for an input bucket to get triggered for all types of object creations on that bucket.
this Lambda converts the file and again uploads the reprocessed one to another one - output bucket and notifies us using Aws SNS to keep a track of what worked and what failed.
in Rails side... we just dynamically use the new output bucket and then serve it with Aws Cloud-front distribution.
You may check Aws notes on MediaConvert to check step by step guide and they also have a well written Github repos for all sorts of experimentation.
So, from the user's point of view, he can upload one large file, with Acceleration enabled on the S3, the React library show uploading progress and once it gets uploaded, Rails callback api again verifies its existence in the S3 BUCKET like mybucket/user_id/file_uploaded_slug and then its confirmed to user through a simple flash message.
You can also configure Lambda to notify end user on successful upload/encoding, if needed.
Refer this documentation - https://github.com/mike1011/aws-media-services-vod-automation/tree/master/MediaConvert-WorkflowWatchFolderAndNotification
Hope it helps someone here.
I am working on a rails application that requires files to be uploaded to my server and then have resque workers (running on several other computers) use those files to do some tasks. I have my workers all set to do the task but I can't seem to find a nice way to get the files from my host computer to my worker computers. I've tried Carrierwave (and looked the documentation for Paperclip), but all I see is using S3 which I cannot use. My only idea is to store a string which contains the URI where the file may be found so that the workers can download them and start working. I'm not particularly fond this idea. Does anyone have any suggestions on what might be best way to do this? Thank you!
Update
I should also note the files that need to be shared are roughly 200MB each
Have you considered something like a Network File System instead of doing this inside your application?
Depending on what platform your workers and server are you should have numerous options to share a filesystem (I assume you have a LAN running between them).
And even if no real LAN, sshfs could work too..
The upsides are obvious: Your Ruby application only has to deal with a regular filesystem using FileUtils and the heavy lifting of pushing stuff around is handled by a much more reliable infrastructure