I have a simple setup going for an API i'm building in rails.
A zip is uploaded via a POST, and I take the file, store it in rails.root/tmp using carrierwave and then background an s3 upload with sidekiq.
the reason i store the file temporarily is because i can't send a complex object to sidekiq, so i store it and send the id, and let sidekiq find it and do work with it, then delete the file once it's done.
the problem is that once it's time for my sidekiq worker to find the file by its path, it can't because it doesn't exist. i've read that heroku's ephemeral file system deletes its files when things are reconfigured/servers are restarted, etc.
none of these things are happening, however and the file doesn't exist. so my theory is that the sidekiq worker is actually trying to open the path that gets passed to it on its own filesystem since it's a separate worker and that file doesn't exist. can someone confirm this? if that's the case, are the any alternate ways to do this?
If your worker is executed on another dyno than your web process, you are experiencing this issue because of dyno isolation. read more about this here: https://devcenter.heroku.com/articles/dynos#isolation-and-security
Although it is possible to run sidekiq workers and the web process on the same machine (maybe not on heroku, i am not sure about that), it is not advisable to design your system architecture like that.
If your application grows or experiences temporarily high loads, you may want to spread the load across multiple servers, and usually also run your workers on separate servers than your web process in order to not block the web process in case that your workers are keeping the server busy.
In all those cases you can never share data on the local filesystem between the web process and the worker.
I would recommend to consider directly uploading the file to S3 using https://github.com/waynehoover/s3_direct_upload
This also takes a lot of load off your web server
Related
I have an import functionality in my rails app that imports a CSV file and updates records accordingly. As this file gets bigger, the request takes longer and eventually times out. Therefore I chose to implement delayed_job to handle my long running requests. The only problem is, when the job runs, the error message Errno::ENOENT: No such file or directory is thrown. This is because my solution works with the CSV file in memory.
Is there a way to save the CSV file to my heroku server (and delete it after the import)?
Heroku's filesystems are ephemeral, i.e. content on them doesn't persist and they are not shared across dynos.
If your delayed job is running on another dyno (which is how it should if it already isn't), you can not access the csv which exists on the disk of your web dyno.
One workaround would be to create an action which serves the CSV. You could then use some HTTP library to download the CSV before the job starts.
You can't store files on a dyno's filesystem and read them back from another dyno.
You can store the temporary file in an external cloud storage like AWS S3 and read this back from your delayed job.
I have a scheduler task which downloads a video from a url. I want to temporarily store this on my Heroku server just long enough so that I can upload it to S3. I can't figure out a way to upload directly from external URL to S3, so instead I'm using my server as the 'middle man'.
But I don't understand where I should be storing the file on my server, or if Heroku will even allow it.
If you're on the Cedar or Cedar-14 stack, you can write the file anywhere on the filesystem.
You're probably aware (if not, you should be) that Heroku Dynos have an ephemeral filesystem and that this filesystem is discarded the moment a dyno is stopped or restarted - which can happen for any number of reasons. With that in mind, you'll probably want to design your task scheduler in such a way that failed jobs are retried a couple of times.
I saw this question:After git push heroku - uploaded files on Heroku are lost
each time the application shuts down and restarts after being inactive
for x minutes), your application is recreated and all stored data is
lost.(C)
Right now I have user which can upload two photos.I got email confirmation of new users. So I can check that user registered and uploaded photo 4 and 14 hours ago.
I've made my last commit and pushed it to heroku around 19 hours ago. And this 4 images, that new users uploaded are lost now.But I can see images if I just now register the user. So it seems to be really true that my app was inactive for x minutes and then it restarts and deletes images.
I read some questions like this Rails] Images erased after a new commit on heroku
There it says that I should use external server like aws s3( I have no idea what it is and how much it will cost and how to connect it)
So is it really true? what are my other options? may be I should simply use digital ocean(won't be there the same problem?) or something else. Will this problem continue in paid account?
I use rails and upload files using carrierwave gem, I can't upload code here cause I am writing from another laptop.
Heroku's filesystem is readonly. You can't expect anything you upload to persist there, you need to use an external storage mechanism, something like Amazon's S3 for example.
See the links for more details.
https://devcenter.heroku.com/articles/s3
https://devcenter.heroku.com/articles/dynos#isolation-and-security
The filesystem that your Heroku instance runs on is not read-only, but it is transient - i.e. files that you store there will not persist after an instance restart.
This is a deliberate design decision by Heroku, to force you to think about where you store your data and how it impacts on scalability.
You're asking about Digital Ocean - I haven't used them but I assume from your question that they allow you to store to a persistent local filesystem.
The question that you then have to ask is: what happens if you want to run more than one instance of your app? Do they share the same persistent filesystem? Can they access each other's files? How do you handle file locking to avoid race-conditions when several app instances are using the same filesystem?
Heroku's model forces you to either put stuff in a database or store it using some external service. Generally, any of these sorts of systems will be reasonably scalable - you can have multiple Heroku instances (perhaps running on different machines, different data-centers, etc), and they will all interact nicely.
I do agree that for a simple use-case where you just want to run a single instance of an app during development it can be inconvenient, but I think this is the reasoning behind it - to force you to design this sort of thing in, rather than developing your whole app on the assumption that it can store everything locally and then find out later that you need to completely redesign to make it scalable.
What you're looking at is something called the ephemeral file system in Heroku:
Each dyno gets its own ephemeral filesystem, with a fresh copy of the most recently deployed code. During the dyno’s lifetime its running processes can use the filesystem as a temporary scratchpad, but no files that are written are visible to processes in any other dyno and any files written will be discarded the moment the dyno is stopped or restarted
In short, it means that any files you upload will only last for the time the dyno is running. When the dyno shuts down, the files will be removed unless they were part of the local git repo.
The way to resolve the issue is to store the files on a third-party service - typically S3. This will store the files on a system independent of Heroku.
Both Paperclip & Carrierwave support S3 (Simple Storage Service) - through a gem called fog. S3 gives you a "free" tier, allowing you to store a certain amount of data (I've forgotten how much) for free.
I would strongly recommend setting up an S3 account and linking it to your Heroku app. This way, any files you upload will be stored off-site.
I want to use Heroku for hosting my Ruby on Rails project. It will involve lots of file uploads, mostly images. Can I host and serve that static files on Heroku or is it wiser to use services like Amazon S3. What is Your opinion on that approach ? What are my options for hosting that static files on Heroku ?
To answer your question, Heroku's "ephemeral filesystem" will not serve as a storage for static uploads. Heroku is an app server, period. You have to plug into data storage elsewhere.
From Heroku's spec:
Ephemeral filesystem
Each dyno gets its own ephemeral filesystem, with a fresh copy of the most recently deployed code. During the dyno’s lifetime its running processes can use the filesystem as a temporary scratchpad, but no files that are written are visible to processes in any other dyno and any files written will be discarded the moment the dyno is stopped or restarted. For example, this occurs any time a dyno is replaced due to application deployment and approximately once a day as part of normal dyno management.
Heroku is a great option for RoR in my opinion. I have used it personally and ran to the problem that has been mentioned here already (you can't store anything in Heroku's filesystem). I therefore used S3 following this tutorial: https://devcenter.heroku.com/articles/s3
Hope it helps!
PD: Make sure not to store the S3 credentials on any file, but rather create variables as described here: https://devcenter.heroku.com/articles/config-vars
I used to have them on a file and long story short someone gained access to my Amazon account and my account was billed several thousands of dollars (just from a couple of days). The Amazon staff was kind enough to waive those. Just something to have in mind.
As pointed out, you shouldn't do this with Heroku for the specific reason of ephemeral storage, but to answer your question more broadly storing user-uploaded content on a local filesystem on any host has a few inherent issues:
You can quickly run out of local storage space on the disk
You can lose all your user-uploaded content if the hardware crashes / the directory gets deleted / etc.
Heroku, EC2, Digital Ocean, etc. all provide servers that don't come with any guarantee of persistence (ephemeral storage especially). This means that your instance may shut down at any point, be swapped out, etc.
You can't scale your application horizontally. The files on one server won't be accessible from another (or dyno, or whatever your provider of choice calls them).
S3, however, is such a widely-used solution because:
It's incredibly cheap (we store 20 TB of data for something like $500 a month)
Your uploaded files aren't at risk of disappearing due to hardware failure
Your uploaded files are decoupled from the application, meaning any server / dyno / whatever could access them.
You can always publish your S3 buckets into cloud front if you need a CDN without any extra effort.
And certainly many more reasons. The most important thing to remember, is that by storing uploaded content locally on a server, you put yourself in a position where you can't scale horizontally, regardless of how you're hosting your app.
It it wiser to host files on S3, and actually it is even more wiser to use direct uploads to S3.
You can read the arguments, for example, here.
Main point: Heroku is really, really expensive thing.
So you need to save every bit of resources you have. And the only option to store static files on Heroku is having separate dyno running app server for you. And static files don't need app server. So it's just a waste of CPU time (and you should read that as "a waste of a lot of my money").
Also, uploading huge amount of huge files will quickly get you out of memory quota (read that as "will waste even more of my money because I will need to run more dynos"). So it's best to upload files directly to S3.
Heroku is great for hosting your app. Use the tool that best suites the task.
UPD. Forgot to tell you – not only you will need separate dyno for static assets, your static assets will die every time this dyno is restarted.
I had the same problem. I do solve it by adding all my images in my rails app. I then reference the images using their links that might be something like
myapp.herokuapp.com/assets/image1.jpg
I might add the link from the CMS. It might not be the best option, but it works.
Is it possible to host the application in one server and queue jobs in another server?
Possible examples:
Two different EC2 instances, one with the main server and the second with the queueing service.
Host the app in Heroku and use an EC2 instance with the queueing service
Is that possible?
Thanks
Yes, definitely. We have delayed_job set up that way where I work.
There are a couple of requirements for it to work:
The servers have to have synced clocks. This is usually not a problem as long as the server timezones are all set to the same.
The servers all have to access the same database.
To do it, you simply have the same application on both (or all, if more than two) servers, and start workers on whichever server you want to process jobs. Either server can still queue jobs, but only the one(s) with workers running will actually process them.
For example, we have one interface server, a db server and several worker servers. The interface server serves the application via Apache/Passenger, connecting the Rails application to the db server. The workers have the same application, though Apache isn't running and you can't access the application through http. They do, on the other hand, have delayed_jobs workers running. In a common scenario, the interface server queues up jobs in the db, and the worker servers process them.
One word of caution: If you're relying on physical files in your application (attachments, log files, downloaded XML or anything else), you'll most likely need a solution like S3 to keep those files. The reason for this is that the individual servers might not have the actual files. An example of this is if your user were to upload their profile picture on your web-facing server, the files would likely be stored on that server. If you then have another server to resize the profile pictures, the image wouldn't exist on the worker server.
Just to offer another viable option: you can use a new worker service like IronWorker that relies completely on an elastic farm of cloud servers inside EC2.
This way you can queue/schedule jobs to run and they will parallelize across tons of threads spanning multiple servers - all without worrying about the infrastructure.
Same deal with the database though - it needs to be accessible from the outside.
Full disclosure: I helped build IW.