everyone
I have a small site running on free tier of heroku. It fetchs/updates data from various sources frequently, and I want to save a copy of the database (~10000 records) every month to somewhere else, so I can see how the data changes overtime, and make some more detailed analysis. The website is developed in Ruby on Rails.
I want to know
What is the best practice of exporting data from heroku, esp. for Ruby on Rails apps? (~10000 records)
Is there any good place to share this data with others? (i.e., Kaggle dataset, Github repo)
Thanks!
TL;DR You’re better off building your own export script connecting to your instance and using SQL dump. The hobby plan is very limited.
There are multiple backup strategies. For instance, if you require exporting once a month you could set up a cron job every 30 days that exports the data you desire.
Since you're using Heroku, they have a way to manage backups. To do so navigate to:
https://dashboard.heroku.com/apps/{your-app}/resources
Select your database add-on
Navigate to Durability
And there you should see the default backup strategy from Heroku. This is heroku’s daily strategy to modify this, heroku toolbelt provides the following:
heroku pg:backups:schedule DATABASE_URL --at '02:00 America/Los_Angeles' --app sushi
but this would be daily backups.
Mind the following constraint:
A monthly backup means that only 1 backup is saved over the course of a month. Based on current limits, for example, a Premium-0 would have 12 monthly backups, one for each of the last 12 months.
Also, if you decide to adopt Heroku’s built in approach, mind the following:
There is a limit to the number of manual backups that you can retain. That number is based on your database plan.
Plan Backups Retained
Hobby-Dev 2
Concerning sharing it, there are a few things to take into consideration; for instance, if the information is sensible (by default) we want a way to control who has access to the resource. There are ways to achieve this commercially using a private Github repo or even an amazon S3 bucket with ACL (Access Control List). Heroku's dataclips may also be used but not sure you want this.
Related
I have a very high-traffic Rails app. We use an older version of PostgreSQL as the backend database which we need to upgrade. We cannot use either the data-directory copy method because the formats of data files have changed too much between our existing releases and the current PostgreSQL release (10.x at the time of writing). We also cannot use the dump-restore processes for migration because we would either incur downtime of several hours or lose important customer data. Replication would not be possible as the two DB versions are incompatible for that.
The strategy so far is to have two databases and copy all the data (and functions) from existing to a new installation. However, while the copy is happening, we need data arriving at the backend to reach both servers so that once the data migration is complete, the switch becomes a matter of redeploying the code.
I have figured out the other parts of the puzzle but am unable to determine how to send all writes happening on the Rails app to both DB servers.
I am not bothered if both installations get queried for displaying data to the user (I can discard the data coming out of the new installation); so, if it is possible on driver level, or adding a line somewhere in the ActiveRecord, I am fine with it.
PS: Rails version is 4.1 and the company is not planning to upgrade that.
you can have multiple database by adding an env for the database.yml file. After that you can have a seperate class Like ActiveRecordBase and connect that to the new env.
have a look at this post
However, as I can see, that will not solve your problem. Redirecting new data to the new DB while copying from the old one can lead to data inconsistencies.
For and example, ID of a record can be changed due to two data source feeds.
If you are upgrading the DB, I would recommend define a schedule downtime and let your users know in advance. I would say, having a small downtime is far better than fixing inconstant data down the line.
When you have a downtime,
Let the customers know well in advance
Keep the downtime minimal
Have a backup procedure, in an even the new site takes longer than you think, rollback to the old site.
I saw this question:After git push heroku - uploaded files on Heroku are lost
each time the application shuts down and restarts after being inactive
for x minutes), your application is recreated and all stored data is
lost.(C)
Right now I have user which can upload two photos.I got email confirmation of new users. So I can check that user registered and uploaded photo 4 and 14 hours ago.
I've made my last commit and pushed it to heroku around 19 hours ago. And this 4 images, that new users uploaded are lost now.But I can see images if I just now register the user. So it seems to be really true that my app was inactive for x minutes and then it restarts and deletes images.
I read some questions like this Rails] Images erased after a new commit on heroku
There it says that I should use external server like aws s3( I have no idea what it is and how much it will cost and how to connect it)
So is it really true? what are my other options? may be I should simply use digital ocean(won't be there the same problem?) or something else. Will this problem continue in paid account?
I use rails and upload files using carrierwave gem, I can't upload code here cause I am writing from another laptop.
Heroku's filesystem is readonly. You can't expect anything you upload to persist there, you need to use an external storage mechanism, something like Amazon's S3 for example.
See the links for more details.
https://devcenter.heroku.com/articles/s3
https://devcenter.heroku.com/articles/dynos#isolation-and-security
The filesystem that your Heroku instance runs on is not read-only, but it is transient - i.e. files that you store there will not persist after an instance restart.
This is a deliberate design decision by Heroku, to force you to think about where you store your data and how it impacts on scalability.
You're asking about Digital Ocean - I haven't used them but I assume from your question that they allow you to store to a persistent local filesystem.
The question that you then have to ask is: what happens if you want to run more than one instance of your app? Do they share the same persistent filesystem? Can they access each other's files? How do you handle file locking to avoid race-conditions when several app instances are using the same filesystem?
Heroku's model forces you to either put stuff in a database or store it using some external service. Generally, any of these sorts of systems will be reasonably scalable - you can have multiple Heroku instances (perhaps running on different machines, different data-centers, etc), and they will all interact nicely.
I do agree that for a simple use-case where you just want to run a single instance of an app during development it can be inconvenient, but I think this is the reasoning behind it - to force you to design this sort of thing in, rather than developing your whole app on the assumption that it can store everything locally and then find out later that you need to completely redesign to make it scalable.
What you're looking at is something called the ephemeral file system in Heroku:
Each dyno gets its own ephemeral filesystem, with a fresh copy of the most recently deployed code. During the dyno’s lifetime its running processes can use the filesystem as a temporary scratchpad, but no files that are written are visible to processes in any other dyno and any files written will be discarded the moment the dyno is stopped or restarted
In short, it means that any files you upload will only last for the time the dyno is running. When the dyno shuts down, the files will be removed unless they were part of the local git repo.
The way to resolve the issue is to store the files on a third-party service - typically S3. This will store the files on a system independent of Heroku.
Both Paperclip & Carrierwave support S3 (Simple Storage Service) - through a gem called fog. S3 gives you a "free" tier, allowing you to store a certain amount of data (I've forgotten how much) for free.
I would strongly recommend setting up an S3 account and linking it to your Heroku app. This way, any files you upload will be stored off-site.
I want to be able to have one app access multiple databases on the HEROKU "system".
Can the connection to the database be changed dynamically?
Why I ask...
I have an app that has a lot of very processor heavy background jobs. If a given user uploads a product feed of say 50,000 product that have to be compared to existing products and update only the deltas it can take a "few" minutes.
Now to mitigate the delay I spin up multiple workers, each taking small bites out of the lot until there's none. I can get to about 20 workers before the GUI starts to feel sluggish because the DB is being hammered.
I've tuned some of the code and indexed the DB to some extent, and I'm sure there's more I could do, but it will eventually suffer the law of diminished returns.
For one user, I don't much care... if you upload 50k products you need to wait a bit..
But user one's choice to upload impacts user two. (different company so no cross over of data)..
Currently I handle different users by separating their data with schemas in postgresql.
The different users however share the same db connection and even on the best plan I can see a time when 20 users try to upload 50,000 products at the same time.(first of month/quarter for example).
User 21 would see a huge slow down on their system because of this..
So the question: Can I assign different users to different databases? User logs in, validates their info against a central DB, and then a different DB takes over?
My current solution is different instances of heroku. It's easy to maintain the code because it's one base and I just script the git push(es). The only issue is the different login URL's; which I suppose I could confront if I can't find an easy DB switch solution.
It sounds like you're able to shard your data by user, or set of users without much concern since you already separate them by schema. If that's the case, and you're using Ruby and ActiveRecord, look at https://github.com/tchandy/octopus. I imagine you're not looking to spin up databases on the fly, rather, you'll have them already built and ready to be used, and can add more as you go.
Granted, it sounds like what you're doing could be done a lot more effectively by using the right tool for that type of intensive processing like one of the Heroku Hadoop add-ons; nonetheless, if that's not an option for whatever reason, check out the gem above. There are a couple other gems like it, and of course you could technically manage your own ActiveRecord connections without this gem, but I think you'll find that will be painful really fast.
Of course, if you aren't using Ruby or ActiveRecord, still shard the data, and look for something like the gem above in your app's language :).
the postgres databases on heroku are configured with environment variables. when you run heroku config you should see:
DATABASE_URL: postgres://xxx.compute.amazonaws.com:5432/xxx
you can use these variables to connect to databases on other heroku instances or share a single database on different heroku apps.
if you try to run this kind of stuff on free heroku instances, i think it is against their terms of services.
if it's about scalability, i think you will just have to pay for a more expensive database instance...
I recently reached the 5mb database limit with heroku, the costs rise dramatically after this point so I'm looking to move the database elsewhere.
I am very new to using VPS and setting up servers from scratch, however, I have done this recently for another app.
I have a couple questions related to this:
Is it possible to create a database on a VPS and point my rails app on heroku to use that database?
If so, what would database.yml actually look like. What would be an example localhost with the database stored outside the app?
These may be elementary questions but my knowledge of servers and programming is very much self taught, so I admit, there may be huge loopholes in things that I "should" already understand.
Note: Other (simpler) suggestions for moving my database are welcomed. Thanks.
OK - for starters, yes you can host a database external to Heroku and point your database.yml at that server - it's simply a case of setting up the hostname to point at the right address, and give it the correct credentials.
However, you need to consider a couple of things:
1) Latency - unless you're hosting inside EC2 East the latency between Heroku and your DB will cause you all sorts of performance issues.
2) Setting up a database server is not a simple task. You need consider how secure it is, how it performs, keeping it up to date, keeping it backed up, and having to worry day and night about it being up. With Heroku you don't need to do this as it's fully managed.
Price wise, are you aware of the new low cost Postgres plans at Heroku? $15/mo will get you 20Gb (shared instance), and $50/mp will get you a terabyte (dedicated instance). To me, that is absurdly cheap as I value my time much more, and I know how many hours I would need to invest in making my own server to save maybe $10 a month.
It would be cheaper to use Amazon RDS, which is officially supported by Heroku and served from the same datacenter (Amazon US-East). If you do want to use a VPS, use an Amazon EC2 instance in US-East for maximum performance. This tutorial shows exactly how to do it with Django in detail. Even if you don't decide to use EC2, refer to that tutorial to see how to properly add external database information to your Heroku application so that Heroku doesn't try to overwrite it.
Still, Heroku's shared database is extremely cost-competitive -- far moreso than most VPSes and with much less setup and maintenance.
I am building an app that is fast moving into production and I am concerned about the possibility that due to hacking, some silly personal error (like running rake db:schema:load or rake db:rollback) or other circumstance we may suffer data loss in one database table or even across the system.
While I don't find it likely that the above will happen, I would be remiss in not being prepared in case it ever does.
I am using Heroku's PG Backups (which is to be replaced with something else this month), and I also run automated daily backups to S3: http://trevorturk.com/2010/04/14/automated-heroku-backups/, successfully generating .dump files.
What is the correct way to deal with data loss on a production app?
How would I restore the .dump file in case I need to? Can I do a selective restore if a small part of the system is hit?
In case a selective restore is not possible: assume one table loses data 4 hours after the last backup. Result => would fixing the lost table require rolling back 4 hours of users' activity? Any good solution to this?
What is the best way to support users through the inconvenience if something like this happens?
A full DR (disaster recovery) solution requires the following:
Multisite. If a fire, flood, Osama Bin Laden or whathaveyou strikes the Amazon (or is it Salesforce?) data center that Heroku uses, you want to be sure that your data is safe elsewhere.
On-going replication of the data to a separate site (or sites). That means that every transaction that's written to your database on one site, is replicated within seconds to the mirror database on the other site. Most RDBMS's have mechanisms to let you do a master-slave replication like that.
The same goes for anything you put on a filesystem outside of the database, such as images, XML configuration files etc. S3 is a good solution here - they replicate everything to multiple data centers for you.
I won't hurt to create periodic (daily or so) dumps of the database and store them separately (e.g. on S3). This helps you recover from data corruption that propagates to the slave DBs.
Automate the process of data recovery. You want this to just work when you need it.
Test everything. Ideally, you want to automate the test process and run it periodically to ensure that your backups can restore. Netflix Chaos Monkey is an extreme example of this.
I'm not sure how you'd implement all this on Heroku. A complete solution is still priced out of reach for most companies - we're running this across our own data centers (one in the US, one in EU) and it costs many millions. Work according to the 80-20 rule - on-going backup to a separate site, plus a well tested recovery plan (continuously test your ability to recover from backups) covers 80% of what you need.
As for supporting users, the best solution is simply to communicate timely and truthfully when trouble happens and make sure you don't lose any data. If your users are paying for your service (i.e. you're not ad-supported), then you should probably have an SLA in place.
About backups, you cannot be sure at 100 percent every time that no data will be lost. The best is to test it on another server. You must have at leat two types of backup :
A database backup, like pg-dump. A dump is uniquely SQL commands so you can use it to recreate the whole database, just a table, or just a few rows. You loose the data added in the meantime.
A code backup, for example a git repository.
in addition to Hartator's answer:
use replication if your DB offers it, e.g. at least master/slave replication with one slave
do database backups on a slave DB server and store them externally (e.g. scp or rsync them out of your server)
use a good version control system for your source code, e.g. Git
use a solid deploy mechanism, such as Capistrano and write your custom tasks, so nobody needs to do DB migrations by hand
have somebody you trust check your firewall setup and the security of your system in general
The DB-Dumps contain SQL-commands to recreate all tables and all data... if you were to restore only one table, you could extract that portion from a copy of the dump file and (very carefully) edit it and then restore with the modified dump file (for one table).
Always restore first to an independent machine and check if the data looks right. e.g. you could use one Slave server, take if offline, then restore there locally and check the data. Good if you have two slaves in your system, then the remaining system has still one master and one slave while you restore to the second slave.
To simulate a fairly simple "total disaster recovery" on Heroku, create another Heroku project and replicate your production application completely (except use a different custom domain name).
You can add multiple remote git targets to a single git repository so you can use your current production code base. You can push your database backups to the replicated project, and then you should be good to go.
The only step missing from this exercise verses a real disaster recovery is assigning your production domain to the replicated Heroku project.
If you can afford to run two copies of your application in parallel, you could automate this exercise and have it replicate itself on a regular basis (e.g. hourly, daily) based on your data loss tolerance.