Cassandra keeping incremental backups AND commitlog archive? - datastax-enterprise

I am currently refining a cassandra backup solution.
So i am stumbling upon the point if i should save incremental_backups AND commitlog_archive.
If i understand correctly, restoring from either
Snapshot + Incremental Backups + Commitlog (only these after the last flush)
OR
Snapshot + Commitlog from archive
should end in the same set of data, right?
Or is the latter option much slower because of replaying takes longer than just checking the sstables integrity?
Should i keep both?

I would prefer incremental backups over commit logs.
Incremental backups result in links to immutable sstables which can then be replayed back to a live Cassandra cluster using sstableloader. When incremental backups are enabled (disabled by default), Cassandra hard-links each flushed SSTable to a backups directory under the keyspace data directory. The disadvantage of incremental backups is that it is all or nothing, it is not possible to select a subset of column families for incremental backup. As I mentioned before, the ability to restore to a live Cassandra cluster to a different column family makes incremental backups superior. And you also have to manage incremental backup space because there is no utility to clean up incremental backups over time or even do a rebase.
The advantage of commit logs is that it provides a point in time restore capability. To restore from commit logs, you have to go to the latest incremental backup or the latest snaphshot (in your former case), stop the database, clear the existing commit logs, copy the commit logs till the latest incremental backup or snapshot, run the rollforward utility to bring the database to the exact point in time that you require.
However, if you use only commit logs, your database downtime is going to be higher as you have more commit logs to process while the database is down. So, I would use the incremental backup approach and then use commit logs.
Lastly, better to use a professional tool out here rather than hacking this on your own - from experience with multiple customers, both the first and second approaches are fraught with potential for error.

It depends on your restore requirements. If you want to restore to particular time then you will need snapshot, incremental and commit logs.
why Commit logs ?
If you take snapshot at 11:00 am today. And you want to restore for 11:30 am today. Now If you only use incremental backups which cover 11:30 am.
Then there is possibility you will have some extra/missing data.
If some one deletes one row at 11:31 am and you have SSTables in incremental backups which has been flushed at 11:32 am so during restore you will find above row as a tombstone which is wrong if you consider the restore time.
So for particular time restore you need to process commit logs along with full snapshot and incremental backups.

Related

Rails & Heroku: Automatically save a copy of data from heroku

everyone
I have a small site running on free tier of heroku. It fetchs/updates data from various sources frequently, and I want to save a copy of the database (~10000 records) every month to somewhere else, so I can see how the data changes overtime, and make some more detailed analysis. The website is developed in Ruby on Rails.
I want to know
What is the best practice of exporting data from heroku, esp. for Ruby on Rails apps? (~10000 records)
Is there any good place to share this data with others? (i.e., Kaggle dataset, Github repo)
Thanks!
TL;DR You’re better off building your own export script connecting to your instance and using SQL dump. The hobby plan is very limited.
There are multiple backup strategies. For instance, if you require exporting once a month you could set up a cron job every 30 days that exports the data you desire.
Since you're using Heroku, they have a way to manage backups. To do so navigate to:
https://dashboard.heroku.com/apps/{your-app}/resources
Select your database add-on
Navigate to Durability
And there you should see the default backup strategy from Heroku. This is heroku’s daily strategy to modify this, heroku toolbelt provides the following:
heroku pg:backups:schedule DATABASE_URL --at '02:00 America/Los_Angeles' --app sushi
but this would be daily backups.
Mind the following constraint:
A monthly backup means that only 1 backup is saved over the course of a month. Based on current limits, for example, a Premium-0 would have 12 monthly backups, one for each of the last 12 months.
Also, if you decide to adopt Heroku’s built in approach, mind the following:
There is a limit to the number of manual backups that you can retain. That number is based on your database plan.
Plan Backups Retained
Hobby-Dev 2
Concerning sharing it, there are a few things to take into consideration; for instance, if the information is sensible (by default) we want a way to control who has access to the resource. There are ways to achieve this commercially using a private Github repo or even an amazon S3 bucket with ACL (Access Control List). Heroku's dataclips may also be used but not sure you want this.

setting up Neo4j replication on two instances

I am planning to configure some sort of 2 node replication for neo4j, similar to mysql replication. Since I am a little constrained on resources I don't want to pay for more than two Cloud compute instances. Also I am happy with just one real time or near real time copy of the neo4j database. So the approach i can think of is:
Configure HA on the two compute nodes with the help of an arbiter instance. Setup one neo4j instance (master) on first node and another neo4j instance (slave) + another neo4j instance (arbiter, only for arbitration, no data logging) instance on second node.
OR
Setup a cron for online backup using the neo4j-backup tool. Setup incremental backups every hour or so. Not sure the load it may put on the prod server, planning to test that out.
I am more inclined on the first approach since I get a more real time copy the database (I also get HA/load balancing with instant failover but that is not a priority right now).
Please let me know
which of the two approach is better,
if there is another way to achieve the same or
if any of the above approaches are not suitable or have some flaws.
I am a little new to Neo4j HA so please pardon me for my ignorance. Thanks !
So. You already mentioned available solutions.
TL;DR; I prefer first option.
Cluster
In general, recommended layout is 3 nodes (2 slaves + 1 master).
But your layout - 2 nodes (1 master + 1 slave + 1 arbiter) is viable too. Especially if one server can handle your workload.
Good things:
Almost "real-time" replica.
Possibility to utilise resources to handle bigger workload.
Better availability.
Notes:
If you have 10mb/sec write load on master, then same load will be applied on slave node. This shouldn't affect reads from slave at all (except write load is REALLY huge).
Maintenance costs are bigger, then single-instance installation. You should plan how to handle cluster upgrades, configuration updates, plugin updates.
Branched data. In clustered environment there is possibility to end up in "split-brain" scenario, when 2 nodes have different data and decision should be made which data should be kept. Neo4j handles such cases quite good. But you should keep in mind that small data-loss can occur in VERY RARE scenarios.
Backup
Good things:
Simple. Just do backups from database.
Consistency check. When backup is made, tool runs consistency check to verify if database is not damaged. There is no possibility that Backup will screw up live database. If there any issues - you will be notified via logs from backup utility. See below detailed info on to how backup is performed.
Database. Neo4j backup is fully-functional database. You can spin-up server that points to backup database, and do everything you wan't.
Incremental backups. You can do incremental backups as often, as you wan't.
Notes:
Neo4j scales vertically very well (depends on size of database). It can handle huge load on single instance (we had up to 3k requests/second on medium machine). So, you can get one bigger machine for Neo4j server and other smaller (cheaper) for backups.
How backup is performed?
One thing that should be kept in mind - live database is still fully operational. Backup utility doesn't not stop or prevent any actions.
When transaction in database is committed, all changes are appended to transaction log.
When there are no previous backup present: copy whole storage.
When there is previous backup AND transaction logs are available: copy new transaction logs and replay them on to storage.
When there is previous backup AND transactions are NOT available: discard existing storage, copy existing storage.
Why transaction logs can not be available? Your configuration may say to keep only latest transaction logs (i.e. 1 hour), or not to keep at all.
Relevant settings:
keep_logical_logs
logical_log_rotation_threshold
Other
Anyway, you should consider making backups event in clustered environment. Everything can fail, in any moment.
In general - everything depends on your load and database size.
If your database is small enough to fully fit in memory and one machine is enough to handle all load, then one Neo4j instance will be enough. Just do backup.
If you wan't better scalability/availability and real-time working replica, then cluster setup is best choice.

How to take a snapshot of neo4j database

I see there is a tool that allows for backups to be taken of a running Neo4J database, either via Java or via the backup tool.
The backup will obviously take some time to complete, during which time additional nodes may be added, modified or deleted. Is it possible to take a snapshot of the graph database at a particular instant in time?
My use case: N4J is used to store events, which are stored elsewhere. I'd like to take a snapshot of the graph at an instant in time, then when it's restored at a later date, know what was missing from the graph based on when the backup was taken and be able to reconstruct a complete version of the database that is accurate to the present time by adding the missing events.
There's a related question that has good discussion of this, let me cut to the chase.
If you're using the commercial version of neo4j, then neo4j backup options and/or the backup tool are your best options.
If you're using community edition, then you can't do online backups at present. I have several applications that run using neo4j community, and we have a cron job that runs at 03:00. It shuts down the application, and creates a copy of the database in another location (by copy, I mean it actually creates a tar.gz archive of the DB directory). After this is completed with other maintenance, the application gets restarted again.
Depending on file copy performance and DB size, this isn't too bad. We have a moderate sized DB and we simply accept about 10 minutes of downtime every night.
The neo4j-backup tool is part of Neo4j Enterprise edition. It takes a backup consistently at the time you've started it. After backup is finished a verbose consistency check is run to validate recoverability. It works either as full backup or incrementally.
This tool does not incorporate restoring for a given point in time or comparing with other backups. A point-in-time restore can be achieved by combining it with a classic file backup tool. I've made good experience with backup2l. neo4j-backup would started as part of backup2l's PRE_BACKUP. The same approach should work with any other backup tool out there.
Using your backup tool you can retrieve the full graph.db directory at a desired point-in-time from your archives and use them.

What does Transactional Backup Interval mean in the Backup Plan Wizard for Team Foundation Server?

I'm in the process of setting up a backup plan for a Team Foundation Server. I downloaded Power Tools for TFS and I'm using the Backup Plan Wizard that was included in that pack. I am now at the step where I'm supposed to decide how to schedule the backups and I have no idea what to choose for my setup.
I get what everything means, except Transactional Backup Interval.
I would appreciate suggestions for a good schedule. What I would like to achieve is being able to restore and still look back a few versions, if possible. The minimum backup I would like to have is the latest version.
It might be important to add that I got to choose "Backup retention days" earlier and set that to 30.
The transactional backup interval likely refers to how often transaction logs for your TFS databases are backed up. The schedule you choose will probably depend on how busy your repository is.
At my current client there are six developers, and we share some of the load for source control between VSS and TFS (we're transitioning). Corporate policy says we must backup transaction logs every hour during business hours, and an additional one at midnight. Our local backups are on a four-day retention cycle with off-site backups lasting years.
I would make a decision based on how much work you'd want to lose if your repository was lost and your working copy was destroyed simultaneously (natural disaster?).

How to prepare for data loss in a production website?

I am building an app that is fast moving into production and I am concerned about the possibility that due to hacking, some silly personal error (like running rake db:schema:load or rake db:rollback) or other circumstance we may suffer data loss in one database table or even across the system.
While I don't find it likely that the above will happen, I would be remiss in not being prepared in case it ever does.
I am using Heroku's PG Backups (which is to be replaced with something else this month), and I also run automated daily backups to S3: http://trevorturk.com/2010/04/14/automated-heroku-backups/, successfully generating .dump files.
What is the correct way to deal with data loss on a production app?
How would I restore the .dump file in case I need to? Can I do a selective restore if a small part of the system is hit?
In case a selective restore is not possible: assume one table loses data 4 hours after the last backup. Result => would fixing the lost table require rolling back 4 hours of users' activity? Any good solution to this?
What is the best way to support users through the inconvenience if something like this happens?
A full DR (disaster recovery) solution requires the following:
Multisite. If a fire, flood, Osama Bin Laden or whathaveyou strikes the Amazon (or is it Salesforce?) data center that Heroku uses, you want to be sure that your data is safe elsewhere.
On-going replication of the data to a separate site (or sites). That means that every transaction that's written to your database on one site, is replicated within seconds to the mirror database on the other site. Most RDBMS's have mechanisms to let you do a master-slave replication like that.
The same goes for anything you put on a filesystem outside of the database, such as images, XML configuration files etc. S3 is a good solution here - they replicate everything to multiple data centers for you.
I won't hurt to create periodic (daily or so) dumps of the database and store them separately (e.g. on S3). This helps you recover from data corruption that propagates to the slave DBs.
Automate the process of data recovery. You want this to just work when you need it.
Test everything. Ideally, you want to automate the test process and run it periodically to ensure that your backups can restore. Netflix Chaos Monkey is an extreme example of this.
I'm not sure how you'd implement all this on Heroku. A complete solution is still priced out of reach for most companies - we're running this across our own data centers (one in the US, one in EU) and it costs many millions. Work according to the 80-20 rule - on-going backup to a separate site, plus a well tested recovery plan (continuously test your ability to recover from backups) covers 80% of what you need.
As for supporting users, the best solution is simply to communicate timely and truthfully when trouble happens and make sure you don't lose any data. If your users are paying for your service (i.e. you're not ad-supported), then you should probably have an SLA in place.
About backups, you cannot be sure at 100 percent every time that no data will be lost. The best is to test it on another server. You must have at leat two types of backup :
A database backup, like pg-dump. A dump is uniquely SQL commands so you can use it to recreate the whole database, just a table, or just a few rows. You loose the data added in the meantime.
A code backup, for example a git repository.
in addition to Hartator's answer:
use replication if your DB offers it, e.g. at least master/slave replication with one slave
do database backups on a slave DB server and store them externally (e.g. scp or rsync them out of your server)
use a good version control system for your source code, e.g. Git
use a solid deploy mechanism, such as Capistrano and write your custom tasks, so nobody needs to do DB migrations by hand
have somebody you trust check your firewall setup and the security of your system in general
The DB-Dumps contain SQL-commands to recreate all tables and all data... if you were to restore only one table, you could extract that portion from a copy of the dump file and (very carefully) edit it and then restore with the modified dump file (for one table).
Always restore first to an independent machine and check if the data looks right. e.g. you could use one Slave server, take if offline, then restore there locally and check the data. Good if you have two slaves in your system, then the remaining system has still one master and one slave while you restore to the second slave.
To simulate a fairly simple "total disaster recovery" on Heroku, create another Heroku project and replicate your production application completely (except use a different custom domain name).
You can add multiple remote git targets to a single git repository so you can use your current production code base. You can push your database backups to the replicated project, and then you should be good to go.
The only step missing from this exercise verses a real disaster recovery is assigning your production domain to the replicated Heroku project.
If you can afford to run two copies of your application in parallel, you could automate this exercise and have it replicate itself on a regular basis (e.g. hourly, daily) based on your data loss tolerance.

Resources