I see there is a tool that allows for backups to be taken of a running Neo4J database, either via Java or via the backup tool.
The backup will obviously take some time to complete, during which time additional nodes may be added, modified or deleted. Is it possible to take a snapshot of the graph database at a particular instant in time?
My use case: N4J is used to store events, which are stored elsewhere. I'd like to take a snapshot of the graph at an instant in time, then when it's restored at a later date, know what was missing from the graph based on when the backup was taken and be able to reconstruct a complete version of the database that is accurate to the present time by adding the missing events.
There's a related question that has good discussion of this, let me cut to the chase.
If you're using the commercial version of neo4j, then neo4j backup options and/or the backup tool are your best options.
If you're using community edition, then you can't do online backups at present. I have several applications that run using neo4j community, and we have a cron job that runs at 03:00. It shuts down the application, and creates a copy of the database in another location (by copy, I mean it actually creates a tar.gz archive of the DB directory). After this is completed with other maintenance, the application gets restarted again.
Depending on file copy performance and DB size, this isn't too bad. We have a moderate sized DB and we simply accept about 10 minutes of downtime every night.
The neo4j-backup tool is part of Neo4j Enterprise edition. It takes a backup consistently at the time you've started it. After backup is finished a verbose consistency check is run to validate recoverability. It works either as full backup or incrementally.
This tool does not incorporate restoring for a given point in time or comparing with other backups. A point-in-time restore can be achieved by combining it with a classic file backup tool. I've made good experience with backup2l. neo4j-backup would started as part of backup2l's PRE_BACKUP. The same approach should work with any other backup tool out there.
Using your backup tool you can retrieve the full graph.db directory at a desired point-in-time from your archives and use them.
Related
I'm trying to make a shell script that will allow the users to backup an Informix IDS database before using it and rollback (restore it) if they need to do so.
I know I can use ontape and onbar but I don't know if it would work for every database, no matter the size, and to be honest, I don't know if it would be safe for the users to use a script that takes the DBNAME as an argument to backup/restore.
Using ON-Tape (ontape), you can back up a whole server, but not a single database. Using ON-Bar (onbar), you can back up one or more storage spaces (dbspaces, blobspaces, etc) or the whole server. Therefore, if you locate the database in a separate dbspace and ensure no other database uses the dbspace, then you can use ON-Bar to achieve a database-level backup. Consequently, you must design your system to allow for database recovery and restore.
Running backups requires administrative privileges, which you should not give to anyone casually. Therefore, you will need to design a backup and restore system that will limit people to backing up the databases you intend them to be able to backup. I have some views on how this can be done, but the result is complex.
Amongst other places, read the Comparison of the ON-Bar and ON-Tape utilities. That is part of the Backup and Restore Guide documentation.
I am currently refining a cassandra backup solution.
So i am stumbling upon the point if i should save incremental_backups AND commitlog_archive.
If i understand correctly, restoring from either
Snapshot + Incremental Backups + Commitlog (only these after the last flush)
OR
Snapshot + Commitlog from archive
should end in the same set of data, right?
Or is the latter option much slower because of replaying takes longer than just checking the sstables integrity?
Should i keep both?
I would prefer incremental backups over commit logs.
Incremental backups result in links to immutable sstables which can then be replayed back to a live Cassandra cluster using sstableloader. When incremental backups are enabled (disabled by default), Cassandra hard-links each flushed SSTable to a backups directory under the keyspace data directory. The disadvantage of incremental backups is that it is all or nothing, it is not possible to select a subset of column families for incremental backup. As I mentioned before, the ability to restore to a live Cassandra cluster to a different column family makes incremental backups superior. And you also have to manage incremental backup space because there is no utility to clean up incremental backups over time or even do a rebase.
The advantage of commit logs is that it provides a point in time restore capability. To restore from commit logs, you have to go to the latest incremental backup or the latest snaphshot (in your former case), stop the database, clear the existing commit logs, copy the commit logs till the latest incremental backup or snapshot, run the rollforward utility to bring the database to the exact point in time that you require.
However, if you use only commit logs, your database downtime is going to be higher as you have more commit logs to process while the database is down. So, I would use the incremental backup approach and then use commit logs.
Lastly, better to use a professional tool out here rather than hacking this on your own - from experience with multiple customers, both the first and second approaches are fraught with potential for error.
It depends on your restore requirements. If you want to restore to particular time then you will need snapshot, incremental and commit logs.
why Commit logs ?
If you take snapshot at 11:00 am today. And you want to restore for 11:30 am today. Now If you only use incremental backups which cover 11:30 am.
Then there is possibility you will have some extra/missing data.
If some one deletes one row at 11:31 am and you have SSTables in incremental backups which has been flushed at 11:32 am so during restore you will find above row as a tombstone which is wrong if you consider the restore time.
So for particular time restore you need to process commit logs along with full snapshot and incremental backups.
I am planning to configure some sort of 2 node replication for neo4j, similar to mysql replication. Since I am a little constrained on resources I don't want to pay for more than two Cloud compute instances. Also I am happy with just one real time or near real time copy of the neo4j database. So the approach i can think of is:
Configure HA on the two compute nodes with the help of an arbiter instance. Setup one neo4j instance (master) on first node and another neo4j instance (slave) + another neo4j instance (arbiter, only for arbitration, no data logging) instance on second node.
OR
Setup a cron for online backup using the neo4j-backup tool. Setup incremental backups every hour or so. Not sure the load it may put on the prod server, planning to test that out.
I am more inclined on the first approach since I get a more real time copy the database (I also get HA/load balancing with instant failover but that is not a priority right now).
Please let me know
which of the two approach is better,
if there is another way to achieve the same or
if any of the above approaches are not suitable or have some flaws.
I am a little new to Neo4j HA so please pardon me for my ignorance. Thanks !
So. You already mentioned available solutions.
TL;DR; I prefer first option.
Cluster
In general, recommended layout is 3 nodes (2 slaves + 1 master).
But your layout - 2 nodes (1 master + 1 slave + 1 arbiter) is viable too. Especially if one server can handle your workload.
Good things:
Almost "real-time" replica.
Possibility to utilise resources to handle bigger workload.
Better availability.
Notes:
If you have 10mb/sec write load on master, then same load will be applied on slave node. This shouldn't affect reads from slave at all (except write load is REALLY huge).
Maintenance costs are bigger, then single-instance installation. You should plan how to handle cluster upgrades, configuration updates, plugin updates.
Branched data. In clustered environment there is possibility to end up in "split-brain" scenario, when 2 nodes have different data and decision should be made which data should be kept. Neo4j handles such cases quite good. But you should keep in mind that small data-loss can occur in VERY RARE scenarios.
Backup
Good things:
Simple. Just do backups from database.
Consistency check. When backup is made, tool runs consistency check to verify if database is not damaged. There is no possibility that Backup will screw up live database. If there any issues - you will be notified via logs from backup utility. See below detailed info on to how backup is performed.
Database. Neo4j backup is fully-functional database. You can spin-up server that points to backup database, and do everything you wan't.
Incremental backups. You can do incremental backups as often, as you wan't.
Notes:
Neo4j scales vertically very well (depends on size of database). It can handle huge load on single instance (we had up to 3k requests/second on medium machine). So, you can get one bigger machine for Neo4j server and other smaller (cheaper) for backups.
How backup is performed?
One thing that should be kept in mind - live database is still fully operational. Backup utility doesn't not stop or prevent any actions.
When transaction in database is committed, all changes are appended to transaction log.
When there are no previous backup present: copy whole storage.
When there is previous backup AND transaction logs are available: copy new transaction logs and replay them on to storage.
When there is previous backup AND transactions are NOT available: discard existing storage, copy existing storage.
Why transaction logs can not be available? Your configuration may say to keep only latest transaction logs (i.e. 1 hour), or not to keep at all.
Relevant settings:
keep_logical_logs
logical_log_rotation_threshold
Other
Anyway, you should consider making backups event in clustered environment. Everything can fail, in any moment.
In general - everything depends on your load and database size.
If your database is small enough to fully fit in memory and one machine is enough to handle all load, then one Neo4j instance will be enough. Just do backup.
If you wan't better scalability/availability and real-time working replica, then cluster setup is best choice.
I am building an app that is fast moving into production and I am concerned about the possibility that due to hacking, some silly personal error (like running rake db:schema:load or rake db:rollback) or other circumstance we may suffer data loss in one database table or even across the system.
While I don't find it likely that the above will happen, I would be remiss in not being prepared in case it ever does.
I am using Heroku's PG Backups (which is to be replaced with something else this month), and I also run automated daily backups to S3: http://trevorturk.com/2010/04/14/automated-heroku-backups/, successfully generating .dump files.
What is the correct way to deal with data loss on a production app?
How would I restore the .dump file in case I need to? Can I do a selective restore if a small part of the system is hit?
In case a selective restore is not possible: assume one table loses data 4 hours after the last backup. Result => would fixing the lost table require rolling back 4 hours of users' activity? Any good solution to this?
What is the best way to support users through the inconvenience if something like this happens?
A full DR (disaster recovery) solution requires the following:
Multisite. If a fire, flood, Osama Bin Laden or whathaveyou strikes the Amazon (or is it Salesforce?) data center that Heroku uses, you want to be sure that your data is safe elsewhere.
On-going replication of the data to a separate site (or sites). That means that every transaction that's written to your database on one site, is replicated within seconds to the mirror database on the other site. Most RDBMS's have mechanisms to let you do a master-slave replication like that.
The same goes for anything you put on a filesystem outside of the database, such as images, XML configuration files etc. S3 is a good solution here - they replicate everything to multiple data centers for you.
I won't hurt to create periodic (daily or so) dumps of the database and store them separately (e.g. on S3). This helps you recover from data corruption that propagates to the slave DBs.
Automate the process of data recovery. You want this to just work when you need it.
Test everything. Ideally, you want to automate the test process and run it periodically to ensure that your backups can restore. Netflix Chaos Monkey is an extreme example of this.
I'm not sure how you'd implement all this on Heroku. A complete solution is still priced out of reach for most companies - we're running this across our own data centers (one in the US, one in EU) and it costs many millions. Work according to the 80-20 rule - on-going backup to a separate site, plus a well tested recovery plan (continuously test your ability to recover from backups) covers 80% of what you need.
As for supporting users, the best solution is simply to communicate timely and truthfully when trouble happens and make sure you don't lose any data. If your users are paying for your service (i.e. you're not ad-supported), then you should probably have an SLA in place.
About backups, you cannot be sure at 100 percent every time that no data will be lost. The best is to test it on another server. You must have at leat two types of backup :
A database backup, like pg-dump. A dump is uniquely SQL commands so you can use it to recreate the whole database, just a table, or just a few rows. You loose the data added in the meantime.
A code backup, for example a git repository.
in addition to Hartator's answer:
use replication if your DB offers it, e.g. at least master/slave replication with one slave
do database backups on a slave DB server and store them externally (e.g. scp or rsync them out of your server)
use a good version control system for your source code, e.g. Git
use a solid deploy mechanism, such as Capistrano and write your custom tasks, so nobody needs to do DB migrations by hand
have somebody you trust check your firewall setup and the security of your system in general
The DB-Dumps contain SQL-commands to recreate all tables and all data... if you were to restore only one table, you could extract that portion from a copy of the dump file and (very carefully) edit it and then restore with the modified dump file (for one table).
Always restore first to an independent machine and check if the data looks right. e.g. you could use one Slave server, take if offline, then restore there locally and check the data. Good if you have two slaves in your system, then the remaining system has still one master and one slave while you restore to the second slave.
To simulate a fairly simple "total disaster recovery" on Heroku, create another Heroku project and replicate your production application completely (except use a different custom domain name).
You can add multiple remote git targets to a single git repository so you can use your current production code base. You can push your database backups to the replicated project, and then you should be good to go.
The only step missing from this exercise verses a real disaster recovery is assigning your production domain to the replicated Heroku project.
If you can afford to run two copies of your application in parallel, you could automate this exercise and have it replicate itself on a regular basis (e.g. hourly, daily) based on your data loss tolerance.
I have been put in charge of looking at setting up a build server for our office. We currently put all queries into stored procedures in SQL 2000 server. This is done manually and no SQL files are produced or put into SVN.
What I am after is a good way of dealing with having a build server that can get all the stored procs from a DB.
I am guessing this might not be possible / practice and am pretty sure not the best practice. I realize one solution could be to start creating SQL script files and putting them into SVN so they can be picked up and dealt with.
You have answered your own question. Get these things into source control before you start digging yourself further into a hole you really don't want to be in.
Once done, an approach we have used successfully is to have an initial snapshot set of scripts, then version numbered script folders for changes, with the overall database version number stored in a database table specifically for that purpose. We then wrote a utility to assemble all the update scripts since the stored version number, run them, and update the version number. This integrated with our build script that was run against the dev DB by an automated build. Schedules and so on are of course up to you.
Would strongly advise you make all DB scripts safely repeatable as well.