Ref : https://neo4j.com/docs/operations-manual/current/clustering/high-availability/haproxy/
I try to configure HAProxy as said above. I try writing to master and read from slaves, as suggested. When the read is happening the salves is not up-to date with the master, resulting in discrepancy(after some time lag it is updated). How to ensure that data is in sync with master before reading?
First thing make sure you have configured neo4j.conf correctly. ha.tx_push_factor determines to how many slaves a transaction should be pushed to synchronously. When setting this to ha.tx_push_factor=-1 you have immediate full consistency.
If inconsistent more, then read this link for consistency. There they are telling that, splitting should be based on consistency, not read vs write
Related
my C Drive size is growing and my server is not running any thing but neo4j.
even though i configured neo4j to store database information on some other drive.
node count might be irrelevant but for the record, i have almost 10 million nodes and traffic to database about 200 request / minute.
is there any thing else written by neo4j that i should be aware of?
dbms.directories.data=E:/MyNeoDB4/
dbms.directories.logs=E:/MyNeoDb4
dbms.jvm.additional=-Dunsupported.dbms.udc.source=zip
dbms.memory.heap.initial_size=15
dbms.memory.heap.max_size=15G
dbms.security.procedures.unrestricted=apoc.*
dbms.memory.pagecache.size=8G
Update 1:
things i have checked already:
my debug log is being written some where other than Drive C
metrics.enabled=false
Update 2:
- as #InverseFalcon said i also checked transaction logs in the first step. they were being written in some other directory.
(Note: Answer was written before original question was updated to say that neither metrics nor logs were the likely culprits)
Logs, and possibly metrics
I'm not sure what your logging needs have been like, but a major source of disk consumption that is not the data itself is the writing of log files. They typically do not grow extremely quickly, but it totally depends on your set up.
I suspect that your drive may be filling up with logs, although I am surprised it's filling up so quickly. I would check out your log files and see if they are full of long chains of exceptions.
It could also be metrics being exported to CSV on the local disk, although I do not believe that Neo4J will do that without being explicitly configured to do so.
More info on metrics is at the official docs:
https://neo4j.com/docs/operations-manual/current/monitoring/metrics/
A variant on Rebecca Nelson's answer, you might want to check for transaction log files.
Transaction logs are the source of truth for changes made to a database, and they are not the same kinds of logs as the readable log files (debug.log, neo4j.log) that live in the logs folder.
You can find transaction logs in your graph.db folder (or whatever name you've given to your graph database folder) using the naming pattern neostore.transaction.db.0 (with incremental numbering of the log files starting with 0).
Transaction logs are a stage of data persistence. Transactions affecting the database first write to these logs. When criteria are met, a checkpoint operation occurs which flushes the contents of the transaction logs to the datastore files (some of the other files in the graph.db folder) and the transaction logs are pruned and/or rotated.
While you should not modify or delete transaction log files yourself, you can add configuration parameters in neo4j.conf to control how these files are handled.
Here are the docs dealing with transaction logs.
The situation right now:
Every Monday morning I manually check Jenkins jobs jUnit results that ran over the weekend, using Project Health plugin I can filter on the timeboxed runs. I then copy paste this table into Excel and go over each test case's output log to see what failed and note down the failure cause. Every weekend has another tab in Excel. All this makes tracability a nightmare and causes time consuming manual labor.
What I am looking for (and hoping that already exists to some degree):
A database that stores all failed tests for all jobs I specify. It parses the output log of a failed test case and based on some regex applies a 'tag' e.g. 'Audio' if a test regarding audio is failing. Since everything is in a database I could make or use a frontend that can apply filters at will.
For example, if I want to see all tests regarding audio failing over the weekend (over multiple jobs and multiple runs) I could run a query that returns all entries with the Audio tag.
I'm OK with manually tagging failed tests and the cause, as well as writing my own frontend, is there a way (Jenkins API perhaps?) to grab the failed tests (jUnit format and Jenkins plugin) and create such a system myself if it does not exist?
A good question. Unfortunately, it is very difficult in Jenkins to get such "meta statistics" that spans several jobs. There is no existing solution for that.
Basically, I see two options for getting what you want:
Post-processing Jenkins-internal data to get the statistics that you need.
Feeding a database on-the-fly with build execution data.
The first option basically means automating the tasks that you do manually right now.
you can use external scripting (Python, Perl,...) to process Jenkins-internal data (via REST or CLI APIs, or directly reading on-disk data)
or you run Groovy scripts internally (which will be faster and more powerful)
It's the most direct way to go. However, depending on the statistics that you need and depending on your requirements regarding data persistance , you may want to go for...
The second option: more flexible and completely decoupled from Jenkins' internal data storage. You could implement it by
introducing a Groovy post-build step for all your jobs
that script parses job results and puts data of interest in a custom, external database
Statistics you'd get from querying that database.
Typically, you'd start with the first option. Once requirements grow, you'd slowly migrate to the second one (e.g., by collecting internal data via explicit post-processing scripts, putting that into a database, and then running queries on it). You'll want to cut this migration phase as short as possible, as it eventually requires the effort of implementing both options.
You may want to have a look at couchdb-statistics. It is far from a perfect fit, but at least seems to do partially what you want to achieve.
I am planning to configure some sort of 2 node replication for neo4j, similar to mysql replication. Since I am a little constrained on resources I don't want to pay for more than two Cloud compute instances. Also I am happy with just one real time or near real time copy of the neo4j database. So the approach i can think of is:
Configure HA on the two compute nodes with the help of an arbiter instance. Setup one neo4j instance (master) on first node and another neo4j instance (slave) + another neo4j instance (arbiter, only for arbitration, no data logging) instance on second node.
OR
Setup a cron for online backup using the neo4j-backup tool. Setup incremental backups every hour or so. Not sure the load it may put on the prod server, planning to test that out.
I am more inclined on the first approach since I get a more real time copy the database (I also get HA/load balancing with instant failover but that is not a priority right now).
Please let me know
which of the two approach is better,
if there is another way to achieve the same or
if any of the above approaches are not suitable or have some flaws.
I am a little new to Neo4j HA so please pardon me for my ignorance. Thanks !
So. You already mentioned available solutions.
TL;DR; I prefer first option.
Cluster
In general, recommended layout is 3 nodes (2 slaves + 1 master).
But your layout - 2 nodes (1 master + 1 slave + 1 arbiter) is viable too. Especially if one server can handle your workload.
Good things:
Almost "real-time" replica.
Possibility to utilise resources to handle bigger workload.
Better availability.
Notes:
If you have 10mb/sec write load on master, then same load will be applied on slave node. This shouldn't affect reads from slave at all (except write load is REALLY huge).
Maintenance costs are bigger, then single-instance installation. You should plan how to handle cluster upgrades, configuration updates, plugin updates.
Branched data. In clustered environment there is possibility to end up in "split-brain" scenario, when 2 nodes have different data and decision should be made which data should be kept. Neo4j handles such cases quite good. But you should keep in mind that small data-loss can occur in VERY RARE scenarios.
Backup
Good things:
Simple. Just do backups from database.
Consistency check. When backup is made, tool runs consistency check to verify if database is not damaged. There is no possibility that Backup will screw up live database. If there any issues - you will be notified via logs from backup utility. See below detailed info on to how backup is performed.
Database. Neo4j backup is fully-functional database. You can spin-up server that points to backup database, and do everything you wan't.
Incremental backups. You can do incremental backups as often, as you wan't.
Notes:
Neo4j scales vertically very well (depends on size of database). It can handle huge load on single instance (we had up to 3k requests/second on medium machine). So, you can get one bigger machine for Neo4j server and other smaller (cheaper) for backups.
How backup is performed?
One thing that should be kept in mind - live database is still fully operational. Backup utility doesn't not stop or prevent any actions.
When transaction in database is committed, all changes are appended to transaction log.
When there are no previous backup present: copy whole storage.
When there is previous backup AND transaction logs are available: copy new transaction logs and replay them on to storage.
When there is previous backup AND transactions are NOT available: discard existing storage, copy existing storage.
Why transaction logs can not be available? Your configuration may say to keep only latest transaction logs (i.e. 1 hour), or not to keep at all.
Relevant settings:
keep_logical_logs
logical_log_rotation_threshold
Other
Anyway, you should consider making backups event in clustered environment. Everything can fail, in any moment.
In general - everything depends on your load and database size.
If your database is small enough to fully fit in memory and one machine is enough to handle all load, then one Neo4j instance will be enough. Just do backup.
If you wan't better scalability/availability and real-time working replica, then cluster setup is best choice.
I am currently setting up a master-slave app using Ruby on Rails. I am planning to use data-fabric or octopus gem for handling the read/write connections.
This is my first time setting up master-slave DBs. I am confused over the various open source tools available to implement the postgresql replication e.g. pgpool II, pgcluster, Bucardo and Hot Standby/Streaming Replication (built in feature in postgresql 9.1)
My requirements are
fault tolerance(high availability and no data loss on failover)
load balancing
Thanks in advance
Note: I have gone through the stackoverflow post regarding postgresql replication but they are pretty old and not helping to conclude on which tool I should go with.
In your case, streaming replication is the place to start. It is not very flexible but it does what you need regarding database reads as long as you don't need to replicate between major versions.
Database Replication 101
Database replication is a way to ensure that data saved to a specific server becomes stored in a number of other servers. This is often done to better utilize more limited network connections, ensure fault tolerance (so there is essentially a hot back-up), ensure that read-only queries can be distributed over a larger number of databases, etc. This all must be done without sacrificing the the basic guarantees of ACID.
There are a number of different overlapping ways to categorize replication solutions. These include:
Page or file-level vs row-level vs statement-level
Synchronous vs Asynchronous
Master-slave vs Multi-Master
In general understanding replication and the tradeoffs between solutions requires relatively strong understanding of database mechanics and ACID guarantees. I will assume you are relatively familiar with storage mechanics, and deterministic vs non-deterministic operations and the like.
What is Being Replicated? File changes (Physical) vs Row Changes (Logical) vs Statements
The simplest approach is to replicate block changes to files, for example as stored in the write-ahead log in PostgreSQL. This replicates changes at the page level and it requires identical file formats. This means you cannot replicate across major versions, CPU architectures, or operating systems. Anything that could affect the alignment of tuples, for example, will cause the replication to either fail or, worse, corrupt the slave's database. This is the approach streaming replication uses. It is simple to set up, and it always replicates everything in the database cluster.
Additionally this approach means you can easily guarantee that the master and slave databases are identical down to the file level. Because of the fact that the PostgreSQL WAL is cluster-global it is unlikely that this approach will ever replicate anything short of the entire database cluster.
As a description of how this works, suppose I:
UPDATE my_table SET rand_value = random() WHERE id > 10000;
In this case, this changes a bunch of data pages and the file operations are replicated to the replicas. The files remain identical between the master and slave.
Another approach, one taken by Slony, Bucardo, and others is to replicate rows in a logical manner. In this approach, changed rows are flagged and logged, and the changes sent to the replicas. The replicas re-run row operations from the master database. Because these are add-on tools which do not replicate file operations but rather logical database operations, they can replicate across CPU architectures, operating systems, etc. Also they are usually designed so that you can replicate some but not all tables in a database, allowing for a lot of flexibility. On the other hand this leads to a lot of potential for errors. "Oops, that table was not replicated" is a real problem.
In this case when I run the update statement above, a trigger is fired capturing the actual rows inserted and deleted and these are logged, replicated, and the row operations re-run. Because this happens after rand() is run, the databases are logically, but not necessarily physically identical.
A final approach is statement replication. In this case we replicate statements and re-run the statements on the replicas. Some configurations of PgPool will do this. In this case, you cannot ensure that a database is logically equivalent to its replica if any non-deterministic functions are run. In the statement above, the statement itself will run on each replica, ensuring different pseudorandom numbers in the relevant column.
Synchronous vs Asynchronous
This distinction is important to understand regarding failover guarantees. In an asynchronous replication system, the updates are queued and transferred when possible to the replicas and re-run there. In a synchronous replication system the database which accepts the write will not return a successful commit until at least a certain number of replica databases report a successful commit.
Asynchronous replication is generally more robust and produces better availability than synchronous replication. This is because synchronous replication introduces additional points of failure. If you have one master and one slave, then if either system goes down, your database becomes unavailable at least for write operations.
The tradeoff though is that synchronous replication offers a guarantee that data which is committed is in fact available on replicas in the event that the master, say, suffers catastrophic hardware failure immediately following commit. This is a very low probability event, but in some cases it is important that you know the data is still available. In short this provides additional durability guarantees not present in async replication.
Multi-Master vs Master-Slave
Most replication systems are master-slave. In this case, all writes begin at one node and are replicated to other nodes. Writes may only begin at one node. They may not begin at other nodes. This makes replication straight-forward because we know that the slaves represent a past state of the master.
Multi-master replication allows writes to occur to more than one node. In an asynchronous replication system, this leads to the problem of conflict resolution. These problems are actually worse than most assume when you add DDL statements. Suppose two different users run the above update statement on two different masters. We will now have a set of records that have to be replicated across but they will conflict.
Multi-master replication typically requires that people think through this conflict resolution process quite carefully. It is never a process that just works out of the box. Often times you write your own conflict resolution routines. For this reason I typically recommend avoiding multi-master replication unless you really need it.
I am building an app that is fast moving into production and I am concerned about the possibility that due to hacking, some silly personal error (like running rake db:schema:load or rake db:rollback) or other circumstance we may suffer data loss in one database table or even across the system.
While I don't find it likely that the above will happen, I would be remiss in not being prepared in case it ever does.
I am using Heroku's PG Backups (which is to be replaced with something else this month), and I also run automated daily backups to S3: http://trevorturk.com/2010/04/14/automated-heroku-backups/, successfully generating .dump files.
What is the correct way to deal with data loss on a production app?
How would I restore the .dump file in case I need to? Can I do a selective restore if a small part of the system is hit?
In case a selective restore is not possible: assume one table loses data 4 hours after the last backup. Result => would fixing the lost table require rolling back 4 hours of users' activity? Any good solution to this?
What is the best way to support users through the inconvenience if something like this happens?
A full DR (disaster recovery) solution requires the following:
Multisite. If a fire, flood, Osama Bin Laden or whathaveyou strikes the Amazon (or is it Salesforce?) data center that Heroku uses, you want to be sure that your data is safe elsewhere.
On-going replication of the data to a separate site (or sites). That means that every transaction that's written to your database on one site, is replicated within seconds to the mirror database on the other site. Most RDBMS's have mechanisms to let you do a master-slave replication like that.
The same goes for anything you put on a filesystem outside of the database, such as images, XML configuration files etc. S3 is a good solution here - they replicate everything to multiple data centers for you.
I won't hurt to create periodic (daily or so) dumps of the database and store them separately (e.g. on S3). This helps you recover from data corruption that propagates to the slave DBs.
Automate the process of data recovery. You want this to just work when you need it.
Test everything. Ideally, you want to automate the test process and run it periodically to ensure that your backups can restore. Netflix Chaos Monkey is an extreme example of this.
I'm not sure how you'd implement all this on Heroku. A complete solution is still priced out of reach for most companies - we're running this across our own data centers (one in the US, one in EU) and it costs many millions. Work according to the 80-20 rule - on-going backup to a separate site, plus a well tested recovery plan (continuously test your ability to recover from backups) covers 80% of what you need.
As for supporting users, the best solution is simply to communicate timely and truthfully when trouble happens and make sure you don't lose any data. If your users are paying for your service (i.e. you're not ad-supported), then you should probably have an SLA in place.
About backups, you cannot be sure at 100 percent every time that no data will be lost. The best is to test it on another server. You must have at leat two types of backup :
A database backup, like pg-dump. A dump is uniquely SQL commands so you can use it to recreate the whole database, just a table, or just a few rows. You loose the data added in the meantime.
A code backup, for example a git repository.
in addition to Hartator's answer:
use replication if your DB offers it, e.g. at least master/slave replication with one slave
do database backups on a slave DB server and store them externally (e.g. scp or rsync them out of your server)
use a good version control system for your source code, e.g. Git
use a solid deploy mechanism, such as Capistrano and write your custom tasks, so nobody needs to do DB migrations by hand
have somebody you trust check your firewall setup and the security of your system in general
The DB-Dumps contain SQL-commands to recreate all tables and all data... if you were to restore only one table, you could extract that portion from a copy of the dump file and (very carefully) edit it and then restore with the modified dump file (for one table).
Always restore first to an independent machine and check if the data looks right. e.g. you could use one Slave server, take if offline, then restore there locally and check the data. Good if you have two slaves in your system, then the remaining system has still one master and one slave while you restore to the second slave.
To simulate a fairly simple "total disaster recovery" on Heroku, create another Heroku project and replicate your production application completely (except use a different custom domain name).
You can add multiple remote git targets to a single git repository so you can use your current production code base. You can push your database backups to the replicated project, and then you should be good to go.
The only step missing from this exercise verses a real disaster recovery is assigning your production domain to the replicated Heroku project.
If you can afford to run two copies of your application in parallel, you could automate this exercise and have it replicate itself on a regular basis (e.g. hourly, daily) based on your data loss tolerance.