neo4j is storing arbitrary files in drive C? - neo4j

my C Drive size is growing and my server is not running any thing but neo4j.
even though i configured neo4j to store database information on some other drive.
node count might be irrelevant but for the record, i have almost 10 million nodes and traffic to database about 200 request / minute.
is there any thing else written by neo4j that i should be aware of?
dbms.directories.data=E:/MyNeoDB4/
dbms.directories.logs=E:/MyNeoDb4
dbms.jvm.additional=-Dunsupported.dbms.udc.source=zip
dbms.memory.heap.initial_size=15
dbms.memory.heap.max_size=15G
dbms.security.procedures.unrestricted=apoc.*
dbms.memory.pagecache.size=8G
Update 1:
things i have checked already:
my debug log is being written some where other than Drive C
metrics.enabled=false
Update 2:
- as #InverseFalcon said i also checked transaction logs in the first step. they were being written in some other directory.

(Note: Answer was written before original question was updated to say that neither metrics nor logs were the likely culprits)
Logs, and possibly metrics
I'm not sure what your logging needs have been like, but a major source of disk consumption that is not the data itself is the writing of log files. They typically do not grow extremely quickly, but it totally depends on your set up.
I suspect that your drive may be filling up with logs, although I am surprised it's filling up so quickly. I would check out your log files and see if they are full of long chains of exceptions.
It could also be metrics being exported to CSV on the local disk, although I do not believe that Neo4J will do that without being explicitly configured to do so.
More info on metrics is at the official docs:
https://neo4j.com/docs/operations-manual/current/monitoring/metrics/

A variant on Rebecca Nelson's answer, you might want to check for transaction log files.
Transaction logs are the source of truth for changes made to a database, and they are not the same kinds of logs as the readable log files (debug.log, neo4j.log) that live in the logs folder.
You can find transaction logs in your graph.db folder (or whatever name you've given to your graph database folder) using the naming pattern neostore.transaction.db.0 (with incremental numbering of the log files starting with 0).
Transaction logs are a stage of data persistence. Transactions affecting the database first write to these logs. When criteria are met, a checkpoint operation occurs which flushes the contents of the transaction logs to the datastore files (some of the other files in the graph.db folder) and the transaction logs are pruned and/or rotated.
While you should not modify or delete transaction log files yourself, you can add configuration parameters in neo4j.conf to control how these files are handled.
Here are the docs dealing with transaction logs.

Related

CDC log retention for Informix

Actual Situation:
We use IBM Data Replication (11.4) to replicate Data from an Informix Database to an SQL Server Database.
Now we have an instance with 45 different subscriptions. On the informix side, we have 30 different log files.
The Problem:
When we want to “refresh” all subscriptions at once, we get in trouble that some logs aren’t available anymore, because they were overwritten.
The problem is that these logs were not full to 100 percent, but instead only to approximately 0,5%.
I don’t know when exactly a new log will be created.
Is there any possibility to change the settings, at which time a new logfile will be created? or that a new logfile only will be created when it is full to 100% or something else? Or do you have another solution to that problem at all?
We have found the problem:
The parameter “log_api_switch_log_num_pages” has to be defined. It describes log switching after a refresh.
See details here:
http://www-01.ibm.com/support/docview.wss?uid=swg21997830

mnesia files damaged need to forensically dump everything

I have damaged my Mnesia database beyond repair as a result of overestimating the fragility of the implementation. When I try Mnesia API the records I need are not visible even though they keys are visible in the file. Even though the documentation indicates that Mnesia artifacts are DETS files they cannot be opened with or identified as DETS artifacts. PS: dump_to_textfile() does not work either.
Eventually I was able to dump my DB. It did not end my Mnesia problems but it gave me options I did not have before.
SETUP:
Originally I had implemented a master-master mnesia cluster. (read the docs). It turns out that not even the most seasoned Erlang programmer uses Mnesia replication as there are to many flaws. In fact I come to this information from the Erlang inner circle and a few L1 teams too. In my case, however, the work was already in production. And that's when problems started.
We started getting DB consistency errors and, my favorite, network or DB partition errors. It takes a very highly skilled and knowledgeable individual to recover as well as a lot of planning and code in advance; which I did not have.
Ultimately I took two steps. (a) removed the second app so that even though the DB was in a master-master cluster; one was a slave because it was never used as a master. (b) In a second implementation I split the cluster so that the app ran on a single node with a single DB. #a was in production and #b was the warm standby. Replication was manual as writes were very rare.
In the single node deployment there are two nodes. The first node is the application; app#ks and on the same hardware was an "erl" node when I needed to rpc into the app and see how things were going.
MY SOLUTION:
when I posted this question I was trying to dump the contents of my Mnesia DB. I was having a number of problems because I was trying to access the DB from the admin node as the application node was operational.
Because I was trying to access the mnesia lib from the erl node the DB was not LOCAL to the erl node and so dump_to_textfile produced an empty file. I eventually had success when I used rpc to tell the app#ks node to dump.
STILL UNDEFINED
When I launched the admin node I set the mnesia dir parameter to the same folder as the app#ks node. I have a vague memory that this is undesirable.
There are many more Mnesia issues to solve but none that refer to the problem I reported. But I still do not know how to extract the raw data from the various DB files.

setting up Neo4j replication on two instances

I am planning to configure some sort of 2 node replication for neo4j, similar to mysql replication. Since I am a little constrained on resources I don't want to pay for more than two Cloud compute instances. Also I am happy with just one real time or near real time copy of the neo4j database. So the approach i can think of is:
Configure HA on the two compute nodes with the help of an arbiter instance. Setup one neo4j instance (master) on first node and another neo4j instance (slave) + another neo4j instance (arbiter, only for arbitration, no data logging) instance on second node.
OR
Setup a cron for online backup using the neo4j-backup tool. Setup incremental backups every hour or so. Not sure the load it may put on the prod server, planning to test that out.
I am more inclined on the first approach since I get a more real time copy the database (I also get HA/load balancing with instant failover but that is not a priority right now).
Please let me know
which of the two approach is better,
if there is another way to achieve the same or
if any of the above approaches are not suitable or have some flaws.
I am a little new to Neo4j HA so please pardon me for my ignorance. Thanks !
So. You already mentioned available solutions.
TL;DR; I prefer first option.
Cluster
In general, recommended layout is 3 nodes (2 slaves + 1 master).
But your layout - 2 nodes (1 master + 1 slave + 1 arbiter) is viable too. Especially if one server can handle your workload.
Good things:
Almost "real-time" replica.
Possibility to utilise resources to handle bigger workload.
Better availability.
Notes:
If you have 10mb/sec write load on master, then same load will be applied on slave node. This shouldn't affect reads from slave at all (except write load is REALLY huge).
Maintenance costs are bigger, then single-instance installation. You should plan how to handle cluster upgrades, configuration updates, plugin updates.
Branched data. In clustered environment there is possibility to end up in "split-brain" scenario, when 2 nodes have different data and decision should be made which data should be kept. Neo4j handles such cases quite good. But you should keep in mind that small data-loss can occur in VERY RARE scenarios.
Backup
Good things:
Simple. Just do backups from database.
Consistency check. When backup is made, tool runs consistency check to verify if database is not damaged. There is no possibility that Backup will screw up live database. If there any issues - you will be notified via logs from backup utility. See below detailed info on to how backup is performed.
Database. Neo4j backup is fully-functional database. You can spin-up server that points to backup database, and do everything you wan't.
Incremental backups. You can do incremental backups as often, as you wan't.
Notes:
Neo4j scales vertically very well (depends on size of database). It can handle huge load on single instance (we had up to 3k requests/second on medium machine). So, you can get one bigger machine for Neo4j server and other smaller (cheaper) for backups.
How backup is performed?
One thing that should be kept in mind - live database is still fully operational. Backup utility doesn't not stop or prevent any actions.
When transaction in database is committed, all changes are appended to transaction log.
When there are no previous backup present: copy whole storage.
When there is previous backup AND transaction logs are available: copy new transaction logs and replay them on to storage.
When there is previous backup AND transactions are NOT available: discard existing storage, copy existing storage.
Why transaction logs can not be available? Your configuration may say to keep only latest transaction logs (i.e. 1 hour), or not to keep at all.
Relevant settings:
keep_logical_logs
logical_log_rotation_threshold
Other
Anyway, you should consider making backups event in clustered environment. Everything can fail, in any moment.
In general - everything depends on your load and database size.
If your database is small enough to fully fit in memory and one machine is enough to handle all load, then one Neo4j instance will be enough. Just do backup.
If you wan't better scalability/availability and real-time working replica, then cluster setup is best choice.

How to take a snapshot of neo4j database

I see there is a tool that allows for backups to be taken of a running Neo4J database, either via Java or via the backup tool.
The backup will obviously take some time to complete, during which time additional nodes may be added, modified or deleted. Is it possible to take a snapshot of the graph database at a particular instant in time?
My use case: N4J is used to store events, which are stored elsewhere. I'd like to take a snapshot of the graph at an instant in time, then when it's restored at a later date, know what was missing from the graph based on when the backup was taken and be able to reconstruct a complete version of the database that is accurate to the present time by adding the missing events.
There's a related question that has good discussion of this, let me cut to the chase.
If you're using the commercial version of neo4j, then neo4j backup options and/or the backup tool are your best options.
If you're using community edition, then you can't do online backups at present. I have several applications that run using neo4j community, and we have a cron job that runs at 03:00. It shuts down the application, and creates a copy of the database in another location (by copy, I mean it actually creates a tar.gz archive of the DB directory). After this is completed with other maintenance, the application gets restarted again.
Depending on file copy performance and DB size, this isn't too bad. We have a moderate sized DB and we simply accept about 10 minutes of downtime every night.
The neo4j-backup tool is part of Neo4j Enterprise edition. It takes a backup consistently at the time you've started it. After backup is finished a verbose consistency check is run to validate recoverability. It works either as full backup or incrementally.
This tool does not incorporate restoring for a given point in time or comparing with other backups. A point-in-time restore can be achieved by combining it with a classic file backup tool. I've made good experience with backup2l. neo4j-backup would started as part of backup2l's PRE_BACKUP. The same approach should work with any other backup tool out there.
Using your backup tool you can retrieve the full graph.db directory at a desired point-in-time from your archives and use them.

How to prepare for data loss in a production website?

I am building an app that is fast moving into production and I am concerned about the possibility that due to hacking, some silly personal error (like running rake db:schema:load or rake db:rollback) or other circumstance we may suffer data loss in one database table or even across the system.
While I don't find it likely that the above will happen, I would be remiss in not being prepared in case it ever does.
I am using Heroku's PG Backups (which is to be replaced with something else this month), and I also run automated daily backups to S3: http://trevorturk.com/2010/04/14/automated-heroku-backups/, successfully generating .dump files.
What is the correct way to deal with data loss on a production app?
How would I restore the .dump file in case I need to? Can I do a selective restore if a small part of the system is hit?
In case a selective restore is not possible: assume one table loses data 4 hours after the last backup. Result => would fixing the lost table require rolling back 4 hours of users' activity? Any good solution to this?
What is the best way to support users through the inconvenience if something like this happens?
A full DR (disaster recovery) solution requires the following:
Multisite. If a fire, flood, Osama Bin Laden or whathaveyou strikes the Amazon (or is it Salesforce?) data center that Heroku uses, you want to be sure that your data is safe elsewhere.
On-going replication of the data to a separate site (or sites). That means that every transaction that's written to your database on one site, is replicated within seconds to the mirror database on the other site. Most RDBMS's have mechanisms to let you do a master-slave replication like that.
The same goes for anything you put on a filesystem outside of the database, such as images, XML configuration files etc. S3 is a good solution here - they replicate everything to multiple data centers for you.
I won't hurt to create periodic (daily or so) dumps of the database and store them separately (e.g. on S3). This helps you recover from data corruption that propagates to the slave DBs.
Automate the process of data recovery. You want this to just work when you need it.
Test everything. Ideally, you want to automate the test process and run it periodically to ensure that your backups can restore. Netflix Chaos Monkey is an extreme example of this.
I'm not sure how you'd implement all this on Heroku. A complete solution is still priced out of reach for most companies - we're running this across our own data centers (one in the US, one in EU) and it costs many millions. Work according to the 80-20 rule - on-going backup to a separate site, plus a well tested recovery plan (continuously test your ability to recover from backups) covers 80% of what you need.
As for supporting users, the best solution is simply to communicate timely and truthfully when trouble happens and make sure you don't lose any data. If your users are paying for your service (i.e. you're not ad-supported), then you should probably have an SLA in place.
About backups, you cannot be sure at 100 percent every time that no data will be lost. The best is to test it on another server. You must have at leat two types of backup :
A database backup, like pg-dump. A dump is uniquely SQL commands so you can use it to recreate the whole database, just a table, or just a few rows. You loose the data added in the meantime.
A code backup, for example a git repository.
in addition to Hartator's answer:
use replication if your DB offers it, e.g. at least master/slave replication with one slave
do database backups on a slave DB server and store them externally (e.g. scp or rsync them out of your server)
use a good version control system for your source code, e.g. Git
use a solid deploy mechanism, such as Capistrano and write your custom tasks, so nobody needs to do DB migrations by hand
have somebody you trust check your firewall setup and the security of your system in general
The DB-Dumps contain SQL-commands to recreate all tables and all data... if you were to restore only one table, you could extract that portion from a copy of the dump file and (very carefully) edit it and then restore with the modified dump file (for one table).
Always restore first to an independent machine and check if the data looks right. e.g. you could use one Slave server, take if offline, then restore there locally and check the data. Good if you have two slaves in your system, then the remaining system has still one master and one slave while you restore to the second slave.
To simulate a fairly simple "total disaster recovery" on Heroku, create another Heroku project and replicate your production application completely (except use a different custom domain name).
You can add multiple remote git targets to a single git repository so you can use your current production code base. You can push your database backups to the replicated project, and then you should be good to go.
The only step missing from this exercise verses a real disaster recovery is assigning your production domain to the replicated Heroku project.
If you can afford to run two copies of your application in parallel, you could automate this exercise and have it replicate itself on a regular basis (e.g. hourly, daily) based on your data loss tolerance.

Resources