We use Neo4J 3.0.6 Enterprise and experience that our queries become very slow.
What we found out hat index files have become very large and probably need to pruned/vacuumed (like postgresql does).
The data we use changes very quickly over time and the index files have a lot of stale or marked deleted data.
Is there a way to do clean index files and if yes, how?
Related
I have a very high-traffic Rails app. We use an older version of PostgreSQL as the backend database which we need to upgrade. We cannot use either the data-directory copy method because the formats of data files have changed too much between our existing releases and the current PostgreSQL release (10.x at the time of writing). We also cannot use the dump-restore processes for migration because we would either incur downtime of several hours or lose important customer data. Replication would not be possible as the two DB versions are incompatible for that.
The strategy so far is to have two databases and copy all the data (and functions) from existing to a new installation. However, while the copy is happening, we need data arriving at the backend to reach both servers so that once the data migration is complete, the switch becomes a matter of redeploying the code.
I have figured out the other parts of the puzzle but am unable to determine how to send all writes happening on the Rails app to both DB servers.
I am not bothered if both installations get queried for displaying data to the user (I can discard the data coming out of the new installation); so, if it is possible on driver level, or adding a line somewhere in the ActiveRecord, I am fine with it.
PS: Rails version is 4.1 and the company is not planning to upgrade that.
you can have multiple database by adding an env for the database.yml file. After that you can have a seperate class Like ActiveRecordBase and connect that to the new env.
have a look at this post
However, as I can see, that will not solve your problem. Redirecting new data to the new DB while copying from the old one can lead to data inconsistencies.
For and example, ID of a record can be changed due to two data source feeds.
If you are upgrading the DB, I would recommend define a schedule downtime and let your users know in advance. I would say, having a small downtime is far better than fixing inconstant data down the line.
When you have a downtime,
Let the customers know well in advance
Keep the downtime minimal
Have a backup procedure, in an even the new site takes longer than you think, rollback to the old site.
I see there is a tool that allows for backups to be taken of a running Neo4J database, either via Java or via the backup tool.
The backup will obviously take some time to complete, during which time additional nodes may be added, modified or deleted. Is it possible to take a snapshot of the graph database at a particular instant in time?
My use case: N4J is used to store events, which are stored elsewhere. I'd like to take a snapshot of the graph at an instant in time, then when it's restored at a later date, know what was missing from the graph based on when the backup was taken and be able to reconstruct a complete version of the database that is accurate to the present time by adding the missing events.
There's a related question that has good discussion of this, let me cut to the chase.
If you're using the commercial version of neo4j, then neo4j backup options and/or the backup tool are your best options.
If you're using community edition, then you can't do online backups at present. I have several applications that run using neo4j community, and we have a cron job that runs at 03:00. It shuts down the application, and creates a copy of the database in another location (by copy, I mean it actually creates a tar.gz archive of the DB directory). After this is completed with other maintenance, the application gets restarted again.
Depending on file copy performance and DB size, this isn't too bad. We have a moderate sized DB and we simply accept about 10 minutes of downtime every night.
The neo4j-backup tool is part of Neo4j Enterprise edition. It takes a backup consistently at the time you've started it. After backup is finished a verbose consistency check is run to validate recoverability. It works either as full backup or incrementally.
This tool does not incorporate restoring for a given point in time or comparing with other backups. A point-in-time restore can be achieved by combining it with a classic file backup tool. I've made good experience with backup2l. neo4j-backup would started as part of backup2l's PRE_BACKUP. The same approach should work with any other backup tool out there.
Using your backup tool you can retrieve the full graph.db directory at a desired point-in-time from your archives and use them.
TokuMx though has benefits, we are running into issues. Recently we migrated to this engine and in process our clean up scripts are useless. We have transient data that we used clean every night and then reclaim disk via db.repairDatabase . However that command is not supported by TokuMX and as a result we are not able to reclaim the disk.
Is there an alternate way ?
It sounds like partitioned collections are the right abstraction for your application. Normal collections will suffer from the accumulation of MVCC garbage if you have a pattern of deleting large swaths of old data. With partitioned collections, you can drop a partition and reclaim all the space instantaneously.
I'm using Neo4j over windows for testing purposes and I'm working with a db containing ~2 million relations and about the same amount of nodes. after I had an ungraceful shutdown of neo4j while writing a batch of relations the db got corrupted.
it seems like there are some broken nodes/relations in the db and whenever I try to read them I get this error (I'm using py2neo):
Error: NodeImpl#1292315 not found. This can be because someone else deleted this entity while we were trying to read properties from it, or because of concurrent modification of other properties on this entity. The problem should be temporary.
I tried rebooting but neo4j fails to recover from this error. I found this question:
Neo4j cannot read certain nodes. Throws NotFoundException. Corrupt database
but the answer he got is no good for me because it involved in going over the db and redo the indexing, and I can't even read those broken nodes/relations so I can't fix their index (tried it and got the same error).
In general I've had many stability issues with neo4j (and on multiple platforms, not just windows). if no decent solution is found for this problem I will have to switch to a different database.
thanks in advance!
I wrote a tool a while ago that allows you to copy a broken store and keeps the good records intact.
You might want to check it out. I assume you used the 2.1.x version of Neo4j.
https://github.com/jexp/store-utils/tree/21
For 2.0.x check out:
https://github.com/jexp/store-utils/tree/20
To verify if your datastore is consistent follow the steps mentioned in http://www.markhneedham.com/blog/2014/01/22/neo4j-backup-store-copy-and-consistency-check/.
Are you referring to batch inserter API when speaking of "while writing a batch of relations"?
If so, be aware that batch inserter API requires a clean shutdown, see the big fat red warning on http://docs.neo4j.org/chunked/stable/batchinsert.html.
Are the broken nodes schema indexed and are you attempting to read them via this indexed label/property? If so, it's possible you may have a broken index following the sudden shutdown.
Assuming this is the case, you could try deleting the schema subdirectory within the graph store directory while the server is not running and let the database rebuild the index on restart. While this isn't an official way to recover from a broken index, it can sometimes work. Obviously, I suggest you back up your store before trying this.
Hi I am using neo4j in my application and my structure is as following:
I am using Embedded Graph API
I have several databases that I point to using a pool that I maintain in my application eg-> db1, db2, db3, ..... db100
When I want to access a particular database I point to it using new EmbeddedGraphDatabase("Path to db(n)")
The problem is that when the connection pool count increases the RAM size being consumed by the application keep increasing and breaks down the application at a point of limit.
So I am Thinking of migrating from Neo4j to some other Database.
Additionally only a small part of my database is utilizing the graph structure.
One way for migration is that I write a script for it. Is there any better option?
My another question is what is the best Database so that my structure can be maintained.
Other view-point that I am thinking about is I can keep part of my data into Neo4j and shift another part to some other database.
If anything is unclear I can clarify.
Thanks in advance.
An EmbeddedGraphDatabase instance is not the equivalent of a "connection" in SQL. It's designed to run a long time (days, months). Hence starting/stopping is costly.
What is the use case for having hundreds of separate databases in the same JVM?
Your lots of small databases will perform poorly as the graphdb is designed to hold the whole datamodel on a single host.
Do you run a single JVM per database?
You can control the amount of memory used by neo4j by providing the correct properties for memory mapping and also use the gcr cache from neo4j-enterprise and control the cache size-property variables.
I think it still makes sense to keep the graph part in Neo4j and only move the non-graphy part.