I have a basic neo4j database with 10 nodes (no relationships). I am running embedded mode and starting the server/webadmin as follows:
GraphDatabaseService graphDb;
graphDb = new GraphDatabaseFactory().newEmbeddedDatabaseBuilder("/path/to/data/directory").newGraphDatabase();
WrappingNeoServer srv = new WrappingNeoServer((GraphDatabaseAPI) graphDb);
srv.start();
Once I create the nodes, query performance is fine, but when I restart the server, query performance for basic cypher queries becomes slow. The following query takes about 1-3 seconds:
MATCH (n) RETURN count(n);
Before a restart (immediately after the nodes are created), this query is less than 100ms.
Here is a link to the data directory I am using: https://drive.google.com/file/d/0B1pENwDgk7SQTFkxU1BGd2poeUU/edit?usp=sharing
I am running version 2.0.1.
What could be causing this slow performance?
The databases has to be initialized, memory mapping set up.
Nodes have to be loaded from disk.
The Cypher query parser has to build up it's rules first.
Your query has to be parsed build and put in the cache.
Much more
That's why never trust the first 100-1000 queries for performance measurements.
Related
We're loading data in a Neo4j Server which represents mainly (almost) k-ary trees with k between 2 and 10 in most case. We have about 50 node types possible, and about same amount of type of relationships.
The server is online and data can be loaded from several instances (So, unhappily, we can't use neo4j-import)
We experience very slow loading for about 100 000 nodes and relationships, which take about 6mn to load in a good machine. Sometimes we experience loading of the same datas which takes 40mn ! Looking at the neo4j process, it sometime doing nothing....
In this case, we have messages like :
WARN [o.n.k.g.TimeoutGuard] Transaction timeout. (Overtime: 1481 ms).
Beside we don't experience problems with query which execute quickly despite very complex structures
We load data as follow :
A cypher file is loaded like this :
neo4j-shell -host localhost -v -port 1337 -file myGraph.cypher
The cypher file contains several sections :
Constraints creations :
CREATE CONSTRAINT ON (p:MyNodeType) ASSERT p.uid IS UNIQUE;
Index on very little set of Nodes (10 at more)
We carefully select these to avoid counter performance behaviours.
CREATE INDEX ON :MyNodeType1(uid);
Nodes creations
USING PERIODIC COMMIT 4000 LOAD CSV WITH HEADERS FROM "file:////tmp/my.csv" AS csvLine CREATE (p:MyNodeType1 {Prop1: csvLine.prop1, mySupUUID: toInt(csvLine.uidFonctionEnglobante), lineNum: toInt(csvLine.lineNum), uid: toInt(csvLine.uid), name: csvLine.name, projectID: csvLine.projectID, vValue: csvLine.vValue});
Relationships creations
LOAD CSV WITH HEADERS FROM "file:////tmp/RelsInfixExpression-vLeftOperand-SimpleName_javaouille-normal-b11695.csv" AS csvLine Match (n1:MyNodeType1) Where n1.uid = toInt(csvLine.uidFather) With n1, csvLine Match (n2:MyNodeType2) Where n2.uid = toInt(csvLine.uidSon) MERGE (n1)-[:vOperandLink]-(n2);
Question 1
We experienced, sometimes, OOM in Neo4j server while loading datas, difficult to reproduce even with the same datas. But having recently added USING PERIODIC COMMIT 1000 to relationships loading commands, we never reproduced this problem. Could it is possibly the solution for OOM problem ?
Question 2
Is the Periodic Commit parameter good ?
Is there another way to speed up data loading ? Ie. another strategy to write the data loading script ?
Question 3
Is there ways to prevent timeout ? With another way to write the data loading script or maybe JVM tuning ?
Question 4
Some months ago we splited the cypher script in 2 or 3 parts to launch it concurrently, but we stoped that because the server messed up the data frequently and became unusable. Is there a way to split "cleanly" the script and launch them concurrently ?
Question 1: Yes, USING PERIODIC COMMIT is the first thing to try when LOAD CSV causes OOM errors.
Question 2&3: The "sweet spot" for periodic commit batch size depends on your Cypher query, your data characteristics, and how your neo4j server is configured (all of which can change over time). You do not want the batch size to be too high (to avoid occasional OOMs), nor too low (to avoid slowing down the import). And you should tune the server's memory configuration as well. But you will have to do your own experimentation to discover the best batch size and server configuration, and adjust them as needed.
Question 4: Concurrent write operations that touch the same nodes and/or relationships must be avoided, as they can cause errors (like deadlocks and constraint violations). If you can split up your operations so that they act on completely disjoint subgraphs, then they should be able to run concurrently without these kinds of errors.
Also, you should PROFILE your queries to see how the server will actual execute them. For example, even if both :MyNodeType1(uid) and :MyNodeType2(uid) are indexed (or have uniqueness constraints), that does not mean that the Cypher planner will automatically use those indexes when it executes your last query. If your profile of that query shows that it is not using the indexes, then you can add hints to the query to make the planner (more likely to) use them:
LOAD CSV WITH HEADERS FROM "file:////tmp/RelsInfixExpression-vLeftOperand-SimpleName_javaouille-normal-b11695.csv" AS csvLine
MATCH (n1:MyNodeType1) USING INDEX n1:MyNodeType1(uid)
WHERE n1.uid = TOINT(csvLine.uidFather)
MATCH (n2:MyNodeType2) USING INDEX n2:MyNodeType2(uid)
WHERE n2.uid = TOINT(csvLine.uidSon)
MERGE (n1)-[:vOperandLink]-(n2);
In addition, if it is OK to store the uid values as strings, you can remove the uses of TOINT().This will speed up things to some extent.
I am running Neo4j 3.0.4 Enterprise in HA and am encountering an extremely mysterious issue. On doing a very basic Cypher query from our Spring Data Neo4j application (following is taken from the Neo4j DB query.log):
MATCH (n:`Node`)
WHERE n.`key` = { `key` }
WITH n
MATCH p=(n)-[*0..1]-(m)
RETURN p, ID(n) - {key: 123456}
I get an exception related to the page cache:
java.lang.NullPointerException
at org.neo4j.io.pagecache.impl.muninn.MuninnPageCursor.assertPagedFileStillMappedAndGetIdOfLastPage(MuninnPageCursor.java:369)
at org.neo4j.io.pagecache.impl.muninn.MuninnReadPageCursor.next(MuninnReadPageCursor.java:55)
at org.neo4j.io.pagecache.impl.muninn.MuninnPageCursor.next(MuninnPageCursor.java:121)
at org.neo4j.kernel.impl.store.CommonAbstractStore.readIntoRecord(CommonAbstractStore.java:1039)
at org.neo4j.kernel.impl.store.CommonAbstractStore.access$000(CommonAbstractStore.java:64)
at org.neo4j.kernel.impl.store.CommonAbstractStore$1.next(CommonAbstractStore.java:1179)
at org.neo4j.kernel.impl.api.store.StoreSingleNodeCursor.next(StoreSingleNodeCursor.java:64)
at org.neo4j.kernel.impl.api.StateHandlingStatementOperations.nodeCursorById(StateHandlingStatementOperations.java:137)
at org.neo4j.kernel.impl.api.ConstraintEnforcingEntityOperations.nodeCursorById(ConstraintEnforcingEntityOperations.java:422)
at org.neo4j.kernel.impl.api.OperationsFacade.nodeGetProperty(OperationsFacade.java:333)
at org.neo4j.cypher.internal.spi.v3_0.TransactionBoundQueryContext$NodeOperations.getProperty(TransactionBoundQueryContext.scala:316)
And then from here on out, the same query will fail intermittently and will be logged as a one liner:
java.lang.NullPointerException
On the app server side, we get the following generic error code:
org.neo4j.ogm.exception.CypherException: Error executing Cypher "Neo.DatabaseError.Statement.ExecutionFailed"
Has any Neo4j experts seen this issue before?
Moving the master in my cluster (ie restarting) seems to fix it but this resurfaces if any significant amount of load is placed on the server. The page cache should be fairly large as I have an 8 GB database and have left not only the heap size as default but the page cache size unset too so that it takes on its default value.
The logs indicate that there are a large number of concurrent queries right before this exception (all happening within the same second and quite a few querying for the exact same thing). Could there be some type of race condition in how the page cache works? What is a realistic limit on concurrent reads?
Any advice at all is greatly appreciated!
I was doing a POC on publicly-available Twitter dataset for our project. I was able to create the Neo4j database for it using Michael Hunger's Batch Inserter utility, and it was relatively fast (It just took a 2h and 53 mins to finish). All in all there were
15,203,731 Nodes, with 2 properties (name, url)
256,147,121 Relationships, with 1 property
Now I created a Cypher query to update the Twitter database. I added a new property (Age) on the Node and a new property on the Relationship (FollowedSince) in the CSVs. Now things start to look bad. The query to update the relationship (see below) takes forever to run.
USING PERIODIC COMMIT 100000
LOAD CSV WITH HEADERS FROM {csvfile} AS row FIELDTERMINATOR '\t'
MATCH (u1:USER {name:row.`name:string:user`}), (u2:USER {name:row.`name:string:user2`})
MERGE (u1)-[r:Follows]->(u2)
ON CREATE SET r.Property=row.Property, r.FollowedSince=row.FollowedSince
ON MATCH SET r.Property=row.Property, r.FollowedSince=row.FollowedSince;
I already pre-created the index by running
CREATE INDEX ON :USER(name);
My neo4j property:
allow_store_upgrade=true
dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=260M
neostore.propertystore.db.index.mapped_memory=260M
neostore.nodestore.db.mapped_memory=768M
neostore.relationshipstore.db.mapped_memory=12G
neostore.propertystore.db.mapped_memory=2048M
neostore.propertystore.db.strings.mapped_memory=2048M
neostore.propertystore.db.arrays.mapped_memory=260M
node_auto_indexing=true
I'd like to know what should I do to speed up my Cypher query? As of this writing, it's more than an hour and a half have passed and my Relationship (10,000,747) still hasn't finished. The Node (15,203,731) that finished earlier clocked at 34 minutes which I think is way too long. (The Batch Inserter utility processed the whole Node in just 5 minutes!)
I did test my queries on a small dataset just to try it out first before tackling bigger dataset, and it did work.
My Neo4j lives on a server-grade machine, so hardware is not an issue here.
Any advice please? Thanks.
I'm loading a Neo4j database using Cypher commands piped directly into the neo4j-shell. Some experiments suggest that subgraph batches of about 1000 lines give the optimal throughput (about 3.2ms/line, 300 lines/sec (slow!), Neo4j 2.0.1). I use MATCH statements to bind existing nodes to the loading subgraph. Here's a chopped example:
begin
...
MATCH (domain75ea8a4da9d65189999d895f536acfa5:SubDomain { shorturl: "threeboysandanoldlady.blogspot.com" })
MATCH (domainf47c8afacb0346a5d7c4b8b0e968bb74:SubDomain { shorturl: "myweeview.com" })
MATCH (domainf431704fab917205a54b2477d00a3511:SubDomain { shorturl: "www.computershopper.com" })
CREATE
(article1641203:Article { id: "1641203", url: "http://www.coolsocial.net/sites/www/blackhawknetwork.com.html", type: 4, timestamp: 1342549270, datetime: "2012-07-17 18:21:10"}),
(article1641203)-[:PUBLISHED_IN]->(domaina9b3ed6f4bc801731351b913dfc3f35a),(author104675)-[:WROTE]->(article1641203),
....
commit
Using this (ridiculously slow) method, it takes several hours to load 200K nodes (~370K relationships) and, at that point, the loading slows down even more. I presume the asymptotic slowdown is due to the overhead of the MATCH statements. They make up 1/2 of the subgraph load statements by the time the graph hits 200K nodes. There's got to be a better way of doing this, it just doesn't scale.
I'm going to try rewriting the statements with parameters (refs: What is the most efficient way to insert nodes into a neo4j database using cypher AND http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/). I expect that to help, but it seems that I will still have problems making the subgraph connections. Would using MERGE or CREATE UNIQUE instead of the MATCH statements be the way to go? There must be best practice ways to do this that I'm missing. Any other speed-up ideas?
many thanks
Use MERGE, and do smaller transactions--I've found best results with batches of 50-100 (while doing index lookups). Bigger batches are better when doing CREATE only without MATCH. Also, I recommend using a driver to send your commands over the transactional API (with parameters) instead of via neo4j-shell--it tends to be a fair bit faster.
Alternatively (might not be applicable to all use cases), keep a local "index" of the node ids you've created. For only 200k items, this should be easy to fit in a normal map/dict of string->long. This will prevent you needing to tax the index on the db, and you can do only node-ID-based lookups and CREATE statements, and create the indexes later.
The load2neo plugin worked well for me. Installation was fast+painless and it has a very cypher-like command structure that easily supports uniqueness requirements. Works with neo4j 2.0 labels.
load2neo install + curl usage example:
http://nigelsmall.com/load2neo
load2neo Geoff syntax:
http://nigelsmall.com/geoff
It is much faster (>>10x) than using Cypher via neo4j-shell.
I wasn't able to get the parameters in Cypher through neo4j-shell working despite trying everything I could find via internet search.
I am new to using Neo4j and have setup a test graph db in neo4j for organizing some click stream data with a very small subset of what we actually use on a day to day basis. This graph has about 23 million nodes and 34 million relationships. The queries seem to be taking forever to run i.e. I haven't seen the response come back even after waiting for more than 30 mins.
The data is organized as Year->Month->Day->Session{1..n}->Event{1..n}
I am running the db on a Windows 7 machine with 1.5 gb of heap allocated to Neo4j server
These are the configurations in the neo4j-wrapper.conf
wrapper.java.additional.1=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional.2=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional.3=-Dlog4j.configuration=file:conf/log4j.properties
wrapper.java.additional.6=-XX:+UseParNewGC
wrapper.java.additional.7=-XX:+UseConcMarkSweepGC
wrapper.java.additional.8=-Xloggc:data/log/neo4j-gc.log
wrapper.java.initmemory=1500
wrapper.java.maxmemory=1500
This is what my query looks like
START n=node(3)
MATCH (n)-[:HAS]->(s)
WITH distinct s
MATCH (s)-[:HAS]->(e) WHERE e.page_name = 'Login'
WITH s.session_id as session, e
MATCH (e)-[:FOLLOWEDBY*0..1]->(e1)
WITH count(session) as session_cnt, e.page_name as startPage, e1.page_name as nextPage
RETURN startPage, nextPage, session_cnt
Also i have these properties set
node_auto_indexing=true
node_keys_indexable=name,page_name,geo_country
relationship_auto_indexing=true
Can anyone help me to figure out what might be wrong.
Even when I run portions of the query it takes 10-15 minutes before I can see a response.
Note: I have no other applications running on the Windows Machine
Why would you want to return all the nodes in the first place?
If you really want to do that, use the transactional http endpoint and curl to stream the response:
I tested it with a database of 100k nodes. It takes 0.9 seconds to transfer them (1.5MB) over the wire.
If you transfer all their properties by using "return n", it takes 1.4 seconds and results in 4.1MB transferred.
If you just want to know how many nodes are in your db. use something like this instead:
match (n) return count(*);