Neo4j 1.9.9 legacy index very slow after deletes - neo4j

Using Neo4j 1.9.9.
Some Cypher queries we were running seemed to be unreasonably slow. Some investigation showed that:
Delete 200k nodes takes about 2-3 seconds on my hardware (MacBook Pro), when I select them using:
START n=node(*) DELETE n
Adding a WHERE clause does not significantly slow it down
If the nodes were selected using an index, it has similar performance, e.g.
START n=node:__types__(className="com.e2sd.domain.Comment") DELETE n
Except that when repeating the previous test, it is 20x or more slower, with actual time varying from 80 to several hundred seconds. Even more curious, it doesn't matter whether I repeat the test in the same JVM or start a new program, or clear out all the nodes in the database and verify it has zero nodes. The index-based delete is extremely slow on any subsequent run of the test until I clobber my neo4j data directory with
rm -R target/neo4j-test/
I'll give some example Scala code here. I'm happy to provide more detail as required.
for (j <- 1 to 3) {
log("Total nodes in database: " + inNeo4j( """ START n=node(*) RETURN COUNT(n) """).to(classOf[Int]).single)
log("Start")
inNeo4j(""" CREATE (x) WITH x FOREACH(i IN RANGE(1, 200000, 1) : CREATE ({__type__: "com.e2sd.domain.Comment"})) """)
rebuildTypesIndex()
log("Created lots of nodes")
val x = inNeo4j(
"""
START n=node:__types__(className="com.e2sd.domain.Comment")
DELETE n
RETURN COUNT(n)
""").to(classOf[Int]).single
log("Deleted x nodes: " + x)
}
// log is a convenience method that prints a string and the time since the last log
// inNeo4j is a convenience method to run a Cypher query
def rebuildTypesIndex(): Unit = {
TransactionUtils.withTransaction(neo4jTemplate) {
log.info("Rebuilding __types__ index...")
val index = neo4jTemplate.getGraphDatabase.getIndex[Node]("__types__")
for (node <- GlobalGraphOperations.at(neo4jTemplate.getGraphDatabaseService).getAllNodes.asScala) {
index.remove(node)
if (node.hasProperty("__type__")) {
val typeProperty = node.getProperty("__type__")
index.add(node, "className", typeProperty)
}
}
log.info("Done")
}
}
We are using Neo4j embedded here with the following Spring Data configuration.
<bean id="graphDbFactory" class="org.neo4j.graphdb.factory.GraphDatabaseFactory"/>
<bean id="graphDatabaseService" scope="singleton" destroy-method="shutdown"
factory-bean="graphDbFactory" factory-method="newEmbeddedDatabase">
<constructor-arg value="target/neo4j-test"/>
</bean>
<neo4j:config graphDatabaseService="graphDatabaseService" base-package="my.package.*"/>
Why is the DELETE query slow under the conditions described?

You have to specifically delete entries from the legacy index, deleting nodes is not enough to make it remove from a legacy index. Thus, when you run it the second time, you have 400k entries in your index, even though half of them point to deleted nodes. In this way your program is slow as repeated runs extend the size of the index.
I had this problem when I wrote an extension to neo4j spatial to bulk load the RTree. I had to use the Java API you had to explicitly delete from the index separately from deleting the node. Glad I could help.

Related

how to print logs and measure execution time

Currently, I'm trying to import a CSV file that contains around 2 million lines. Each line corresponds to a node. I'm using neo4j browser. note: I also tried neo4j import tool but it is also somehow working slower.
I tried to run the script with standard cypher query like
USING PERIODIC COMMIT 500 LOAD CSV FROM 'file:///data.csv' AS r
WITH toInteger(r[0]) AS ID, toInteger(r[1]) AS national_id, toInteger(r[2]) as passport_no, toInteger(r[3]) as status, toInteger(r[4]) as activation_date
MERGE (p:Customer {ID: ID}) SET p.national_id = national_id, p.passport_no = passport_no, p.status = status, p.activation_date = activation_date
This works very slow.
Later I tried.
CALL apoc.periodic.iterate('CALL apoc.load.csv(\'file:/data.csv\') yield list as r return r','WITH toInteger(r[0]) AS ID, toInteger(r[1]) AS national_id, toInteger(r[2]) as passport_no, toInteger(r[3]) as status, toInteger(r[4]) as activation_date MERGE (p:Customer {ID: ID}) SET p.national_id = national_id, p.passport_no = passport_no, p.status = status, p.activation_date = activation_date',
{batchSize:10000, iterateList:true, parallel:true});
This one seems like working faster since the parallel option is true. BUT I want to measure the execution time of one batch.
How could I print something on the neo4j browser?
How could I measure execution time for one batch?
Your first query uses a batch size of 500, and your second one uses a batch size that is 20 times larger. You need to use the same batch size to do a valid comparison.
Since your query requires a large number of batches (at least 200), dividing the total time by the number of batches should be a reasonable approximation of the average time per batch.
Have you created an index on :Customer(ID)? That should help to speed up your queries.
You should consider whether you should use the ON CREATE expression with your MERGE clause. Right now, the SET clause is always executed, even if the node already exists.
The key thing is adding "unique constraint" before adding any data. This makes the process a lot faster. I see that from https://neo4j.com/docs/getting-started/current/cypher-intro/load-csv/
Now a script like this
CREATE CONSTRAINT ON (n:Movie) ASSERT n.no IS UNIQUE;
USING PERIODIC COMMIT 10000
LOAD CSV FROM 'file:///data/MovieData.csv' AS r
WITH r[0] AS no, toInteger(r[1]) AS status, toInteger(r[2]) as activation_date
MERGE (p:Movie {no: no})
ON CREATE SET p.status = status, p.activation_date = activation_date
adding 1 million nodes in 1 minute. Before it was more than 2-3 days.

Move properties from relation to node in Neo4J on large datasets

I'm trying to move a property I've set up on a relationship in Neo4J to one of it's member nodes, because I want to index that property, and as of version 2.2.5 which I am using, indexing on relationships is not possible.
However, when I try to move it via Cypher command MATCH (k)-[l]->(m) SET m.key = l.key, my request also drops due to a lack of memory. I have no possibility to add additional memory either.
Does any one know of a good way to do this without having to resort to lots of memory when dealing with large (ca. 20M) datasets?
If it's one time operation I highly recommend you to write Unmanaged Extensions.
It will be much faster than Cypher.
Here is an example
Label startNodeLabel = DynamicLabel.label("StartNode");
Label endNodeLabel = DynamicLabel.label("EndNode");
RelationshipType relationshipType = DynamicRelationshipType.withName("RelationshipType");
String nodeProperty = "nodeProperty";
String relationshipProperty = "relationshipProperty";
try(Transaction tx = database.beginTx()) {
final ResourceIterator<Node> nodes = database.findNodes(startNodeLabel);
for (Node startNode : IteratorUtil.asCollection(nodes)) {
if (startNode.hasRelationship(relationshipType, Direction.OUTGOING)) {
final Iterable<Relationship> relationships = startNode.getRelationships(relationshipType, Direction.OUTGOING);
for (Relationship relationship : relationships) {
final Node endNode = relationship.getOtherNode(startNode);
if (endNode.hasLabel(endNodeLabel)) {
endNode.setProperty(nodeProperty, relationship.getProperty(relationshipProperty));
}
}
}
}
tx.success();
}
If you do not want to go for an unmanaged extension because you are moving the properties as a one-time problem you can also write e.g. a shell script which calls the linux curl command and loops in a loop with skip and limit. This has the advantage that you don't need to move the values but can copy them.
MATCH (k)-[l]->(m)
WITH l skip 200000 limit 100000
SET m.key = l.key
RETURN COUNT(*) AS nRows
Replace 200000 with the value of the loop variable.
You can use use LIMIT to limit the query to a specific number of rows, and then repeat the query until no more rows are returned. That will also limit the amount of memory usage.
For example, if you also wanted to remove the key property from the relationship at the same time (and you wanted to process 100K rows each time):
[EDITED]
MATCH (k)-[l]->(m)
WHERE HAS(l.key)
SET m.key = l.key
REMOVE l.key
WITH l
LIMIT 100000
RETURN COUNT(*) AS nRows;
This query will return an nRows value of 0 when you are done.

Neo4j 1.9.5 (using spring data neo4j) query time slow

I have a neo4j community edition 1.9.5 running in an ec2 m1.medium instance (around 4gb ram) . I have around 300 nodes, 800 relationships and some 2000 properties. Neo4j is running in REST mode. Below is my applicationContext.xml :
<beans profile="default">
<bean class="org.springframework.data.neo4j.rest.SpringRestGraphDatabase" id="graphDatabaseService">
<constructor-arg index="0" value="http://localhost:7474/db/data/"/>
</bean>
<neo4j:config graphDatabaseService="graphDatabaseService"/>
</beans>
Now, I have this query below, which shows all the movies which your friends have liked, which takes like ~10 seconds to return ! :
start user=node(*)
match user-[friend_rela:FRIENDS]-friend,friend-[movie_rela:LIKE]->movie
where has(user.uid) and user.uid={0}
return distinct movie,movie_rela,friend
order by movie_rela.timeStamp desc
skip {1} " +
limit {2}
I have indexed the following things:
My Indexes in the adming UI shows I have indexed the following :
Nodes:
movieId (from Movie)
__types__
Movie
uid (from User)
User
Relationships:
IsFriends
Like
__rel_types__
timeStamp
I have also changed the neo4j-wrapper.conf file to have the following heap sizes
# Initial Java Heap Size (in MB)
wrapper.java.initmemory=512
# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=2000
Do you think I am missing anything. Wondering why it takes so long ! Please advise.
Thanks
Your query is horrible inefficient, it basically traverses the full graph multiple times. You should use a index lookup to find the start point for your query and then traverse from the start point(s) - so you're doing a local query instead of a global one:
start user=node:Movie(uid={0})
match (user)-[friend_rela:FRIENDS]-(friend)-[movie_rela:LIKE]->(movie)
return distinct movie,movie_rela,friend
order by movie_rela.timeStamp desc
skip {1} limit {2}

Too much time importing data and creating nodes

i have recently started with neo4j and graph databases.
I am using this Api to make the persistence of my model. I have everything done and working but my problems comes related to efficiency.
So first of all i will talk about the scenary. I have a couple of xml documents which translates to some nodes and relations between the, as i already read that this API still not support a batch insertion, i am creating the nodes and relations once a time.
This is the code i am using for creating a node:
var newEntry = new EntryNode { hash = incremento++.ToString() };
var result = client.Cypher
.Merge("(entry:EntryNode {hash: {_hash} })")
.OnCreate()
.Set("entry = {newEntry}")
.WithParams(new
{
_hash = newEntry.hash,
newEntry
})
.Return(entry => new
{
EntryNode = entry.As<Node<EntryNode>>()
});
As i get it takes time to create all the nodes, i do not understand why the time it takes to create one increments so fats. I have made some tests and am stuck at the point where creating an EntryNode the setence takes 0,2 seconds to resolve, but once it has reached 500 it has incremented to ~2 seconds.
I have also created an index on EntryNode(hash) manually on the console before inserting any data, and made test with both versions, with and without index.
Am i doing something wrong? is this time normal?
EDITED:
#Tatham
Thanks for the answer, really helped. Now i am using the foreach statement in the neo4jclient to create 1000 nodes in just 2 seconds.
On a related topic, now that i create the nodes this way i wanted to also create relationships. This is the code i am trying right now, but got some errors.
client.Cypher
.Match("(e:EntryNode)")
.Match("(p:EntryPointerNode)")
.ForEach("(n in {set} | " +
"FOREACH (e in (CASE WHEN e.hash = n.EntryHash THEN [e] END) " +
"FOREACH (p in pointers (CASE WHEN p.hash = n.PointerHash THEN [p] END) "+
"MERGE ((p)-[r:PointerToEntry]->(ee)) )))")
.WithParam("set", nodesSet)
.ExecuteWithoutResults();
What i want it to do is, given a list of pairs of strings, get the nodes (which are uniques) with the string value as the property "hash" and create a relationship between them. I have tried a couple of variants to do this query but i dont seem to find the solution.
Is this possible?
This approach is going to be very slow because you do a separate HTTP call to Neo4j for every node you are inserting. Each call is then a transaction. Finally, you are also returning the node back, which is probably a waste.
There are two options for doing this in batches instead.
From https://stackoverflow.com/a/21865110/211747, you can do something like this, where you pass in a set of objects and then FOREACH through them in Cypher. This means one, larger, HTTP call to Neo4j and then executing in a single transaction on the DB:
FOREACH (n in {set} | MERGE (c:Label {Id : n.Id}) SET c = n)
http://docs.neo4j.org/chunked/stable/query-foreach.html
The other option, coming soon, is that you will be able to write something like this in Cypher:
LOAD CSV WITH HEADERS FROM 'file://c:/temp/input.csv' AS n
MERGE (c:Label { Id : n.Id })
SET c = n
https://github.com/davidegrohmann/neo4j/blob/2.1-fix-resource-failure-load-csv/community/cypher/cypher/src/test/scala/org/neo4j/cypher/LoadCsvAcceptanceTest.scala

Neo4j Cypher: How to stop duplicate SET if multiple CREATES

I have a complex cypher query that creates multiple nodes and increments some counters on those nodes. For sake of example here is a simplified version of what I am trying to do:
START a = node(1), e = node(2)
CREATE a-[r1]->(b {})-[r2]->(c {}), e-[r3]->b-[r4]->(d{})
SET a.first=a.first+1, e.second=e.second+1
RETURN b
The issue is that because there are two CREATE commands the SET commands run twice and the values are incremented by 2 instead of 1 as intended. I have looked to see if I can merge the multiple CREATE statements and I cannot.
My initial idea is to separate out the different creates into a batch query, however I was wondering if there is another option.
Where are you executing this query? What version of neo4j are you using?
I went to console.neo4j.org and successfully ran the following and it correctly added one to both a.first and e.second:
START a = node(1), e = node(2)
CREATE a-[r:KNOWS]->b-[r2:KNOWS]->c, e-[:KNOWS]->b-[:KNOWS]->d
SET a.first=a.first+1, e.second=e.second+1
RETURN b

Resources