Move properties from relation to node in Neo4J on large datasets - neo4j

I'm trying to move a property I've set up on a relationship in Neo4J to one of it's member nodes, because I want to index that property, and as of version 2.2.5 which I am using, indexing on relationships is not possible.
However, when I try to move it via Cypher command MATCH (k)-[l]->(m) SET m.key = l.key, my request also drops due to a lack of memory. I have no possibility to add additional memory either.
Does any one know of a good way to do this without having to resort to lots of memory when dealing with large (ca. 20M) datasets?

If it's one time operation I highly recommend you to write Unmanaged Extensions.
It will be much faster than Cypher.
Here is an example
Label startNodeLabel = DynamicLabel.label("StartNode");
Label endNodeLabel = DynamicLabel.label("EndNode");
RelationshipType relationshipType = DynamicRelationshipType.withName("RelationshipType");
String nodeProperty = "nodeProperty";
String relationshipProperty = "relationshipProperty";
try(Transaction tx = database.beginTx()) {
final ResourceIterator<Node> nodes = database.findNodes(startNodeLabel);
for (Node startNode : IteratorUtil.asCollection(nodes)) {
if (startNode.hasRelationship(relationshipType, Direction.OUTGOING)) {
final Iterable<Relationship> relationships = startNode.getRelationships(relationshipType, Direction.OUTGOING);
for (Relationship relationship : relationships) {
final Node endNode = relationship.getOtherNode(startNode);
if (endNode.hasLabel(endNodeLabel)) {
endNode.setProperty(nodeProperty, relationship.getProperty(relationshipProperty));
}
}
}
}
tx.success();
}

If you do not want to go for an unmanaged extension because you are moving the properties as a one-time problem you can also write e.g. a shell script which calls the linux curl command and loops in a loop with skip and limit. This has the advantage that you don't need to move the values but can copy them.
MATCH (k)-[l]->(m)
WITH l skip 200000 limit 100000
SET m.key = l.key
RETURN COUNT(*) AS nRows
Replace 200000 with the value of the loop variable.

You can use use LIMIT to limit the query to a specific number of rows, and then repeat the query until no more rows are returned. That will also limit the amount of memory usage.
For example, if you also wanted to remove the key property from the relationship at the same time (and you wanted to process 100K rows each time):
[EDITED]
MATCH (k)-[l]->(m)
WHERE HAS(l.key)
SET m.key = l.key
REMOVE l.key
WITH l
LIMIT 100000
RETURN COUNT(*) AS nRows;
This query will return an nRows value of 0 when you are done.

Related

How to properly use apoc.periodic.iterate to reduce heap usage for large transactions?

I am trying to use apoc.periodic.iterate to reduce heap usage when doing very large transactions in a Neo4j database.
I've been following the advice given in this presentation.
BUT, my results are differing from those observed in those slides.
First, some notes on my setup:
Using Neo4j Desktop, graph version 4.0.3 Enterprise, with APOC 4.0.0.10
I'm calling queries using the .NET Neo4j Driver, version 4.0.1.
neo4j.conf values:
dbms.memory.heap.initial_size=2g
dbms.memory.heap.max_size=4g
dbms.memory.pagecache.size=2g
Here is the cypher query I'm running:
CALL apoc.periodic.iterate(
"UNWIND $nodes AS newNodeObj RETURN newNodeObj",
"CREATE(n:MyNode)
SET n = newNodeObj",
{batchSize:2000, iterateList:true, parallel:false, params: { nodes: $nodes_in } }
)
And the line of C#:
var createNodesResCursor = await session.RunAsync(createNodesQueryString, new { nodes_in = nodeData });
where createNodesQueryString is the query above, and nodeData is a List<Dictionary<string, object>> where each Dictionary has just three entries: 2 strings, 1 long.
When attempting to run this to create 1.3Million nodes I observe the heap usage (via JConsole) going all the way up to the 4GB available, and bouncing back and forth between ~2.5g - 4g. Reducing the batch size makes no discernible difference, and upping the heap.max_size causes the heap usage to shoot up to almost as much as that value. It's also really slow, taking 30+ mins to create those 1.3 million nodes.
Does anyone have any idea what I may be doing wrong/differently to the linked presentation? I understand my query is doing a CREATE whereas in the presentation they are only updating an already loaded dataset, but I can't imagine that's the reason my heap usage is so high.
Thanks
My issue was that although using apoc.periodic.iterate, I was still uploading that large 1.3million node data set to the database as a parameter for the query!
Modifying my code to do the batching myself as follows fixed my heap usage problem, and the slowness problem:
const int batchSize = 2000;
for (int count = 0; count < nodeData.Count; count += batchSize)
{
string createNodesQueryString = $#"
UNWIND $nodes_in AS newNodeObj
CREATE(n:MyNode)
SET n = newNodeObj";
int length = Math.Min(batchSize, nodeData.Count - count);
var createNodesResCursor = await session.RunAsync(createNodesQueryString,
new { nodes_in = nodeData.ToList().GetRange(count, length) });
var createNodesResSummary = await createNodesResCursor.ConsumeAsync();
}

How can i find all the path that less than a maximum in Neo4j DB?

Everyone. I am a new one to use Neo4j DataBase.
Now, I have a graph which contains nodes and relationships, I want to get all the path that from A to other nodes which the total cost is less than a maximum.
The maximum should be changed.
I use Java to query Neo4j. I know Evaluator cl ones can depend when we stop traversal the path. But i can give my Maximum to The interface evaluate()
My code is here:
public class MyEvaluators implements Evaluator {
#Override
public Evaluation evaluate(Path path) {
// TODO Auto-generated method stub
Iterable<Relationship> rels = path.relationships();
double totalCost = 0.0;
for(Relationship rel: rels){
totalCost += (double) rel.getProperty("cost");
}
return totalCost > MAXIMUM ? Evaluation.EXCLUDE_AND_PRUNE:Evaluation.INCLUDE_AND_CONTINUE;
}}
And I don't want to limit the path depth.
So how can I do this query quickly?
Which version are you looking at?
https://neo4j.com/docs/java-reference/current/tutorial-traversal/
In the current API you can pass a context object (branch-state) to the traversal, that keeps your current state per branch. So you can accumulate the total cost in the PathEvaluator:
https://neo4j.com/docs/java-reference/3.4/javadocs/org/neo4j/graphdb/traversal/PathEvaluator.html
Also perhaps you want to derive from the Dijkstra Evaluator.
https://github.com/neo4j/neo4j/blob/3.5/community/graph-algo/src/main/java/org/neo4j/graphalgo/impl/path/Dijkstra.java

Loading 9M rows from Neo4j and writing it into CSV throws out of memory exception

I have a big graph model and I need to write the result of following query into a csv.
Match (u:USER)-[r:PURCHASED]->(o:ORDER)-[h:HAS]->(i:ITEM) return u.id as userId,i.product_number as itemId
When I "Explain" query, this is the result I get :
It shows that the estimated result is something around 9M. My problems are :
1) It takes alot of time to get a response. From neo4j-shell it takes 38 minutes! Is this normal? BTW I have all schema indexes there and they all are ONLINE.
2) When I use SpringDataNeo4j to fetch the result , it throws an "java.lang.OutOfMemoryError: GC overhead limit exceeded" error , and that happens when SDN tries to convert the loaded data to our #QueryResult object.
I tried to optimize the query in all different ways but nothing was changed ! My impression is that I am doing something wrong. Does anyone have any idea how I can solve this problem? Should I go for Batch read/write ?
P.S I am using Neo4j comunity edition Version:3.0.1 and these are my sysinfos:
and these are my server configs.
dbms.jvm.additional=-Dunsupported.dbms.udc.source=tarball
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=3G
neostore.relationshipstore.db.mapped_memory=4G
neostore.propertystore.db.mapped_memory=3G
neostore.propertystore.db.strings.mapped_memory=1000M
neostore.propertystore.db.index.keys.mapped_memory=500M
neostore.propertystore.db.index.mapped_memory=500M
Although Neo4j will stream results to you as it matches them, when you use SDN it has to collect the output into a single #QueryResult object. To avoid OOM problems you'll need to either ensure your application has sufficient heap memory available to load all 9m responses, or use the neo4j-shell, or use a purpose-built streaming interface, such as https://www.npmjs.com/package/cypher-stream. (caveat emptor: I haven't tried this, but it looks like it should do the trick)
Your config settings are not correct for Neo4j 3.0.1
you have to set the heap in conf/neo4j-wrapper.conf, e.g. 8G
and page-cache in conf/neo4j.conf (looking at your store you only need 2G for page-cache).
Also as you can see it will create 8+ million rows.
You might have more luck with this query:
Match (u:USER)-[:PURCHASED]->(:ORDER)-[:HAS]->(i:ITEM)
with distinct u,i
return u.id as userId,i.product_number as itemId
It also doesn't make sense to return 8M rows to neoj-shell to be honest.
If you want to measure it, replace the RETURN with WITH and add a RETURN count(*)
Match (u:USER)-[r:PURCHASED]->(o:ORDER)-[h:HAS]->(i:ITEM)
with distinct u,i
WITH u.id as userId,i.product_number as itemId
RETURN count(*)
Another optimization could be to go via item and user and do a hash-join in the middle for a global query like this:
Match (u:USER)-[:PURCHASED]->(o:ORDER)-[:HAS]->(i:ITEM)
USING JOIN ON o
with distinct u,i
WITH u.id as userId,i.product_number as itemId
RETURN count(*)
The other thing that I'd probably do to reduce the number of returned results is to try aggregation.
Match (u:USER)-[:PURCHASED]->(o:ORDER)-[:HAS]->(i:ITEM)
with distinct u,i
WITH u, collect(distinct i) as products
WITH u.id as userId,[i in products | i.product_number] as items
RETURN count(*)
Thanks to Vince's and Michael comments I found a solution !
After doing some experiments it got clear that the server response time is actually good ! 1.5 minute for 9 million data ! The problem is with SDN as Vince mentioned ! The OOM happens when SDN tries to convert the data to #QueryResult Object. Increasing heap memory for our application is not a permanent solution as we will have more rows in future ! So we decide to use neo4j-jdbc-driver for big data queries... & it works like a jet ! Here is the code example we used :
Class.forName("org.neo4j.jdbc.Driver");
try (Connection con = DriverManager.getConnection("jdbc:neo4j:bolt://HOST:PORT", "USER", "PASSWORD")) {
// Querying
String query = "match (u:USER)-[r:PURCHASED]->(o:ORDER)-[h:HAS]->(i:ITEM) return u.id as userId,i.product_number as itemId";
con.setAutoCommit(false); // important for large dataset
Statement st = con.createStatement();
st.setFetchSize(50);// important for large dataset
try (ResultSet rs = st.executeQuery(query)) {
while (rs.next()) {
writer.write(rs.getInt("userId") + ","+rs.getInt("itemId"));
writer.newLine();
}
}
st.setFetchSize(0);
writer.close();
st.close();
}
Just make sure you use " con.setAutoCommit(false); " and "st.setFetchSize(50)" if you know that you are going to load a large dataset. Thanks Everyone !

What is the less expensive way to Traverse Through ExecutionResult in neo4j?

I am very new in Neo4j data base .
I am taking above site as a reference and try to create nodes of data storing and retrieving the nodes and there respective properties.
For Retrieving the nodes i am using following method :
ExecutionResult result = engine.execute(query,map);
Iterator<Object> columnAs = result.columnAs("n");
while(columnAs.hasNext())
{
Node n = (Node)columnAs.next();
for (String key : n.getPropertyKeys()) {
Sysout(key);
Sysout(n.getProperty(key));
}
}
For Executing above while loop it takes lots of time it takes almost 10 - 12 sec to traverse 28k nodes.
I am not sure whether I am following proper method or is there any other alternative for this.
Thanks in advance.

GC OutOfMemory error when deleting a large amount of nodes in Neo4j

I have a large amount of highly connected nodes that I sometimes want removed from the database. Through a couple traversals, I wind up with a list of nodes I want to delete:
for (Node nodeToDelete : nodesToDelete)
{
for (Relationship rel : nodeToDelete.getRelationships())
{
rel.delete();
}
nodeToDelete.delete();
}
The problem is that no matter how large I set my Heap, I keep getting:
java.lang.OutOfMemoryError: GC overhead limit exceeded
What is the best way to delete a large list of nodes? I know I have to remove their relationships first before actually deleting them - I step through the code and it appears to fail on a relationship delete. Is there a better function for deleting nodes than what I have? Everything is wrapped in a transaction which is very important since no part of this delete is allowed to fail - could that be an issue?
Thanks!
Do it as a batch. The problem is that your deletes are wrapped in a transaction, which can be reverted, but in order to store that reversion, it is stored in memory. try doing this.
long counter = 0;
for (Node nodeToDelete : nodesToDelete)
{
if (counter == 1000) {
tx.success();
tx.finish();
tx = db.beginTransaction();
counter = 0;
}
for (Relationship rel : nodeToDelete.getRelationships())
{
rel.delete();
}
nodeToDelete.delete();
counter++;
}

Resources