What is the less expensive way to Traverse Through ExecutionResult in neo4j? - neo4j

I am very new in Neo4j data base .
I am taking above site as a reference and try to create nodes of data storing and retrieving the nodes and there respective properties.
For Retrieving the nodes i am using following method :
ExecutionResult result = engine.execute(query,map);
Iterator<Object> columnAs = result.columnAs("n");
while(columnAs.hasNext())
{
Node n = (Node)columnAs.next();
for (String key : n.getPropertyKeys()) {
Sysout(key);
Sysout(n.getProperty(key));
}
}
For Executing above while loop it takes lots of time it takes almost 10 - 12 sec to traverse 28k nodes.
I am not sure whether I am following proper method or is there any other alternative for this.
Thanks in advance.

Related

How to properly use apoc.periodic.iterate to reduce heap usage for large transactions?

I am trying to use apoc.periodic.iterate to reduce heap usage when doing very large transactions in a Neo4j database.
I've been following the advice given in this presentation.
BUT, my results are differing from those observed in those slides.
First, some notes on my setup:
Using Neo4j Desktop, graph version 4.0.3 Enterprise, with APOC 4.0.0.10
I'm calling queries using the .NET Neo4j Driver, version 4.0.1.
neo4j.conf values:
dbms.memory.heap.initial_size=2g
dbms.memory.heap.max_size=4g
dbms.memory.pagecache.size=2g
Here is the cypher query I'm running:
CALL apoc.periodic.iterate(
"UNWIND $nodes AS newNodeObj RETURN newNodeObj",
"CREATE(n:MyNode)
SET n = newNodeObj",
{batchSize:2000, iterateList:true, parallel:false, params: { nodes: $nodes_in } }
)
And the line of C#:
var createNodesResCursor = await session.RunAsync(createNodesQueryString, new { nodes_in = nodeData });
where createNodesQueryString is the query above, and nodeData is a List<Dictionary<string, object>> where each Dictionary has just three entries: 2 strings, 1 long.
When attempting to run this to create 1.3Million nodes I observe the heap usage (via JConsole) going all the way up to the 4GB available, and bouncing back and forth between ~2.5g - 4g. Reducing the batch size makes no discernible difference, and upping the heap.max_size causes the heap usage to shoot up to almost as much as that value. It's also really slow, taking 30+ mins to create those 1.3 million nodes.
Does anyone have any idea what I may be doing wrong/differently to the linked presentation? I understand my query is doing a CREATE whereas in the presentation they are only updating an already loaded dataset, but I can't imagine that's the reason my heap usage is so high.
Thanks
My issue was that although using apoc.periodic.iterate, I was still uploading that large 1.3million node data set to the database as a parameter for the query!
Modifying my code to do the batching myself as follows fixed my heap usage problem, and the slowness problem:
const int batchSize = 2000;
for (int count = 0; count < nodeData.Count; count += batchSize)
{
string createNodesQueryString = $#"
UNWIND $nodes_in AS newNodeObj
CREATE(n:MyNode)
SET n = newNodeObj";
int length = Math.Min(batchSize, nodeData.Count - count);
var createNodesResCursor = await session.RunAsync(createNodesQueryString,
new { nodes_in = nodeData.ToList().GetRange(count, length) });
var createNodesResSummary = await createNodesResCursor.ConsumeAsync();
}

Move properties from relation to node in Neo4J on large datasets

I'm trying to move a property I've set up on a relationship in Neo4J to one of it's member nodes, because I want to index that property, and as of version 2.2.5 which I am using, indexing on relationships is not possible.
However, when I try to move it via Cypher command MATCH (k)-[l]->(m) SET m.key = l.key, my request also drops due to a lack of memory. I have no possibility to add additional memory either.
Does any one know of a good way to do this without having to resort to lots of memory when dealing with large (ca. 20M) datasets?
If it's one time operation I highly recommend you to write Unmanaged Extensions.
It will be much faster than Cypher.
Here is an example
Label startNodeLabel = DynamicLabel.label("StartNode");
Label endNodeLabel = DynamicLabel.label("EndNode");
RelationshipType relationshipType = DynamicRelationshipType.withName("RelationshipType");
String nodeProperty = "nodeProperty";
String relationshipProperty = "relationshipProperty";
try(Transaction tx = database.beginTx()) {
final ResourceIterator<Node> nodes = database.findNodes(startNodeLabel);
for (Node startNode : IteratorUtil.asCollection(nodes)) {
if (startNode.hasRelationship(relationshipType, Direction.OUTGOING)) {
final Iterable<Relationship> relationships = startNode.getRelationships(relationshipType, Direction.OUTGOING);
for (Relationship relationship : relationships) {
final Node endNode = relationship.getOtherNode(startNode);
if (endNode.hasLabel(endNodeLabel)) {
endNode.setProperty(nodeProperty, relationship.getProperty(relationshipProperty));
}
}
}
}
tx.success();
}
If you do not want to go for an unmanaged extension because you are moving the properties as a one-time problem you can also write e.g. a shell script which calls the linux curl command and loops in a loop with skip and limit. This has the advantage that you don't need to move the values but can copy them.
MATCH (k)-[l]->(m)
WITH l skip 200000 limit 100000
SET m.key = l.key
RETURN COUNT(*) AS nRows
Replace 200000 with the value of the loop variable.
You can use use LIMIT to limit the query to a specific number of rows, and then repeat the query until no more rows are returned. That will also limit the amount of memory usage.
For example, if you also wanted to remove the key property from the relationship at the same time (and you wanted to process 100K rows each time):
[EDITED]
MATCH (k)-[l]->(m)
WHERE HAS(l.key)
SET m.key = l.key
REMOVE l.key
WITH l
LIMIT 100000
RETURN COUNT(*) AS nRows;
This query will return an nRows value of 0 when you are done.

Too much time importing data and creating nodes

i have recently started with neo4j and graph databases.
I am using this Api to make the persistence of my model. I have everything done and working but my problems comes related to efficiency.
So first of all i will talk about the scenary. I have a couple of xml documents which translates to some nodes and relations between the, as i already read that this API still not support a batch insertion, i am creating the nodes and relations once a time.
This is the code i am using for creating a node:
var newEntry = new EntryNode { hash = incremento++.ToString() };
var result = client.Cypher
.Merge("(entry:EntryNode {hash: {_hash} })")
.OnCreate()
.Set("entry = {newEntry}")
.WithParams(new
{
_hash = newEntry.hash,
newEntry
})
.Return(entry => new
{
EntryNode = entry.As<Node<EntryNode>>()
});
As i get it takes time to create all the nodes, i do not understand why the time it takes to create one increments so fats. I have made some tests and am stuck at the point where creating an EntryNode the setence takes 0,2 seconds to resolve, but once it has reached 500 it has incremented to ~2 seconds.
I have also created an index on EntryNode(hash) manually on the console before inserting any data, and made test with both versions, with and without index.
Am i doing something wrong? is this time normal?
EDITED:
#Tatham
Thanks for the answer, really helped. Now i am using the foreach statement in the neo4jclient to create 1000 nodes in just 2 seconds.
On a related topic, now that i create the nodes this way i wanted to also create relationships. This is the code i am trying right now, but got some errors.
client.Cypher
.Match("(e:EntryNode)")
.Match("(p:EntryPointerNode)")
.ForEach("(n in {set} | " +
"FOREACH (e in (CASE WHEN e.hash = n.EntryHash THEN [e] END) " +
"FOREACH (p in pointers (CASE WHEN p.hash = n.PointerHash THEN [p] END) "+
"MERGE ((p)-[r:PointerToEntry]->(ee)) )))")
.WithParam("set", nodesSet)
.ExecuteWithoutResults();
What i want it to do is, given a list of pairs of strings, get the nodes (which are uniques) with the string value as the property "hash" and create a relationship between them. I have tried a couple of variants to do this query but i dont seem to find the solution.
Is this possible?
This approach is going to be very slow because you do a separate HTTP call to Neo4j for every node you are inserting. Each call is then a transaction. Finally, you are also returning the node back, which is probably a waste.
There are two options for doing this in batches instead.
From https://stackoverflow.com/a/21865110/211747, you can do something like this, where you pass in a set of objects and then FOREACH through them in Cypher. This means one, larger, HTTP call to Neo4j and then executing in a single transaction on the DB:
FOREACH (n in {set} | MERGE (c:Label {Id : n.Id}) SET c = n)
http://docs.neo4j.org/chunked/stable/query-foreach.html
The other option, coming soon, is that you will be able to write something like this in Cypher:
LOAD CSV WITH HEADERS FROM 'file://c:/temp/input.csv' AS n
MERGE (c:Label { Id : n.Id })
SET c = n
https://github.com/davidegrohmann/neo4j/blob/2.1-fix-resource-failure-load-csv/community/cypher/cypher/src/test/scala/org/neo4j/cypher/LoadCsvAcceptanceTest.scala

ExecutionResult result.hasNext() taking very long time to return

I am fairly new to Neo4j. I am running into a peculiar error when trying to iterate over a ExecutionResult result set. In the following code snippet, the last res.hasNext() takes close to 50 seconds to return on the last iteration.
The cypher query I am using is
start p=node(*) where (p.`process-workflowID`? = '" + Id + "') and (p.type? = 'process') return ID(p);
I am using neo4j-community-1.8.1 and java 1.6.0_41, testing against a DB with 226710 nodes.
Does anyone have any clue as to why this is happening? I assume the query is done when engine.execute(query) returns, but if this isn't the case, would appreciate someone shedding some light on when the query actually gets completed. Thank you in advance.
ExecutionResult result = engine.execute(query);
Iterator<Map<String, Object>> res = result.iterator();
while(res.hasNext()) {
Map<String, Object> row = res.next();
for(Entry<String, Object> column : row.entrySet()){
...
}
long t1 = System.currentTimeMillis();
res.hasNext(); // <--------------------------- statement in question
long t2 = System.currentTimeMillis();
System.out.println(t2-t1);
}
Queries are performed while iterating over the result set. So each call to hasNext/next involves some operation on the graph. Nevertheless a pause of 50 secs with a graph off ~250k nodes indicates that you are doing something basically wrong.
You might look into:
Your query is very inefficient, you should make use of indexes. The most easy way is to setup autoindexing for the properties you're searching for, see http://docs.neo4j.org/chunked/stable/auto-indexing.html. Please note that pre-existing data does not get reindexed!
After rebuilding the database use the following cypher statement instead:
Map<String,Object> = Collections.singletonMap("id", Id);
executionEngine.execute("start p=node:node_auto_index('process-workflowID:{id} type:process') return ID(p)", params)
I'm not sure if "process-workflowID" needs additional quoting in lucene syntax.
make sure that you're not suffering from gc/memory issues using e.g. jvisualvm
Setup mapped memory according to http://docs.neo4j.org/chunked/stable/configuration-caches.html and run your query more than once to benefit from warmed up caches.

How to speed up parsing Neo4j ExecutionResult set?

I am running a two part Neo4j search which is performing well. However, the actual parsing of the ExecutionResult set is taking longer than the Cypher query by a factor of 8 or 10. I'm looping through the ExecutionResult map as follows:
result = engine.execute("START facility=node({id}), service=node({serviceIds}) WHERE facility-->service RETURN facility.name as facilityName, service.name as serviceName", cypherParams);
for ( Map<String, Object> row : result )
{
sb.append((String) row.get("facilityName")+" : " + (String) row.get("serviceName") + "<BR/>" );
}
Any suggestions for speeding this up? Thanks
Do you need access to entities or is it sufficient to work with nodes (and thus use the core API)? In the latter case, you could use the traversal API which is faster than Cypher.
I'm not sure what your use case is, but depending on the scenario, you could probably do something like this:
for (final Path position : Traversal.description().depthFirst()
.relationships(YOUR_RELATION_TYPE, Direction.INCOMING)
.uniqueness(Uniqueness.NODE_RECENT)
.evaluator(Evaluators.toDepth(1)
.traverse(facilityNode,serviceNode)) {
// do something like e.g. position.endNode().getProperty("name")
}

Resources