Best way to batch insert using cypher in java code

Best way to batch insert using cypher in java code - neo4j

I'm not sure if this has been answered already but here goes.
I have a Neoj DB already populated with lets say 100k nodes labelled as Person.
I want to import activities that these persons have created and label them Activity.
I have a csv of about 10 million activities which I would like to import into Neo4j.
The code below is what I do to create Cypher statements that can look up a user that is associated with an activity, create the activity node and establish a relationship between the user and the activity.
The method to handle this is below
public void addActivityToGraph(List<String> activities) {
Map<String, Object> params = new HashMap<>();
for (String r : activities) {
String[] rd = r.split(";");
log.info("Row count: " + (rowCount + 1) + "| " + r);
log.info("Row count: " + (rowCount + 1)
+ "| Array Length: " + rd.length);
Map<String, Object> props = new HashMap<>();
props.put("activityid", Long.parseLong(rd[0]));
props.put("objecttype", Integer.parseInt(rd[1]));
props.put("objectid", Integer.parseInt(rd[2]));
props.put("containertype", Integer.parseInt(rd[3]));
props.put("containerid", Integer.parseInt(rd[4]));
props.put("activitytype", Integer.parseInt(rd[5]));
props.put("creationdate", Long.parseLong(rd[7]));
params.put("props", props);
params.put("userid", Integer.parseInt(rd[6]));
try (Transaction tx = gd.beginTx()) {
//engine is RestCypherQueryEngine
engine.query("match (p:Person{userid:{userid}}) create unique (p)-[:created]->(a:Activity{props})", params);
params.clear();
tx.success();
}
}
}
While this works, I'm sure I am not using the right mix of tools as this process takes a whole day to finish. There has to be an easier way. I see a lot of documentation with Batch Rest API but I've not seen any with the case I have here (find an already existing user, create a relationship between the user and a new activity)
I appreciate all the help i can get here.
Thanks.

There are many ways to do batch import into Neo4j.
If you're using the 2.1 milestone release, there's a load CSV option in cypher.
If you actually already have structured CSV, I'd suggest not writing a bunch of java code to do it. Explore the available tools, and go from there.
Using the new cypher option, it might look something like this. The cypher query can be run in the neo4j shell, or via java if you wanted.
LOAD CSV WITH HEADERS FROM "file:///tmp/myPeople.csv" AS csvLine
MERGE (p:Person { userid: csvLine.userid})
MERGE (a:Activity { someProperty: csvLine.someProperty })
create unique (p)-[:created]->(a)

There are no transactions with the rest-query-engine over the wire. You could use batching, but I think it is more sensible to use something like my neo4j-shell-tools to load your csv file
Install them as outlined here, then use
import-cypher -i activities.csv MATCH (p:Person{userid:{userid}}) CREATE (p)-[:created]->(a:Activity{activityid:{activityid}, ....})
Make sure to have indexes/constraints for your :Person(userid) and :Activity(activityid) to make matching and merging fast.

Related

Creating relationship queries in py2neo.ogm

I am using the py2neo.ogm api to construct queries of my IssueOGM class based on its relationship to another class.
I can see why this fails:
>>> list(IssueOGM.select(graph).where(
... "_ -[:HAS_TAG]- (t:TagOGM {tag: 'critical'})"))
Traceback (most recent call last):
...
py2neo.database.status.CypherSyntaxError: Variable `t` not defined (line 1, column 42 (offset: 41))
"MATCH (_:IssueOGM) WHERE _ -[:HAS_TAG]- (t:TagOGM {tag: 'critical'}) RETURN _"
Is there a way using the OGM api to create a filter that is interpreted as this?
"MATCH (_:IssueOGM) -[:HAS_TAG]- (t:TagOGM {tag: 'critical'}) RETURN _"

Like an ORM, the OGM seems to be really good for quickly storing and/or retrieving nodes from your graph, and saving special methods and so forth to make each node 'work' nicely in your application. In this instance, you could use the RelatedFrom class on TagOGM to list all the issues tagged with a particular tag. However, this approach can sometimes lead to making lots of inadvertent db calls without realising (especially in a big application).
Often for cases like this (where you're looking for a pattern rather than a specific node), I'd recommend just writing a cypher query to get the job done. py2neo.ogm actually makes this remarkably simple, by allowing you to store it as a class method of the GraphObject. In your example, something like the following should work. Writing similar queries in the future will also allow you to search based on much more complex criteria and leverage the functionality of neo4j and cypher to make really complex queries quickly in a single transaction (rather than going back and forth to the db as you manipulate an OGM object).
from py2neo import GraphObject, Property
class TagOGM(GraphObject):
name = Property()
class IssueOGM(GraphObject):
name = Property()
time = Property()
description = Property()
#classmethod
def select_by_tag(cls, tag_name):
'''
Returns an OGM instance for every instance tagged a certain way
'''
q = 'MATCH (t:TagOGM { name: {tag_name} })<-[:HAS_TAG]-(i:IssueOGM) RETURN i'
return [
cls.wrap(row['i'])
for row in graph.eval(q, { 'tag_name': tag_name }).data()
]

Neo4jclient - insert multiple relationships in one batch

I have a use case where I create a new relationship every time a user sees a photo like this:
var dateParams = new { Date = DateTime.Now.ToString() };
graphClient.Cypher
.Match("(user:User), (photo:Photo)")
.Where((UserEntity user) => user.Id == userId)
.AndWhere((PhotoEntity photo) => photo.Id == photoId)
.CreateUnique("user-[:USER_SEEN_PHOTO {params}]->photo")
.WithParam("params", dateParams)
.ExecuteWithoutResults();
With many concurrent users this will happen very very often so I need to be able too queue a number of write operations and execute them together at once. Unfortunately I havn't been able to find good info about how to do this in the most efficient way with Neo4jClient so all suggestions would be very appreciated :)
--- UPDATE ---
So I tried different combinations but I still havn't found anything which works. Below query gives me a "PatternException: Unbound pattern!"?
var query = graphClient.Cypher;
for (int i = 0; i < seenPhotosList.Count; i++)
{
query = query.CreateUnique("(user" + i + ":User {Id : {userId" + i + "} })-[:USER_SEEN_PHOTO]->(photo" + i + ":Photo {Id : {photoId" + i + "} })")
.WithParam("userId" + i, seenPhotosList[i].UserId)
.WithParam("photoId" + i, seenPhotosList[i].PhotoId);
}
query.ExecuteWithoutResults();
I also tried to change CreateUnique to Merge and that query executes without exception but is creating new nodes instead of connecting the existing ones?
var query = graphClient.Cypher;
for (int i = 0; i < seenPhotosList.Count; i++)
{
query = query.Merge("(user" + i + ":User {Id : {userId" + i + "} })-[:USER_SEEN_PHOTO]->(photo" + i + ":Photo {Id : {photoId" + i + "} })")
.WithParam("userId" + i, seenPhotosList[i].UserId)
.WithParam("photoId" + i, seenPhotosList[i].PhotoId);
}
query.ExecuteWithoutResults();

I set up 5 types of relationships using Batch Insert. It runs extremely fast, but not sure how you'd manage the interupt in a multiuser environment. You need to know the nodeIDs in advance and then create a string for the API request that looks like this ...
[{"method":"POST","to":"/node/222/relationships","id":222,"body":{"to":"26045","type":"mother"}},
{"method":"POST","to":"/node/291/relationships","id":291,"body":{"to":"26046","type":"mother"}},
{"method":"POST","to":"/node/389/relationships","id":389,"body":{"to":"26047","type":"mother"}},
{"method":"POST","to":"/node/1031/relationships","id":1031,"body":{"to":"1030","type":"wife"}},
{"method":"POST","to":"/node/1030/relationships","id":1030,"body":{"to":"1031","type":"husband"}},
{"method":"POST","to":"/node/1034/relationships","id":1034,"body":{"to":"26841","type":"father"}},
{"method":"POST","to":"/node/34980/relationships","id":34980,"body":{"to":"26042","type":"child"}}]
I also broke this down into reasonable sized iterative requests to avoid memory challenges. But the iterations run very fast for setting up the strings needed. Getting nodeIDs also required iterations because Neo4j limits the number of nodes returned to 1000. This is a shortcoming of Neo4j that was designed based on visualization concerns (who can study a picture with 10000 nodes?) rather than coding issues such as those we are discussing.

I do not believe Neo4j/cypher has any built-in way of performing what you are asking for. What you could do is build something that does this for you with a queueing system. Here's a blog post on doing scalable writes in Ruby which is something that you could implement in your language to handle doing batch inserts/updates.

Too much time importing data and creating nodes

i have recently started with neo4j and graph databases.
I am using this Api to make the persistence of my model. I have everything done and working but my problems comes related to efficiency.
So first of all i will talk about the scenary. I have a couple of xml documents which translates to some nodes and relations between the, as i already read that this API still not support a batch insertion, i am creating the nodes and relations once a time.
This is the code i am using for creating a node:
var newEntry = new EntryNode { hash = incremento++.ToString() };
var result = client.Cypher
.Merge("(entry:EntryNode {hash: {_hash} })")
.OnCreate()
.Set("entry = {newEntry}")
.WithParams(new
{
_hash = newEntry.hash,
newEntry
})
.Return(entry => new
{
EntryNode = entry.As<Node<EntryNode>>()
});
As i get it takes time to create all the nodes, i do not understand why the time it takes to create one increments so fats. I have made some tests and am stuck at the point where creating an EntryNode the setence takes 0,2 seconds to resolve, but once it has reached 500 it has incremented to ~2 seconds.
I have also created an index on EntryNode(hash) manually on the console before inserting any data, and made test with both versions, with and without index.
Am i doing something wrong? is this time normal?
EDITED:
#Tatham
Thanks for the answer, really helped. Now i am using the foreach statement in the neo4jclient to create 1000 nodes in just 2 seconds.
On a related topic, now that i create the nodes this way i wanted to also create relationships. This is the code i am trying right now, but got some errors.
client.Cypher
.Match("(e:EntryNode)")
.Match("(p:EntryPointerNode)")
.ForEach("(n in {set} | " +
"FOREACH (e in (CASE WHEN e.hash = n.EntryHash THEN [e] END) " +
"FOREACH (p in pointers (CASE WHEN p.hash = n.PointerHash THEN [p] END) "+
"MERGE ((p)-[r:PointerToEntry]->(ee)) )))")
.WithParam("set", nodesSet)
.ExecuteWithoutResults();
What i want it to do is, given a list of pairs of strings, get the nodes (which are uniques) with the string value as the property "hash" and create a relationship between them. I have tried a couple of variants to do this query but i dont seem to find the solution.
Is this possible?

This approach is going to be very slow because you do a separate HTTP call to Neo4j for every node you are inserting. Each call is then a transaction. Finally, you are also returning the node back, which is probably a waste.
There are two options for doing this in batches instead.
From https://stackoverflow.com/a/21865110/211747, you can do something like this, where you pass in a set of objects and then FOREACH through them in Cypher. This means one, larger, HTTP call to Neo4j and then executing in a single transaction on the DB:
FOREACH (n in {set} | MERGE (c:Label {Id : n.Id}) SET c = n)
http://docs.neo4j.org/chunked/stable/query-foreach.html
The other option, coming soon, is that you will be able to write something like this in Cypher:
LOAD CSV WITH HEADERS FROM 'file://c:/temp/input.csv' AS n
MERGE (c:Label { Id : n.Id })
SET c = n
https://github.com/davidegrohmann/neo4j/blob/2.1-fix-resource-failure-load-csv/community/cypher/cypher/src/test/scala/org/neo4j/cypher/LoadCsvAcceptanceTest.scala

ExecutionResult result.hasNext() taking very long time to return

I am fairly new to Neo4j. I am running into a peculiar error when trying to iterate over a ExecutionResult result set. In the following code snippet, the last res.hasNext() takes close to 50 seconds to return on the last iteration.
The cypher query I am using is
start p=node(*) where (p.`process-workflowID`? = '" + Id + "') and (p.type? = 'process') return ID(p);
I am using neo4j-community-1.8.1 and java 1.6.0_41, testing against a DB with 226710 nodes.
Does anyone have any clue as to why this is happening? I assume the query is done when engine.execute(query) returns, but if this isn't the case, would appreciate someone shedding some light on when the query actually gets completed. Thank you in advance.
ExecutionResult result = engine.execute(query);
Iterator<Map<String, Object>> res = result.iterator();
while(res.hasNext()) {
Map<String, Object> row = res.next();
for(Entry<String, Object> column : row.entrySet()){
...
}
long t1 = System.currentTimeMillis();
res.hasNext(); // <--------------------------- statement in question
long t2 = System.currentTimeMillis();
System.out.println(t2-t1);
}

Queries are performed while iterating over the result set. So each call to hasNext/next involves some operation on the graph. Nevertheless a pause of 50 secs with a graph off ~250k nodes indicates that you are doing something basically wrong.
You might look into:
Your query is very inefficient, you should make use of indexes. The most easy way is to setup autoindexing for the properties you're searching for, see http://docs.neo4j.org/chunked/stable/auto-indexing.html. Please note that pre-existing data does not get reindexed!
After rebuilding the database use the following cypher statement instead:
Map<String,Object> = Collections.singletonMap("id", Id);
executionEngine.execute("start p=node:node_auto_index('process-workflowID:{id} type:process') return ID(p)", params)
I'm not sure if "process-workflowID" needs additional quoting in lucene syntax.
make sure that you're not suffering from gc/memory issues using e.g. jvisualvm
Setup mapped memory according to http://docs.neo4j.org/chunked/stable/configuration-caches.html and run your query more than once to benefit from warmed up caches.

How to speed up parsing Neo4j ExecutionResult set?

I am running a two part Neo4j search which is performing well. However, the actual parsing of the ExecutionResult set is taking longer than the Cypher query by a factor of 8 or 10. I'm looping through the ExecutionResult map as follows:
result = engine.execute("START facility=node({id}), service=node({serviceIds}) WHERE facility-->service RETURN facility.name as facilityName, service.name as serviceName", cypherParams);
for ( Map<String, Object> row : result )
{
sb.append((String) row.get("facilityName")+" : " + (String) row.get("serviceName") + "<BR/>" );
}
Any suggestions for speeding this up? Thanks

Do you need access to entities or is it sufficient to work with nodes (and thus use the core API)? In the latter case, you could use the traversal API which is faster than Cypher.
I'm not sure what your use case is, but depending on the scenario, you could probably do something like this:
for (final Path position : Traversal.description().depthFirst()
.relationships(YOUR_RELATION_TYPE, Direction.INCOMING)
.uniqueness(Uniqueness.NODE_RECENT)
.evaluator(Evaluators.toDepth(1)
.traverse(facilityNode,serviceNode)) {
// do something like e.g. position.endNode().getProperty("name")
}

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Best way to batch insert using cypher in java code - neo4j

Related

Creating relationship queries in py2neo.ogm

Neo4jclient - insert multiple relationships in one batch

Too much time importing data and creating nodes

ExecutionResult result.hasNext() taking very long time to return

How to speed up parsing Neo4j ExecutionResult set?

Categories

Resources