Too much time importing data and creating nodes - neo4jclient

i have recently started with neo4j and graph databases.
I am using this Api to make the persistence of my model. I have everything done and working but my problems comes related to efficiency.
So first of all i will talk about the scenary. I have a couple of xml documents which translates to some nodes and relations between the, as i already read that this API still not support a batch insertion, i am creating the nodes and relations once a time.
This is the code i am using for creating a node:
var newEntry = new EntryNode { hash = incremento++.ToString() };
var result = client.Cypher
.Merge("(entry:EntryNode {hash: {_hash} })")
.OnCreate()
.Set("entry = {newEntry}")
.WithParams(new
{
_hash = newEntry.hash,
newEntry
})
.Return(entry => new
{
EntryNode = entry.As<Node<EntryNode>>()
});
As i get it takes time to create all the nodes, i do not understand why the time it takes to create one increments so fats. I have made some tests and am stuck at the point where creating an EntryNode the setence takes 0,2 seconds to resolve, but once it has reached 500 it has incremented to ~2 seconds.
I have also created an index on EntryNode(hash) manually on the console before inserting any data, and made test with both versions, with and without index.
Am i doing something wrong? is this time normal?
EDITED:
#Tatham
Thanks for the answer, really helped. Now i am using the foreach statement in the neo4jclient to create 1000 nodes in just 2 seconds.
On a related topic, now that i create the nodes this way i wanted to also create relationships. This is the code i am trying right now, but got some errors.
client.Cypher
.Match("(e:EntryNode)")
.Match("(p:EntryPointerNode)")
.ForEach("(n in {set} | " +
"FOREACH (e in (CASE WHEN e.hash = n.EntryHash THEN [e] END) " +
"FOREACH (p in pointers (CASE WHEN p.hash = n.PointerHash THEN [p] END) "+
"MERGE ((p)-[r:PointerToEntry]->(ee)) )))")
.WithParam("set", nodesSet)
.ExecuteWithoutResults();
What i want it to do is, given a list of pairs of strings, get the nodes (which are uniques) with the string value as the property "hash" and create a relationship between them. I have tried a couple of variants to do this query but i dont seem to find the solution.
Is this possible?

This approach is going to be very slow because you do a separate HTTP call to Neo4j for every node you are inserting. Each call is then a transaction. Finally, you are also returning the node back, which is probably a waste.
There are two options for doing this in batches instead.
From https://stackoverflow.com/a/21865110/211747, you can do something like this, where you pass in a set of objects and then FOREACH through them in Cypher. This means one, larger, HTTP call to Neo4j and then executing in a single transaction on the DB:
FOREACH (n in {set} | MERGE (c:Label {Id : n.Id}) SET c = n)
http://docs.neo4j.org/chunked/stable/query-foreach.html
The other option, coming soon, is that you will be able to write something like this in Cypher:
LOAD CSV WITH HEADERS FROM 'file://c:/temp/input.csv' AS n
MERGE (c:Label { Id : n.Id })
SET c = n
https://github.com/davidegrohmann/neo4j/blob/2.1-fix-resource-failure-load-csv/community/cypher/cypher/src/test/scala/org/neo4j/cypher/LoadCsvAcceptanceTest.scala

Related

Can I load in nodes and relationships from a csv file using 1 cypher command?

I have 2 csv files which I am trying to load into a Neo4j database using cypher: drivers.csv which holds every formula 1 driver and lap times.csv which stores every lap ever raced in F1.
I have managed to load in all of the nodes, although the lap times file is very large so it took quite a long time! I then tried to add relationships after, but there is so many that needs to be added that I gave up on it waiting (it was taking multiple days and still had not loaded in fully).
I’m pretty sure there is a way to load in the nodes and relationships at the same time, which would allow me to use periodic commit for the relationships which I cannot do right now. Essentially I just need to combine the 2 commands into one and after some attempts I can’t seem to work out how to do it?
// load in the lap_times.csv, changing the variable names - about half million nodes (takes 3-4 days)
PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv'
AS row
MERGE (lt: lapTimes {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
RETURN lt;
// add a relationship between laptimes, drivers and races - takes 3-4 days
MATCH (lt:lapTimes),(d:Driver),(r:race)
WHERE lt.raceId = r.raceId AND lt.driverId = d.driverId
MERGE (d)-[rel8:LAPPING_AT]->(lt)
MERGE (r)-[rel9:TIMED_LAP]->(lt)
RETURN type(rel8), type(rel9)
Thanks in advance for any help!
You should review the documentation for indexes here:
https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/
Basically, indexes, once created, allow quick lookups of nodes of a given label, for the given property or properties. If you DON'T have an index and you do a MATCH or MERGE of a node, then for every row of that MATCH or MERGE, it has to do a label scan of all nodes of the given label and check all of their properties to find the nodes, and that becomes very expensive, especially when loading CSVs because those operations are likely happening for each row in the CSV.
For your :lapTimes nodes (though we would recommend you use singular labels in most cases), if there are none of them in your graph to start with, then a CREATE instead of a MERGE is fine. You may want a composite index on :lapTimes(raceId, driverId, lap), since that should uniquely identify the node, if you need to look it up later. Using CREATE instead of MERGE here should process much much faster.
Your second query should be MATCHing on :lapTimes nodes (label scan), and from each doing an index lookup on the :race and :driver nodes, so indexes are key here for performance.
You need indexes on: :race(raceId) and :Driver(driverId).
MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)
You might consider CREATE instead of MERGE for the relationships, if you know there are no duplicate entries.
I removed your RETURN because returning the types isn't useful information.
Also, consider using consistent cases for your node labels, and that you are using the same case between the labels in your graph and the indexes you create.
Also, you would probably want to batch these changes instead of trying to process them all at once.
If you install APOC Procedures you can make use of apoc.periodic.iterate(), which can be used to batch changes, which will be faster and easier on your heap. You will still need indexes first.
CALL apoc.periodic.iterate("
MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
RETURN lt, d, ir",
"MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)", {}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
Single CSV load
If you want to handle everything all at once in a single CSV load, you can do that, but again you will need indexes first. Here's what you'll need at a minimum:
CREATE INDEX ON :Driver(driverId);
CREATE INDEX ON :Race(raceId);
After those are created, you can use this, assuming you are starting from scratch (I fixed the case of your labels and made them singular:
USING PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv' AS row
MERGE (d:Driver {driverId:row.driverId})
MERGE (r:Race {raceId:row.raceId})
CREATE (lt:LapTime {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
CREATE (d)-[:LAPPING_AT]->(lt)
CREATE (r)-[:TIMED_LAP]->(lt)

Merging a set of nodes onto one (only in query)

I am currently investigating how to model a bitemporal graph in neo4j. Unfortunately noone seems to have publicly undertaken this before.
One particular thing I am looking at is whether I can store in a new node only those values that have changed and then express a query that would merge all those values ordered by a given timestamp:
This creates the data I am playing with:
CREATE (:P1 {id: '1'})<-[:EXPANDS {date:5200, recorded:5100}]-(:P1Data {name:'Joe', wage: 3000})
// New data, recorded 2014-10-1 for 2015-1-1
MATCH (p:P1 {id: '1'}) CREATE (:P1Data { wage:3100 })-[:EXPANDS { date:5479, recorded: 5387}]->(p)
Now, I can get a history for a given point in time so far, e.g. like
MATCH (:P1 { id: '1' })<-[x:EXPANDS]-(d:P1Data)
WHERE x.recorded < 6000
WITH {date: x.date, data:d} as data
RETURN data
ORDER BY data.date DESC
What I would like to achieve is to merge the name and wage values such that I get a whole view of the data at a given point in time. The answer may also be that this is not really possible.
(PS: I say only in query, because I found a refactor function in apoc which does merge nodes, but that procedure actually merges and persists the node, while I would just want to query it).
As with most things, you can do it using REDUCE like so:
MATCH (:P1 { id: '1' })<-[x:EXPANDS]-(d:P1Data)
WITH x.date AS date, d AS data
ORDER BY date
WITH COLLECT(data) AS datas
WITH REDUCE(s = {}, y IN datas|
{name: COALESCE(y.name, s.name),
wage: COALESCE(y.wage, s.wage)})
AS most_recent_fields
RETURN most_recent_fields.name AS name, most_recent_fields.wage AS wage
You can do it in descending order instead (swap s and y inside the COALESCE statements if so), but there isn't really a way to shortcut processing the entire set of results from your queried time back to the start.
UPDATE: This will, of course, generate a Map and not a Node, but if you only want the properties and don't want to create a permanent record, a Map is actually better suited to your needs.
EXTENDED: If you don't want to specify which keys to use, you can do it without REDUCE like this instead:
MATCH (:P1 { id: '1' })<-[x:EXPANDS]-(d:P1Data)
WITH x.date AS date, d AS data
ORDER BY date
WITH COLLECT(data) AS datas
CREATE (t:Temp)
FOREACH(data IN datas|
SET t += data)
DELETE t
RETURN t
This approach does create a node, but if you DELETE it right before you RETURN it, it won't persist at all. += ensures that pre-existing properties aren't removed, only overwritten if the data node has existing values.

connecting the nodes with relationships in py2neo

I have the following python code to make a graph in neo4j. I am using py2neo version 2.0.3.
import json
from py2neo import neo4j, Node, Relationship, Graph
graph = neo4j.Graph("http://localhost:7474/db/data/")
with open("example.json") as f:
for line in f:
while True:
try:
file = json.loads(line)
break
except ValueError:
# Not yet a complete JSON value
line += next(f)
# Now creating the node and relationships
news, = graph.create(Node("Mainstream_News", id=unicode(file["_id"]), entry_url=unicode(file["entry_url"]),
title=unicode(file["title"]))) # Comma unpacks length-1 tuple.
authors, = graph.create(
Node("Authors", auth_name=unicode(file["auth_name"]), auth_url=unicode(file["auth_url"]),
auth_eml=unicode(file["auth_eml"])))
graph.create(Relationship(news, "hasAuthor", authors ))
I can create a graph with nodes Mainstream_News and Authors with a relation 'hasAuthor'. My problem is when I am doing this I am having one Mainstream_News node with one Authors but in reality one author nodes has more than one Mainstream_News. I would like to make auth_name property of a Author nodes as a index to connect with the Mainstream_news nodes. Any suggestions will be great.
You are creating a new Authors node each time through your loop, even if an Author node (with the same properties) already exists.
First, I think you should create uniqueness constraints on Authors(auth_name) and Mainstream_News(id), to enforce what seem to be your requirements. This only needs to be done once. A uniqueness constraint also creates an index for you automatically, which is a bonus.
graph.schema.create_uniqueness_constraint("Authors", "auth_name")
graph.schema.create_uniqueness_constraint("Mainstream_News", "id")
But you will probably have to empty out your DB first (at least of all Authors and Mainstream_News nodes and their relationships), since I presume it currently has a lot of duplicate nodes.
Then, you can use the merge_one and create_unique APIs to prevent duplicate nodes and relationships:
news = graph.merge_one("Mainstream_News", "id", unicode(file["_id"]))
news.properties["entry_url"] = unicode(file["entry_url"])
news.properties["title"] = unicode(file["title"])
authors = graph.merge_one("Authors", "auth_name", unicode(file["auth_name"]))
news.properties["auth_url"] = unicode(file["auth_url"])
news.properties["auth_eml"] = unicode(file["auth_eml"])
graph.create_unique(Relationship(news, "hasAuthor", authors))
This is what I normally do, as I find it easier to know what's happening. As far as I know there are a but when you create_unique with only a Node, and there are no need to create the nodes, when you also have to create an edge.
I don't have the database on this computer, so please bear with me, if there are some typo'es, I'll correct it in the morning, but I guess you'll rather have a fast answer.. :-)
news = graph.cypher.execute_one('MATCH (m:Mainstream_News) '
'WHERE m.id = {id} '
'RETURN p'.format(id=unicode(file["_id"])))
if not news:
news = Node("Mainstream_News")
news.properties['id] = unicode(file["_id"])
news.properties['entry_url'] = unicode(file["entry_url"])
news.properties['title'] = unicode(file["title"])
# You can make a for-loop here
authors = Node("Authors")
authors.properties['auth_name'] = unicode(file["auth_name"])
authors.properties['auth_url'] = unicode(file["auth_url"])
authors.properties['auth_eml'] = unicode(file["auth_eml"])
rel = Relationship(new, "hasAuthor", authors)
graph.create_unique(rel)
# For-loop should end here
I've included the tree first lines, to make it more generic. It returns a node-object or None.
EDIT:
#cybersam use of schema is cool, implement that to, I'll try to use it myselfe also.. :-)
You can read more about it here:
http://neo4j.com/docs/stable/query-constraints.html
http://py2neo.org/2.0/schema.html

Incorrect sort order with Neo4jClient Cypher query

I have the following Neo4jClient code
var queryItem = _graphClient
.Cypher
.Start(new
{
n = Node.ByIndexLookup("myindex", "Name", sku),
})
.Match("p = n-[r:Relationship]->ci")
.With("ci , r")
.Return((ci, r) => new
{
N = ci.Node<Item>(),
R = r.As<RelationshipInstance<Payload>>()
})
.Limit(5)
.Results
.OrderByDescending(u => u.R.Data.Frequency);
The query is executing fine but the results are not sorted correctly (i.e. in descending order). Here is the Payload class as well.
Please let me know if you see something wrong with my code. TIA.
You're doing the sorting after the .Results call. This means that you're doing it back in .NET, not on Neo4j. Neo4j is returning any 5 results, because the Cypher query doesn't contain a sort instruction.
Change the last three lines to:
.OrderByDescending("r.Frequency")
.Limit(5)
.Results;
As a general debugging tip, Neo4jClient does two things:
It helps you construct Cypher queries using the fluent interface.
It executes these queries for you. This is a fairly dumb process: we send the text to Neo4j, and it gives the objects back.
The execution is obviously working, so you need to work out why the queries are different.
Read the doco at http://hg.readify.net/neo4jclient/wiki/cypher (we write it for a reason)
Read the "Debugging" section on that page which tells you how to get the query text
Compare the query text with what you expected to be run
Resolve the difference (or report an issue at http://hg.readify.net/neo4jclient/issues/new if it's a library bug)

What is the less expensive way to Traverse Through ExecutionResult in neo4j?

I am very new in Neo4j data base .
I am taking above site as a reference and try to create nodes of data storing and retrieving the nodes and there respective properties.
For Retrieving the nodes i am using following method :
ExecutionResult result = engine.execute(query,map);
Iterator<Object> columnAs = result.columnAs("n");
while(columnAs.hasNext())
{
Node n = (Node)columnAs.next();
for (String key : n.getPropertyKeys()) {
Sysout(key);
Sysout(n.getProperty(key));
}
}
For Executing above while loop it takes lots of time it takes almost 10 - 12 sec to traverse 28k nodes.
I am not sure whether I am following proper method or is there any other alternative for this.
Thanks in advance.

Resources