Create Relationships between Consequent Nodes (on date attribute) in Neo4j - neo4j

I am trying to get a csv into Neo4j. As it consists of log entries, I'd like to connect nodes with a NEXT-pointer/relationship when the corresponding logs have been created at subsequent times.
LOAD CSV WITH HEADERS FROM 'http://localhost/Export.csv' AS line
CREATE (:Entry { date: line[0], ...})
MATCH (n)
RETURN n
ORDER BY n:date
MATCH (a:Entry),(b:Entry),(c:Entry)
WITH p AS min(b:date)
WHERE a:date < b:date AND c.date = p
CREATE (a)-[r:NEXT]->(c)
The last four lines do not work however. What I try is to get the earliest entry 'c' out of the group of entries 'b' with a larger timestamp than 'a'. Can anyone help me out here?

Not sure if I understood your question correctly: you have a csv file containing log records with a timestamp. Each line contains one record. You want to interconnect the events to form a linked list based on a timestamp?
In this case I'd split up the process into two steps:
using LOAD CSV create a node with a data property for each line
afterwards connect the entries using e.g. a cypher statement like this:
.
MATCH (e:Entry)
WITH e ORDER BY e.date DESC
WITH collect(e) as entries
FOREACH(i in RANGE(0, length(entries)-2) |
FOREACH(e1 in [entries[i]] |
FOREACH(e2 in [entries[i+1]] |
MERGE (e1)-[:NEXT]->(e2))))

Related

Longest Path Neo4j returning incorrect path

I have the following graph stored in csv format:
graphUnioned.csv:
a b
b c
The above graph denotes path from Node:a to Node:b. Note that the first column in the file denotes source and the second column denotes destination. With this logic the second path in the graph is from Node:b to Node:c. And the longest path in the graph is: Node:a to Node:b to Node:c.
I loaded the above csv in Neo4j desktop using the following command:
LOAD CSV WITH HEADERS FROM "file:\\graphUnioned.csv" AS csvLine
MERGE (s:s {s:csvLine.s})
MERGE (o:o {o:csvLine.o})
MERGE (s)-[]->(o)
RETURN *;
And then for finding longest path I run the following command:
match (n:s)
where (n:s)-[]->()
match p = (n:s)-[*1..]->(m:o)
return p, length(p) as L
order by L desc
limit 1;
However unfortunately this command only gives me path from Node: a to Node:b and does not return the longest path. Can someone please help me understand as to where am I going wrong?
There are two mistakes in your CSV import query.
First, you need to use a type when you MERGE a relationship between nodes, that query won't compile otherwise. You likely supplied one and forgot to add it when you pasted it here.
Second, the big one, is that your query is merging nodes with different labels and different properties, and this is majorly throwing it off. Your intent was to create 3 nodes, with a longest path connecting them, but your query creates 4 nodes, two isolated groups of two nodes each:
This creates 2 b nodes: (:s {s:b}) and (:o {o:b}). Each of them is connected to a different node, and this is due to treating the nodes to be created from each variable in the CSV differently.
What you should be doing is using the same label and property key for all of the nodes involved, and this will allow the match to the b node to only refer to a single node and not create two:
LOAD CSV WITH HEADERS FROM "file:\\graphUnioned.csv" AS csvLine
MERGE (s:Node {value:csvLine.s})
MERGE (o:Node {value:csvLine.o})
MERGE (s)-[:REL]->(o)
RETURN *;
You'll also want an index on :Node(value) (or whatever your equivalent is when you import real data) so that your MERGEs and subsequent MATCHes are fast when performing lookups of the nodes by property.
Now, to get to your longest path query.
If you are assuming that the start node has no relations to it, and that your end node has no relationships from it, then you can use a query like this:
match (start:Node)
where not ()-->(start)
match p = (start)-[*]->(end)
where not (end)-->()
return p, length(p) as L
order by L desc
limit 1;

Error creating relationships over huge dataset

My question is similar to the one pointed here :
Creating unique node and relationship NEO4J over huge dataset
I have 2 tables Entity (Entities.txt) & Relationships (EntitiesRelationships_Updated.txt) which looks like below: Both the tables are inside an import folder within the Neo4j database. What I am trying to do is load the tables using the load csv command and then create relationships.
As in the table below: If ParentID is 0, it means that ENT_ID does not have a parent. If it is populated, then it has a parent. For example in the table below, ENT_ID = 3 is the parent of ENT_ID = 4 and ENT_ID = 1 is the parent of ENT_ID = 2
**Entity Table**
ENT_ID Name PARENTID
1 ABC 0
2 DEF 1
3 GHI 0
4 JKG 3
**Relationship Table**
RID ENT_IDPARENT ENT_IDCHILD
1 1 2
2 3 5
The Entity table has 2 million records and the relationship tables has about 400K lines
Each RID has a particular tag associated with it. For example RID = 1 has it that the relation is A FATHER_OF B; RID = 2 has it that the relation is A MOTHER_OF B. Similarly there are 20 such RIDs associated.
Both of these are in txt format.
My first step is to load the entity table. I used the following script:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Entities.txt" AS Entity FIELDTERMINATOR '|'
CREATE (n:Entity{ENT_ID: toInt(Entity.ENT_ID),NAME: Entity.NAME,PARENTID: toInt(Entity.PARENTID)})
This query works fine. It takes about 10 minutes to load 2.8mil records. The next step I do is to index the records:
CREATE INDEX ON :Entity(PARENTID)
CREATE INDEX ON :Entity(ENT_ID)
This query runs fine as well. Following this I tried creating the relationships from the relationship table using a similar query as in the above link:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (n:A {ENT_IDPARENT : Rships.ENT_IDPARENT})
with Entity, n
MATCH (m:B {ENT_IDCHILD : Rships.ENT_IDCHILD})
with m,n
MERGE (n)-[r:RELATION_OF]->(m);
As I do this, my query keeps running for about an hour and it stops at a particular size(in my case 2.2gb) I followed this query based on the link above. This includes the edit from the solution below and still does not work
I have one more query, which would be as follows (Based on the above link). I run this query as I want to create a relationship based of the Entity table
PROFILE
MATCH(Entity)
MATCH (a:Entity {ENT_ID : Entity.ENT_ID})
WITH Entity, a
MATCH (b:Entity {PARENTID : Entity.PARENTID})
WITH a,b
MERGE (a)-[r:PARENT_OF]->(b)
While I tried running this query, I get a Java Heap Space Error. Unfortunately, I have not been able to get the solution for these.
Could you please advice if I am doing something wrong?
This query allows you to take advantage of your :Entity(ENT_ID) index:
MATCH (child:Entity)
WHERE child.PARENTID > 0
WITH child.PARENTID AS pid, child
MATCH (parent:Entity {ENT_ID : pid})
MERGE (parent)-[:PARENT_OF]->(child);
Cypher does not use indices when the property value comes from another node. To get around that, the above query uses a WITH clause to represent child.PARENTID as a variable (pid). The time complexity of this query should be O(N). You original query has a complexity of O(N * N).
[EDITED]
If the above query takes too long or encounters errors that might be related to running out of memory, try this variant, which creates 1000 new relationships at a time. You can change 1000 to any number that is workable for you.
MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT 1000
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);
The WHERE clause filters out child nodes that already have a parent relationship. And the MERGE operation has been changed to a simpler CREATE operation, since we have already ascertained that the relationship does not yet exist. The query returns a count of the number of relationships created. If the result is less than 1000, then all parent relationships have been created.
Finally, to make the repeated queries automated, you can install the APOC plugin on the neo4j server and use the apoc.periodic.commit procedure, which will repeatedly invoke a query until it returns 0. In this example, I use a limit parameter of 10000:
CALL apoc.periodic.commit(
"MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT {limit}
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);",
{limit: 10000});
Your entity creation Cypher looks fine, as do your indexes.
I am rather confused about the last two Cypher fragments though.
Since your relationships have a specific label or id associated with them, it's probably best to add your relationships by loading from the relationship table data, though the node labels in your query (A and B) aren't used in your Entity creation and aren't in your graph, and neither are ENT_IDPARENT or ENT_IDCHILD fields. Looks like this isn't really the Cypher you used, but an example you built off of?
I'd change this relationship creation query to this, setting the type property of the relationship for post-processing later (this assumes that there can only be one :RELATION_OF relation between the same two nodes):
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (parent:Entity {ENT_ID : Rships.ENT_IDPARENT})
MATCH (child:Entity {ENT_ID : Rships.ENT_IDCHILD})
MERGE (parent)-[r:RELATION_OF]->(child)
ON CREATE SET r.RID = Rships.RID;
Later on, if you like, you can match on your relationships with an RID, and add the corresponding type ("FATHER_OF", "MOTHER_OF", etc) property.
As for creating the :PARENT_OF relationship, you're doing some extra match on an Entity variable bound to every single node in your graph - get rid of that.
Instead, use this:
PROFILE
// first, match on all Entities with a PARENTID property
MATCH(child:Entity)
WHERE EXISTS(child.PARENTID)
// next, find the parent for each child by the child's PARENTID
WITH child
MATCH (parent:Entity {ENT_ID : child.PARENTID})
MERGE (parent)-[:PARENT_OF]->(child)
// lastly remove the parentid from the child, so it won't be reprocessed
// if we run the query again.
REMOVE child.PARENTID
EDITED the above query to use an existence check on child.PARENTID, and to remove child.PARENTID after the corresponding relationship has been created.
If you need a solution that uses batching, you could do this manually (adding LIMIT 100000 to your WITH child line, or you could install the APOC Procedures Library and use its periodic.commit() function to batch your processing.

Neo4j Load CSV Import Stalling

After playing around with toy datasets, this was my first attempt to use data that is relevant for a project at work. In short, after limping to get nearly all of my data into Neo4j, my last query simply stalled. See the screenshot.
Note: I was prototyping my queries by pasting them into the browser tool, but my longer term plan was to keep all of the commands in a .cql file that I could script on my workstation in order to perform nightly analyses.
To add context to my problem, I am prototyping on my macbook.
8gb ram
2.2 ghz intel core i7
osx 10.9.5
2.2.0 community
The files I am processing (rows/columns). I am not importing every column, it was just easier to keep my current datasets in check.
Ability.csv = 3/1
brm.csv = 276992/34
cont.sv = 80093/17
email chain.csv = 199143/34 (this is the only data I can't get in)
email first last.csv = 77849/20
recs.csv = 77962/20
templates_topics.csv = 29/3
templates.csv = 49/4
topics.csv = 13/1
vendors = 5/1
The only config options that I set manually for neo4j were in neo4j-wrapper.conf where I set wrapper.java.initmemory and wrapper.java.maxmemory to 4096. I did this after poking around to find similar problems.
I made these changes out of the gate because within the browser, I was getting error messages that the database was disconnected while processing my queries.
Lastly, because my data are work-related, I can't provide test data. I can, however, link to my cypher queries.
Constraint and LOAD CSV .cql file
Any help and advice would be greatly appreciated. I am pretty confident this is user error on my end, but I have definitely hit the road with respect to what my next steps would be.
Avoid eager loading in LOAD CSV. It doesn't respect PERIODIC COMMIT. See this article by Mark Needham for a thorough explanation.
I would split this one up, into creating nodes once and creating relationships (each) second:
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:///Users/btibert/Dropbox/Projects/bentley-search-neo4j/data/templates.csv" AS row
WITH row
MATCH (r:Vendor {name:row.vendor})
WITH row, r
MERGE (p:Template {name:row.template_clean})
MERGE (v:Version {version:row.template_ver})
MERGE (p)-[:FROM_VERSION]->(v)
MERGE (p)-[:CREATED_BY]->(r);
As you can clearly see the Eager operation in the plan.
I mean it doesn't matter if you just have a few thousand rows. But if it goes towards many hundred thousand or millions then pulling all data in takes more memory.
+----------------+------------------------------------+------------------------------------------------------------------------------------------------+
| Operator | Identifiers | Other |
+----------------+------------------------------------+------------------------------------------------------------------------------------------------+
| EmptyResult | | |
| UpdateGraph(0) | anon[270], anon[301], p, r, row, v | MergePattern |
| UpdateGraph(1) | anon[270], p, r, row, v | MergePattern |
| UpdateGraph(2) | p, r, row, v | MergeNode; row.template_clean; :Template(name); MergeNode; row.template_ver; :Version(version) |
| Eager | r, row | |
| SchemaIndex | r, row | row.vendor; :Vendor(name) |
| LoadCSV | row | |
+----------------+------------------------------------+------------------------------------------------------------------------------------------------+
I would probably change this into an ON CREATE SET variant for the non-key properties:
Also if you have multiple rows per student you can use WITH DISTINCT toInt(row.pidm) as pidm, .... to reduce the number of merges it has to run.
LOAD CSV WITH HEADERS FROM "recs.csv" AS row
WITH row
MERGE (s:Student {pidm:toInt(row.pidm)})
ON CREATE SET s.hash_pidm=toInt(row.hash_pidm), ....;
This one I'd split up into two statements, one for each relationship, otherwise you might get too many matches:
(And you don't need the WITHs in between)
LOAD CSV WITH HEADERS FROM "...recs.csv" AS row
WITH row
MATCH (s:Student {pidm: toInt(row.pidm)} )
MATCH (v:Vendor {name: row.vendor} )
MATCH (a:Ability {name: row.ability} )
WITH row, s, v, a
MERGE (s)-[:PURCHASED_FROM]->(v)
MERGE (s)-[:HAS_ABILITY]->(a);
would become:
LOAD CSV WITH HEADERS FROM "...recs.csv" AS row
MATCH (s:Student {pidm: toInt(row.pidm)} )
MATCH (v:Vendor {name: row.vendor} )
MERGE (s)-[:PURCHASED_FROM]->(v);
LOAD CSV WITH HEADERS FROM "...recs.csv" AS row
MATCH (s:Student {pidm: toInt(row.pidm)} )
MATCH (a:Ability {name: row.ability} )
MERGE (s)-[:HAS_ABILITY]->(a);
Here I would also create the contacts on themselves. (Again with ON CREATE SET)
And do the student relationship in a separate statement:
LOAD CSV WITH HEADERS FROM "....cont.csv" AS row
MERGE (c:Contact {cid:row.cid}) ON CREATE SET ....;
LOAD CSV WITH HEADERS FROM "...cont.csv" AS row
MATCH (s:Student {pidm:toInt(row.pidm)} )
MATCH (c:Contact {cid:row.cid})
MERGE (s)-[:HAS_CONTACT]->(c);
I would also split this one up into two statements:
LOAD CSV WITH HEADERS FROM "...cont.csv" AS row
WITH row WHERE toInt(row.seqnum) = 1
MATCH (s:Student {pidm:toInt(row.pidm)})
MATCH (f:Contact {cid:row.first_cont})
MERGE (s)-[:FIRST]->(f);
LOAD CSV WITH HEADERS FROM "...cont.csv" AS row
WITH row WHERE toInt(row.seqnum) = 1
MATCH (s:Student {pidm:toInt(row.pidm)})
MATCH (l:Contact {cid:row.last_cont})
MERGE (s)-[:LAST]->(l);
Split this one up into E-Mail creation and then later connecting it to the student by msg-id:
LOAD CSV WITH HEADERS FROM "...brm.csv" AS row
MERGE (e:Email {msgid:row.msgid}) ON CREATE SET ... ;
LOAD CSV WITH HEADERS FROM "file:///Users/btibert/Dropbox/Projects/bentley-search-neo4j/data/brm.csv" AS row
MATCH (s:Student {pidm:toInt(row.pidm)})
MATCH (e:Email {msgid:row.msgid})
MERGE (s)-[:WAS_SENT]->(e);
HTH Michael

Neo4j Cypher Load CSV Failure on Unique Constraint

I'm having issues importing a large volume of data into a Neo4j instance using the Cypher LOAD CSV command. I'm attempting to load in roughly 253k user records each with a unique user_id. My first step was to add a unique constraint on tje label to make sure the user was only being run once
CREATE CONSTRAINT ON (b:User) ASSERT b.user_id IS UNIQUE;
I then tried to run LOAD CSV with periodic commits to pull this data in.
This query failed so I tried to merge User record before setting
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_users.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id)})-[:REGISTERED_TO]->(t)
set p.created=toInt(line.created), p.completed=toInt(line.completed);
Modifying the periodic commit value has made no difference, the same error is returned.
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})-[:REGISTERED_TO]->(t);
I receive the following error:
LoadCsvStatusWrapCypherException: Node 9752 already exists with label Person and property "hpcm_uk_buddy_id"=[2446] (Failure when processing URL 'file:/home/data/uk_buddies.csv' on line 253316 (which is the last row in the file). Possibly the last row committed during import is line 253299. Note that this information might not be accurate.)
The numbers seem to match up roughly, the CSV file contains 253315 records in total. The periodic commit doesn't seem to have taken effect either, a count of nodes returns only 5446 rows.
neo4j-sh (?)$ match (n) return count(n);
+----------+
| count(n) |
+----------+
| 5446 |
+----------+
1 row
768 ms
I can understand the number of nodes being incorrect if this ID is only roughly 5000 rows into the CSV file. But is there any technique or command I can use to succeed this import?
You're falling victim to a common mistake with MERGE I think. Relative to cypher query, seriously this would be like in my top 10 FAQs about common problems with cypher. See you're doing this:
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})-[:REGISTERED_TO]->(t);
The way merge works, that last merge matches on the entire relationship, not just on the user node. So probably, you're creating duplicate users that you shouldn't be. When you run this merge, even if a user with those exact properties already exists, the relationship to the t node doesn't, so it attempt to create a new user node with those attributes, to connect to t, which isn't what you want.
The solution is to merge the user individually, then separately merge the relationship path, like this:
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})
merge (p)-[:REGISTERED_TO]->(t);
Note the two merges at the end. One creates just the user. If the user already exists, it won't try to create a duplicate, and you should hopefully be OK with your constraint (assuming there aren't two users with the same user_id, but different created values). After you've merged just the user, then you merge the relationship.
The net result of the second query is the same, but shouldn't create duplicate users.

neo4j Optimize a relationsship check (query)

after importing data via CSV LOAD I want to connect the imported nodes to customer nodes that are already in the DB. The idea was to look up all imported nodes with the Label TICKET and run through the result set and create the relationship.
Here is the code I come up with first approach:
# Find nodes without relationship for label Ticket
MATCH (t:Ticket), (c:Customer)
WHERE NOT (t)--(c)
RETURN t.number as ticket_number, t.type as ticket_type,t.sid as ticket_sid
# Run through the resultset and execute for each found node
MATCH (t:Ticket { number: "xxx" }), (c:Customer {code: "xxx"})
MERGE (t)-[:IS_TICKET_OF]->(c);
There is an index
ON :Ticket (number)
ON :Customer(code)
This way to handle it is very slow and it took minutes to run through the CSV file. I hope there is a way to optimize the query or maybe to find a way to create the missing relationship easier as first to look them all up and then run through a loop.
The CSV Load is :
LOAD CSV FROM "file:c:..." AS csvLine
MERGE (t:Ticket { number: csvLine[0]})
Maybe its also fine to create the relation already in the CSV import - maybe something like
MATCH (c:Customer {code:"xxx"})
MERGE (t) - [:IS_TICKET_OF]-> (c)
But I would need to figure out in the query how to extract the code from a field as I have something like "aaa/vvv/bbb/1234" in the CSV import and would need only aaa for the match above as this is stored in the customer node as ID.
Any hint is very appreciated.
Thanks!
Does this query work for you?
It stores the aaa part of the input string in num, makes sure the ticket with that number exists, and then makes sure a relationship exists to the matching customer (if there is such a customer).
LOAD CSV FROM "file:c:..." AS csvLine
WITH SPLIT(csvLine[0], '/')[0] AS num
MERGE (t:Ticket {number: num})
WITH num, t
OPTIONAL MATCH (c:Customer {code: num})
MERGE (t)-[:IS_TICKET_OF]->(c);

Resources