Repeating a query after running a few other queries in the meantime may result in different results. This shows up only in larger databases and it was a little difficult to reproduce. However, the following protocol will almost surely (worked on windows and linux installations) show the same problem:
Begin with a newly created empty database.
Create unique index for identifying nodes when importing relationships. Wait until index is online.
Load many nodes from CSV.
//load nodes from UTF-8 encoded TSV file
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///all_nodes.tsv' AS row FIELDTERMINATOR ' '
MERGE (root:UID {Spec:row.UID})
WITH root,row
// set additional label
CALL apoc.create.addLabels(root,[row.Label]) YIELD node AS labnode
// set additional property
CALL apoc.create.setProperty(root,row.PropName,row.PropValue) YIELD node as propnode
RETURN count(*)
Load many relationships from CSV.
// load relationships from UTF-8 encoded TSV file
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///all_edges.tsv' AS row FIELDTERMINATOR ' '
MATCH (sourcenode:UID {Spec:row.SourceUID})
MATCH (targetnode:UID {Spec:row.TargetUID})
WITH row,sourcenode,targetnode
CALL apoc.merge.relationship(sourcenode,row.Relationship,CASE row.PropName
WHEN "Source" THEN {Source:row.PropValue} ELSE {} END,{}, targetnode) YIELD rel
RETURN count(*)
MATCH (n) RETURN count(*).
MATCH ()-[r]->() RETURN count(*).
Delete many relationships until all relationships are deleted.
7.5. MATCH (n) RETURN count(*).
7.6. MATCH ()-[r]->() RETURN count(*).
Load same relationships from CSV (same query as in 4).
8.5. MATCH (n) RETURN count(*).
MATCH ()-[r]->() RETURN count(*).
I cannot make my data publicly available, so I can only show the numbers:
Test Case 1:
Loaded 5908886 nodes and 11801553 relationships.
Deleted all relationships in small batches.
Finalized with MATCH ()-[r]->() DELETE r.
Database contained 5908886 nodes and 0 relationships.
Loaded 11801871 relationships.
Database contained 5908886 nodes and 11801871 relationships.
Test Case 2:
Loaded 3338901 nodes and 8892829 relationships.
Deleted all relationships in small batches.
Finalized with MATCH ()-[r]->() DELETE r.
Database contained 3338901 nodes and 0 relationships.
Loaded 8893041 relationships.
Database contained 3338901 nodes and 8893041 relationships.
The differences are only 318 and 212 relationships, but should they not be 0?
EDIT: Partial solution has been found. The two above test cases contained unescaped Neo4j control characters such as / (forward slash) in the property values to be imported. APOC didn't recognize these as errors and thus they were introduced to the database store. Whenever a query was run that accessed these parameter values, it caused unexpected side effects that 'corrupted' the database. The issue has been more or less resolved by removing these faulty property values from the CSV input.
Related
I have a graph with 300k nodes and 4M relationships.
I'd like to query all triples:
MATCH p=()-[]->()
RETURN p
I get the following error:
Neo.DatabaseError.Statement.ExecutionFailed
org.neo4j.io.pagecache.CursorException: PropertyRecord claims to have more property blocks than can fit in a record
Do you know what goes wrong? Thanks.
This is a way to export all nodes and relationships into a csv file using APOC function.
Ref: https://neo4j.com/labs/apoc/4.1/export/csv/
For example; to download all nodes and relationships of Movies database
CALL apoc.export.csv.all("movies.csv", {})
OR if you want to add your own query, see sample below:
MATCH (person:Person)
WHERE person.name STARTS WITH "L"
WITH collect(person) AS people
CALL apoc.export.csv.data(people, [], "movies-l.csv", {})
YIELD file, source, format, nodes, relationships, properties, time, rows, batchSize, batches, done, data
RETURN file, source, format, nodes, relationships, properties, time, rows, batchSize, batches, done, data
==================
Why do you need to see 300k nodes and 8M relationships in one browser?
You can use alternatives below:
1 call db.schema.visualization() -> a simplified view of the database
2 MATCH p=()-[]->()
RETURN p
LIMIT 25 -> limits few nodes to view
I have been created a graph having a constraint on primary id. In my csv a primary id is duplicate but the other proprieties are different. Based on the other properties I want to create relationships.
I tried multiple times to change the code but it does not do what I need.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///Trial.csv' AS line FIELDTERMINATOR '\t'
MATCH (n:Trial {id: line.primary_id})
with line.cui= cui
MATCH (m:Intervention)
where m.id = cui
MERGE (n)-[:HAS_INTERVENTION]->(m);
I already have the nodes Intervention in the graph as well as the trials. So what I am trying to do is to match a trial with the id from intervention and create only the relationship. Instead is creating me also the nodes.
This is a sample of my data, so the same primary id, having different cuis and I am trying to match on cui:
You can refer the following query which finds Trial and Intervention nodes by primary_id and cui respectively and creates the relationship between them.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///Trial.csv' AS line FIELDTERMINATOR '\t'
MATCH (n:Trial {id: line.primary_id}), (m:Intervention {id: line.cui})
MERGE (n)-[:HAS_INTERVENTION]->(m);
The behavior you observed is caused by 2 aspects of the Cypher language:
The WITH clause drops all existing variables except for the ones explicitly specified in the clause. Therefore, since your WITH clause does not specify the n node, n becomes an unbound variable after the clause.
The MERGE clause will create its entire pattern if any part of the pattern does not already exist. Since n is not bound to anything, the MERGE clause would go ahead and create the entire pattern (including the 2 nodes).
So, you could have fixed the issue by simply specifying the n variable in the WITH clause, as in:
WITH n, line.cui= cui
But #Raj's query is even better, avoiding the need for WITH entirely.
I have the following graph stored in csv format:
graphUnioned.csv:
a b
b c
The above graph denotes path from Node:a to Node:b. Note that the first column in the file denotes source and the second column denotes destination. With this logic the second path in the graph is from Node:b to Node:c. And the longest path in the graph is: Node:a to Node:b to Node:c.
I loaded the above csv in Neo4j desktop using the following command:
LOAD CSV WITH HEADERS FROM "file:\\graphUnioned.csv" AS csvLine
MERGE (s:s {s:csvLine.s})
MERGE (o:o {o:csvLine.o})
MERGE (s)-[]->(o)
RETURN *;
And then for finding longest path I run the following command:
match (n:s)
where (n:s)-[]->()
match p = (n:s)-[*1..]->(m:o)
return p, length(p) as L
order by L desc
limit 1;
However unfortunately this command only gives me path from Node: a to Node:b and does not return the longest path. Can someone please help me understand as to where am I going wrong?
There are two mistakes in your CSV import query.
First, you need to use a type when you MERGE a relationship between nodes, that query won't compile otherwise. You likely supplied one and forgot to add it when you pasted it here.
Second, the big one, is that your query is merging nodes with different labels and different properties, and this is majorly throwing it off. Your intent was to create 3 nodes, with a longest path connecting them, but your query creates 4 nodes, two isolated groups of two nodes each:
This creates 2 b nodes: (:s {s:b}) and (:o {o:b}). Each of them is connected to a different node, and this is due to treating the nodes to be created from each variable in the CSV differently.
What you should be doing is using the same label and property key for all of the nodes involved, and this will allow the match to the b node to only refer to a single node and not create two:
LOAD CSV WITH HEADERS FROM "file:\\graphUnioned.csv" AS csvLine
MERGE (s:Node {value:csvLine.s})
MERGE (o:Node {value:csvLine.o})
MERGE (s)-[:REL]->(o)
RETURN *;
You'll also want an index on :Node(value) (or whatever your equivalent is when you import real data) so that your MERGEs and subsequent MATCHes are fast when performing lookups of the nodes by property.
Now, to get to your longest path query.
If you are assuming that the start node has no relations to it, and that your end node has no relationships from it, then you can use a query like this:
match (start:Node)
where not ()-->(start)
match p = (start)-[*]->(end)
where not (end)-->()
return p, length(p) as L
order by L desc
limit 1;
I have two csv files:
(clean_data_2.csv : Sample Content as under)
(stationdata.csv : Sample Content as under)
From my cypher query, I want that each station is represented as node and relationship is represented as count.
I did something like this:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///stationdata.csv" AS line
CREATE (s:station{id:line.station_id,station_name:line.name});
Loading all station data: it creates all the nodes - source and destination columns
LOAD CSV WITH HEADERS FROM "file:///clean_data_2.csv" AS line
MATCH (src:station),(dst:station)
CREATE (src)-[:TO{ count: [line.count]}]->(dst);
The above part runs, but does not give me count in the relationship between nodes.
I am new to Neo4j - graph databases, thanks!
Your second query's MATCH clause does not specify the names of the station nodes for src and dst, so all possible pairs of station nodes would be matched. That would cause the creation of a lot of extra TO relationships with count properties.
Try using this instead of your second query:
LOAD CSV WITH HEADERS FROM "file:///clean_data_2.csv" AS line
MATCH (src:station {name: line.src}), (dst:station {name: line.dst})
CREATE (src)-[:TO {count: TOINTEGER(line.count)}]->(dst);
This query specifies the station names in the MATCH clause, which your query was not doing.
This query also converts the line.count value from a string (which all values produced by LOAD CSV are) into an integer, and assigns it as a scalar value to the count property, as there does not seem a need for it to be an array.
My question is similar to the one pointed here :
Creating unique node and relationship NEO4J over huge dataset
I have 2 tables Entity (Entities.txt) & Relationships (EntitiesRelationships_Updated.txt) which looks like below: Both the tables are inside an import folder within the Neo4j database. What I am trying to do is load the tables using the load csv command and then create relationships.
As in the table below: If ParentID is 0, it means that ENT_ID does not have a parent. If it is populated, then it has a parent. For example in the table below, ENT_ID = 3 is the parent of ENT_ID = 4 and ENT_ID = 1 is the parent of ENT_ID = 2
**Entity Table**
ENT_ID Name PARENTID
1 ABC 0
2 DEF 1
3 GHI 0
4 JKG 3
**Relationship Table**
RID ENT_IDPARENT ENT_IDCHILD
1 1 2
2 3 5
The Entity table has 2 million records and the relationship tables has about 400K lines
Each RID has a particular tag associated with it. For example RID = 1 has it that the relation is A FATHER_OF B; RID = 2 has it that the relation is A MOTHER_OF B. Similarly there are 20 such RIDs associated.
Both of these are in txt format.
My first step is to load the entity table. I used the following script:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Entities.txt" AS Entity FIELDTERMINATOR '|'
CREATE (n:Entity{ENT_ID: toInt(Entity.ENT_ID),NAME: Entity.NAME,PARENTID: toInt(Entity.PARENTID)})
This query works fine. It takes about 10 minutes to load 2.8mil records. The next step I do is to index the records:
CREATE INDEX ON :Entity(PARENTID)
CREATE INDEX ON :Entity(ENT_ID)
This query runs fine as well. Following this I tried creating the relationships from the relationship table using a similar query as in the above link:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (n:A {ENT_IDPARENT : Rships.ENT_IDPARENT})
with Entity, n
MATCH (m:B {ENT_IDCHILD : Rships.ENT_IDCHILD})
with m,n
MERGE (n)-[r:RELATION_OF]->(m);
As I do this, my query keeps running for about an hour and it stops at a particular size(in my case 2.2gb) I followed this query based on the link above. This includes the edit from the solution below and still does not work
I have one more query, which would be as follows (Based on the above link). I run this query as I want to create a relationship based of the Entity table
PROFILE
MATCH(Entity)
MATCH (a:Entity {ENT_ID : Entity.ENT_ID})
WITH Entity, a
MATCH (b:Entity {PARENTID : Entity.PARENTID})
WITH a,b
MERGE (a)-[r:PARENT_OF]->(b)
While I tried running this query, I get a Java Heap Space Error. Unfortunately, I have not been able to get the solution for these.
Could you please advice if I am doing something wrong?
This query allows you to take advantage of your :Entity(ENT_ID) index:
MATCH (child:Entity)
WHERE child.PARENTID > 0
WITH child.PARENTID AS pid, child
MATCH (parent:Entity {ENT_ID : pid})
MERGE (parent)-[:PARENT_OF]->(child);
Cypher does not use indices when the property value comes from another node. To get around that, the above query uses a WITH clause to represent child.PARENTID as a variable (pid). The time complexity of this query should be O(N). You original query has a complexity of O(N * N).
[EDITED]
If the above query takes too long or encounters errors that might be related to running out of memory, try this variant, which creates 1000 new relationships at a time. You can change 1000 to any number that is workable for you.
MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT 1000
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);
The WHERE clause filters out child nodes that already have a parent relationship. And the MERGE operation has been changed to a simpler CREATE operation, since we have already ascertained that the relationship does not yet exist. The query returns a count of the number of relationships created. If the result is less than 1000, then all parent relationships have been created.
Finally, to make the repeated queries automated, you can install the APOC plugin on the neo4j server and use the apoc.periodic.commit procedure, which will repeatedly invoke a query until it returns 0. In this example, I use a limit parameter of 10000:
CALL apoc.periodic.commit(
"MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT {limit}
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);",
{limit: 10000});
Your entity creation Cypher looks fine, as do your indexes.
I am rather confused about the last two Cypher fragments though.
Since your relationships have a specific label or id associated with them, it's probably best to add your relationships by loading from the relationship table data, though the node labels in your query (A and B) aren't used in your Entity creation and aren't in your graph, and neither are ENT_IDPARENT or ENT_IDCHILD fields. Looks like this isn't really the Cypher you used, but an example you built off of?
I'd change this relationship creation query to this, setting the type property of the relationship for post-processing later (this assumes that there can only be one :RELATION_OF relation between the same two nodes):
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (parent:Entity {ENT_ID : Rships.ENT_IDPARENT})
MATCH (child:Entity {ENT_ID : Rships.ENT_IDCHILD})
MERGE (parent)-[r:RELATION_OF]->(child)
ON CREATE SET r.RID = Rships.RID;
Later on, if you like, you can match on your relationships with an RID, and add the corresponding type ("FATHER_OF", "MOTHER_OF", etc) property.
As for creating the :PARENT_OF relationship, you're doing some extra match on an Entity variable bound to every single node in your graph - get rid of that.
Instead, use this:
PROFILE
// first, match on all Entities with a PARENTID property
MATCH(child:Entity)
WHERE EXISTS(child.PARENTID)
// next, find the parent for each child by the child's PARENTID
WITH child
MATCH (parent:Entity {ENT_ID : child.PARENTID})
MERGE (parent)-[:PARENT_OF]->(child)
// lastly remove the parentid from the child, so it won't be reprocessed
// if we run the query again.
REMOVE child.PARENTID
EDITED the above query to use an existence check on child.PARENTID, and to remove child.PARENTID after the corresponding relationship has been created.
If you need a solution that uses batching, you could do this manually (adding LIMIT 100000 to your WITH child line, or you could install the APOC Procedures Library and use its periodic.commit() function to batch your processing.