Cypher query - creation of graph from csv file - neo4j

I have two csv files:
(clean_data_2.csv : Sample Content as under)
(stationdata.csv : Sample Content as under)
From my cypher query, I want that each station is represented as node and relationship is represented as count.
I did something like this:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///stationdata.csv" AS line
CREATE (s:station{id:line.station_id,station_name:line.name});
Loading all station data: it creates all the nodes - source and destination columns
LOAD CSV WITH HEADERS FROM "file:///clean_data_2.csv" AS line
MATCH (src:station),(dst:station)
CREATE (src)-[:TO{ count: [line.count]}]->(dst);
The above part runs, but does not give me count in the relationship between nodes.
I am new to Neo4j - graph databases, thanks!

Your second query's MATCH clause does not specify the names of the station nodes for src and dst, so all possible pairs of station nodes would be matched. That would cause the creation of a lot of extra TO relationships with count properties.
Try using this instead of your second query:
LOAD CSV WITH HEADERS FROM "file:///clean_data_2.csv" AS line
MATCH (src:station {name: line.src}), (dst:station {name: line.dst})
CREATE (src)-[:TO {count: TOINTEGER(line.count)}]->(dst);
This query specifies the station names in the MATCH clause, which your query was not doing.
This query also converts the line.count value from a string (which all values produced by LOAD CSV are) into an integer, and assigns it as a scalar value to the count property, as there does not seem a need for it to be an array.

Related

Update a list property of a node by values of a CSV file

I would like to update the list property of one node in my DB by a CSV file. The node's name is Class and it has two properties name of type string and vector of type list of string. The CSV file has two columns nameCSV and vectorCSV and some rows have the same value for the nameCSV column but different values for the vectorCSV column. So I want to update the vector property by adding the values of the vectorCSV when their corresponding names are equal. I wrote this query in Cypher but it doesn't work:
load CSV WITH HEADERS FROM "file:///test2.csv" AS vector
MATCH (c:Class)
Merge( nv:CSVfile {nameCSV: vector.nameCSV,vectorCSV:vector.vectorCSV})
where c.name = nv.nameCSV
set c.vector = c.vector + nv.vectorCSV
How should I change this query to update the list property vector of the node Class.
I assume that the vector property on the Class nodes is already an array.
LOAD CSV WITH HEADERS FROM "file:///test2.csv" AS vector
MATCH (c:Class {name: vector.nameCSV})
SET c.vector=c.vector + vector.vectorCSV
This could result in duplicate values. If you do not want duplicates, you can use apoc.coll.union
LOAD CSV WITH HEADERS FROM "file:///test2.csv" AS vector
MATCH (c:Class {name: vector.nameCSV})
SET c.vector=apoc.coll.union(c.vector,[vector.vectorCSV])

Loading sparse adjacency matrix on Neo4j

I'm trying to load a sparse (co-occurrence) matrix in Neo4j but after many failed queries, it's getting frustrating.
Raw data
Basically, I want to create the nodes from the ids, and the relationship weight against each other node (including itself) should be the value on the matrix.
So, for example, 'nhs' should have a self-relationship with weight 41 and 16 with 'england', and so on.
I was trying things like:
LOAD CSV WITH HEADERS FROM 'file:///matpharma.csv' AS row
MERGE (a: node{name: row.id})
MERGE (b: node{name: row.key})
MERGE (a)-[:w]-(b);
I'm not sure how to attach the edge values though (and not yet sure if the merges are producing the expected result).
Thanks in advance for the assistance
If you just need to add a property on a relationship, where the property value is in your CSV, then it's just a matter of adding a variable for the relationship that you MERGE in, and then using SET (or ON CREATE SET, if you only want to set the property if the relationship didn't exist and needed to be created). So something like:
LOAD CSV WITH HEADERS FROM 'file:///matpharma.csv' AS row
MERGE (a: node{name: row.id})
MERGE (b: node{name: row.key})
MERGE (a)-[r:w]-(b)
SET r.weight = row.weight
EDIT
Ah, took a look at the CSV clip. This is a very strange way to format your data. You have data in your header (that is, your headers are trying to define the other node to lookup) which is the wrong way to go about this. You should instead have, per row, one column that defines one of the two nodes to connect (like the "id" column) and then another column for the other node (something like an "id2"). That way you can just do two MATCHes to get your nodes, then a MERGE between them, and then setting the relationship property, similar to the sample query I provided above.
But if you're set on this format, then it's going to be a more complicated query, since we have to deal with dynamic access of the row keys and values.
Something like:
LOAD CSV WITH HEADERS FROM 'file:///matpharma.csv' AS row
MERGE (start:Node {name:row.id})
WITH start, row, [key in keys(row) WHERE key <> 'id'] as keys
FOREACH (key in keys |
MERGE (end:Node {name:key})
MERGE (start)-[r:w]-(end)
ON CREATE SET r.weight = row[key] )
This is a nice Cypher challenge :) Let's say that LOAD CSV is not really meant to do this and probably you would be happier by flattening your data
Here is what I came up with :
LOAD CSV FROM "https://gist.githubusercontent.com/ikwattro/a5260d131f25bcce97c945cb97bc0bee/raw/4ce2b3421ad80ca946329a0be8a6e79ca025f253/data.csv" AS row
WITH collect(row) AS rows
WITH rows, rows[0] AS firstRow
UNWIND rows AS row
WITH firstRow, row SKIP 1
UNWIND range(0, size(row)-2) AS i
RETURN firstRow[i+1], row[0], row[i+1]
You can take a look at the gist

creating a list type collection for a property of node while importing CSV

I am trying to import CSV into Neo4j and create a list collection type property to node.
I have tried with the below code but it creates multiple nodes for the values in csvline.name.
LOAD CSV WITH HEADERS FROM "file:\\persons1.csv" AS csvLine
merge (p:Persons {id: toInteger(csvLine.id), name: [csvLine.name]})
CREATE (n:Person{name:'john',age:34,gender:'m', phone_no:[1234,5678]})
I am expecting only one node having property with collection of phone number should be created in the above case.
Since your CREATE clause is in the same Cypher statement as the LOAD CSV, it will be executed once for every csvLine value.
You will need to run the CREATE clause separately if you want it to be executed only once. (But you may still end up with 2 Person nodes with the name, "John", since a MERGE call may have already created one.)

Performance: Creating a relationship in neo4j based on property ID

For test purposes i imported data from another datasource into neo4j.
I imported the data only as nodes. Now i want to add the edges based on the imported ID. Every node has 2 fields
id: contains the identification as String
from: contains all connections as a String[]
For performance improvements i also created an index for the propertiy "id" and an index for the property "from"
First i created both properties as String (the from list as comma separated String).
This works, but is really slow:
MATCH (e:Test1),(r:Test2)
WHERE r.from CONTAINS e._id
MERGE (e)-[:HAS]->(r)
is there a better way?
PS: i tried also to store the from field as String[]. than i used the following query
MATCH (e:Test1),(r:Test2)
WHERE e._id IN r.from
MERGE (e)-[:HAS]->(r)
-> Performance is the same
The problem is that you take a combination of all components - the Cartesian product. In both cases. More would be better to split the string by comma to identifiers. For example:
MATCH (T2:Test2)
UNWIND split(T2.from, ",") as id
MATCH (T1:Test1) WHERE T1._id = id
MERGE (T1)-[:HAS]->(T2)
Or, if you keep the identifiers in the array:
MATCH (T2:Test2)
UNWIND T2.from as id
MATCH (T1:Test1) WHERE T1._id = id
MERGE (T1)-[:HAS]->(T2)
And, of course, do not forget about the index.
Actually, at import time, you should be creating the :HAS relationships instead of creating the from property (which forces you to make a wasteful additional query to create the relationships, and leaves you with redundant from properties that you would probably want to delete).
For example, if you are using LOAD CSV to import, and your import file has test2Id and from columns (a string and a string collection, respectively), this import query should create all the nodes and relationships:
LOAD CSV WITH HEADERS FROM "file:///input.csv" AS row
MERGE (t2:Test2 {id: row.test2Id})
WITH row, t2
UNWIND row.from AS t1Id
MERGE (t1:Test1 {id: t1Id})
MERGE (t1)-[:HAS]->(t2);
For better performance, you would want indexes on both :Test1(id) and :Test2(id).

Error creating relationships over huge dataset

My question is similar to the one pointed here :
Creating unique node and relationship NEO4J over huge dataset
I have 2 tables Entity (Entities.txt) & Relationships (EntitiesRelationships_Updated.txt) which looks like below: Both the tables are inside an import folder within the Neo4j database. What I am trying to do is load the tables using the load csv command and then create relationships.
As in the table below: If ParentID is 0, it means that ENT_ID does not have a parent. If it is populated, then it has a parent. For example in the table below, ENT_ID = 3 is the parent of ENT_ID = 4 and ENT_ID = 1 is the parent of ENT_ID = 2
**Entity Table**
ENT_ID Name PARENTID
1 ABC 0
2 DEF 1
3 GHI 0
4 JKG 3
**Relationship Table**
RID ENT_IDPARENT ENT_IDCHILD
1 1 2
2 3 5
The Entity table has 2 million records and the relationship tables has about 400K lines
Each RID has a particular tag associated with it. For example RID = 1 has it that the relation is A FATHER_OF B; RID = 2 has it that the relation is A MOTHER_OF B. Similarly there are 20 such RIDs associated.
Both of these are in txt format.
My first step is to load the entity table. I used the following script:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Entities.txt" AS Entity FIELDTERMINATOR '|'
CREATE (n:Entity{ENT_ID: toInt(Entity.ENT_ID),NAME: Entity.NAME,PARENTID: toInt(Entity.PARENTID)})
This query works fine. It takes about 10 minutes to load 2.8mil records. The next step I do is to index the records:
CREATE INDEX ON :Entity(PARENTID)
CREATE INDEX ON :Entity(ENT_ID)
This query runs fine as well. Following this I tried creating the relationships from the relationship table using a similar query as in the above link:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (n:A {ENT_IDPARENT : Rships.ENT_IDPARENT})
with Entity, n
MATCH (m:B {ENT_IDCHILD : Rships.ENT_IDCHILD})
with m,n
MERGE (n)-[r:RELATION_OF]->(m);
As I do this, my query keeps running for about an hour and it stops at a particular size(in my case 2.2gb) I followed this query based on the link above. This includes the edit from the solution below and still does not work
I have one more query, which would be as follows (Based on the above link). I run this query as I want to create a relationship based of the Entity table
PROFILE
MATCH(Entity)
MATCH (a:Entity {ENT_ID : Entity.ENT_ID})
WITH Entity, a
MATCH (b:Entity {PARENTID : Entity.PARENTID})
WITH a,b
MERGE (a)-[r:PARENT_OF]->(b)
While I tried running this query, I get a Java Heap Space Error. Unfortunately, I have not been able to get the solution for these.
Could you please advice if I am doing something wrong?
This query allows you to take advantage of your :Entity(ENT_ID) index:
MATCH (child:Entity)
WHERE child.PARENTID > 0
WITH child.PARENTID AS pid, child
MATCH (parent:Entity {ENT_ID : pid})
MERGE (parent)-[:PARENT_OF]->(child);
Cypher does not use indices when the property value comes from another node. To get around that, the above query uses a WITH clause to represent child.PARENTID as a variable (pid). The time complexity of this query should be O(N). You original query has a complexity of O(N * N).
[EDITED]
If the above query takes too long or encounters errors that might be related to running out of memory, try this variant, which creates 1000 new relationships at a time. You can change 1000 to any number that is workable for you.
MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT 1000
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);
The WHERE clause filters out child nodes that already have a parent relationship. And the MERGE operation has been changed to a simpler CREATE operation, since we have already ascertained that the relationship does not yet exist. The query returns a count of the number of relationships created. If the result is less than 1000, then all parent relationships have been created.
Finally, to make the repeated queries automated, you can install the APOC plugin on the neo4j server and use the apoc.periodic.commit procedure, which will repeatedly invoke a query until it returns 0. In this example, I use a limit parameter of 10000:
CALL apoc.periodic.commit(
"MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT {limit}
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);",
{limit: 10000});
Your entity creation Cypher looks fine, as do your indexes.
I am rather confused about the last two Cypher fragments though.
Since your relationships have a specific label or id associated with them, it's probably best to add your relationships by loading from the relationship table data, though the node labels in your query (A and B) aren't used in your Entity creation and aren't in your graph, and neither are ENT_IDPARENT or ENT_IDCHILD fields. Looks like this isn't really the Cypher you used, but an example you built off of?
I'd change this relationship creation query to this, setting the type property of the relationship for post-processing later (this assumes that there can only be one :RELATION_OF relation between the same two nodes):
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (parent:Entity {ENT_ID : Rships.ENT_IDPARENT})
MATCH (child:Entity {ENT_ID : Rships.ENT_IDCHILD})
MERGE (parent)-[r:RELATION_OF]->(child)
ON CREATE SET r.RID = Rships.RID;
Later on, if you like, you can match on your relationships with an RID, and add the corresponding type ("FATHER_OF", "MOTHER_OF", etc) property.
As for creating the :PARENT_OF relationship, you're doing some extra match on an Entity variable bound to every single node in your graph - get rid of that.
Instead, use this:
PROFILE
// first, match on all Entities with a PARENTID property
MATCH(child:Entity)
WHERE EXISTS(child.PARENTID)
// next, find the parent for each child by the child's PARENTID
WITH child
MATCH (parent:Entity {ENT_ID : child.PARENTID})
MERGE (parent)-[:PARENT_OF]->(child)
// lastly remove the parentid from the child, so it won't be reprocessed
// if we run the query again.
REMOVE child.PARENTID
EDITED the above query to use an existence check on child.PARENTID, and to remove child.PARENTID after the corresponding relationship has been created.
If you need a solution that uses batching, you could do this manually (adding LIMIT 100000 to your WITH child line, or you could install the APOC Procedures Library and use its periodic.commit() function to batch your processing.

Resources