Avoid processing duplicate data when CSV importing via Cypher

Avoid processing duplicate data when CSV importing via Cypher - neo4j

I have a set of CSV files with duplicate data, i.e. the same row might (and does) appear in multiple files. Each row is uniquely identified by one of the columns (id) and has quite a few other columns that indicate properties, as well as required relationships (i.e. ids of other nodes to link to). The files all have the same format.
My problem is that, due to size and number of the files, I want to avoid processing the rows that already exist - I know that as long as id is the same, the contents of the rows will be the same across the files.
Can any cypher wizard advise how to write a query that would create the node, set all the properties and create all the relationship if a node with given id does not exist, but skip the action altogether if such node is found? I tried with MERGE ON CREATE, something along the lines of:
LOAD CSV WITH HEADERS FROM "..." AS row
MERGE (f:MyLabel {id:row.uniqueId})
ON CREATE SET f....
WITH f,row
MATCH (otherNode:OtherLabel {id : row.otherNodeId})
MERGE (f) -[:REL1] -> (otherNode)
but unfortunately that can only be applied to not setting the properties again, but I couldn't work out how to skip the merging part of relationships (only shown one here, but there are quite a few more).
Thanks in advance!

You can just optionally match the node and then skip with WHERE n IS NULL
Make sure you have an index or constraint on :MyLabel(id)
LOAD CSV WITH HEADERS FROM "..." AS row
OPTIONAL MATCH (f:MyLabel {id:row.uniqueId})
WHERE f IS NULL
MERGE (f:MyLabel {id:row.uniqueId})
ON CREATE SET f....
WITH f,row
MATCH (otherNode:OtherLabel {id : row.otherNodeId})
MERGE (f) -[:REL1] -> (otherNode)

Related

Loading sparse adjacency matrix on Neo4j

I'm trying to load a sparse (co-occurrence) matrix in Neo4j but after many failed queries, it's getting frustrating.
Raw data
Basically, I want to create the nodes from the ids, and the relationship weight against each other node (including itself) should be the value on the matrix.
So, for example, 'nhs' should have a self-relationship with weight 41 and 16 with 'england', and so on.
I was trying things like:
LOAD CSV WITH HEADERS FROM 'file:///matpharma.csv' AS row
MERGE (a: node{name: row.id})
MERGE (b: node{name: row.key})
MERGE (a)-[:w]-(b);
I'm not sure how to attach the edge values though (and not yet sure if the merges are producing the expected result).
Thanks in advance for the assistance

If you just need to add a property on a relationship, where the property value is in your CSV, then it's just a matter of adding a variable for the relationship that you MERGE in, and then using SET (or ON CREATE SET, if you only want to set the property if the relationship didn't exist and needed to be created). So something like:
LOAD CSV WITH HEADERS FROM 'file:///matpharma.csv' AS row
MERGE (a: node{name: row.id})
MERGE (b: node{name: row.key})
MERGE (a)-[r:w]-(b)
SET r.weight = row.weight
EDIT
Ah, took a look at the CSV clip. This is a very strange way to format your data. You have data in your header (that is, your headers are trying to define the other node to lookup) which is the wrong way to go about this. You should instead have, per row, one column that defines one of the two nodes to connect (like the "id" column) and then another column for the other node (something like an "id2"). That way you can just do two MATCHes to get your nodes, then a MERGE between them, and then setting the relationship property, similar to the sample query I provided above.
But if you're set on this format, then it's going to be a more complicated query, since we have to deal with dynamic access of the row keys and values.
Something like:
LOAD CSV WITH HEADERS FROM 'file:///matpharma.csv' AS row
MERGE (start:Node {name:row.id})
WITH start, row, [key in keys(row) WHERE key <> 'id'] as keys
FOREACH (key in keys |
MERGE (end:Node {name:key})
MERGE (start)-[r:w]-(end)
ON CREATE SET r.weight = row[key] )

This is a nice Cypher challenge :) Let's say that LOAD CSV is not really meant to do this and probably you would be happier by flattening your data
Here is what I came up with :
LOAD CSV FROM "https://gist.githubusercontent.com/ikwattro/a5260d131f25bcce97c945cb97bc0bee/raw/4ce2b3421ad80ca946329a0be8a6e79ca025f253/data.csv" AS row
WITH collect(row) AS rows
WITH rows, rows[0] AS firstRow
UNWIND rows AS row
WITH firstRow, row SKIP 1
UNWIND range(0, size(row)-2) AS i
RETURN firstRow[i+1], row[0], row[i+1]
You can take a look at the gist

How to match line of csv which is ignored by constraint and create only relationship

I have been created a graph having a constraint on primary id. In my csv a primary id is duplicate but the other proprieties are different. Based on the other properties I want to create relationships.
I tried multiple times to change the code but it does not do what I need.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///Trial.csv' AS line FIELDTERMINATOR '\t'
MATCH (n:Trial {id: line.primary_id})
with line.cui= cui
MATCH (m:Intervention)
where m.id = cui
MERGE (n)-[:HAS_INTERVENTION]->(m);
I already have the nodes Intervention in the graph as well as the trials. So what I am trying to do is to match a trial with the id from intervention and create only the relationship. Instead is creating me also the nodes.
This is a sample of my data, so the same primary id, having different cuis and I am trying to match on cui:

You can refer the following query which finds Trial and Intervention nodes by primary_id and cui respectively and creates the relationship between them.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///Trial.csv' AS line FIELDTERMINATOR '\t'
MATCH (n:Trial {id: line.primary_id}), (m:Intervention {id: line.cui})
MERGE (n)-[:HAS_INTERVENTION]->(m);

The behavior you observed is caused by 2 aspects of the Cypher language:
The WITH clause drops all existing variables except for the ones explicitly specified in the clause. Therefore, since your WITH clause does not specify the n node, n becomes an unbound variable after the clause.
The MERGE clause will create its entire pattern if any part of the pattern does not already exist. Since n is not bound to anything, the MERGE clause would go ahead and create the entire pattern (including the 2 nodes).
So, you could have fixed the issue by simply specifying the n variable in the WITH clause, as in:
WITH n, line.cui= cui
But #Raj's query is even better, avoiding the need for WITH entirely.

Performance: Creating a relationship in neo4j based on property ID

For test purposes i imported data from another datasource into neo4j.
I imported the data only as nodes. Now i want to add the edges based on the imported ID. Every node has 2 fields
id: contains the identification as String
from: contains all connections as a String[]
For performance improvements i also created an index for the propertiy "id" and an index for the property "from"
First i created both properties as String (the from list as comma separated String).
This works, but is really slow:
MATCH (e:Test1),(r:Test2)
WHERE r.from CONTAINS e._id
MERGE (e)-[:HAS]->(r)
is there a better way?
PS: i tried also to store the from field as String[]. than i used the following query
MATCH (e:Test1),(r:Test2)
WHERE e._id IN r.from
MERGE (e)-[:HAS]->(r)
-> Performance is the same

The problem is that you take a combination of all components - the Cartesian product. In both cases. More would be better to split the string by comma to identifiers. For example:
MATCH (T2:Test2)
UNWIND split(T2.from, ",") as id
MATCH (T1:Test1) WHERE T1._id = id
MERGE (T1)-[:HAS]->(T2)
Or, if you keep the identifiers in the array:
MATCH (T2:Test2)
UNWIND T2.from as id
MATCH (T1:Test1) WHERE T1._id = id
MERGE (T1)-[:HAS]->(T2)
And, of course, do not forget about the index.

Actually, at import time, you should be creating the :HAS relationships instead of creating the from property (which forces you to make a wasteful additional query to create the relationships, and leaves you with redundant from properties that you would probably want to delete).
For example, if you are using LOAD CSV to import, and your import file has test2Id and from columns (a string and a string collection, respectively), this import query should create all the nodes and relationships:
LOAD CSV WITH HEADERS FROM "file:///input.csv" AS row
MERGE (t2:Test2 {id: row.test2Id})
WITH row, t2
UNWIND row.from AS t1Id
MERGE (t1:Test1 {id: t1Id})
MERGE (t1)-[:HAS]->(t2);
For better performance, you would want indexes on both :Test1(id) and :Test2(id).

How to create a relationship to existing nodes based on a condition of matching properties

In Neo4j, I am trying to load a CSV file whilst creating a relationship between nodes based on the condition that a certain property is matched.
My Cypher code is:
LOAD CSV WITH HEADERS FROM "file:C:/Users/George.Kyle/Simple/Simple scream v3.csv" AS
csvLine
MATCH (g:simplepages { page: csvLine.page}),(y:simplepages {pagekeyword: csvLine.keyword} )
MATCH (n:sensitiveskin)
WHERE g.keyword = n.keyword
CREATE (f)-[:_]->(n)
You can see I am trying to create a relationship between 'simplepages' and 'sensitiveskin' based on their keyword properties being the same.
The query is executing but relationships won't form.
What I hope for is that when I execute a query such as
MATCH (n:sensitiveskin) RETURN n LIMIT 25
You will see all nodes (both sensitive skin and simple pages) with auto-complete switched on.

CREATE (f)-[:_]->(n) is using an f variable that was not previously defined, so it is creating a new node (with no label or properties) instead, and then creating a relationship from that new node. I think you meant to use either g or y instead of f. (Probably y, since you don't otherwise use it?)

Neo4J Cypher - Case Insensitive MERGE

Is there a way to do a case insensitive MERGE in Cypher (Neo4J)?
I'm creating a graph of entities I have been able to extract from a set of documents, and want to merge entities that are the same across multiple documents (accepting the risk that the same name doesn't mean it's the same entity!). The issue is that the case can vary between documents.
At the moment, I'm using the MERGE syntax to create merged nodes, but it is sensitive to the differences in case. How can I perform a case-insensitive merge?

There is no direct way but you can try out something like below.MERGE is made for pattern matching and labels of different cases constitute different patterns
MERGE (a:Crew123)
WITH a,labels(a) AS t
LIMIT 1
MATCH (n)
WHERE [l IN labels(n)
WHERE lower(l)=lower(t[0])] AND a <> n
WITH a,collect(n) AS s
FOREACH (x IN s |
DELETE a)
RETURN *
The above query will give you an ERROR but it will delete the newly created node if a similar label exists. You can add additional pattern in the MERGE clause . And in case there are no similar labels it will run successfully.
Again this is just a work around to not allow new similar labels.

If the data is coming for instance from a CSV or similar source (parameter) you can use a dedicated, consistent case property for the merge and set the original value separately.
e.g.
CREATE CONSTRAINT ON (u:User) ASSERT u.iname IS UNIQUE;
LOAD CSV WITH HEADERS FROM "http://some/url" AS line
WITH line, lower(line.name) as iname
MERGE (u:User {iname:iname}) ON CREATE SET u.name = line.name;

The best solution we found was to change our schema to include a labelled node that contains the upper-cased value which we can merge on, whilst still retaining the case information on the original mode. E.g. (OriginalCase)-[uppercased]->(ORIGINALCASE)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Avoid processing duplicate data when CSV importing via Cypher - neo4j

Related

Loading sparse adjacency matrix on Neo4j

How to match line of csv which is ignored by constraint and create only relationship

Performance: Creating a relationship in neo4j based on property ID

How to create a relationship to existing nodes based on a condition of matching properties

Neo4J Cypher - Case Insensitive MERGE

Categories

Resources