I'm trying to load a sparse (co-occurrence) matrix in Neo4j but after many failed queries, it's getting frustrating.
Raw data
Basically, I want to create the nodes from the ids, and the relationship weight against each other node (including itself) should be the value on the matrix.
So, for example, 'nhs' should have a self-relationship with weight 41 and 16 with 'england', and so on.
I was trying things like:
LOAD CSV WITH HEADERS FROM 'file:///matpharma.csv' AS row
MERGE (a: node{name: row.id})
MERGE (b: node{name: row.key})
MERGE (a)-[:w]-(b);
I'm not sure how to attach the edge values though (and not yet sure if the merges are producing the expected result).
Thanks in advance for the assistance
If you just need to add a property on a relationship, where the property value is in your CSV, then it's just a matter of adding a variable for the relationship that you MERGE in, and then using SET (or ON CREATE SET, if you only want to set the property if the relationship didn't exist and needed to be created). So something like:
LOAD CSV WITH HEADERS FROM 'file:///matpharma.csv' AS row
MERGE (a: node{name: row.id})
MERGE (b: node{name: row.key})
MERGE (a)-[r:w]-(b)
SET r.weight = row.weight
EDIT
Ah, took a look at the CSV clip. This is a very strange way to format your data. You have data in your header (that is, your headers are trying to define the other node to lookup) which is the wrong way to go about this. You should instead have, per row, one column that defines one of the two nodes to connect (like the "id" column) and then another column for the other node (something like an "id2"). That way you can just do two MATCHes to get your nodes, then a MERGE between them, and then setting the relationship property, similar to the sample query I provided above.
But if you're set on this format, then it's going to be a more complicated query, since we have to deal with dynamic access of the row keys and values.
Something like:
LOAD CSV WITH HEADERS FROM 'file:///matpharma.csv' AS row
MERGE (start:Node {name:row.id})
WITH start, row, [key in keys(row) WHERE key <> 'id'] as keys
FOREACH (key in keys |
MERGE (end:Node {name:key})
MERGE (start)-[r:w]-(end)
ON CREATE SET r.weight = row[key] )
This is a nice Cypher challenge :) Let's say that LOAD CSV is not really meant to do this and probably you would be happier by flattening your data
Here is what I came up with :
LOAD CSV FROM "https://gist.githubusercontent.com/ikwattro/a5260d131f25bcce97c945cb97bc0bee/raw/4ce2b3421ad80ca946329a0be8a6e79ca025f253/data.csv" AS row
WITH collect(row) AS rows
WITH rows, rows[0] AS firstRow
UNWIND rows AS row
WITH firstRow, row SKIP 1
UNWIND range(0, size(row)-2) AS i
RETURN firstRow[i+1], row[0], row[i+1]
You can take a look at the gist
Related
This was inspired by these two SO threads:
Boolean value return from Neo4j cypher query without CASE
How to set property name and value on csv load?
I have a CSV with 3 columns:
first_name,prop_name,prop_value
John,weight,100
Paul,height,200
John,hair_color,blonde
Paul,weight,120
So, there are a number of people, and their properties are randomly scattered across different rows. My goal is to screen the rows and assign all found properties to their holders. For the sake of simplicity, let's focus on the 'weight' property only.
I do know how to make this work the long way:
LOAD CSV WITH HEADERS FROM
"file:///test.csv" AS line
WITH line
MERGE (m:Person {name:line.first_name})
WITH line, CASE line.prop_name WHEN "weight" THEN [1] ELSE [] END as loopW
UNWIND loopW as w
MATCH (p:Person {name: line.first_name})
SET p.weight = line.prop_value
But then, I tried to replace the CASE line with a shorter version
WITH line, collect(line.prop_name = "weight") as loopW
...which resulted in weird behavior, where the created nodes did get their 'weight' keys assigned to, but sometimes with the wrong values. So, I could see something like (:Person {weight:blue})
What would be the right way to get rid of the CASE?
You should know that your current usage filters out all lines that don't have "weight" as the prop_name (the UNWIND of an empty collection wipes out all the other lines and they won't get processed).
What you really need is a better way to set dynamically named properties on your nodes so you can avoid the CASE usage completely.
If you can install APOC Procedures (please read the instructions at the top for how to modify your neo4j.conf to whitelist the procedures, and pay attention to the version matrix to ensure you get the version that corresponds with your Neo4j version) there is one that is a perfect fit for what you're trying to do: CALL apoc.create.setProperty( [node,id,ids,nodes], key, value) YIELD node
Usage would be something like:
LOAD CSV WITH HEADERS FROM
"file:///test.csv" AS line
MERGE (m:Person {name:line.first_name})
CALL apoc.create.setProperty(m, line.prop_name, line.prop_value) YIELD node
RETURN count(distinct m)
EDIT
Expanding on what's wrong with the original query:
UNWIND produces rows in a multiplicative way, with respect to the number of elements in the collection. If the collection in the row had 5 elements, the single row would result in 5 rows, one for each element. If the collection was empty, the row would be removed instead, as there wouldn't be any elements in the collection for which to output rows. Because of this, any further WITH line, CASE ... lines in the query wouldn't do any good.
Let's analyze the original query, going by the example input csv:
LOAD CSV WITH HEADERS FROM
"file:///test.csv" AS line
WITH line //redundant, line not needed
MERGE (m:Person {name:line.first_name}) // 4 rows corresponding with 2 nodes
WITH line, CASE line.prop_name WHEN "weight" THEN [1] ELSE [] END as loopW
// still 4 rows, 2 have [1] as loopW, the other 2 have [] as loopW
UNWIND loopW as w // 2 rows eliminated by unwinding empty collections
MATCH (p:Person {name: line.first_name})
SET p.weight = line.prop_value
// only 2 rows are for 'John,weight,100' and 'Paul,weight,120'
// any further repetitions of WITH line, CASE ... UNWIND for different props will fail and eliminate the remaining 2 rows.
i'm trying to make a graph database from an edgelist and i'm kind of new with neo4j so i have this problem. First of all, the edgelist i got is like this:
geneId geneSymbol diseaseId diseaseName score
10 NAT2 C0005695 Bladder Neoplasm 0.245871429880008
10 NAT2 C0013182 Drug Allergy 0.202681755307501
100 ADA C0002170 Alopecia 0.2
100 ADA C0002880 Autoimmune hemolytic anemia 0.2
100 ADA C0004096 Asthma 0.21105290517153
i have a lot of relationships like that (165k) between gen and diseases associated.
I want to make a bipartite network in which nodes are gen or diseases, so i upload the data like this:
LOAD CSV WITH HEADERS FROM "file:///path/curated_gene_disease_associations.tsv" as row FIELDTERMINATOR '\t'
MERGE (g:Gene{geneId:row.geneId})
ON CREATE SET g.geneSymbol = row.geneSymbol
MERGE (d:Disease{diseaseId:row.diseaseId})
ON CREATE SET d.diseaseName = row.diseaseName
after a while (which is way longer than what it takes in R to upload the nodes using igraph), it's done and i got the nodes, i used MERGE because i don't want to repeat the gen/disease. The problem is that i don't know how to make the relationships, i've searched and they always use something like
MATCH (g:Gene {geneId: toInt(row.geneId)}), (d:Disease {diseaseId: toInt(row.geneId)})
CREATE (g)-[:RELATED_TO]->(d);
But when i run it it says that there are no changes. I've seen the neo4j tutorial but when they do the relations they don't work with edgelists so maybe the problem is when i merge the nodes so they don't repeat. I'd appreciate any help!
Looks like there might be two problems with your relationship query:
1) You're inserting (probably) as a string type (no toInt), and doing the MATCH query as an integer type (with toInt).
2) You're MATCHing the Disease node on row.geneId, not row.diseaseId.
Try the following modification:
MATCH (g:Gene {geneId: row.geneId}), (d:Disease {diseaseId: row.diseaseId})
CREATE (g)-[:RELATED_TO]->(d);
#DanielKitchener's answer seems to address your main question.
With respect to the slowness of creating the nodes, you should create indexes (or uniqueness constraints, which automatically create indexes as well) on these label/property pairs:
:Gene(geneId)
:Disease(diseaseId)
For example, execute these 2 statements separately:
CREATE INDEX ON :Gene(geneId);
CREATE INDEX ON :Disease(diseaseId);
Once the DB has those indexes, your MERGE clauses should be much faster, as they would not have to scan through all existing Gene or Disease nodes to find possible matches.
For test purposes i imported data from another datasource into neo4j.
I imported the data only as nodes. Now i want to add the edges based on the imported ID. Every node has 2 fields
id: contains the identification as String
from: contains all connections as a String[]
For performance improvements i also created an index for the propertiy "id" and an index for the property "from"
First i created both properties as String (the from list as comma separated String).
This works, but is really slow:
MATCH (e:Test1),(r:Test2)
WHERE r.from CONTAINS e._id
MERGE (e)-[:HAS]->(r)
is there a better way?
PS: i tried also to store the from field as String[]. than i used the following query
MATCH (e:Test1),(r:Test2)
WHERE e._id IN r.from
MERGE (e)-[:HAS]->(r)
-> Performance is the same
The problem is that you take a combination of all components - the Cartesian product. In both cases. More would be better to split the string by comma to identifiers. For example:
MATCH (T2:Test2)
UNWIND split(T2.from, ",") as id
MATCH (T1:Test1) WHERE T1._id = id
MERGE (T1)-[:HAS]->(T2)
Or, if you keep the identifiers in the array:
MATCH (T2:Test2)
UNWIND T2.from as id
MATCH (T1:Test1) WHERE T1._id = id
MERGE (T1)-[:HAS]->(T2)
And, of course, do not forget about the index.
Actually, at import time, you should be creating the :HAS relationships instead of creating the from property (which forces you to make a wasteful additional query to create the relationships, and leaves you with redundant from properties that you would probably want to delete).
For example, if you are using LOAD CSV to import, and your import file has test2Id and from columns (a string and a string collection, respectively), this import query should create all the nodes and relationships:
LOAD CSV WITH HEADERS FROM "file:///input.csv" AS row
MERGE (t2:Test2 {id: row.test2Id})
WITH row, t2
UNWIND row.from AS t1Id
MERGE (t1:Test1 {id: t1Id})
MERGE (t1)-[:HAS]->(t2);
For better performance, you would want indexes on both :Test1(id) and :Test2(id).
I have a set of CSV files with duplicate data, i.e. the same row might (and does) appear in multiple files. Each row is uniquely identified by one of the columns (id) and has quite a few other columns that indicate properties, as well as required relationships (i.e. ids of other nodes to link to). The files all have the same format.
My problem is that, due to size and number of the files, I want to avoid processing the rows that already exist - I know that as long as id is the same, the contents of the rows will be the same across the files.
Can any cypher wizard advise how to write a query that would create the node, set all the properties and create all the relationship if a node with given id does not exist, but skip the action altogether if such node is found? I tried with MERGE ON CREATE, something along the lines of:
LOAD CSV WITH HEADERS FROM "..." AS row
MERGE (f:MyLabel {id:row.uniqueId})
ON CREATE SET f....
WITH f,row
MATCH (otherNode:OtherLabel {id : row.otherNodeId})
MERGE (f) -[:REL1] -> (otherNode)
but unfortunately that can only be applied to not setting the properties again, but I couldn't work out how to skip the merging part of relationships (only shown one here, but there are quite a few more).
Thanks in advance!
You can just optionally match the node and then skip with WHERE n IS NULL
Make sure you have an index or constraint on :MyLabel(id)
LOAD CSV WITH HEADERS FROM "..." AS row
OPTIONAL MATCH (f:MyLabel {id:row.uniqueId})
WHERE f IS NULL
MERGE (f:MyLabel {id:row.uniqueId})
ON CREATE SET f....
WITH f,row
MATCH (otherNode:OtherLabel {id : row.otherNodeId})
MERGE (f) -[:REL1] -> (otherNode)
Is there a way to do a case insensitive MERGE in Cypher (Neo4J)?
I'm creating a graph of entities I have been able to extract from a set of documents, and want to merge entities that are the same across multiple documents (accepting the risk that the same name doesn't mean it's the same entity!). The issue is that the case can vary between documents.
At the moment, I'm using the MERGE syntax to create merged nodes, but it is sensitive to the differences in case. How can I perform a case-insensitive merge?
There is no direct way but you can try out something like below.MERGE is made for pattern matching and labels of different cases constitute different patterns
MERGE (a:Crew123)
WITH a,labels(a) AS t
LIMIT 1
MATCH (n)
WHERE [l IN labels(n)
WHERE lower(l)=lower(t[0])] AND a <> n
WITH a,collect(n) AS s
FOREACH (x IN s |
DELETE a)
RETURN *
The above query will give you an ERROR but it will delete the newly created node if a similar label exists. You can add additional pattern in the MERGE clause . And in case there are no similar labels it will run successfully.
Again this is just a work around to not allow new similar labels.
If the data is coming for instance from a CSV or similar source (parameter) you can use a dedicated, consistent case property for the merge and set the original value separately.
e.g.
CREATE CONSTRAINT ON (u:User) ASSERT u.iname IS UNIQUE;
LOAD CSV WITH HEADERS FROM "http://some/url" AS line
WITH line, lower(line.name) as iname
MERGE (u:User {iname:iname}) ON CREATE SET u.name = line.name;
The best solution we found was to change our schema to include a labelled node that contains the upper-cased value which we can merge on, whilst still retaining the case information on the original mode. E.g. (OriginalCase)-[uppercased]->(ORIGINALCASE)