Cypher: create node and relationship if not exists, else create relationship - neo4j

I am trying to use neo to create a unified data dictionary across many datasets, since many of the columns are shared. I have one dictionary as a csv per dataset, with common columns in each. I am new to graph databases, but I think the pseudo code should look like this:
Create dataset node (single node with name and featues of dataset)
Upload data dictionary for dataset in step 1
If field node exists, create relationship between dataset node and existing field node.
If not exists, create field node and relationship between dataset node and field node.
Excluding step 1 which I am doing manually for each dataset node, here is what I have so far:
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:.csv" AS csvLine
MERGE (d:data {field: csvLine.Field, dtype: csvLine.Type, format: csvLine.Format})
ON CREATE SET d.field = csvLine.Field
ON MATCH SET d.field = csvLine.Field
CREATE (dataset)-[r:CONTAINS]->(d);
The results appear almost correct, only new fields are created, and the number of created relationships is equal to the number of fields in the uploaded dataset. However, the (dataset) node I created previously does not connect to the fields. Instead, label-less nodes are created and attach to all the fields in the new dataset. How can I properly connect the dataset node to the appropriate fields?

The problem is here: CREATE (dataset)-[r:CONTAINS]->(d)
dataset is a variable, and this the first time it's used in the query, so this CREATE will create a blank node, bind it to the dataset variable, then create the relationship to d.
Variables only last for the duration of a query (or less, if they are not carried in scope with the WITH clause), and are never persisted to the database. If you previously created some node in a different query with the dataset variable, then that variable went out of scope when the query ended. If you want to refer to that same node again, you will need to match to that node in this query.

Related

Connecting two nodes based on identical properties in Neo4j

New to Neo4j. My goal is to make a database with various csv sources. I have created node labels of "geochemistry" and "geospatial" to be linked with the "LABID" node via a common property. I have loaded in one dataset easily, and made the necessary connections with the ":LOCATED" relationship being defined as between the geospatial and LABID nodes.
However, moving on to the second csv source, I am a little confused. I have tried matching the new geospatial data (which have no current relationships) to another set of Lab IDs. Below is my current code:
MATCH (g:geospatial) WHERE NOT (g)-[:LOCATED]->(:LABID)
MATCH (l:LABID)
WHERE l.labid = g.Sample_ID
MERGE (g)-[r:LOCATED]->(l)
RETURN r
labid is a current property in LABID, and so is Sample_ID to the geospatial nodes.
After completing the above query, the output is "(no changes, no records)"
Thanks for the help in advance!

Neo4j: how to avoid node to be created again if it is already in the database?

I have a question about Cypher requests and the update of a database.
I have a python script that does web scraping and generate a csv at the end. I use this csv to import data in a neo4j database.
The scraping is done 5 times a day. So every time a new scraping is done the csv is updated, new data is added to the the previous csv and so on.
I import the data after each scraping.
Actually when I import the data after each scraping to update the DB, I have all the nodes created again even if it is already in the DB.
For example the first csv gives 5 rows and I insert this in Neo4j.
Next the new scraping gives 2 rows so the csv has now 7 rows. And if I insert the data I will have the first five rows twice in the DB.
I would like to have everything unique and not added if it is already in the database.
For example when I try to create node ARTICLE I do this:
CREATE (a:ARTICLE {id:$id, title:$title, img_url:$img_url, link:$link, sentence:$sentence, published:$published})
I think MERGE instead of CREATE should solve the solution, but it doesn't and I can't figure it out why.
How can I do this ?
A MERGE clause will create its entire pattern if any part of it does not already exist. So, for a MERGE clause to work reasonably, the pattern used with it must only specify the minimum data necessary to uniquely identify a node (or a relationship).
For instance, assuming ARTICLE nodes are supposed to have unique id properties, then you should replace your CREATE clause:
CREATE (a:ARTICLE {id:$id, title:$title, img_url:$img_url, link:$link, sentence:$sentence, published:$published})
with something like this:
MERGE (a:ARTICLE {id:$id})
SET a += {title:$title, img_url:$img_url, link:$link, sentence:$sentence, published:$published}
In the above example, the SET clause will always overwrite the non-id properties. If you want to set those properties only when the node is created, you can use ON CREATE before the SET clause.
Use MERGE instead of CREATE. You can use it for both nodes and relationships.
MERGE (charlie { name: 'Charlie Sheen', age: 10 })
Create a single node with properties where not all properties match any existing node.
MATCH (a:Person {name: "Martin"}),
(b:Person {name: "Marie"})
MERGE (a)-[r:LOVES]->(b)
Finds or creates a relationship between the nodes.

MERGE the Creation of Nodes from Two geohash Columns in CSV

So I am planning to create a geohash Graph with neo4j.
my CSV contains ,for each row, two informations for geohash one for pickup and another for dropoff as follow :
What I want is:
the node that have the same geohash as another one shouldn't be recreated (so multiple edges are allowed).
one node could be a pickup and a dropoff in the same time
I tried to use MERGE but works by columns:
load csv from "file:///green_data.csv" as line
merge(pick:pickup{geohash:line[20]})merge (drop:dropoff{geohash: line[22]})merge(pick)-[:trip]->(drop)
as you can see , the same geohash dr5rkky node is being created twice one for pickups and another for dropoffs
how to avoid that ?
load csv from "file:///green_data.csv" as line MERGE(p:HashNode {geohash: line[20]}) ON CREATE set p.pickup=True ON MATCH set p.pickup=True MERGE(d:HashNode {geohash: line[22]}) ON CREATE set d.dropoff=True ON MATCH set d.dropoff=True MERGE (p)-[:trip]->(d)
Base on neo4j docs:
MERGE either matches existing nodes and binds them, or it creates new data and binds that. It’s like a combination of MATCH and CREATE that additionally allows you to specify what happens if the data was matched or created.
The last part of MERGE is the ON CREATE and ON MATCH. These allow a query to express additional changes to the properties of a node or relationship, depending on if the element was MATCH -ed in the database or if it was CREATE -ed.

Create atmost one relationship for newly created node with existing node based on some property

My application receive stream of data which I need to persist in graph DB. With this data , I am first creating nodes in neo4j db in batches (of 1000) and just after that I am trying to find out matching node in existing data to link it.
MATCH(new:EVENT) where new.uniqueId in [NEWLY CREATED NODES UNIQUE ID]
MATCH (existing:EVENT) where new.myprop = existing.myprop and new.uniqueId <> exising.uniqueID
CREATE (new)-[:LINKED]-(existing)
My problem is, if for a node there are more than one matching existing node than i want to create relationship with just one existing node. My current above query will create relationship with all matching nodes.
is there any efficient way of doing it as number of existing node could be huge ie approx 300M.
Node: I have index created on myprop and uniqueId field
As #InverseFalcon's answer states, you can use aggregation to collect the existing nodes for each distinct new, and take the first in each collection.
For better performance, you should always PROFILE a query to see what can be improved. For example, after doing that with some sample data on my neo4j installation, I saw that: the index was not automatically being used when finding new, and the new.uniqueId <> exising.uniqueId test was causing DB hits. This query fixes both issues, and should have better performance:
MATCH(new:EVENT)
USING INDEX new:EVENT(uniqueId)
WHERE new.uniqueId in [NEWLY CREATED NODES UNIQUE ID]
MATCH (existing:EVENT)
WHERE new.myprop = existing.myprop AND new <> existing
WITH new, COLLECT(existing)[0] AS e
CREATE (new)-[:LINKED]->(e);
It uses USING INDEX to provide a hint to use the index. Also, since uniqueId is supposed to be unique, it just compares the new and existing nodes directly to see if they are the same node.
To ensure that the uniqueness is actually enforced by neo4j, you should create a uniqueness constraint:
CREATE CONSTRAINT ON (e:EVENT) ASSERT e.uniqueId IS UNIQUE;
You can collect the existing node matches per new node and just grab the first:
MATCH(new:EVENT) where new.uniqueId in [NEWLY CREATED NODES UNIQUE ID]
MATCH (existing:EVENT) where new.myprop = existing.myprop and new.uniqueId <> exising.uniqueID
WITH new, head(collect(existing)) as existing
CREATE (new)-[:LINKED]-(existing)

How to get the Node name in Neo4J

I am new to Neo4j and am referring to this tutorial.
I am not finding any answer on how to fetch the node name using CQL.
For example:
If I create two nodes like so:
CREATE (Dhawan:player{name: "Shikar Dhawan", YOB: 1985, POB: "Delhi"})
CREATE (Ind:Country {name: "India"})
and then build relationship at a later date using:
CREATE (Dhawan)-[r:BATSMAN_OF]->(Ind)
How do we know the node name: Dhawan or Ind?
Using:
MATCH (n) RETURN n
I am getting back the label name but not the node name!
How do I get all the details of an existing graph DB?
The thing you're calling "the node name" is actually a variable, and is only present for the duration of a single query (or less, if you don't include it in a WITH clause and it goes out of scope). It is never saved to the graph db, and is not persisted data.
In your example, you would only be able to use CREATE (Dhawan)-[r:BATSMAN_OF]->(Ind) (and have those variables refer to your previously created nodes) if the create was performed in the same query where those variables were previously bound (and still in scope).
Otherwise, this would create two new nodes, create the :BATSMAN_OF relationship between them, and bind those variables to the new nodes for the duration of their scope.

Resources