I am working on creating a graph database in neo4j for a CALL dataset. The dataset is stored in csv file with following columns: Source, Target, Timestamp, Duration. Here Source and Target are Person id's (numeric), Timestamp is datetime and duration is in seconds (integer).
I modeled my graph where person are nodes(person_id as property) and call as relationship (time and duration as property).
There are around 2,00,000 nodes and around 70 million relationships. I have a separate csv files with person id's which I used to create the nodes. I also added uniqueness constraint on the Person id's.
CREATE CONSTRAINT ON ( person:Person ) ASSERT (person.pid) IS UNIQUE
I didn't completely understand the working of bulk import so I wrote a python script to split my csv into 70 csv's where each csv has 1 million nodes (saved as calls_0, calls_1, .... calls_69). I took the initiative to manually run a cypher query changing the filename every time. It worked well(fast enough) for first few(around 10) files but then I noticed that after adding relationship from a file, the import is getting slower for the next file. Now it is taking almost 25 minutes for importing a file.
Can someone link me to an efficient and easy way of doing it?
Here is the cypher query:
:auto USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source,
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
MERGE (p1)-[rel:CALLS {time: time, duration: Duration}]->(p2)
RETURN count(rel)
I am using Neo4j 4.0.3
Your MERGE clause has to check for an existing matching relationship (to avoid creating duplicates). If you added a lot of relationships between Person nodes, that could make the MERGE clause slower.
You should consider whether it is safe for you to use CREATE instead of MERGE.
Is much better if you export the match using the ID of each node and then create the relationship.
POC
CREATE INDEX ON :Person(`pid`);
CALL apoc.export.csv.query("LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source,
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
RETURN ID(a) AS ida,ID(b) as idb,time,Duration","rels.csv", {});
and then
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:////rels.csv' AS row
MATCH (a:Person) WHERE ID(a) = toInt(row.ida)
MATCH (b:Person) WHERE ID(b) = toInt(row.idb)
MERGE (b)-[:CALLS {time: row.time, duration: Duration}]->(a);
For me this is the best way to do this.
Related
I have been created a graph having a constraint on primary id. In my csv a primary id is duplicate but the other proprieties are different. Based on the other properties I want to create relationships.
I tried multiple times to change the code but it does not do what I need.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///Trial.csv' AS line FIELDTERMINATOR '\t'
MATCH (n:Trial {id: line.primary_id})
with line.cui= cui
MATCH (m:Intervention)
where m.id = cui
MERGE (n)-[:HAS_INTERVENTION]->(m);
I already have the nodes Intervention in the graph as well as the trials. So what I am trying to do is to match a trial with the id from intervention and create only the relationship. Instead is creating me also the nodes.
This is a sample of my data, so the same primary id, having different cuis and I am trying to match on cui:
You can refer the following query which finds Trial and Intervention nodes by primary_id and cui respectively and creates the relationship between them.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///Trial.csv' AS line FIELDTERMINATOR '\t'
MATCH (n:Trial {id: line.primary_id}), (m:Intervention {id: line.cui})
MERGE (n)-[:HAS_INTERVENTION]->(m);
The behavior you observed is caused by 2 aspects of the Cypher language:
The WITH clause drops all existing variables except for the ones explicitly specified in the clause. Therefore, since your WITH clause does not specify the n node, n becomes an unbound variable after the clause.
The MERGE clause will create its entire pattern if any part of the pattern does not already exist. Since n is not bound to anything, the MERGE clause would go ahead and create the entire pattern (including the 2 nodes).
So, you could have fixed the issue by simply specifying the n variable in the WITH clause, as in:
WITH n, line.cui= cui
But #Raj's query is even better, avoiding the need for WITH entirely.
I am working in a Streamsets pipeline to read data from a active file directory where .csv files are uploaded remotely and put those data in a neo4j database.
The steps I have used is-
Creating a observation node for each row in .csv
Creating a csv node and creating relation between csv & the record
Updating Timestamp taken from csv node to burn_in_test nodes, already created in graph database from different pipeline, if it is latest
creating relation from csv to burn in test
deleting outdated relation based on latest timestamp
Now I am doing all of these using jdbc query and the cypher query used is
MERGE (m:OBSERVATION{
SerialNumber: "${record:value('/SerialNumber')}",
Test_Stage: "${record:value('/Test_Stage')}",
CUR: "${record:value('/CUR')}",
VOLT: "${record:value('/VOLT')}",
Rel_Lot: "${record:value('/Rel_Lot')}",
TimestampINT: "${record:value('/TimestampINT')}",
Temp: "${record:value('/Temp')}",
LP: "${record:value('/LP')}",
MON: "${record:value('/MON')}"
})
MERGE (t:CSV{
SerialNumber: "${record:value('/SerialNumber')}",
Test_Stage: "${record:value('/Test_Stage')}",
TimestampINT: "${record:value('/TimestampINT')}"
})
WITH m
MATCH (t:CSV) where t.SerialNumber=m.SerialNumber and t.Test_Stage=m.Test_Stage and t.TimestampINT=m.TimestampINT MERGE (m)-[:PART_OF]->(t)
WITH t, t.TimestampINT AS TimestampINT
MATCH (rl:Burn_In_Test) where rl.SerialNumber=t.SerialNumber and rl.Test_Stage=t.Test_Stage and rl.TimestampINT<TimestampINT
SET rl.TimestampINT=TimestampINT
WITH t
MATCH (rl:Burn_In_Test) where rl.SerialNumber=t.SerialNumber and rl.Test_Stage=t.Test_Stage
MERGE (t)-[:POINTS_TO]->(rl)
WITH rl
MATCH (t:CSV)-[r:POINTS_TO]->(rl) WHERE t.TimestampINT<rl.TimestampINT
DELETE r
Right now this process is very slow and taking about 15 mins of time for 10 records. Can This be further optimized?
Best practices when using MERGE is to merge on a single property and then use SET to add other properties.
If I assume that serial number is property is unique for every node (might not be), it would look like:
MERGE (m:OBSERVATION{SerialNumber: "${record:value('/SerialNumber')}"})
SET m.Test_Stage = "${record:value('/Test_Stage')}",
m.CUR= "${record:value('/CUR')}",
m.VOLT= "${record:value('/VOLT')}",
m.Rel_Lot= "${record:value('/Rel_Lot')}",
m.TimestampINT = "${record:value('/TimestampINT')}",
m.Temp= "${record:value('/Temp')}",
m.LP= "${record:value('/LP')}",
m.MON= "${record:value('/MON')}"
MERGE (t:CSV{
SerialNumber: "${record:value('/SerialNumber')}"
})
SET t.Test_Stage = "${record:value('/Test_Stage')}",
t.TimestampINT = "${record:value('/TimestampINT')}"
WITH m
MATCH (t:CSV) where t.SerialNumber=m.SerialNumber and t.Test_Stage=m.Test_Stage and t.TimestampINT=m.TimestampINT MERGE (m)-[:PART_OF]->(t)
WITH t, t.TimestampINT AS TimestampINT
MATCH (rl:Burn_In_Test) where rl.SerialNumber=t.SerialNumber and rl.Test_Stage=t.Test_Stage and rl.TimestampINT<TimestampINT
SET rl.TimestampINT=TimestampINT
WITH t
MATCH (rl:Burn_In_Test) where rl.SerialNumber=t.SerialNumber and rl.Test_Stage=t.Test_Stage
MERGE (t)-[:POINTS_TO]->(rl)
WITH rl
MATCH (t:CSV)-[r:POINTS_TO]->(rl) WHERE t.TimestampINT<rl.TimestampINT
DELETE r
another thing to add is that I would probably split this into two queries.
First one would be the importing part and the second one would be the delete of relationships. Also add unique constraints and indexes where possible.
i'm trying to make a graph database from an edgelist and i'm kind of new with neo4j so i have this problem. First of all, the edgelist i got is like this:
geneId geneSymbol diseaseId diseaseName score
10 NAT2 C0005695 Bladder Neoplasm 0.245871429880008
10 NAT2 C0013182 Drug Allergy 0.202681755307501
100 ADA C0002170 Alopecia 0.2
100 ADA C0002880 Autoimmune hemolytic anemia 0.2
100 ADA C0004096 Asthma 0.21105290517153
i have a lot of relationships like that (165k) between gen and diseases associated.
I want to make a bipartite network in which nodes are gen or diseases, so i upload the data like this:
LOAD CSV WITH HEADERS FROM "file:///path/curated_gene_disease_associations.tsv" as row FIELDTERMINATOR '\t'
MERGE (g:Gene{geneId:row.geneId})
ON CREATE SET g.geneSymbol = row.geneSymbol
MERGE (d:Disease{diseaseId:row.diseaseId})
ON CREATE SET d.diseaseName = row.diseaseName
after a while (which is way longer than what it takes in R to upload the nodes using igraph), it's done and i got the nodes, i used MERGE because i don't want to repeat the gen/disease. The problem is that i don't know how to make the relationships, i've searched and they always use something like
MATCH (g:Gene {geneId: toInt(row.geneId)}), (d:Disease {diseaseId: toInt(row.geneId)})
CREATE (g)-[:RELATED_TO]->(d);
But when i run it it says that there are no changes. I've seen the neo4j tutorial but when they do the relations they don't work with edgelists so maybe the problem is when i merge the nodes so they don't repeat. I'd appreciate any help!
Looks like there might be two problems with your relationship query:
1) You're inserting (probably) as a string type (no toInt), and doing the MATCH query as an integer type (with toInt).
2) You're MATCHing the Disease node on row.geneId, not row.diseaseId.
Try the following modification:
MATCH (g:Gene {geneId: row.geneId}), (d:Disease {diseaseId: row.diseaseId})
CREATE (g)-[:RELATED_TO]->(d);
#DanielKitchener's answer seems to address your main question.
With respect to the slowness of creating the nodes, you should create indexes (or uniqueness constraints, which automatically create indexes as well) on these label/property pairs:
:Gene(geneId)
:Disease(diseaseId)
For example, execute these 2 statements separately:
CREATE INDEX ON :Gene(geneId);
CREATE INDEX ON :Disease(diseaseId);
Once the DB has those indexes, your MERGE clauses should be much faster, as they would not have to scan through all existing Gene or Disease nodes to find possible matches.
I have the following file A.csv
"NODE","PREDECESSORS"
"1",""
"2","1"
"3","1;2"
I want to create with the nodes: 1,2,3 and its relationships 1->2->3 and 1->3
I have already tried to do so:
LOAD CSV WITH HEADERS FROM 'file:///A.csv' AS line
CREATE (:Task { NODE: line.NODE, PREDECESSORS: SPLIT(line.PREDECESSORS ';')})
FOREACH (value IN line.PREDECESSORS |
MERGE (PREDECESSORS:value)-[r:RELATIONSHIP]->(NODE) )
But it does not work, that is, it does not create any relationship.
Please, might you help me?
The problem is in your MERGE:
MERGE (PREDECESSORS:value)-[r:RELATIONSHIP]->(NODE)
This is merging a :value labeled node and assigning it to the variable PREDECESSORS, which can't be what you want to do.
A better approach would be not save the predecessor data in the node, just use that to match on the relevant nodes and create the relationships.
It will also help to have an index on :Task(NODE) so your matches to the predecessors are quick.
Remember also that cypher queries do not process the entire query for each row, but rather each operation in the query is processed for each row, so once the CREATE executes, all nodes will be created, there's no need to use MERGE the predecessor nodes.
Try something like this:
LOAD CSV WITH HEADERS FROM 'file:///A.csv' AS line
CREATE (node:Task { NODE: line.NODE})
WITH node, SPLIT(line.PREDECESSORS, ';') as predecessors
MATCH (p:Task)
WHERE p.NODE in predecessors
MERGE (p)-[:RELATIONSHIP]->(node)
after importing data via CSV LOAD I want to connect the imported nodes to customer nodes that are already in the DB. The idea was to look up all imported nodes with the Label TICKET and run through the result set and create the relationship.
Here is the code I come up with first approach:
# Find nodes without relationship for label Ticket
MATCH (t:Ticket), (c:Customer)
WHERE NOT (t)--(c)
RETURN t.number as ticket_number, t.type as ticket_type,t.sid as ticket_sid
# Run through the resultset and execute for each found node
MATCH (t:Ticket { number: "xxx" }), (c:Customer {code: "xxx"})
MERGE (t)-[:IS_TICKET_OF]->(c);
There is an index
ON :Ticket (number)
ON :Customer(code)
This way to handle it is very slow and it took minutes to run through the CSV file. I hope there is a way to optimize the query or maybe to find a way to create the missing relationship easier as first to look them all up and then run through a loop.
The CSV Load is :
LOAD CSV FROM "file:c:..." AS csvLine
MERGE (t:Ticket { number: csvLine[0]})
Maybe its also fine to create the relation already in the CSV import - maybe something like
MATCH (c:Customer {code:"xxx"})
MERGE (t) - [:IS_TICKET_OF]-> (c)
But I would need to figure out in the query how to extract the code from a field as I have something like "aaa/vvv/bbb/1234" in the CSV import and would need only aaa for the match above as this is stored in the customer node as ID.
Any hint is very appreciated.
Thanks!
Does this query work for you?
It stores the aaa part of the input string in num, makes sure the ticket with that number exists, and then makes sure a relationship exists to the matching customer (if there is such a customer).
LOAD CSV FROM "file:c:..." AS csvLine
WITH SPLIT(csvLine[0], '/')[0] AS num
MERGE (t:Ticket {number: num})
WITH num, t
OPTIONAL MATCH (c:Customer {code: num})
MERGE (t)-[:IS_TICKET_OF]->(c);