Streamsets: Neo4j query very slow - neo4j

I am working in a Streamsets pipeline to read data from a active file directory where .csv files are uploaded remotely and put those data in a neo4j database.
The steps I have used is-
Creating a observation node for each row in .csv
Creating a csv node and creating relation between csv & the record
Updating Timestamp taken from csv node to burn_in_test nodes, already created in graph database from different pipeline, if it is latest
creating relation from csv to burn in test
deleting outdated relation based on latest timestamp
Now I am doing all of these using jdbc query and the cypher query used is
MERGE (m:OBSERVATION{
SerialNumber: "${record:value('/SerialNumber')}",
Test_Stage: "${record:value('/Test_Stage')}",
CUR: "${record:value('/CUR')}",
VOLT: "${record:value('/VOLT')}",
Rel_Lot: "${record:value('/Rel_Lot')}",
TimestampINT: "${record:value('/TimestampINT')}",
Temp: "${record:value('/Temp')}",
LP: "${record:value('/LP')}",
MON: "${record:value('/MON')}"
})
MERGE (t:CSV{
SerialNumber: "${record:value('/SerialNumber')}",
Test_Stage: "${record:value('/Test_Stage')}",
TimestampINT: "${record:value('/TimestampINT')}"
})
WITH m
MATCH (t:CSV) where t.SerialNumber=m.SerialNumber and t.Test_Stage=m.Test_Stage and t.TimestampINT=m.TimestampINT MERGE (m)-[:PART_OF]->(t)
WITH t, t.TimestampINT AS TimestampINT
MATCH (rl:Burn_In_Test) where rl.SerialNumber=t.SerialNumber and rl.Test_Stage=t.Test_Stage and rl.TimestampINT<TimestampINT
SET rl.TimestampINT=TimestampINT
WITH t
MATCH (rl:Burn_In_Test) where rl.SerialNumber=t.SerialNumber and rl.Test_Stage=t.Test_Stage
MERGE (t)-[:POINTS_TO]->(rl)
WITH rl
MATCH (t:CSV)-[r:POINTS_TO]->(rl) WHERE t.TimestampINT<rl.TimestampINT
DELETE r
Right now this process is very slow and taking about 15 mins of time for 10 records. Can This be further optimized?

Best practices when using MERGE is to merge on a single property and then use SET to add other properties.
If I assume that serial number is property is unique for every node (might not be), it would look like:
MERGE (m:OBSERVATION{SerialNumber: "${record:value('/SerialNumber')}"})
SET m.Test_Stage = "${record:value('/Test_Stage')}",
m.CUR= "${record:value('/CUR')}",
m.VOLT= "${record:value('/VOLT')}",
m.Rel_Lot= "${record:value('/Rel_Lot')}",
m.TimestampINT = "${record:value('/TimestampINT')}",
m.Temp= "${record:value('/Temp')}",
m.LP= "${record:value('/LP')}",
m.MON= "${record:value('/MON')}"
MERGE (t:CSV{
SerialNumber: "${record:value('/SerialNumber')}"
})
SET t.Test_Stage = "${record:value('/Test_Stage')}",
t.TimestampINT = "${record:value('/TimestampINT')}"
WITH m
MATCH (t:CSV) where t.SerialNumber=m.SerialNumber and t.Test_Stage=m.Test_Stage and t.TimestampINT=m.TimestampINT MERGE (m)-[:PART_OF]->(t)
WITH t, t.TimestampINT AS TimestampINT
MATCH (rl:Burn_In_Test) where rl.SerialNumber=t.SerialNumber and rl.Test_Stage=t.Test_Stage and rl.TimestampINT<TimestampINT
SET rl.TimestampINT=TimestampINT
WITH t
MATCH (rl:Burn_In_Test) where rl.SerialNumber=t.SerialNumber and rl.Test_Stage=t.Test_Stage
MERGE (t)-[:POINTS_TO]->(rl)
WITH rl
MATCH (t:CSV)-[r:POINTS_TO]->(rl) WHERE t.TimestampINT<rl.TimestampINT
DELETE r
another thing to add is that I would probably split this into two queries.
First one would be the importing part and the second one would be the delete of relationships. Also add unique constraints and indexes where possible.

Related

Neo4j Node Overwrites

I'm looking to perform periodic refreshes of node data in my neo4j db. A good example of my needs would be company employees -- where if an employee is terminated, they are removed from the graph completely and new employees are added.
Really, deleting all nodes of this label and ingesting a fresh dataset likely suffices -- but it feels quite ugly. Is there a more elegant solution? My fresh data exists in csv and I want to pull it in daily.
You could put a 'last updated' timestamp on your nodes. Each day, do your update using MERGE. If the csv data exists on the database, update the timestamp using the ON MATCH clause of MERGE. If the csv data doesn't exist MERGE will create new nodes (make sure to add a timestamp property of some description). E.g:
MERGE (n:Person {<selection_filter>})
ON CREATE SET <required_properties>, n.lastUpdated = date()
ON MATCH SET
n.lastUpdated = date()
After updating the graph with csv data, run a query which deletes all nodes whose timestamps are before today's; i.e. haven't been updated.
You might find creating an index on lastUpdated will improve performance for the delete query.
If your CSV file is all active employees, then you can do something like this:
MATCH (e:Employee)
SET e.ActiveToday = False
LOAD CSV FROM "employees.csv" as line
MERGE (e:Employee {employeeID:line.employeeID})
SET e.ActiveToday = True
MATCH (e:Employee {ActiveToday: False})
DETACH DELETE e
The MERGE will create nodes for new employees in the file and match those that already exist. Both creates and matches will have their ActiveToday property updated. From there, you just match those where the property is still false and remove them.

Efficient way to import multiple csv's in neo4j

I am working on creating a graph database in neo4j for a CALL dataset. The dataset is stored in csv file with following columns: Source, Target, Timestamp, Duration. Here Source and Target are Person id's (numeric), Timestamp is datetime and duration is in seconds (integer).
I modeled my graph where person are nodes(person_id as property) and call as relationship (time and duration as property).
There are around 2,00,000 nodes and around 70 million relationships. I have a separate csv files with person id's which I used to create the nodes. I also added uniqueness constraint on the Person id's.
CREATE CONSTRAINT ON ( person:Person ) ASSERT (person.pid) IS UNIQUE
I didn't completely understand the working of bulk import so I wrote a python script to split my csv into 70 csv's where each csv has 1 million nodes (saved as calls_0, calls_1, .... calls_69). I took the initiative to manually run a cypher query changing the filename every time. It worked well(fast enough) for first few(around 10) files but then I noticed that after adding relationship from a file, the import is getting slower for the next file. Now it is taking almost 25 minutes for importing a file.
Can someone link me to an efficient and easy way of doing it?
Here is the cypher query:
:auto USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source,
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
MERGE (p1)-[rel:CALLS {time: time, duration: Duration}]->(p2)
RETURN count(rel)
I am using Neo4j 4.0.3
Your MERGE clause has to check for an existing matching relationship (to avoid creating duplicates). If you added a lot of relationships between Person nodes, that could make the MERGE clause slower.
You should consider whether it is safe for you to use CREATE instead of MERGE.
Is much better if you export the match using the ID of each node and then create the relationship.
POC
CREATE INDEX ON :Person(`pid`);
CALL apoc.export.csv.query("LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source,
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
RETURN ID(a) AS ida,ID(b) as idb,time,Duration","rels.csv", {});
and then
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:////rels.csv' AS row
MATCH (a:Person) WHERE ID(a) = toInt(row.ida)
MATCH (b:Person) WHERE ID(b) = toInt(row.idb)
MERGE (b)-[:CALLS {time: row.time, duration: Duration}]->(a);
For me this is the best way to do this.

Import Edgelist from CSV Neo4J

i'm trying to make a graph database from an edgelist and i'm kind of new with neo4j so i have this problem. First of all, the edgelist i got is like this:
geneId geneSymbol diseaseId diseaseName score
10 NAT2 C0005695 Bladder Neoplasm 0.245871429880008
10 NAT2 C0013182 Drug Allergy 0.202681755307501
100 ADA C0002170 Alopecia 0.2
100 ADA C0002880 Autoimmune hemolytic anemia 0.2
100 ADA C0004096 Asthma 0.21105290517153
i have a lot of relationships like that (165k) between gen and diseases associated.
I want to make a bipartite network in which nodes are gen or diseases, so i upload the data like this:
LOAD CSV WITH HEADERS FROM "file:///path/curated_gene_disease_associations.tsv" as row FIELDTERMINATOR '\t'
MERGE (g:Gene{geneId:row.geneId})
ON CREATE SET g.geneSymbol = row.geneSymbol
MERGE (d:Disease{diseaseId:row.diseaseId})
ON CREATE SET d.diseaseName = row.diseaseName
after a while (which is way longer than what it takes in R to upload the nodes using igraph), it's done and i got the nodes, i used MERGE because i don't want to repeat the gen/disease. The problem is that i don't know how to make the relationships, i've searched and they always use something like
MATCH (g:Gene {geneId: toInt(row.geneId)}), (d:Disease {diseaseId: toInt(row.geneId)})
CREATE (g)-[:RELATED_TO]->(d);
But when i run it it says that there are no changes. I've seen the neo4j tutorial but when they do the relations they don't work with edgelists so maybe the problem is when i merge the nodes so they don't repeat. I'd appreciate any help!
Looks like there might be two problems with your relationship query:
1) You're inserting (probably) as a string type (no toInt), and doing the MATCH query as an integer type (with toInt).
2) You're MATCHing the Disease node on row.geneId, not row.diseaseId.
Try the following modification:
MATCH (g:Gene {geneId: row.geneId}), (d:Disease {diseaseId: row.diseaseId})
CREATE (g)-[:RELATED_TO]->(d);
#DanielKitchener's answer seems to address your main question.
With respect to the slowness of creating the nodes, you should create indexes (or uniqueness constraints, which automatically create indexes as well) on these label/property pairs:
:Gene(geneId)
:Disease(diseaseId)
For example, execute these 2 statements separately:
CREATE INDEX ON :Gene(geneId);
CREATE INDEX ON :Disease(diseaseId);
Once the DB has those indexes, your MERGE clauses should be much faster, as they would not have to scan through all existing Gene or Disease nodes to find possible matches.

I can't create a relationship between nodes and predecessors by cypher while creating the graph

I have the following file A.csv
"NODE","PREDECESSORS"
"1",""
"2","1"
"3","1;2"
I want to create with the nodes: 1,2,3 and its relationships 1->2->3 and 1->3
I have already tried to do so:
LOAD CSV WITH HEADERS FROM 'file:///A.csv' AS line
CREATE (:Task { NODE: line.NODE, PREDECESSORS: SPLIT(line.PREDECESSORS ';')})
FOREACH (value IN line.PREDECESSORS |
MERGE (PREDECESSORS:value)-[r:RELATIONSHIP]->(NODE) )
But it does not work, that is, it does not create any relationship.
Please, might you help me?
The problem is in your MERGE:
MERGE (PREDECESSORS:value)-[r:RELATIONSHIP]->(NODE)
This is merging a :value labeled node and assigning it to the variable PREDECESSORS, which can't be what you want to do.
A better approach would be not save the predecessor data in the node, just use that to match on the relevant nodes and create the relationships.
It will also help to have an index on :Task(NODE) so your matches to the predecessors are quick.
Remember also that cypher queries do not process the entire query for each row, but rather each operation in the query is processed for each row, so once the CREATE executes, all nodes will be created, there's no need to use MERGE the predecessor nodes.
Try something like this:
LOAD CSV WITH HEADERS FROM 'file:///A.csv' AS line
CREATE (node:Task { NODE: line.NODE})
WITH node, SPLIT(line.PREDECESSORS, ';') as predecessors
MATCH (p:Task)
WHERE p.NODE in predecessors
MERGE (p)-[:RELATIONSHIP]->(node)

Fetching specific paths from neo4j Graph database

We are using Neo4j 2.1.4 Community edition.
We are facing some issues in getting the specific paths in neo4j.
Below is the csv file used.
In graph database we are creating Product, Company,Country and Zipcode nodes along with the relationship type as ‘MyRel’ at each level.In the above data we wanted to differentiate each path ,
that is
Mobile, Google,US,88888 -- as path1
Mobile,Goolge,US -- as path2
Mobile,Goolge -- as path3
That’s why we created one more column called Path in data file and maintaining the Path value as a relatioship property. So whenever someone wants to see the different paths he can query based on Relationship property either 1 or 2 or 3. For eample , whenever we query for the relationship property, we should get Mobile,Google ,US
But whenever I do this , in graph it is creating dummy node for Country and Zipcode. This is due to in 2nd and 3rd row the zip and country values are empty(null).
Query used:
LOAD CSV WITH HEADERS FROM "file:C:\\WorkingFolder\\Neo4j\\EDGE_Graph_POC\\newdata\\trial1.csv " as file
MERGE (p:Product {Name:file.Product})
MERGE (comp:Company {Name:file.Company})
MERGE (c:Country {Name:file.Country})
MERGE (zip:Zipcode{Code:file.Zipcode})
CREATE (p)-[:MyRel{Path:file.Path}]->(comp)-[:MyRel{Path:file.Path}]->(c)-[:MyRel{Path:file.Path}]->(zip)
Resultant graph:
So how can I avoid creating dummy nodes ?
Is there any better alternative option to get the proper path?
Thanks,
First, A simple solution is to follow the LOAD CSV query with others that clean up your graph. Run the queries
MATCH (zip:Zipcode { Code : ''})<-[r]-()
DELETE zip, r
and
MATCH (c:Country { Name : ''})<-[r]-()
DELETE c, r
You will then have the graph you desire.
You can filter them out with
WHERE file.Country <> '' and file.Zipcode <> ''
and split up your CREATE e.g.
CREATE (p)-[:MyRel{Path:file.Path}]->(comp)
WHERE file.Country <> '' and file.Zipcode <> ''
MERGE (c:Country {Name:file.Country})
MERGE (zip:Zipcode{Code:file.Zipcode})
CREATE (comp)-[:MyRel{Path:file.Path}]->(c)-[:MyRel{Path:file.Path}]->(zip)

Resources