Cypher Query endless loop - neo4j

I am new to graph databases and especially cypher. I am importing data from my csv. Below is the sample I pulled for some country data and added the cities and states. Now I was pushing the data for areas
LOAD CSV WITH HEADERS FROM
"file:///X:/loc.csv" as csvRow
MATCH (ct:city {poc:csvRow.poc})
MERGE (loc:area {eoc: csvRow.eoc, name:csvRow.loc_nme, name_wr:replace(csvRow.loc_nme," ","")})
MERGE (loc)-[:exists_inside]->(ct)
I've already pushed city and country data using the same query and built a relation between them too.
But when I try to create the areas inside the city it just keeps going, there is no stopping it. (15 mins have passed).
There are 7000 cities in the data I've got from the internet and 90k areas inside those cities.
Is it just taking time or have I messed up with the query.
After the Update
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
"file:///X:/loc.csv" as csvRow
MATCH (ct:city {poc:csvRow.poc})
MERGE (loc:area {eoc: csvRow.eoc, name:csvRow.loc_nme, name_wr:replace(csvRow.loc_nme," ","")})
MERGE (loc)-[:exists_inside]->(ct)

Okay, your query plan shows NodeByLabelScans and filters are being used to find your nodes, which means that every time you match or merge to a node, it has to scan all nodes with the given labels and perform property access on all of them to find the nodes you're looking for.
You need to add indexes (or unique constraints, depending on if the field is supposed to be unique) on the relevant label/property combinations so those lookups will be quick.
So you'll need one on :city(poc), and probably one on :area(eoc), assuming those properties are referring to unique properties.
EDIT
One other big thing I initially missed, you need to add USING PERIODIC COMMIT before the LOAD CSV so the load will batch the writes to the db, that should do the trick here.

Related

How to make relationships between already created nodes of different columns from a single csv file?

I have a single csv file whose contents are as follows -
id,name,country,level
1,jon,USA,international
2,don,USA,national
3,ron,USA,local
4,bon,IND,national
5,kon,IND,national
6,jen,IND,local
7,ken,IND,international
8,ben,GB,local
9,den,GB,international
10,lin,GB,national
11,min,AU,national
12,win,AU,local
13,kin,AU,international
14,bin,AU,international
15,nin,CN,national
16,con,CN,local
17,eon,CN,international
18,fon,CN,international
19,pon,SZN,national
20,zon,SZN,international
First of all I created a constraint on id
CREATE CONSTRAINT idConstraint ON (n:Name) ASSERT n.id IS UNIQUE
Then I created nodes for name, then for country and finally for level as follows -
LOAD CSV WITH HEADERS FROM "file:///demo.csv" AS row
MERGE (name:Name {name: row.name, id: row.id, country:row.country, level:row.level})
MERGE (country:Country {name: row.country})
MERGE (level:Level {type: row.level})
I can see the nodes fine. However, I want to be able to query for things like, for a given country how many names are there? For a given level, how many countries and then how many names for that country are there?
So for that I need to make Relationships between the nodes.
For that I tried like this -
LOAD CSV WITH HEADERS FROM "file:///demo.csv" AS row
MATCH (n:Name {name:row.name}), (c:Country {name:row.country})
CREATE (n)-[:LIVES_IN]->(c)
RETURN n,c
However this gives me a warning as follows -
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (c))
Moreover the resulting Graph looks slightly wrong - each Name node has 2 relations with a country whereas I would think there would be only one?
I also have a nagging fear that I am not doing things in an optimized or correct way. This is just a demo. In my real dataset, I often cannot run multiple CREATE or MERGE statements together. I have to LOAD the same CSV file again and again to do pretty much everything from creating nodes. When creating relationships, because a cartesian product forms, the command basically gives Java Heap Memory error.
PS. I just started with neo4j yesterday. I really don't know much about it. I have been struggling with this for a whole day, hence thought of asking here.
You can ignore the cartesian product warning, since that exact approach is needed in order to create the relationships that form the patterns you need.
As for the multiple relationships, it's possible you may have run the query twice. The second run would have created the duplicate relationships. You could use MERGE instead of CREATE for the relationships, that would ensure that there would be no duplicates.

Add relationships to existing data in Neo4j

To start with Neo4j (4.2.3) I loaded a year's worth of flights data (7m rows) and wanted to try and model a flight as a relationship between origin and destination airport. However the following query just eats up memory and has not finished after two days, so something is clearly amiss:
MATCH (f:Flight), (dest:Airport), (orig:Airport)
WHERE f.Dest = dest.IATA_Code AND f.Origin = orig.IATA_Code
CREATE (orig)-[r:FlightTo {DeptDateTime:f.DepDT, ArriveDateTime:f.ArrDT, Flight:f.Name}]->(dest)
I can do this instead:
LOAD CSV WITH HEADERS FROM 'file:///flights.csv' AS row
MERGE (o:Org_Airport {Org_IATA:row.Origin})
MERGE (d:Dest_Airport {Dest_IATA:row.Dest})
CREATE (o)-[r:FlightTo {DeptDateTime:row.DepDT, ArriveDateTime:row.ArrDT, Flight:row.Name}]->(d)
While this has the advantage of working (even in a reasonable time) it feels ugly to essentially duplicate the airports and also to go through the CSV file again when all the required data is already in the database.
I'm not quite there with my graph thinking probably so I'd appreciate some guidance on what the best way is to add a relationship like this, keeping in mind that original load files might get lost.
Do you have indexes set? Looking at your first query, you'd need:
CREATE INDEX ON :Flight(Dest);
CREATE INDEX ON :Airport(IATA_Code);
If you don't have indexes/constraints set on the label/property, the look up/merge will be very slow.

Creating relationships between nodes in neo4j is extremely slow

I'm using a python script to generate and execute queries loaded from data in a CSV file. I've got a substantial amount of data that needs to be imported so speed is very important.
The problem I'm having is that merging between two nodes takes a very long time, and including the cypher to create the relations between the nodes causes a query to take around 3 seconds (for a query which takes around 100ms without).
Here's a small bit of the query I'm trying to execute:
MERGE (s0:Chemical{`name`: "10074-g5"})
SET s0.`name`="10074-g5"
MERGE (y0:Gene{`gene-id`: "4149"})
SET y0.`name`="MAX"
SET y0.`gene-id`="4149"
MERGE (s0)-[:INTERACTS_WITH]->(y0)
MERGE (s1:Chemical{`name`: "10074-g5"})
SET s1.`name`="10074-g5"
MERGE (y1:Gene{`gene-id`: "4149"})
SET y1.`name`="MAX"
SET y1.`gene-id`="4149"
MERGE (s1)-[:INTERACTS_WITH]->(y1)
Any suggestions on why this is running so slowly? I've got index's set up on Chemical->name and Gene->gene-id so I honestly don't understand why this runs so slowly.
Most of your SET clauses are just setting properties to the same values they already have (as guaranteed by the preceding MERGE clauses).
The remaining SET clauses probably only need to be executed if the MERGE had created a new node. So, they should probably be preceded by ON CREATE.
You should never generate a long sequence of almost identical Cypher code. Instead, your Cypher code should use parameters, and you should pass your data as parameter(s).
You said you have a :Gene(id) index, whereas your code actually requires a :Gene(gene-id) index.
Below is sample Cypher code that uses the dataList parameter (a list of maps containing the desired property values), which fixes most of the above issues. The UNWIND clause just "unwinds" the list into individual maps.
UNWIND $dataList AS d
MERGE (s:Chemical{name: d.sName})
MERGE (y:Gene{`gene-id`: d.yId})
ON CREATE SET y.name=d.yName
MERGE (s)-[:INTERACTS_WITH]->(y)

Big data import into neo4j

I am importing the data around 12 million nodes and 13 million relationships.
First I used the csv import with periodic commit 50000 and divided the data into different chunks, but still its taking too much time.
Then I saw the batch insertion method. But for the batch insertion method I have to create new data sets in excel sheet.
Basically I am importing the data from SqlServer: first I save the data into csv, then import it into my neo4j.
Also, I am using the neo4j community version. I did change the properties for the like all i had found on stackoverflow. But still initially with preiodic commit 50K it goes faster but after 1 million it takes too much time.
Is there anyway to import this data directly from sql in short span of time, as neo4j is famous for its fast working with big data? Any suggestions or help?
Here is the LOAD CSV used (index on numbers(num)) :
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Numbers.csv"
AS csvLine fieldterminator ';'
Merge (Numbers:Number {num: csvLine.Numbers}) return * ;
USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Level1.csv"
AS csvLine fieldterminator ';'
MERGE (TermNum:Number {num: csvLine.TermNum})
MERGE (OrigNum:Number {num: (csvLine.OrigNum)})
MERGE (OrigNum)-[r:CALLS ]->(TermNum) return * ;
How long is it taking?
To give you a reference, my db is about 4m nodes, 650,000 unique relationships, ~10m-15m properties (not as large, but should provide an idea). It takes me less than 10 minutes to load in the nodes file + set multiple labels, and then load in the relationships file + set the relationships (all via LOAD CSV). This is also being done on a suped up computer, but if yours is taking hours, I would make some tweaks.
My suggestions are as follows:
Are you intentionally returning the nodes after the MERGE? I can't imagine you are doing anything with it, but either way, consider removing the RETURN *. With RETURN *, you're returning all nodes, relationships, and paths found in the query and that's bound to slow things down. (http://neo4j.com/docs/stable/query-return.html#return-return-all-elements)
Is the "num" field meant to be unique? If so, consider adding the following constraints (NOTE: this will also create the index, so no need to create it separately). I think this might speed up the MERGE (I'm not sure on that), though see next point.
CREATE CONSTRAINT ON (Numbers:Number) ASSERT Numbers.num IS UNIQUE;
If the num field is unique AND this is a brand new database (i.e. you're starting from scratch when you run this script), then call CREATE to create the nodes, rather than MERGE (for the creation of the nodes only).
As was already mentioned by Christophe, you should definitely increase the heap size to around 4g.
Let us know how it goes!
EDIT 1
I have not been able to find much relevant information on memory/performance tuning for the Windows version. What I have found leaves me with a couple of questions, and is potentially outdated.
This is potentially outdated, but provides some background on some of the different settings and the differences between Windows and Linux.
http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html
Those differences between Windows & Linux have themselves changed from one version to the next (as demonstrated with the following links)
Cypher MATCH query speed,
https://stackoverflow.com/a/29055966/4471711
Michael's response above seems to indicate that if you're NOT running a java application with Neo4j, you don't need to worry about the heap (-Xmx), however that doesn't seem right in my mind given the other information I saw, but perhaps all of that other info is prior to 2.2.
I have also been through this.
http://neo4j.com/docs/stable/configuration.html
So, what I have done is set both heap (-Xmx in the neo4j.vmoptions) and the pagecache to 32g.
Can you modify your heap settings to 4096MB.
Also, in the second LOAD CSV, are the numbers used for the two first MERGE already in the database ? If yes use MATCH instead.
I would also commit at a level of 10000.

Cypher LOAD FROM CSV performance for relationship merge

I am merging large batches of ~500,000 relationships with the LOAD CSV command:
LOAD CSV WITH HEADERS FROM 'http://file.csv'
MATCH (a:Label {uid: csv.uid1}),(b:Otherlabel {uid: csv.uid2})
MERGE (a)-[:TYPE {key1: csv.key1}]->(b)
Both uid properties have a UNIQUE constraint.
The CSV file looks like:
uid1,uid2,key1
123,abc,some_value
456,def,some_value
This is usually very fast (< 1 min) when there are many different nodes on each side.
But performance drops dramatically when I load batches where a single a node is connected to many different b nodes. The uid1 is always the same but schema constraints are still there. ~30,000 relationships take ~8 min to load.
Am I missing something here? What could explain the huge performance difference in MERGEing 'many-to-many' relationships vs. 'one-to-many'?
As I mentioned in the comment on the question, I verified this behavior with a ~300,000 line CSV file that I created with unique random values for uid1 and uid2. #MartinPreusse then mentioned that if you change the query to use CREATE instead of MERGE, the query is fast. This observation made me realize what is going on.
The slowdown is caused by the need to scan the relationships list of the 'a' node each time a MERGE is performed. When a CREATE is performed, the relationship is added without testing first to see if the relationship already exists. When the relationship lists remain short (first case), scanning the relationship lists has little impact. When the relationship lists are growing long (second case), the repeated scanning of a growing list is dominating the process. In my test I linked all 300,000 nodes to a single node using a MERGE clause and it took hours.
If you don't have to worry about creating duplicate relationships, you can use CREATE without fear. Even if duplicates are an issue, it might be faster to use CREATE and then craft a query that will remove the duplicates.

Resources