I have a CSV file with the following headers:
jobid,prev_jobid,next_jobid,change_number,change_datetime,change_description,username,email,job_submittime,job_starttime,job_endtime,job_status
"27555","27552","0","134180","2017-09-07 17:39:06","Testing a new methodology",john,john#myco.com,"2017-09-07 17:39:09","2017-09-07 18:11:10","success"
"27552","27549","27555","134178","2017-09-07 17:32:06","bug fix",emma,emma#myco.co,"2017-09-07 17:29:09","2017-09-07 17:11:10","success"
..
..
I've loaded up the CSV and created 3 types of nodes:
LOAD CSV WITH HEADERS FROM "file:///wbdqueue.csv" AS bud
CREATE (j:job{id:bud.jobid,pid:bud.prev_jobid,nid:bud.next_jobid,
add_time:bud.job_submittime,start_time:bud.job_starttime,end_time:bud.job_endtime,status:bud.job_status})
CREATE (c:cl{clnum:bud.change_number,time:bud.change_datetime,desc:bud.change_description})
CREATE (u:user{user:bud.username,email:bud.email})
I am then attempting to create relationships like so:
LOAD CSV WITH HEADERS FROM "file:///wbdqueue.csv" AS node
MATCH (c:cl),(u:user) WHERE c.clnum=node.change_number AND u.user=node.username
CREATE (u)-[:SUBMITTED]->(c)
First of all there is a warning in the browser that this builds a cartesian product of all disconnected patterns and may take a lot of memory/time and suggests adding an optional MATCH.
Secondly I've given it many hours (>3 days) but this does not create any relationships.
What am I doing wrong with my query?
What I am eventually trying to achieve is something like this (if you run the following in your Ne04j console you should get the sample graph that I am also attaching. I've reduced the properties for keeping this simple in this visual example.):
CREATE
(`0` :user ) ,
(`1` :job ) ,
(`2` :change_number ) ,
(`3` :user ) ,
(`4` :change_number ) ,
(`5` :job ) ,
(`0`)-[:`SUBMITTED`]->(`2`),
(`2`)-[:`CAUSED`]->(`1`),
(`3`)-[:`SUBMITTED`]->(`4`),
(`2`)-[:`NEXT_CHECKIN`]->(`4`),
(`4`)-[:`CAUSED`]->(`5`),
(`1`)-[:`NEXT_JOB`]->(`5`),
(`5`)-[:`PREVIOUS_JOB`]->(`1`)
Thank you
The warning about cartesian product occurs because you are matching multiple disconnected patterns (MATCH (c:cl),(u:user)).
I believe you can use a MERGE instead of a MATCH & WHERE followed by CREATE. Also, add USING PERIODIC COMMIT to reduce the memory overhead of the transaction state:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///wbdqueue.csv" AS node
MERGE (:cl {clnum : node.change_number})-[:SUBMITTED]->(:user {user : node.username})
Related
I've got two csv files Job (30,000 entries) and Cat (30 entries) imported into neo4j and am trying to create a relationship between them
Each Job has a cat_ID and Cat contains the category name and ID
after executing the following
LOAD CSV WITH HEADERS FROM 'file:///DimCategory.csv' AS row
MATCH (job:Job {cat_ID: row.cat_ID})
MATCH (cat:category {category: row.category})
CREATE (job)-[r:under]->(cat)
it returns (no changes, no records)
I received a prompt recommending that I index the category and so using
Create INDEX ON :Job(cat_id); I did, but I still get the same error
How do I create a relationship between the two?
I am able to get this to work on a smaller dataset
You are probably trying to match on non-existing nodes. Try
LOAD CSV WITH HEADERS FROM 'file:///DimCategory.csv' AS row
MERGE (job:Job {cat_ID: row.cat_ID})
MERGE (cat:category {category: row.category})
CREATE (job)-[r:under]->(cat)
Have a look in your logs and see if you are running out of memory.
You could try chunking the data set up into smaller pieces with Periodic Commit and see if that helps:
:auto USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///DimCategory.csv' AS row
MATCH (job:Job {cat_ID: row.cat_ID})
MATCH (cat:category {category: row.category})
CREATE (job)-[r:under]->(cat)
I am importing the following to Neo4J:
categories.csv
CategoryName1
CategoryName2
CategoryName3
...
categories_relations.csv
category_parent category_child
CategoryName3 CategoryName10
CategoryName32 CategoryName41
...
Basically, categories_relations.csv shows parent-child relationships between the categories from categories.csv.
I imported the first csv file with the following query which went well and pretty quickly:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///categories.csv' as line
CREATE (:Category {name:line[0]})
Then I imported the second csv file with:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///categories_relations.csv' as line
MATCH (a:Category),(b:Category)
WHERE a.name = line[0] AND b.name = line[1]
CREATE (a)-[r:ISPARENTOF]->(b)
I have about 2 million nodes.
I tried executing the 2nd query and it is taking quite long. Can I make the query execute more quickly?
Confirm you are matching on right property. You are setting only one property for Category node i.e. name while creating
categories. But you are matching on property id in your second
query to create the relationships between categories.
For executing the 2nd query faster you can add an index on the property (here id) which you are matching Category nodes on.
CREATE INDEX ON :Category(id)
If it still takes time, You can refer my answer to Load CSV here
I am trying to achieve what is shown here:
I have 2 CSV Files, diease_mstr and Test_mstr Now in Test_mstr, I have many test to disease ID records, which means none of them are unique. The disease ID points to disease_mstr file. In disease_mstr file I have only 2 fields, ID and Disease_name (disease name is unique).
Now, I am creating 3 nodes with labels
1) Tests (only "testname" property) which will have unique tests (total 345 unique testnames)
**Properties :**
a) testname
2) Linknode (pulled entire Test_mstr file) also pulled "disease_name" for corresponding disease_ID from Disease_mstr File
**Properties**
a)tname
b)dname
c)did
3) Disease (pulled form disease_mstr) file.
**Properties**
a)did
b)diseasename
Afterwhich I run create relationships
1)MATCH (t:Tests),(n:Linknode) where t.testname = n.tname CREATE (n)-[r:TEST_2]->(t) RETURN n,r,t
2)MATCH (d:Disease), (l:Linknode) where d.did = l.did MERGE (d)-[r:FOR_DISEASE]->(l) RETURN d,r,l
To get the desired result as shown in image, I run following cypher command :
MATCH (d:Disease)-[r2:FOR_DISEASE]->(l:Linknode)-[r:TEST_2]->(t:Tests) RETURN l,r,t,r2 LIMIT 25
Can someone please help me create 2 more relationships which is marked and linked in image with BLUE and GREEN lines?.
Sample files and images can be accessed in my google folder link
Is your goal to link all diseases to tests so that for any disease you can find out which tests are relevant and for each test, which diseases it tests for?
If so, you are nearly there.
You don't need the link nodes other than to help you during linking the tests to the diseases. In your current scenario you're treating the link nodes as you would if you were creating a relational database. They won't add any value in your graph db. You can create a single relationship between diseases and tests which will do all the work.
Here's a step by step way to load your database. (It probably isn't the most efficient, but it's easy to follow and it works.)
Normalise and load your tests:
load csv with headers from "file:///test_mstr_csv.csv" as line
merge (:Test {testname:line.test_name});
Load your diseases (these looked normalised to me)
load csv with headers from "file:///disease_mstr_csv.csv" as line
create (:Disease {did:line.did, diseasename:line.disease_name});
Load your link nodes:
load csv with headers from "file:///test_mstr_csv.csv" as line
merge (:Link {testname:line.test_name, parentdiseaseid:line.parent_disease_ID});
Now you can create a direct relationship between the diseases and tests with the following query:
match(d:Disease), (l:Link) where d.did = l.parentdiseaseid
with d, l.testname as name
match(t:Test {testname:name}) create (d)<-[:TEST_FOR]-(t);
This last query will find all the link nodes for each disease and extract the test name. It then looks up the test and joins it directly to its corresponding disease.
The link nodes are redundent now, so you can delete them if you wish.
To create the 'blue lines', which I assume are meant to show where tests have diseases in common, run the query below:
match (d:Disease)<-[]-(:Test)-[]->(e:Disease) where id(d) > id(e)
merge (d)-[:BLUE_LINE]->(e);
The match clause finds all disease pairs with a common test, the where clause ensures a link is created in only one direction and the merge clause ensures only one link is created.
I have a csv file generated with contents as follows
GOID GOName
GO:0007190 activation of adenylate cyclase activity
DiseaseID DiseaseName
D058490 46 XY Disorders of Sex Development
D000172 Acromegaly
D049913 ACTH-Secreting,Pituitary Adenoma
D058186 Acute Kidney Injury
D000310 Adrenal Gland Neoplasms
D000312 Adrenal Hyperplasia Congenital
C537045 Albright's hereditary osteodystrophy
D000544 Alzheimer Disease
D019969 Amphetamine-Related Disorders
D000855 Anorexia
D000860 Anoxia
D001008 Anxiety Disorders
D001169 Arthritis Experimental
D001171 Arthritis Juvenile
D001172 Arthritis Rheumatoid
D001249 Asthma
D001254 Astrocytoma
and so on.
I want to create link between GOIDs through Diseases such that one disease node is connected to two or more different GOID nodes.
My output should look like this
Load your diseases all at once as under a :Disease label.
Load all your Global data at once under a :Global label
Create another CSV file with the Global->Disease linkages, and use MERGE to create the relationships.
The relationship CSV would look like this:
goID,diseaseID
"GO:1234","D000456"
The command to read the CSV and create the relationships would look like this:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:/D:/Relationships.csv" as line
MERGE (:Global {goID: line.goID})-[:RELATIONSHIP]->(:Disease {diseaseID: line.diseaseID})
once your data is loaded, you can then query it like so:
MATCH (g:Global {goID: "GO:0007190"})-[r:RELATIONSHIP]->(d:Disease)
return g, r, d
For cases where a disease has multiple global conditions, you can find and create a relationship like so:
match (d:Disease)
match (go1:GO)-[:RELATIONSHIP]->(d)
match (go2:GO)-[:RELATIONSHIP]->(d) where go2 <> go1
create (go1)-[:RELATIONSHIP]->(go2)
create (go2)-[:RELATIONSHIP]->(go1)
Strictly speaking you don't need a bi-directional relationship, so creating the second relationship could be left out. One potential concern is if more than one disease links two global values. If that is a concern, then setting a "Disease" property on the relationship would help identify how these globals are related.
I'm new to Neo4J, and I want to try it on some data I've exported from MySQL. I've got the community edition running with neo4j console, and I'm entering commands using the neo4j-shell command line client.
I have 2 CSV files, that I use to create 2 types of node, as follows:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/updates.csv" AS row
CREATE (:Update {update_id: row.id, update_type: row.update_type, customer_name: row.customer_name, .... });
CREATE INDEX ON :Update(update_id);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/facts.csv" AS row
CREATE (:Fact {update_id: row.update_id, status: row.status, ..... });
CREATE INDEX ON :Fact(update_id);
This gives me approx 650,000 Update nodes, and 21,000,000 Fact nodes.
Once the indexes are online, I try to create relationships between the nodes, as follows:
MATCH (a:Update)
WITH a
MATCH (b:Fact{update_id:a.update_id})
CREATE (b)-[:FROM]->(a)
This fails with an OutOfMemoryError. I believe this is because Neo4J does not commit the transaction until it completes, keeping it in memory.
What can I do to prevent this? I have read about USING PERIODIC COMMIT but it appears this is only useful when reading the CSV, as it doesn't work in my case:
neo4j-sh (?)$ USING PERIODIC COMMIT
> MATCH (a:Update)
> WITH a
> MATCH (b:Fact{update_id:a.update_id})
> CREATE (b)-[:FROM]->(a);
QueryExecutionKernelException: Invalid input 'M': expected whitespace, comment, an integer or LoadCSVQuery (line 2, column 1 (offset: 22))
"MATCH (a:Update)"
^
Is it possible to create relationships in this way, between large numbers of existing nodes, or do I need to take a different approach?
The Out of Memory Exception is normal as it will try to commit it all at once and as you didn't provide it, I assume java heap settings are set as default (512m).
You can however, batch the process with kind of pagination, only I would prefer to use MERGE rather than CREATE in this case :
MATCH (a:Update)
WITH a
SKIP 0
LIMIT 50000
MATCH (b:Fact{update_id:a.update_id})
MERGE (b)-[:FROM]->(a)
Modify SKIP and LIMIT after each batch until your reach 650k update nodes.