I've got two csv files Job (30,000 entries) and Cat (30 entries) imported into neo4j and am trying to create a relationship between them
Each Job has a cat_ID and Cat contains the category name and ID
after executing the following
LOAD CSV WITH HEADERS FROM 'file:///DimCategory.csv' AS row
MATCH (job:Job {cat_ID: row.cat_ID})
MATCH (cat:category {category: row.category})
CREATE (job)-[r:under]->(cat)
it returns (no changes, no records)
I received a prompt recommending that I index the category and so using
Create INDEX ON :Job(cat_id); I did, but I still get the same error
How do I create a relationship between the two?
I am able to get this to work on a smaller dataset
You are probably trying to match on non-existing nodes. Try
LOAD CSV WITH HEADERS FROM 'file:///DimCategory.csv' AS row
MERGE (job:Job {cat_ID: row.cat_ID})
MERGE (cat:category {category: row.category})
CREATE (job)-[r:under]->(cat)
Have a look in your logs and see if you are running out of memory.
You could try chunking the data set up into smaller pieces with Periodic Commit and see if that helps:
:auto USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///DimCategory.csv' AS row
MATCH (job:Job {cat_ID: row.cat_ID})
MATCH (cat:category {category: row.category})
CREATE (job)-[r:under]->(cat)
Related
I am importing the following to Neo4J:
categories.csv
CategoryName1
CategoryName2
CategoryName3
...
categories_relations.csv
category_parent category_child
CategoryName3 CategoryName10
CategoryName32 CategoryName41
...
Basically, categories_relations.csv shows parent-child relationships between the categories from categories.csv.
I imported the first csv file with the following query which went well and pretty quickly:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///categories.csv' as line
CREATE (:Category {name:line[0]})
Then I imported the second csv file with:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///categories_relations.csv' as line
MATCH (a:Category),(b:Category)
WHERE a.name = line[0] AND b.name = line[1]
CREATE (a)-[r:ISPARENTOF]->(b)
I have about 2 million nodes.
I tried executing the 2nd query and it is taking quite long. Can I make the query execute more quickly?
Confirm you are matching on right property. You are setting only one property for Category node i.e. name while creating
categories. But you are matching on property id in your second
query to create the relationships between categories.
For executing the 2nd query faster you can add an index on the property (here id) which you are matching Category nodes on.
CREATE INDEX ON :Category(id)
If it still takes time, You can refer my answer to Load CSV here
I have a CSV file with the following headers:
jobid,prev_jobid,next_jobid,change_number,change_datetime,change_description,username,email,job_submittime,job_starttime,job_endtime,job_status
"27555","27552","0","134180","2017-09-07 17:39:06","Testing a new methodology",john,john#myco.com,"2017-09-07 17:39:09","2017-09-07 18:11:10","success"
"27552","27549","27555","134178","2017-09-07 17:32:06","bug fix",emma,emma#myco.co,"2017-09-07 17:29:09","2017-09-07 17:11:10","success"
..
..
I've loaded up the CSV and created 3 types of nodes:
LOAD CSV WITH HEADERS FROM "file:///wbdqueue.csv" AS bud
CREATE (j:job{id:bud.jobid,pid:bud.prev_jobid,nid:bud.next_jobid,
add_time:bud.job_submittime,start_time:bud.job_starttime,end_time:bud.job_endtime,status:bud.job_status})
CREATE (c:cl{clnum:bud.change_number,time:bud.change_datetime,desc:bud.change_description})
CREATE (u:user{user:bud.username,email:bud.email})
I am then attempting to create relationships like so:
LOAD CSV WITH HEADERS FROM "file:///wbdqueue.csv" AS node
MATCH (c:cl),(u:user) WHERE c.clnum=node.change_number AND u.user=node.username
CREATE (u)-[:SUBMITTED]->(c)
First of all there is a warning in the browser that this builds a cartesian product of all disconnected patterns and may take a lot of memory/time and suggests adding an optional MATCH.
Secondly I've given it many hours (>3 days) but this does not create any relationships.
What am I doing wrong with my query?
What I am eventually trying to achieve is something like this (if you run the following in your Ne04j console you should get the sample graph that I am also attaching. I've reduced the properties for keeping this simple in this visual example.):
CREATE
(`0` :user ) ,
(`1` :job ) ,
(`2` :change_number ) ,
(`3` :user ) ,
(`4` :change_number ) ,
(`5` :job ) ,
(`0`)-[:`SUBMITTED`]->(`2`),
(`2`)-[:`CAUSED`]->(`1`),
(`3`)-[:`SUBMITTED`]->(`4`),
(`2`)-[:`NEXT_CHECKIN`]->(`4`),
(`4`)-[:`CAUSED`]->(`5`),
(`1`)-[:`NEXT_JOB`]->(`5`),
(`5`)-[:`PREVIOUS_JOB`]->(`1`)
Thank you
The warning about cartesian product occurs because you are matching multiple disconnected patterns (MATCH (c:cl),(u:user)).
I believe you can use a MERGE instead of a MATCH & WHERE followed by CREATE. Also, add USING PERIODIC COMMIT to reduce the memory overhead of the transaction state:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///wbdqueue.csv" AS node
MERGE (:cl {clnum : node.change_number})-[:SUBMITTED]->(:user {user : node.username})
I am trying to add relationship between existing employee nodes in my sample database from csv file using the following commands:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
'file:///newmsg1.csv' AS line
WITH line
MATCH (e:Employee {mail: line.fromemail}), (b:Employee {mail: line.toemail})
CREATE (e)-[m:Message]->(b);
The problem i am facing is that, while there are only 71253 entries in the csv file in which each entry has a "fromemail" and "toemail",
I am getting "Created 240643 relationships, completed after 506170 ms." as the output. I am not able to understand what I am doing wrong. Kindly help me. Thanks in advance!
You can use MERGE to ensure uniqueness of relationships:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
'file:///newmsg1.csv' AS line
WITH line
MATCH (e:Employee {mail: line.fromemail}), (b:Employee {mail: line.toemail})
MERGE (e)-[m:Message]->(b);
Try change your create to CREATE UNIQUE:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
'file:///newmsg1.csv' AS line
WITH line
MATCH (e:Employee {mail: line.fromemail}), (b:Employee {mail: line.toemail})
CREATE UNIQUE (e)-[m:Message]->(b);
From the docs:
CREATE UNIQUE is in the middle of MATCH and CREATE — it will match
what it can, and create what is missing. CREATE UNIQUE will always
make the least change possible to the graph — if it can use parts of
the existing graph, it will.
I am trying to run the following query to create my nodes and relationships from a .csv file that I have:
USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'file:///LoanStats3bEDITED.csv' AS line
//USING PERIODIC COMMIT 1000 makes sure we don't get a memory error
//creating the nodes with their properties
//member node
CREATE (member:Person{member_id:TOINT(line.member_id)})
//Personal information node
CREATE (personalInformation:PersonalInformation{addr_state:line.addr_state})
//recordHistory node
CREATE (recordHistory:RecordHistory{delinq_2yrs:TOFLOAT(line.delinq_2yrs),earliest_cr_line:line.earliest_cr_line,inq_last_6mths:TOFLOAT(line.inq_last_6mths),collections_12_mths_ex_med:TOFLOAT(line.collections_12_mths_ex_med),delinq_amnt:TOFLOAT(line.delinq_amnt),percent_bc_gt_75:TOFLOAT(line.percent_bc_gt_75), pub_rec_bankruptcies:TOFLOAT(line.pub_rec_bankruptcies), tax_liens:TOFLOAT(line.tax_liens)})
//Loan node
CREATE (loan:Loan{funded_amnt:TOFLOAT(line.funded_amnt),term:line.term, int_rate:line.int_rate, installment:TOFLOAT(line.installment),purpose:line.purpose})
//Customer Finances node
CREATE (customerFinances:CustomerFinances{emp_length:line.emp_length,verification_status_joint:line.verification_status_joint,home_ownership:line.home_ownership, annual_inc:TOFLOAT(line.annual_inc), verification_status:line.verification_status,dti:TOFLOAT(line.dti), annual_inc_joint:TOFLOAT(line.annual_inc_joint),dti_joint:TOFLOAT(line.dti_joint)})
//Accounts node
CREATE (accounts:Accounts{revol_util:line.revol_util,tot_cur_bal:TOFLOAT(line.tot_cur_bal)})
//creating the relationships
CREATE UNIQUE (member)-[:FINANCIAL{issue_d:line.issue_d,loan_status:line.loan_status, application_type:line.application_type}]->(loan)
CREATE UNIQUE (customerFinances)<-[:FINANCIAL]-(member)
CREATE UNIQUE (accounts)<-[:FINANCIAL{open_acc:TOINT(line.open_acc),total_acc:TOFLOAT(line.total_acc)}]-(member)
CREATE UNIQUE (personalInformation)<-[:PERSONAL]-(member)
CREATE UNIQUE (recordHistory)<-[:HISTORY]-(member)
However, I keep getting the following error:
Unable to rollback transaction
What does this mean and how can I fix my query so it can be run successfully?
I am now getting the following error:
GC overhead limit exceeded
I think you are out of memory.
Solutions:
Use neo4j-import.batch
Split your querys.
Make constraints to speed up the querys.
Why do you need the create unique? You could just use create if your csv is clean, or use merge.
I think it could also be faster if you execute the query not in the browser but in shell.
Download more ram :-)
If you really need the uniqueness of relationships, replace create unique with merge.
Also, your repeated MERGE operation on FINANCIAL causes Cypher to materialize the whole result before each of the 3 operations with an Eager operator so that it doesn't run into endless loops.
That's why the periodic commit is not going into effect, causing the whole intermediate result to use too much memory.
Something else you could do is to use the APOC library and apoc.periodic.iterate instead for batching.
call apoc.periodic.iterate("
LOAD CSV WITH HEADERS FROM 'file:///LoanStats3bEDITED.csv' AS line RETURN line
","
//member node
CREATE (member:Person{member_id:TOINT(line.member_id)})
//Personal information node
CREATE (personalInformation:PersonalInformation{addr_state:line.addr_state})
//recordHistory node
CREATE (recordHistory:RecordHistory{delinq_2yrs:TOFLOAT(line.delinq_2yrs),earliest_cr_line:line.earliest_cr_line,inq_last_6mths:TOFLOAT(line.inq_last_6mths),collections_12_mths_ex_med:TOFLOAT(line.collections_12_mths_ex_med),delinq_amnt:TOFLOAT(line.delinq_amnt),percent_bc_gt_75:TOFLOAT(line.percent_bc_gt_75), pub_rec_bankruptcies:TOFLOAT(line.pub_rec_bankruptcies), tax_liens:TOFLOAT(line.tax_liens)})
//Loan node
CREATE (loan:Loan{funded_amnt:TOFLOAT(line.funded_amnt),term:line.term, int_rate:line.int_rate, installment:TOFLOAT(line.installment),purpose:line.purpose})
//Customer Finances node
CREATE (customerFinances:CustomerFinances{emp_length:line.emp_length,verification_status_joint:line.verification_status_joint,home_ownership:line.home_ownership, annual_inc:TOFLOAT(line.annual_inc), verification_status:line.verification_status,dti:TOFLOAT(line.dti), annual_inc_joint:TOFLOAT(line.annual_inc_joint),dti_joint:TOFLOAT(line.dti_joint)})
//Accounts node
CREATE (accounts:Accounts{revol_util:line.revol_util,tot_cur_bal:TOFLOAT(line.tot_cur_bal)})
//creating the relationships
MERGE (member)-[:FINANCIAL{issue_d:line.issue_d,loan_status:line.loan_status, application_type:line.application_type}]->(loan)
MERGE (customerFinances)<-[:FINANCIAL]-(member)
MERGE (accounts)<-[:FINANCIAL{open_acc:TOINT(line.open_acc),total_acc:TOFLOAT(line.total_acc)}]-(member)
MERGE (personalInformation)<-[:PERSONAL]-(member)
MERGE (recordHistory)<-[:HISTORY]-(member)
", {batchSize:1000, iterateList:true})
I'm new to Neo4J, and I want to try it on some data I've exported from MySQL. I've got the community edition running with neo4j console, and I'm entering commands using the neo4j-shell command line client.
I have 2 CSV files, that I use to create 2 types of node, as follows:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/updates.csv" AS row
CREATE (:Update {update_id: row.id, update_type: row.update_type, customer_name: row.customer_name, .... });
CREATE INDEX ON :Update(update_id);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/facts.csv" AS row
CREATE (:Fact {update_id: row.update_id, status: row.status, ..... });
CREATE INDEX ON :Fact(update_id);
This gives me approx 650,000 Update nodes, and 21,000,000 Fact nodes.
Once the indexes are online, I try to create relationships between the nodes, as follows:
MATCH (a:Update)
WITH a
MATCH (b:Fact{update_id:a.update_id})
CREATE (b)-[:FROM]->(a)
This fails with an OutOfMemoryError. I believe this is because Neo4J does not commit the transaction until it completes, keeping it in memory.
What can I do to prevent this? I have read about USING PERIODIC COMMIT but it appears this is only useful when reading the CSV, as it doesn't work in my case:
neo4j-sh (?)$ USING PERIODIC COMMIT
> MATCH (a:Update)
> WITH a
> MATCH (b:Fact{update_id:a.update_id})
> CREATE (b)-[:FROM]->(a);
QueryExecutionKernelException: Invalid input 'M': expected whitespace, comment, an integer or LoadCSVQuery (line 2, column 1 (offset: 22))
"MATCH (a:Update)"
^
Is it possible to create relationships in this way, between large numbers of existing nodes, or do I need to take a different approach?
The Out of Memory Exception is normal as it will try to commit it all at once and as you didn't provide it, I assume java heap settings are set as default (512m).
You can however, batch the process with kind of pagination, only I would prefer to use MERGE rather than CREATE in this case :
MATCH (a:Update)
WITH a
SKIP 0
LIMIT 50000
MATCH (b:Fact{update_id:a.update_id})
MERGE (b)-[:FROM]->(a)
Modify SKIP and LIMIT after each batch until your reach 650k update nodes.