I'm running two different Neo4j servers and running the same queries on them in the same order.
I want to check if both databases are equivalent, for that purpose I'm generating a dump of the entire database (and I do realize this is not a real alternative for huge databases) with bin/neo4j-shell -c "dump" > /home/my_user/dump.txt and them comparing the md5 hashes generated with md5sum dump.txt.
The generated dump.txt files look pretty much identical except that the variables which identify the nodes are sometimes different, which of course generates completely different hashes for each file. Example:
dump.txt #1
begin
commit
begin
create (_6:`Person` {`name`:"Arthur", `title`:"King"})
create (_7:`Person` {`name`:"Saladin", `title`:"Sultan"})
create (_8:`Army` {`name`:"Saxon army"})
create (_6)-[:`FIGHTS_AGAINST`]->(_8)
create (_7)-[:`LEADS`]->(_8)
;
commit
dump.txt #2
begin
commit
begin
create (_7:`Person` {`name`:"Arthur", `title`:"King"})
create (_8:`Person` {`name`:"Saladin", `title`:"Sultan"})
create (_9:`Army` {`name`:"Saxon army"})
create (_7)-[:`FIGHTS_AGAINST`]->(_9)
create (_8)-[:`LEADS`]->(_9)
;
commit
I'm guessing right now it works based on the number of nodes the database has had so far and when I clean it with MATCH (n) DETACH DELETE n it doesn't reset this counter. The only way I found to reset it is by restarting the server, which isn't exactly practical.
I guess the simplest way to solve my issue would be to make a script that erases all numbers preceded by an _, but wouldn't it, in very specific situations, generate false positives? Like if the queries where very similar and in the same order but updated different nodes.
Does anyone have a better alternative? Maybe a command to reset this node counter?
Ended up using regular expressions on Bash to achieve the desired result of removing all node identifiers from the files.
cat dump.txt | sed s/\(_[0-9]*/\(/g > dump_new.txt
Which from a dump.txt like this one:
begin
commit
begin
create (_18:`Person` {`name`:"Arthur", `title`:"King"})
create (_19:`Person` {`name`:"Saladin", `title`:"Sultan"})
create (_20:`Army` {`name`:"Saxon army"})
create (_18)-[:`FIGHTS_AGAINST`]->(_20)
create (_19)-[:`LEADS`]->(_20)
;
commit
Generates a dump_new.txt like this one:
begin
commit
begin
create (:`Person` {`name`:"Arthur", `title`:"King"})
create (:`Person` {`name`:"Saladin", `title`:"Sultan"})
create (:`Army` {`name`:"Saxon army"})
create ()-[:`FIGHTS_AGAINST`]->()
create ()-[:`LEADS`]->()
;
commit
Related
I have a LOAD_CSV cypher script that creates and sets properties for nodes and edges.
I want to add a parameter at run time (i.e. when I do cat mycypher.cql | cypher-shell -u xxxx -p xxx) so that a key property gets set on nodes -- like so:
LOAD CSV WITH HEADERS FROM $MY_CSV AS row
MERGE (a:abcLabel {abcId: toInteger(row.abc_id), extraProp: $EXTRA_PROPERTY})
ON CREATE SET
abc.name = row.abc_name
MERGE (b:bcdLabel {bcdId: toInteger(row.bcd_id), extraProp: $EXTRA_PROPERTY})
ON CREATE SET
etc ....
Now, know that I can't use shell-like params, but is there a way to set $EXTRA_PROPERTY and $MY_FILE so that I can rerun the cql against a separate data set and ensure that a subsequent MATCH (:abcProperty {extraLabel: "xyz"}) will return nodes that were given the "xyz" property?
In principle this would be completely automated and templated so I will never do a manual load.
TIA
Upcoming version 1.2 of cypher-shell will support the command line option --param, which would allow you to specify Cypher parameters.
Here is the merged pull request.
I am trying to run the following query to create my nodes and relationships from a .csv file that I have:
USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'file:///LoanStats3bEDITED.csv' AS line
//USING PERIODIC COMMIT 1000 makes sure we don't get a memory error
//creating the nodes with their properties
//member node
CREATE (member:Person{member_id:TOINT(line.member_id)})
//Personal information node
CREATE (personalInformation:PersonalInformation{addr_state:line.addr_state})
//recordHistory node
CREATE (recordHistory:RecordHistory{delinq_2yrs:TOFLOAT(line.delinq_2yrs),earliest_cr_line:line.earliest_cr_line,inq_last_6mths:TOFLOAT(line.inq_last_6mths),collections_12_mths_ex_med:TOFLOAT(line.collections_12_mths_ex_med),delinq_amnt:TOFLOAT(line.delinq_amnt),percent_bc_gt_75:TOFLOAT(line.percent_bc_gt_75), pub_rec_bankruptcies:TOFLOAT(line.pub_rec_bankruptcies), tax_liens:TOFLOAT(line.tax_liens)})
//Loan node
CREATE (loan:Loan{funded_amnt:TOFLOAT(line.funded_amnt),term:line.term, int_rate:line.int_rate, installment:TOFLOAT(line.installment),purpose:line.purpose})
//Customer Finances node
CREATE (customerFinances:CustomerFinances{emp_length:line.emp_length,verification_status_joint:line.verification_status_joint,home_ownership:line.home_ownership, annual_inc:TOFLOAT(line.annual_inc), verification_status:line.verification_status,dti:TOFLOAT(line.dti), annual_inc_joint:TOFLOAT(line.annual_inc_joint),dti_joint:TOFLOAT(line.dti_joint)})
//Accounts node
CREATE (accounts:Accounts{revol_util:line.revol_util,tot_cur_bal:TOFLOAT(line.tot_cur_bal)})
//creating the relationships
CREATE UNIQUE (member)-[:FINANCIAL{issue_d:line.issue_d,loan_status:line.loan_status, application_type:line.application_type}]->(loan)
CREATE UNIQUE (customerFinances)<-[:FINANCIAL]-(member)
CREATE UNIQUE (accounts)<-[:FINANCIAL{open_acc:TOINT(line.open_acc),total_acc:TOFLOAT(line.total_acc)}]-(member)
CREATE UNIQUE (personalInformation)<-[:PERSONAL]-(member)
CREATE UNIQUE (recordHistory)<-[:HISTORY]-(member)
However, I keep getting the following error:
Unable to rollback transaction
What does this mean and how can I fix my query so it can be run successfully?
I am now getting the following error:
GC overhead limit exceeded
I think you are out of memory.
Solutions:
Use neo4j-import.batch
Split your querys.
Make constraints to speed up the querys.
Why do you need the create unique? You could just use create if your csv is clean, or use merge.
I think it could also be faster if you execute the query not in the browser but in shell.
Download more ram :-)
If you really need the uniqueness of relationships, replace create unique with merge.
Also, your repeated MERGE operation on FINANCIAL causes Cypher to materialize the whole result before each of the 3 operations with an Eager operator so that it doesn't run into endless loops.
That's why the periodic commit is not going into effect, causing the whole intermediate result to use too much memory.
Something else you could do is to use the APOC library and apoc.periodic.iterate instead for batching.
call apoc.periodic.iterate("
LOAD CSV WITH HEADERS FROM 'file:///LoanStats3bEDITED.csv' AS line RETURN line
","
//member node
CREATE (member:Person{member_id:TOINT(line.member_id)})
//Personal information node
CREATE (personalInformation:PersonalInformation{addr_state:line.addr_state})
//recordHistory node
CREATE (recordHistory:RecordHistory{delinq_2yrs:TOFLOAT(line.delinq_2yrs),earliest_cr_line:line.earliest_cr_line,inq_last_6mths:TOFLOAT(line.inq_last_6mths),collections_12_mths_ex_med:TOFLOAT(line.collections_12_mths_ex_med),delinq_amnt:TOFLOAT(line.delinq_amnt),percent_bc_gt_75:TOFLOAT(line.percent_bc_gt_75), pub_rec_bankruptcies:TOFLOAT(line.pub_rec_bankruptcies), tax_liens:TOFLOAT(line.tax_liens)})
//Loan node
CREATE (loan:Loan{funded_amnt:TOFLOAT(line.funded_amnt),term:line.term, int_rate:line.int_rate, installment:TOFLOAT(line.installment),purpose:line.purpose})
//Customer Finances node
CREATE (customerFinances:CustomerFinances{emp_length:line.emp_length,verification_status_joint:line.verification_status_joint,home_ownership:line.home_ownership, annual_inc:TOFLOAT(line.annual_inc), verification_status:line.verification_status,dti:TOFLOAT(line.dti), annual_inc_joint:TOFLOAT(line.annual_inc_joint),dti_joint:TOFLOAT(line.dti_joint)})
//Accounts node
CREATE (accounts:Accounts{revol_util:line.revol_util,tot_cur_bal:TOFLOAT(line.tot_cur_bal)})
//creating the relationships
MERGE (member)-[:FINANCIAL{issue_d:line.issue_d,loan_status:line.loan_status, application_type:line.application_type}]->(loan)
MERGE (customerFinances)<-[:FINANCIAL]-(member)
MERGE (accounts)<-[:FINANCIAL{open_acc:TOINT(line.open_acc),total_acc:TOFLOAT(line.total_acc)}]-(member)
MERGE (personalInformation)<-[:PERSONAL]-(member)
MERGE (recordHistory)<-[:HISTORY]-(member)
", {batchSize:1000, iterateList:true})
I'm new to Neo4J, and I want to try it on some data I've exported from MySQL. I've got the community edition running with neo4j console, and I'm entering commands using the neo4j-shell command line client.
I have 2 CSV files, that I use to create 2 types of node, as follows:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/updates.csv" AS row
CREATE (:Update {update_id: row.id, update_type: row.update_type, customer_name: row.customer_name, .... });
CREATE INDEX ON :Update(update_id);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/facts.csv" AS row
CREATE (:Fact {update_id: row.update_id, status: row.status, ..... });
CREATE INDEX ON :Fact(update_id);
This gives me approx 650,000 Update nodes, and 21,000,000 Fact nodes.
Once the indexes are online, I try to create relationships between the nodes, as follows:
MATCH (a:Update)
WITH a
MATCH (b:Fact{update_id:a.update_id})
CREATE (b)-[:FROM]->(a)
This fails with an OutOfMemoryError. I believe this is because Neo4J does not commit the transaction until it completes, keeping it in memory.
What can I do to prevent this? I have read about USING PERIODIC COMMIT but it appears this is only useful when reading the CSV, as it doesn't work in my case:
neo4j-sh (?)$ USING PERIODIC COMMIT
> MATCH (a:Update)
> WITH a
> MATCH (b:Fact{update_id:a.update_id})
> CREATE (b)-[:FROM]->(a);
QueryExecutionKernelException: Invalid input 'M': expected whitespace, comment, an integer or LoadCSVQuery (line 2, column 1 (offset: 22))
"MATCH (a:Update)"
^
Is it possible to create relationships in this way, between large numbers of existing nodes, or do I need to take a different approach?
The Out of Memory Exception is normal as it will try to commit it all at once and as you didn't provide it, I assume java heap settings are set as default (512m).
You can however, batch the process with kind of pagination, only I would prefer to use MERGE rather than CREATE in this case :
MATCH (a:Update)
WITH a
SKIP 0
LIMIT 50000
MATCH (b:Fact{update_id:a.update_id})
MERGE (b)-[:FROM]->(a)
Modify SKIP and LIMIT after each batch until your reach 650k update nodes.
For versioning purposes I wanna have VERSION node in my db. This node and its relationsship should only be created if the property size of a FILE node changes. The FILE node should always contain the current value for size. Since it isn't possible to use MATCH and WHERE in the foreach, how to make such a scenario work? Is there something like a "If-Clause" for using with foreach?
MERGE (root:FOLDER {fullpath: {newRoot}.fullpath})
ON CREATE SET root={newRoot}
FOR EACH (newFile IN {newFiles} |
MERGE (file:FILE {fullpath:newFile.fullpath}) ON CREATE SET file.guid = newFile.guid
SET file.visited = 1
MERGE (root)-[:CONTAINS]->(file)
And the following two lines should only execute if file.size != newFile.size (size changed)
SET file.size = newFile.size
CREATE (file)-[:HASVERSION{v:1}]->(v:VERSION {size:newFile.size})
I hope you understand what I want to achieve. Please give advice if there are better solutions for Node versioning purposes.
fadanner,
Here's what I found. This would work in the earlier answer I gave you as well (instead of the FOREACH).
MERGE (root:FOLDER {fullpath: {newRoot}.fullpath})
ON CREATE SET root={newRoot}
WITH root UNWIND {newFiles} AS newFile
MERGE (file:FILE {fullpath : newFile.fullpath})
ON CREATE SET file.guid = newFile.guid
SET file.visited = 1
MERGE (root)-[:CONTAINS]->(file)
WITH file, newFile
WHERE file.size <> newFile.size
CREATE (file)-[:HASVERSION {v : 1}]->(v:VERSION {size : newFile.size})
This is more complicated than the simple example that I tested, so there may be errors. But see if something like this won't solve your problem.
Grace and peace,
Jim
In production, I am facing this problem.
There is a delete which is taking long time to execute and is finally throwing SQL error of -243.
I got the query using onstat -g.
Is there any way to find out what is causing it to take this much time and finally error out?
It uses COMMITTED READ isolation.
This is causing high Informix cpu usage as well.
EDIT
Environment - Informix 9.2 on Solaris
I do not see any issue related to indexes or application logic, but I suspect some informix corruption.
The session holds 8 locks on different tables while executing this DELETE query.
But, I do not see any locks on the table on which the delete is performed.
Would it be something like, informix is unable to get lock on the table?
DELETE doesn't care about your isolation level. You are getting 243 because another process is locking the table while you're trying to run your delete operation.
I would put your delete into an SP and commit each Xth record:
CREATE PROCEDURE tmp_delete_sp (
p_commit_records INTEGER
)
RETURNING
INTEGER,
VARCHAR(64);
DEFINE l_current_count INTEGER;
SET LOCK MODE TO WAIT 5; -- Wait 5 seconds if another process is locking the table.
BEGIN WORK;
FOREACH WITH HOLD
SELECT .....
DELETE FROM table WHERE ref = ^^ Ref from above;
LET l_current_count = l_current_count + 1;
IF (l_current_count >= p_commit_records) THEN
COMMIT WORK;
BEGIN WORK;
LET l_current_count = 0;
END IF;
END FOREACH;
COMMIT WORK;
RETURN 0, 'Deleted records';
END PROCEDURE;
Some syntax issues there, but it's a good starting block for you. Remember, inserts and updates get incrementally slower as you use more logical logs.
Informix was restarted ungracefully many times, which led to informix instability.
This was the root cause.