neo4j cypher taking time to set relationship - neo4j

I am relatively new to neo4j.
I have imported dataset of 12 million records and I have created a relationship between two nodes. When I created the relationship, I forgot to attach a property to the relationship. Now I am trying to set the property for the relationship as follows.
LOAD CSV WITH HEADERS FROM 'file:///FileName.csv' AS row
MATCH (user:User{userID: USERID})
MATCH (order:Order{orderID: OrderId})
MATCH(user)-[acc:ORDERED]->(order)
SET acc.field1=field1,
acc.field2=field2;
But this query is taking too much time to execute,
I even tried USING index on user and order node.
MATCH (user:User{userID: USERID}) USING INDEX user:User(userID)
Isn't it possible to create new attributes for the relationship at a later point?
Please let me know, how can I do this operation in a quick and efficient way.

You also forgot to prefix your query with USING PERIODIC COMMIT,
your query will build up transaction state for 24 million changes (property updates) and won't have enough memory to keep all that state.
You also forgot row. for the data that comes from your CSV and those names are inconsistently spelled.
If you run this from neo4j browser pay attention to any YELLOW warning signs.
Run
CREATE CONSTRAINT ON (u:User) ASSERT u.userID IS UNIQUE;
Run
CREATE CONSTRAINT ON (o:Order) ASSERT o.orderID IS UNIQUE;
Run
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///FileName.csv' AS row
with row.USERID as userID, row.OrderId as orderID
MATCH (user:User{userID: userID})
USING INDEX user:User(userID)
MATCH (order:Order{orderID: orderID})
USING INDEX order:Order(orderID)
MATCH(user)-[acc:ORDERED]->(order)
SET acc.field1=row.field1, acc.field2=row.field2;

Related

Neo4j Node Overwrites

I'm looking to perform periodic refreshes of node data in my neo4j db. A good example of my needs would be company employees -- where if an employee is terminated, they are removed from the graph completely and new employees are added.
Really, deleting all nodes of this label and ingesting a fresh dataset likely suffices -- but it feels quite ugly. Is there a more elegant solution? My fresh data exists in csv and I want to pull it in daily.
You could put a 'last updated' timestamp on your nodes. Each day, do your update using MERGE. If the csv data exists on the database, update the timestamp using the ON MATCH clause of MERGE. If the csv data doesn't exist MERGE will create new nodes (make sure to add a timestamp property of some description). E.g:
MERGE (n:Person {<selection_filter>})
ON CREATE SET <required_properties>, n.lastUpdated = date()
ON MATCH SET
n.lastUpdated = date()
After updating the graph with csv data, run a query which deletes all nodes whose timestamps are before today's; i.e. haven't been updated.
You might find creating an index on lastUpdated will improve performance for the delete query.
If your CSV file is all active employees, then you can do something like this:
MATCH (e:Employee)
SET e.ActiveToday = False
LOAD CSV FROM "employees.csv" as line
MERGE (e:Employee {employeeID:line.employeeID})
SET e.ActiveToday = True
MATCH (e:Employee {ActiveToday: False})
DETACH DELETE e
The MERGE will create nodes for new employees in the file and match those that already exist. Both creates and matches will have their ActiveToday property updated. From there, you just match those where the property is still false and remove them.

Import Edgelist from CSV Neo4J

i'm trying to make a graph database from an edgelist and i'm kind of new with neo4j so i have this problem. First of all, the edgelist i got is like this:
geneId geneSymbol diseaseId diseaseName score
10 NAT2 C0005695 Bladder Neoplasm 0.245871429880008
10 NAT2 C0013182 Drug Allergy 0.202681755307501
100 ADA C0002170 Alopecia 0.2
100 ADA C0002880 Autoimmune hemolytic anemia 0.2
100 ADA C0004096 Asthma 0.21105290517153
i have a lot of relationships like that (165k) between gen and diseases associated.
I want to make a bipartite network in which nodes are gen or diseases, so i upload the data like this:
LOAD CSV WITH HEADERS FROM "file:///path/curated_gene_disease_associations.tsv" as row FIELDTERMINATOR '\t'
MERGE (g:Gene{geneId:row.geneId})
ON CREATE SET g.geneSymbol = row.geneSymbol
MERGE (d:Disease{diseaseId:row.diseaseId})
ON CREATE SET d.diseaseName = row.diseaseName
after a while (which is way longer than what it takes in R to upload the nodes using igraph), it's done and i got the nodes, i used MERGE because i don't want to repeat the gen/disease. The problem is that i don't know how to make the relationships, i've searched and they always use something like
MATCH (g:Gene {geneId: toInt(row.geneId)}), (d:Disease {diseaseId: toInt(row.geneId)})
CREATE (g)-[:RELATED_TO]->(d);
But when i run it it says that there are no changes. I've seen the neo4j tutorial but when they do the relations they don't work with edgelists so maybe the problem is when i merge the nodes so they don't repeat. I'd appreciate any help!
Looks like there might be two problems with your relationship query:
1) You're inserting (probably) as a string type (no toInt), and doing the MATCH query as an integer type (with toInt).
2) You're MATCHing the Disease node on row.geneId, not row.diseaseId.
Try the following modification:
MATCH (g:Gene {geneId: row.geneId}), (d:Disease {diseaseId: row.diseaseId})
CREATE (g)-[:RELATED_TO]->(d);
#DanielKitchener's answer seems to address your main question.
With respect to the slowness of creating the nodes, you should create indexes (or uniqueness constraints, which automatically create indexes as well) on these label/property pairs:
:Gene(geneId)
:Disease(diseaseId)
For example, execute these 2 statements separately:
CREATE INDEX ON :Gene(geneId);
CREATE INDEX ON :Disease(diseaseId);
Once the DB has those indexes, your MERGE clauses should be much faster, as they would not have to scan through all existing Gene or Disease nodes to find possible matches.

Error creating relationships over huge dataset

My question is similar to the one pointed here :
Creating unique node and relationship NEO4J over huge dataset
I have 2 tables Entity (Entities.txt) & Relationships (EntitiesRelationships_Updated.txt) which looks like below: Both the tables are inside an import folder within the Neo4j database. What I am trying to do is load the tables using the load csv command and then create relationships.
As in the table below: If ParentID is 0, it means that ENT_ID does not have a parent. If it is populated, then it has a parent. For example in the table below, ENT_ID = 3 is the parent of ENT_ID = 4 and ENT_ID = 1 is the parent of ENT_ID = 2
**Entity Table**
ENT_ID Name PARENTID
1 ABC 0
2 DEF 1
3 GHI 0
4 JKG 3
**Relationship Table**
RID ENT_IDPARENT ENT_IDCHILD
1 1 2
2 3 5
The Entity table has 2 million records and the relationship tables has about 400K lines
Each RID has a particular tag associated with it. For example RID = 1 has it that the relation is A FATHER_OF B; RID = 2 has it that the relation is A MOTHER_OF B. Similarly there are 20 such RIDs associated.
Both of these are in txt format.
My first step is to load the entity table. I used the following script:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Entities.txt" AS Entity FIELDTERMINATOR '|'
CREATE (n:Entity{ENT_ID: toInt(Entity.ENT_ID),NAME: Entity.NAME,PARENTID: toInt(Entity.PARENTID)})
This query works fine. It takes about 10 minutes to load 2.8mil records. The next step I do is to index the records:
CREATE INDEX ON :Entity(PARENTID)
CREATE INDEX ON :Entity(ENT_ID)
This query runs fine as well. Following this I tried creating the relationships from the relationship table using a similar query as in the above link:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (n:A {ENT_IDPARENT : Rships.ENT_IDPARENT})
with Entity, n
MATCH (m:B {ENT_IDCHILD : Rships.ENT_IDCHILD})
with m,n
MERGE (n)-[r:RELATION_OF]->(m);
As I do this, my query keeps running for about an hour and it stops at a particular size(in my case 2.2gb) I followed this query based on the link above. This includes the edit from the solution below and still does not work
I have one more query, which would be as follows (Based on the above link). I run this query as I want to create a relationship based of the Entity table
PROFILE
MATCH(Entity)
MATCH (a:Entity {ENT_ID : Entity.ENT_ID})
WITH Entity, a
MATCH (b:Entity {PARENTID : Entity.PARENTID})
WITH a,b
MERGE (a)-[r:PARENT_OF]->(b)
While I tried running this query, I get a Java Heap Space Error. Unfortunately, I have not been able to get the solution for these.
Could you please advice if I am doing something wrong?
This query allows you to take advantage of your :Entity(ENT_ID) index:
MATCH (child:Entity)
WHERE child.PARENTID > 0
WITH child.PARENTID AS pid, child
MATCH (parent:Entity {ENT_ID : pid})
MERGE (parent)-[:PARENT_OF]->(child);
Cypher does not use indices when the property value comes from another node. To get around that, the above query uses a WITH clause to represent child.PARENTID as a variable (pid). The time complexity of this query should be O(N). You original query has a complexity of O(N * N).
[EDITED]
If the above query takes too long or encounters errors that might be related to running out of memory, try this variant, which creates 1000 new relationships at a time. You can change 1000 to any number that is workable for you.
MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT 1000
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);
The WHERE clause filters out child nodes that already have a parent relationship. And the MERGE operation has been changed to a simpler CREATE operation, since we have already ascertained that the relationship does not yet exist. The query returns a count of the number of relationships created. If the result is less than 1000, then all parent relationships have been created.
Finally, to make the repeated queries automated, you can install the APOC plugin on the neo4j server and use the apoc.periodic.commit procedure, which will repeatedly invoke a query until it returns 0. In this example, I use a limit parameter of 10000:
CALL apoc.periodic.commit(
"MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT {limit}
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);",
{limit: 10000});
Your entity creation Cypher looks fine, as do your indexes.
I am rather confused about the last two Cypher fragments though.
Since your relationships have a specific label or id associated with them, it's probably best to add your relationships by loading from the relationship table data, though the node labels in your query (A and B) aren't used in your Entity creation and aren't in your graph, and neither are ENT_IDPARENT or ENT_IDCHILD fields. Looks like this isn't really the Cypher you used, but an example you built off of?
I'd change this relationship creation query to this, setting the type property of the relationship for post-processing later (this assumes that there can only be one :RELATION_OF relation between the same two nodes):
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (parent:Entity {ENT_ID : Rships.ENT_IDPARENT})
MATCH (child:Entity {ENT_ID : Rships.ENT_IDCHILD})
MERGE (parent)-[r:RELATION_OF]->(child)
ON CREATE SET r.RID = Rships.RID;
Later on, if you like, you can match on your relationships with an RID, and add the corresponding type ("FATHER_OF", "MOTHER_OF", etc) property.
As for creating the :PARENT_OF relationship, you're doing some extra match on an Entity variable bound to every single node in your graph - get rid of that.
Instead, use this:
PROFILE
// first, match on all Entities with a PARENTID property
MATCH(child:Entity)
WHERE EXISTS(child.PARENTID)
// next, find the parent for each child by the child's PARENTID
WITH child
MATCH (parent:Entity {ENT_ID : child.PARENTID})
MERGE (parent)-[:PARENT_OF]->(child)
// lastly remove the parentid from the child, so it won't be reprocessed
// if we run the query again.
REMOVE child.PARENTID
EDITED the above query to use an existence check on child.PARENTID, and to remove child.PARENTID after the corresponding relationship has been created.
If you need a solution that uses batching, you could do this manually (adding LIMIT 100000 to your WITH child line, or you could install the APOC Procedures Library and use its periodic.commit() function to batch your processing.

Basic / conceptual issues, query performance with Cypher and Neo4J

I'm doing a project on credit card fraud, and I've got some generated sample data in .CSV (pipe delimited) where each line is basically the person's info, the transaction details along with the merchant name, etc.. Since this is generated data, there's also a flag that indicates if this transaction was fraudulent or not.
What I'm attempting to do is to load the data into Neo4j, create nodes (persons, transactions, and merchants), and then visualize a graph of the fraudulent charges to see if there are any common merchants. (I am aware there is a sample neo4j data set similar to this, but I'm attempting to apply this concept to a separate project).
I load the data in, create constraints, and them attempt my query, which seems to run forever.
Here are a few lines of example data..
ssn|cc_num|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|acct_num|profile|trans_num|trans_date|trans_time|unix_time|category|amt|is_fraud|merchant|merch_lat|merch_long
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|2e5186427c626815e47725e59cb04c9f|2013-03-21|02:01:05|1363831265|misc_net|838.47|1|fraud_Greenfelder, Bartoletti and Davis|31.616203|-110.221915
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|7d3f5eae923428c51b6bb396a3b50aab|2013-03-22|22:36:52|1363991812|shopping_net|907.03|1|fraud_Gerlach Inc|32.142740|-111.675048
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|76083345f18c5fa4be6e51e4d0ea3580|2013-03-22|16:40:20|1363970420|shopping_pos|912.03|1|fraud_Morissette PLC|31.909227|-111.3878746
The sample file I'm using has about 60k transactions
Below is my cypher query / code thus far.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "card_data.csv"
AS line FIELDTERMINATOR '|'
CREATE (p:Person { id: toInt(line.cc_num), name_first: line.first, name_last: line.last })
CREATE (m:Merchant { id: line.merchant, name: line.merchant })
CREATE (t:Transaction { id: line.trans_num, merchant_name: line.merchant, card_number:line.cc_num, amount:line.amt, is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })
create constraint on (t:Transaction) assert t.trans_num is unique;
create constraint on (p:Person) assert p.cc_num is unique;
MATCH (m:Merchant)
WITH m
MATCH (t:Transaction{merchant_name:m.merchant,is_fraud:1})
CREATE (m)-[:processed]->(t)
You can see in the 2nd MATCH query, I am attempting to specify that we only examine fraudulent transactions (is_fraud:1), and of the roughly 65k transactions, 230 have is_fraud:1.
Any ideas why this query would seen to run endlessly? I do have MUCH larger sets of data I'd like to examine this way, and the small data results thus far are not promising (I'm sure due to my lack of understanding, not Neo4j's fault).
You don't show any index creation. To speed things up, you should create an index on both merchant_name and is_fraud, to avoid going through all transaction nodes sequentially for a given merchant:
CREATE INDEX ON :Transaction(merchant_name)
CREATE INDEX ON :Transaction(is_fraud)
You create duplicate entries both for merchants as well as for people.
// not really needed if you don't merge transactions
// and if you don't look up transactions by trans_num
// create constraint on (t:Transaction) assert t.trans_num is unique;
// can't a person use multiple credit cards?
create constraint on (p:Person) assert p.cc_num is unique;
create constraint on (p:Person) assert p.id is unique;
create constraint on (m:Merchant) assert m.id is unique;
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "card_data.csv" AS line FIELDTERMINATOR '|'
MERGE (p:Person { id: toInt(line.cc_num)})
ON CREATE SET p.name_first=line.first, p.name_last=line.las
MERGE (m:Merchant { id: line.merchant}) ON CREATE SET m.name = line.merchant
CREATE (t:Transaction { id: line.trans_num, card_number:line.cc_num, amount:line.amt, merchant_name: line.merchant,
is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })
CREATE (p)-[:issued]->(t)
// only connect fraudulent transactions to the merchant
WHERE t.is_fraud = 1
// also add indicator label to transaction for easier selection / processing later
SET t:Fraudulent
CREATE (m)-[:processed]->(t);
Alternatively you can connect all tx to the merchant and indicate the fraud only via label / alternative rel-types.

Neo4j Cypher Load CSV Failure on Unique Constraint

I'm having issues importing a large volume of data into a Neo4j instance using the Cypher LOAD CSV command. I'm attempting to load in roughly 253k user records each with a unique user_id. My first step was to add a unique constraint on tje label to make sure the user was only being run once
CREATE CONSTRAINT ON (b:User) ASSERT b.user_id IS UNIQUE;
I then tried to run LOAD CSV with periodic commits to pull this data in.
This query failed so I tried to merge User record before setting
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_users.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id)})-[:REGISTERED_TO]->(t)
set p.created=toInt(line.created), p.completed=toInt(line.completed);
Modifying the periodic commit value has made no difference, the same error is returned.
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})-[:REGISTERED_TO]->(t);
I receive the following error:
LoadCsvStatusWrapCypherException: Node 9752 already exists with label Person and property "hpcm_uk_buddy_id"=[2446] (Failure when processing URL 'file:/home/data/uk_buddies.csv' on line 253316 (which is the last row in the file). Possibly the last row committed during import is line 253299. Note that this information might not be accurate.)
The numbers seem to match up roughly, the CSV file contains 253315 records in total. The periodic commit doesn't seem to have taken effect either, a count of nodes returns only 5446 rows.
neo4j-sh (?)$ match (n) return count(n);
+----------+
| count(n) |
+----------+
| 5446 |
+----------+
1 row
768 ms
I can understand the number of nodes being incorrect if this ID is only roughly 5000 rows into the CSV file. But is there any technique or command I can use to succeed this import?
You're falling victim to a common mistake with MERGE I think. Relative to cypher query, seriously this would be like in my top 10 FAQs about common problems with cypher. See you're doing this:
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})-[:REGISTERED_TO]->(t);
The way merge works, that last merge matches on the entire relationship, not just on the user node. So probably, you're creating duplicate users that you shouldn't be. When you run this merge, even if a user with those exact properties already exists, the relationship to the t node doesn't, so it attempt to create a new user node with those attributes, to connect to t, which isn't what you want.
The solution is to merge the user individually, then separately merge the relationship path, like this:
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})
merge (p)-[:REGISTERED_TO]->(t);
Note the two merges at the end. One creates just the user. If the user already exists, it won't try to create a duplicate, and you should hopefully be OK with your constraint (assuming there aren't two users with the same user_id, but different created values). After you've merged just the user, then you merge the relationship.
The net result of the second query is the same, but shouldn't create duplicate users.

Resources