Basic / conceptual issues, query performance with Cypher and Neo4J - neo4j

I'm doing a project on credit card fraud, and I've got some generated sample data in .CSV (pipe delimited) where each line is basically the person's info, the transaction details along with the merchant name, etc.. Since this is generated data, there's also a flag that indicates if this transaction was fraudulent or not.
What I'm attempting to do is to load the data into Neo4j, create nodes (persons, transactions, and merchants), and then visualize a graph of the fraudulent charges to see if there are any common merchants. (I am aware there is a sample neo4j data set similar to this, but I'm attempting to apply this concept to a separate project).
I load the data in, create constraints, and them attempt my query, which seems to run forever.
Here are a few lines of example data..
ssn|cc_num|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|acct_num|profile|trans_num|trans_date|trans_time|unix_time|category|amt|is_fraud|merchant|merch_lat|merch_long
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|2e5186427c626815e47725e59cb04c9f|2013-03-21|02:01:05|1363831265|misc_net|838.47|1|fraud_Greenfelder, Bartoletti and Davis|31.616203|-110.221915
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|7d3f5eae923428c51b6bb396a3b50aab|2013-03-22|22:36:52|1363991812|shopping_net|907.03|1|fraud_Gerlach Inc|32.142740|-111.675048
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|76083345f18c5fa4be6e51e4d0ea3580|2013-03-22|16:40:20|1363970420|shopping_pos|912.03|1|fraud_Morissette PLC|31.909227|-111.3878746
The sample file I'm using has about 60k transactions
Below is my cypher query / code thus far.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "card_data.csv"
AS line FIELDTERMINATOR '|'
CREATE (p:Person { id: toInt(line.cc_num), name_first: line.first, name_last: line.last })
CREATE (m:Merchant { id: line.merchant, name: line.merchant })
CREATE (t:Transaction { id: line.trans_num, merchant_name: line.merchant, card_number:line.cc_num, amount:line.amt, is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })
create constraint on (t:Transaction) assert t.trans_num is unique;
create constraint on (p:Person) assert p.cc_num is unique;
MATCH (m:Merchant)
WITH m
MATCH (t:Transaction{merchant_name:m.merchant,is_fraud:1})
CREATE (m)-[:processed]->(t)
You can see in the 2nd MATCH query, I am attempting to specify that we only examine fraudulent transactions (is_fraud:1), and of the roughly 65k transactions, 230 have is_fraud:1.
Any ideas why this query would seen to run endlessly? I do have MUCH larger sets of data I'd like to examine this way, and the small data results thus far are not promising (I'm sure due to my lack of understanding, not Neo4j's fault).

You don't show any index creation. To speed things up, you should create an index on both merchant_name and is_fraud, to avoid going through all transaction nodes sequentially for a given merchant:
CREATE INDEX ON :Transaction(merchant_name)
CREATE INDEX ON :Transaction(is_fraud)

You create duplicate entries both for merchants as well as for people.
// not really needed if you don't merge transactions
// and if you don't look up transactions by trans_num
// create constraint on (t:Transaction) assert t.trans_num is unique;
// can't a person use multiple credit cards?
create constraint on (p:Person) assert p.cc_num is unique;
create constraint on (p:Person) assert p.id is unique;
create constraint on (m:Merchant) assert m.id is unique;
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "card_data.csv" AS line FIELDTERMINATOR '|'
MERGE (p:Person { id: toInt(line.cc_num)})
ON CREATE SET p.name_first=line.first, p.name_last=line.las
MERGE (m:Merchant { id: line.merchant}) ON CREATE SET m.name = line.merchant
CREATE (t:Transaction { id: line.trans_num, card_number:line.cc_num, amount:line.amt, merchant_name: line.merchant,
is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })
CREATE (p)-[:issued]->(t)
// only connect fraudulent transactions to the merchant
WHERE t.is_fraud = 1
// also add indicator label to transaction for easier selection / processing later
SET t:Fraudulent
CREATE (m)-[:processed]->(t);
Alternatively you can connect all tx to the merchant and indicate the fraud only via label / alternative rel-types.

Related

My match/merge process is not creating relationships in the Neo4J database

I am very new to Neo4j/cypher/graph databases, and have been trying to follow the Neo4j tutorial to import data I have in a csv and create relationships.
The following code does what I want in terms of reading in the data, creating nodes, and setting properties.
/* Importing data on seller-buyer relationshsips */
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///customer_rel_table.tsv' AS row
FIELDTERMINATOR '\t'
MERGE (seller:Seller {sellerID: row.seller})
ON CREATE SET seller += {name: row.seller_name,
root_eid: row.vendor_eid,
city: row.city}
MERGE (buyer:Buyer {buyerID: row.buyer})
ON CREATE SET buyer += {name: row.buyer_name};
/* Creating indices for the properties I might want to match on */
CREATE INDEX seller_id FOR (s:Seller) on (s.seller_name);
CREATE INDEX buyer_id FOR (b:Buyer) on (b.buyer_name);
/* Creating constraints to guarantee buyer-seller pairs are not duplicated */
CREATE CONSTRAINT sellerID ON (s:Seller) ASSERT s.sellerID IS UNIQUE;
CREATE CONSTRAINT buyerID on (b:Buyer) ASSERT b.buyerID IS UNIQUE;
Now I have the nodes (sellers and buyers) that I want, and I would like to link buyers and sellers. The code I have tried for this is:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///customer_rel_table.tsv' AS row
MATCH (s:Seller {sellerID: row.seller})
MATCH (b:Buyer {buyerID: row.buyer})
MERGE (s)-[st:SOLD_TO]->(b)
The query runs, but I don't get any relationships:
Query executed in 294ms. Query type: WRITE_ONLY.
No results.
Since I'm not asking it to RETURN anything, I think the "No results" comment is correct, but when I look at metadata for the DB, no relationships appear. Also, my data has ~220K rows, so 294ms seems fast.
EDIT: At #cybersam's prompting, I tried this query:
MATCH p=(:Seller)-[:SOLD_TO]->(:Buyer) RETURN p, which gives No results.
For clarity, there are two fields in my data that are the heart of the relationship:
seller and buyer, where the seller sells stuff to the buyer. The seller identifiers are repeated, but for each seller there are unique seller-buyer pairs.
What do I need to fix in my code to get relationships between the sellers and buyers? Thank you!
Your second query's LOAD CSV clause does not specify FIELDTERMINATOR '\t'. The default terminator is a comma (','). That is probably why it fails to MATCH anything.
Try adding FIELDTERMINATOR '\t' at the end of that clause.

How to conditionally add property to relationship from CSV in Neo4j

I am making a Neo4j graph to show a network of music artists.
I have a CSV with a few columns. The first column is called Artist and is the person who made the song. The second and third columns are called Feature1 and Feature2, respectively, and represent the featured artists on a song (see example https://docs.google.com/spreadsheets/d/1TE8MtNy6XnR2_QE_0W8iwoWVifd6b7KXl20oCTVo5Ug/edit?usp=sharing)
I have merged so that any given artist has just a single node. Artists are connected by a FEATURED relationship with a strength property that represents the number of times someone has been featured. When the relationship is initialized, the relationship property strength is set to 1. For example, when (X)-[r:FEATURED]->(Y) occurs the first time r.strength = 1.
CREATE CONSTRAINT ON (a:artist) ASSERT a.artistName IS UNIQUE;
CREATE CONSTRAINT ON (f:feature) ASSERT f.artistName IS UNIQUE;
CREATE CONSTRAINT ON (f:feature1) ASSERT f.artistName IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS from 'aws/artist-test.csv' as line
MERGE (artist:Artist {artistName: line.Artist})
MERGE (feature:Artist {artistName: line.Feature1})
MERGE (feature1:Artist {artistName: line.Feature2})
CREATE (artist)-[:FEATURES {strength:1}]->(feature)
CREATE (artist)-[:FEATURES {strength:1}]->(feature1)
Then I deleted the None node for songs that have no features
MATCH (artist:Artist {artistName:'None'})
OPTIONAL MATCH (artist)-[r]-()
DELETE artist, r
If X features Y on another song further down the CSV, the code currently creates another (duplicate) relationship with r.strength = 1. Rather than creating a new relationship, I'd like to have only the one (previously created) relationship and increase the value of r.strength by 1.
Any idea how can I do this? My current approach has been to just create a bunch of duplicate relationships, then go back through and count all duplicate relationships, and set
r.strength = #duplicate relationships. However, I haven't been able to get this to work, and before I waste more time on this, I figured there is a more efficient way to accomplish this.
Any help is greatly appreciated. Thanks!
You can use MERGE on relationships with ON MATCH SET
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS from 'aws/artist-test.csv' as line
MERGE (artist:Artist {artistName: line.Artist})
MERGE (feature:Artist {artistName: line.Feature1})
MERGE (feature1:Artist {artistName: line.Feature2})
MERGE (artist)-[f1:FEATURES]->(feature)
ON CREATE SET f1.strength = 1
ON MATCH SET f2.strength = f1.strength + 1
MERGE (artist)-[f2:FEATURES]->(feature1)
ON CREATE SET f2.strength = 1
ON MATCH SET f2.strength = f2.strength + 1

How does load csv create relations in neo4j?

I am using the following commands to load data from a csv file into Neo4j. The input file is large and there are millions of rows.
While this query is running I can query for the number of nodes and check the progress. But once it stops creating nodes, I guess it moves on to creating relations. But I am not able to check the progress of this step.
I have two doubts:
Does it process the command for each line of file, i.e. create the nodes and relations etc. for each source line??
Or it creates all the nodes in one shot and then creates the relations.
Anyways I want to monitor the progress of the following command. It seems to get stuck after creating the nodes and when I try to query for number of relations I get 0 as output.
I created a constraint on the key attribute.
CREATE CONSTRAINT ON (n:Node) ASSERT n.key is UNIQUE;
Here is the cypher that loads the file.
USING PERIODIC COMMIT
LOAD CSV FROM "file:///data/abc.csv" AS row
MERGE (u:Node {name:row[1],type:row[2],key:row[1]+"*"+row[2]})
MERGE (v:Node {name:row[4],type:row[5], key:row[4]+"*"+row[5]})
CREATE (u) - [r:relatedTo]-> (v)
SET r.type = row[3], r.frequency=toint(trim(row[6]));
For every row of your CSV file, Neo4j is doing the cypher script, ie. :
MERGE (u:Node {name:row[1],type:row[2],key:row[1]+"*"+row[2]})
MERGE (v:Node {name:row[4],type:row[5], key:row[4]+"*"+row[5]})
CREATE (u) - [r:relatedTo]-> (v)
SET r.type = row[3], r.frequency=toint(trim(row[6]))
Due to using periodic commit, every 500 lines (the default value), a commit is done.
You can only see changes in your graph, when Neo4j have finished to parse 500 lines.
But your script is not optimized, you are not using the constraint with the merge.
You should consider this script instead:
USING PERIODIC COMMIT
LOAD CSV FROM "file:///data/abc.csv" AS row
MERGE (u:Node {key:row[1]+"*"+row[2]})
ON CREATE SET u.name = row[1],
u.type = row[2]
MERGE (v:Node {key:row[4]+"*"+row[5]})
ON CREATE SET v.name = row[4],
v.type = row[5]
CREATE (u)-[r:relatedTo]->(v)
SET r.type = row[3], r.frequency=toint(trim(row[6]));
Cheers

neo4j cypher taking time to set relationship

I am relatively new to neo4j.
I have imported dataset of 12 million records and I have created a relationship between two nodes. When I created the relationship, I forgot to attach a property to the relationship. Now I am trying to set the property for the relationship as follows.
LOAD CSV WITH HEADERS FROM 'file:///FileName.csv' AS row
MATCH (user:User{userID: USERID})
MATCH (order:Order{orderID: OrderId})
MATCH(user)-[acc:ORDERED]->(order)
SET acc.field1=field1,
acc.field2=field2;
But this query is taking too much time to execute,
I even tried USING index on user and order node.
MATCH (user:User{userID: USERID}) USING INDEX user:User(userID)
Isn't it possible to create new attributes for the relationship at a later point?
Please let me know, how can I do this operation in a quick and efficient way.
You also forgot to prefix your query with USING PERIODIC COMMIT,
your query will build up transaction state for 24 million changes (property updates) and won't have enough memory to keep all that state.
You also forgot row. for the data that comes from your CSV and those names are inconsistently spelled.
If you run this from neo4j browser pay attention to any YELLOW warning signs.
Run
CREATE CONSTRAINT ON (u:User) ASSERT u.userID IS UNIQUE;
Run
CREATE CONSTRAINT ON (o:Order) ASSERT o.orderID IS UNIQUE;
Run
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///FileName.csv' AS row
with row.USERID as userID, row.OrderId as orderID
MATCH (user:User{userID: userID})
USING INDEX user:User(userID)
MATCH (order:Order{orderID: orderID})
USING INDEX order:Order(orderID)
MATCH(user)-[acc:ORDERED]->(order)
SET acc.field1=row.field1, acc.field2=row.field2;

Neo4j Cypher Load CSV Failure on Unique Constraint

I'm having issues importing a large volume of data into a Neo4j instance using the Cypher LOAD CSV command. I'm attempting to load in roughly 253k user records each with a unique user_id. My first step was to add a unique constraint on tje label to make sure the user was only being run once
CREATE CONSTRAINT ON (b:User) ASSERT b.user_id IS UNIQUE;
I then tried to run LOAD CSV with periodic commits to pull this data in.
This query failed so I tried to merge User record before setting
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_users.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id)})-[:REGISTERED_TO]->(t)
set p.created=toInt(line.created), p.completed=toInt(line.completed);
Modifying the periodic commit value has made no difference, the same error is returned.
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})-[:REGISTERED_TO]->(t);
I receive the following error:
LoadCsvStatusWrapCypherException: Node 9752 already exists with label Person and property "hpcm_uk_buddy_id"=[2446] (Failure when processing URL 'file:/home/data/uk_buddies.csv' on line 253316 (which is the last row in the file). Possibly the last row committed during import is line 253299. Note that this information might not be accurate.)
The numbers seem to match up roughly, the CSV file contains 253315 records in total. The periodic commit doesn't seem to have taken effect either, a count of nodes returns only 5446 rows.
neo4j-sh (?)$ match (n) return count(n);
+----------+
| count(n) |
+----------+
| 5446 |
+----------+
1 row
768 ms
I can understand the number of nodes being incorrect if this ID is only roughly 5000 rows into the CSV file. But is there any technique or command I can use to succeed this import?
You're falling victim to a common mistake with MERGE I think. Relative to cypher query, seriously this would be like in my top 10 FAQs about common problems with cypher. See you're doing this:
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})-[:REGISTERED_TO]->(t);
The way merge works, that last merge matches on the entire relationship, not just on the user node. So probably, you're creating duplicate users that you shouldn't be. When you run this merge, even if a user with those exact properties already exists, the relationship to the t node doesn't, so it attempt to create a new user node with those attributes, to connect to t, which isn't what you want.
The solution is to merge the user individually, then separately merge the relationship path, like this:
USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})
merge (p)-[:REGISTERED_TO]->(t);
Note the two merges at the end. One creates just the user. If the user already exists, it won't try to create a duplicate, and you should hopefully be OK with your constraint (assuming there aren't two users with the same user_id, but different created values). After you've merged just the user, then you merge the relationship.
The net result of the second query is the same, but shouldn't create duplicate users.

Resources