How to combine similar nodes in neo4j - neo4j

I have defined few nodes and relationships in neo4j graph database but the output is bit different from expected one as each node is representing its own data and attributes. I want combination of same node showcasing different relationships and attributes
`LOAD CSV WITH HEADERS FROM "file:///data.csv" AS line
CREATE(s:SourceID{Name:line.SourceID})
CREATE(t:Title{Name:line.Title})
CREATE(c:Coverage{Name:line.Coverage})
CREATE(p:Publisher{Name:line.Publisher})
MERGE (p)-[:PUBLISHES]->(t)
MERGE (p)-[:Coverage{covers:line.Coverage}]->(t)
MERGE (t)-[:BelongsTO]->(p)
MERGE (s)-[:SourceID]->(t)`
In given picture there are two nodes with Springer Nature and i wish to have only one node namely, Springer Nature and all the associated data of both the nodes to be present in single node.

First of all, I would recommend you to set a CONSTRAINT before adding data.
It seems that the Nodes can have duplicates when creating them because you are merging patterns and the cypher query does not specify that the nodes have to be identified unique nodes.
So in your case try this first for each of the node labels:
CREATE CONSTRAINT publisherID IF NOT EXISTS FOR (n:Publisher) REQUIRE (n.Name) IS UNIQUE;
CREATE CONSTRAINT sourceID IF NOT EXISTS FOR (n:SourceID) REQUIRE (n.Name) IS UNIQUE;
CREATE CONSTRAINT titleID IF NOT EXISTS FOR (n:Title) REQUIRE (n.Name) IS UNIQUE;
CREATE CONSTRAINT coverageID IF NOT EXISTS FOR (n:Coverage) REQUIRE (n.Name) IS UNIQUE;
Even better would be to not use the name but a publisher ID. But this is your choice, and if there aren't thousands of publishers in the data, this will be no issue at all.
Also, I would not use CREATE for creating the nodes but use MERGE instead. Because the cypher query goes line-by-line, if you want to create a node which already exists—which could happen on the second line or on the fiftieth line—the query would fail if you set the CONSTRAINT above.
And try everything on a blank database; for example, by deleting all nodes:
MATCH (n) DETACH DELETE n
So to sum up the Cypher Query in one go, you send the queries separately:
CREATE CONSTRAINT publisherID IF NOT EXISTS FOR (n:Publisher) REQUIRE (n.Name) IS UNIQUE;
CREATE CONSTRAINT sourceID IF NOT EXISTS FOR (n:SourceID) REQUIRE (n.Name) IS UNIQUE;
CREATE CONSTRAINT titleID IF NOT EXISTS FOR (n:Title) REQUIRE (n.Name) IS UNIQUE;
CREATE CONSTRAINT coverageID IF NOT EXISTS FOR (n:Coverage) REQUIRE (n.Name) IS UNIQUE;
LOAD CSV WITH HEADERS FROM "file:///data.csv" AS line
MERGE(s:SourceID{Name:line.SourceID})
MERGE(t:Title{Name:line.Title})
MERGE(c:Coverage{Name:line.Coverage})
MERGE(p:Publisher{Name:line.Publisher})
MERGE (p)-[:PUBLISHES]->(t)
MERGE (p)-[:Coverage{covers:line.Coverage}]->(t)
MERGE (t)-[:BelongsTO]->(p)
MERGE (s)-[:SourceID]->(t)
RETURN count(p), count(t), count(c), count(s);

Related

Trouble with correctly modeling data in Neo4j / Cypher

I am a beginner with Neo4j/Cypher and I'm having trouble modeling my data correctly.
The Data has following relationship:
MANUFACTURER_A (unique) -> PRODUCT_A (unique just withing MANUFACTURER_A) -> CUSTOMER_A (globally unique)
MANUFACTURER_A (unique) -> PRODUCT_A (unique just withing MANUFACTURER_A) -> CUSTOMER_B (globally unique)
MANUFACTURER_B (unique) -> PRODUCT_A (unique just withing MANUFACTURER_B) -> CUSTOMER_A (globally unique)
I am not able to make Product_A unique within a Manufacturer. I always get one line from Manufacturer_A to Product_A and another one from Manufacturer_B but I actually want two Product_A nodes (one from each Manufacturer), but just one Product_A node per Manufacturer.
I tried the following:
CREATE CONSTRAINT ON (c:MANUFACTURER) ASSERT c.MANUFACTURER IS UNIQUE;
CREATE CONSTRAINT ON (c:CUSTOMER) ASSERT c.CUSTOMER IS UNIQUE;
LOAD CSV WITH HEADERS FROM
'file:///small.csv' AS line
WITH line LIMIT 20
MERGE (CUSTOMER:Customer {Name: line.CUSTOMER})
MERGE (MANUFACTURER:Manufacturer {Name:line.MANUFACTURER})
MERGE (PRODUCT:Product {Name: line.PRODUCT})
MERGE (MANUFACTURER)<-[:PRODUCES]-(PRODUCT)
MERGE (CUSTOMER)<-[:CONSUMES]-(PRODUCT)
;
How would I model that correctly?
In this case, you don't want to MERGE the :PRODUCT node alone, but as part of a pattern connected to your merged manufacturer variable, that provides the context that the pattern you are looking for must be connected, and if no such pattern exists, the non-bound parts will be created.
...
MERGE (MANUFACTURER:Manufacturer {Name:line.MANUFACTURER})
MERGE (MANUFACTURER)<-[:PRODUCES]-(PRODUCT:Product {Name: line.PRODUCT})
...
So it won't matter if a product of that name exists elsewhere in the graph, as long as one isn't found connected to that manufacturer, it will be created as part of the MERGE.
This knowledge base article may be helpful as well, as it covers this application and more:
https://neo4j.com/developer/kb/understanding-how-merge-works/

get the root parent node using neo4j

I've imported a csv file to neo4j and created nodes and relationships for them.
In the above the first two nodes comes under db1 and last four nodes comes under db2.
How to find that last four nodes belongs to db2?
Below is the code and csv file
columnname,tablename,databasename,systemname
abc,1a,db1,Finance
def,1a,db1,Finance
ghi,1a,db1,Finance
klm,1a,db1,Finance
abc,1a,db2,Medical
def,1a,db2,Medical
ghi,1a,db2,Medical
klm,1a,db2,Medical
nop,1a,db2,Medical
qrs,1a,db2,Medical
I've created nodes and relationships for the above csv file in neo4j
This is for getting unique values
CREATE CONSTRAINT ON (c:ColumnName) ASSERT c.ColumnName IS UNIQUE;
CREATE CONSTRAINT ON (c:TableName) ASSERT c.TableName IS UNIQUE;
CREATE CONSTRAINT ON (c:DatabaseName) ASSERT c.DatabaseName IS UNIQUE;
CREATE CONSTRAINT ON (c:SystemName) ASSERT c.SystemName IS UNIQUE;
This is for loading csv file and creating nodes and relationships
LOAD CSV WITH HEADERS FROM "file:///test.csv" AS line
MERGE (ColumnName:ColumnName {ColumnName: line.ColumnName})
MERGE (TableName:TableName {TableName:line.TableName})
MERGE (DatabaseName: DatabaseName {DatabaseName:line.DatabaseName})
MERGE (SystemName: SystemName {SystemName:line.SystemName})
This is creating relationships among the nodes
MERGE (ColumnName)-[:iscolumnof]->(TableName)
MERGE (TableName)-[:istableof]->(DatabaseName)
MERGE (DatabaseName)-[:isdatabaseof ]->(SystemName)
If, i select one node 'nop'and expand i'll get the node(1a) 1a and if i expand 1a i'll get all the
nodes(columns). How to find that 'nop' belongs to 'db2'?
As far as I understand, you have a pattern
(:ColumnName)-[:iscolumnof]->(:TableName)-[:istableof]->(:DatabaseName)-[:isdatabaseof ]->(:SystemName)
If you want to test whether a certain :ColumnName belongs to a :DatabaseName
WITH 'nop' AS columnName, 'db2' AS databaseName
MATCH (col:ColumnName {ColumnName:columnName}),(db:DatabaseName
{DatabaseName:databaseName})
RETURN EXISTS((col)-[:iscolumnof]->(:TableName)-[:istableof]->(db)) AS result
If you want all the columns of db2
WITH 'db2' AS databaseName
MATCH (c:ColumnName)-[:iscolumnof]->(:TableName)-[:istableof]->(:DatabaseName {DatabaseName:databaseName})
RETURN c.ColumnName AS column

How to match line of csv which is ignored by constraint and create only relationship

I have been created a graph having a constraint on primary id. In my csv a primary id is duplicate but the other proprieties are different. Based on the other properties I want to create relationships.
I tried multiple times to change the code but it does not do what I need.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///Trial.csv' AS line FIELDTERMINATOR '\t'
MATCH (n:Trial {id: line.primary_id})
with line.cui= cui
MATCH (m:Intervention)
where m.id = cui
MERGE (n)-[:HAS_INTERVENTION]->(m);
I already have the nodes Intervention in the graph as well as the trials. So what I am trying to do is to match a trial with the id from intervention and create only the relationship. Instead is creating me also the nodes.
This is a sample of my data, so the same primary id, having different cuis and I am trying to match on cui:
You can refer the following query which finds Trial and Intervention nodes by primary_id and cui respectively and creates the relationship between them.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///Trial.csv' AS line FIELDTERMINATOR '\t'
MATCH (n:Trial {id: line.primary_id}), (m:Intervention {id: line.cui})
MERGE (n)-[:HAS_INTERVENTION]->(m);
The behavior you observed is caused by 2 aspects of the Cypher language:
The WITH clause drops all existing variables except for the ones explicitly specified in the clause. Therefore, since your WITH clause does not specify the n node, n becomes an unbound variable after the clause.
The MERGE clause will create its entire pattern if any part of the pattern does not already exist. Since n is not bound to anything, the MERGE clause would go ahead and create the entire pattern (including the 2 nodes).
So, you could have fixed the issue by simply specifying the n variable in the WITH clause, as in:
WITH n, line.cui= cui
But #Raj's query is even better, avoiding the need for WITH entirely.

Stuck after ~500 inserts

I am inserting nodes and relations into my neo4j DB (graphenedb but also happens locally).
After roughly 500 inserts the insert statment stucks.
After a neo4j server restart, the same insert works as usual and I can continue with the next ~500 inserts.
Do you have any clue why it get stuck?
One insert statement looks like following:
MERGE (b0:Company{company_id:{b1},universal_name:{b2},company_name:{b3}})
ON CREATE SET b0.funding_total_usd = null
ON MATCH SET b0.funding_total_usd = null
MERGE (b13:Industry{name:{b12}})
MERGE (b0)-[:company_industry]->(b13)
MERGE (b15:Category{name:{b14}})
MERGE (b0)-[:company_category]->(b15)
MERGE (b17:Category{name:{b16}})
MERGE (b0)-[:company_category]->(b17)
MERGE (b19:Category{name:{b18}})
MERGE (b0)-[:company_category]->(b19)
MERGE (b21:Category{name:{b20}})
MERGE (b0)-[:company_category]->(b21)
MERGE (b23:Category{name:{b22}})
MERGE (b0)-[:company_category]->(b23)
MERGE (b25:Category{name:{b24}})
MERGE (b0)-[:company_category]->(b25)
MERGE (b27:Category{name:{b26}})
MERGE (b0)-[:company_category]->(b27)
Indexes are present:
Indexes
ON :Category(name) ONLINE (for uniqueness constraint)
ON :Company(company_id) ONLINE (for uniqueness constraint)
ON :Company(universal_name) ONLINE (for uniqueness constraint)
ON :Industry(name) ONLINE (for uniqueness constraint)
Constraints
ON ( category:Category ) ASSERT category.name IS UNIQUE
ON ( company:Company ) ASSERT company.company_id IS UNIQUE
ON ( company:Company ) ASSERT company.universal_name IS UNIQUE
ON ( industry:Industry ) ASSERT industry.name IS UNIQUE
I use following PHP code to submit the statement:
$config = \GraphAware\Bolt\Configuration::create()
->withCredentials($user, $pw)
->withTimeout($timeout);
if($ssl) {
$config = $config->withTLSMode(\GraphAware\Bolt\Configuration::TLSMODE_REQUIRED);
}
$driver = \GraphAware\Bolt\GraphDatabase::driver($uri, $config);
$driver->session()->run($query, $binds);
Tested versions: 3.4.12 and 3.5.1
#edit: Added code which is used to submit the statement and neo4j version.
You should be batching your insertions, and you should not be explicitly creating separate variables for each individual node. Instead, see if you can provide parameters which include lists of properties that you can address at a single time using UNWIND.
See some of our batching tips and tricks.
Applied to your query, your parameter input per batch could look something like this:
{entries:[{companyId:12345, universalName:'foo', companyName:'bar',
industry:'industry', categories:[{name:'cat1'}, {name:'cat2'},
{name:'cat3'}]}]}
And the query itself per batch execution could look like this:
UNWIND $entries as entry
MERGE (c:Company{company_id:entry.companyId, universal_name:entry.universalName, company_name:entry.companyName})
SET c.funding_total_usd = null
MERGE (industry:Industry{name:entry.industry})
MERGE (c)-[:company_industry]->(industry)
WITH entry, c
UNWIND entry.categories as cat
MERGE (category:Category{name:cat.name})
MERGE (c)-[:company_category]->(category)

Basic / conceptual issues, query performance with Cypher and Neo4J

I'm doing a project on credit card fraud, and I've got some generated sample data in .CSV (pipe delimited) where each line is basically the person's info, the transaction details along with the merchant name, etc.. Since this is generated data, there's also a flag that indicates if this transaction was fraudulent or not.
What I'm attempting to do is to load the data into Neo4j, create nodes (persons, transactions, and merchants), and then visualize a graph of the fraudulent charges to see if there are any common merchants. (I am aware there is a sample neo4j data set similar to this, but I'm attempting to apply this concept to a separate project).
I load the data in, create constraints, and them attempt my query, which seems to run forever.
Here are a few lines of example data..
ssn|cc_num|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|acct_num|profile|trans_num|trans_date|trans_time|unix_time|category|amt|is_fraud|merchant|merch_lat|merch_long
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|2e5186427c626815e47725e59cb04c9f|2013-03-21|02:01:05|1363831265|misc_net|838.47|1|fraud_Greenfelder, Bartoletti and Davis|31.616203|-110.221915
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|7d3f5eae923428c51b6bb396a3b50aab|2013-03-22|22:36:52|1363991812|shopping_net|907.03|1|fraud_Gerlach Inc|32.142740|-111.675048
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|76083345f18c5fa4be6e51e4d0ea3580|2013-03-22|16:40:20|1363970420|shopping_pos|912.03|1|fraud_Morissette PLC|31.909227|-111.3878746
The sample file I'm using has about 60k transactions
Below is my cypher query / code thus far.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "card_data.csv"
AS line FIELDTERMINATOR '|'
CREATE (p:Person { id: toInt(line.cc_num), name_first: line.first, name_last: line.last })
CREATE (m:Merchant { id: line.merchant, name: line.merchant })
CREATE (t:Transaction { id: line.trans_num, merchant_name: line.merchant, card_number:line.cc_num, amount:line.amt, is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })
create constraint on (t:Transaction) assert t.trans_num is unique;
create constraint on (p:Person) assert p.cc_num is unique;
MATCH (m:Merchant)
WITH m
MATCH (t:Transaction{merchant_name:m.merchant,is_fraud:1})
CREATE (m)-[:processed]->(t)
You can see in the 2nd MATCH query, I am attempting to specify that we only examine fraudulent transactions (is_fraud:1), and of the roughly 65k transactions, 230 have is_fraud:1.
Any ideas why this query would seen to run endlessly? I do have MUCH larger sets of data I'd like to examine this way, and the small data results thus far are not promising (I'm sure due to my lack of understanding, not Neo4j's fault).
You don't show any index creation. To speed things up, you should create an index on both merchant_name and is_fraud, to avoid going through all transaction nodes sequentially for a given merchant:
CREATE INDEX ON :Transaction(merchant_name)
CREATE INDEX ON :Transaction(is_fraud)
You create duplicate entries both for merchants as well as for people.
// not really needed if you don't merge transactions
// and if you don't look up transactions by trans_num
// create constraint on (t:Transaction) assert t.trans_num is unique;
// can't a person use multiple credit cards?
create constraint on (p:Person) assert p.cc_num is unique;
create constraint on (p:Person) assert p.id is unique;
create constraint on (m:Merchant) assert m.id is unique;
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "card_data.csv" AS line FIELDTERMINATOR '|'
MERGE (p:Person { id: toInt(line.cc_num)})
ON CREATE SET p.name_first=line.first, p.name_last=line.las
MERGE (m:Merchant { id: line.merchant}) ON CREATE SET m.name = line.merchant
CREATE (t:Transaction { id: line.trans_num, card_number:line.cc_num, amount:line.amt, merchant_name: line.merchant,
is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })
CREATE (p)-[:issued]->(t)
// only connect fraudulent transactions to the merchant
WHERE t.is_fraud = 1
// also add indicator label to transaction for easier selection / processing later
SET t:Fraudulent
CREATE (m)-[:processed]->(t);
Alternatively you can connect all tx to the merchant and indicate the fraud only via label / alternative rel-types.

Resources