I have big dataset of persons data and found a lot of duplicates by an algorithm.
I marked these duplicates in Neo4j with a relationship.
Example:
(p:Person)-[:similar]->(d:Person)
For testing purpose I created virtual nodes by combining all nodes marked with the similar-relationship.
CALL algo.unionFind.stream('Person', 'similar', {})
YIELD nodeId, setId
WITH setId AS idd, collect(algo.getNodeById(nodeId)) AS nodis
WHERE size(nodis) > 1
CALL apoc.nodes.collapse(nodis,{properties:'combine'}) YIELD from, rel
RETURN idd, from, rel
Here I found the problem, that only two nodes were compared and stored in the result data.
Example:
ID: 5, Peter Smith
ID: 4635, Peter Smit
ID: 4635, Peter Smit
ID: 765, Peter Smith
ID: 5, Peter Smith
ID: 765, Peter Smith
I want to refactor the graph and merge the duplicates (a forrest) into one node. But only one node is merged. How can I merge all forrests, that exist due to the relationship 'similar'?
UPDATE:
I found a semi solution. All similar persons were merged by the following code. All properties were combined as a list. Seems fine to me, except, that the Ids are in a list now, too - but this isn't the topic of the question.
CALL algo.unionFind.stream('Person', 'similar', {})
YIELD nodeId,setId
WITH setId AS idd, collect(algo.getNodeById(nodeId)) AS nodis
CALL apoc.refactor.mergeNodes(nodis, {properties:'combine', mergeRels: true}) YIELD node
RETURN node
How about using constraints unique?
I also faced same problems with MERGE.
example)
CREATE CONSTRAINT ON ( book:Book) ASSERT book.isbn IS UNIQUE
Related
Given that I'm very new to Neo4j. I have a schema which looks like the below image:
Here Has nodes are different for example Passport, Merchant, Driving License, etc. and also these nodes are describing the customer node (looking for future scope of filtering customers based on these nodes).
SIMILAR is a self-relation meaning there exists a customer with ID:1 is related to another customer with ID:2 with a score of 2800.
I have the following questions:
Is this a good schema given the condition of the future scope I mentioned above, or getting all the properties in a single customer node is viable? (Different nodes may have array of items as well, for example: ()-[:HAS]->(Phone) having {active: "+91-1231241", historic_phone_numbers: ["+91-121213", "+91-1231421"]})
I want to get the customer along with describing nodes in relation to other customers. For that, I tried the below query (w/o number of relation more than 1):
// With number_of_relation > 1
MATCH (searched:Customer)-[r:SIMILAR]->(matched:Customer)
WHERE r.score > 2700
WITH searched, COLLECT(matched.customer_id) AS MatchedList, count(r) as cnt
WHERE cnt > 1
UNWIND MatchedList AS matchedCustomer
MATCH (person:Customer {customer_id: matchedCustomer})-[:HAS|:LIVES_IN|:IS_EMPLOYED_BY]->(related)
RETURN searched, person, related
Result what I got is below, notice one customer node not having its describing nodes:
// without number_of_relation > 1
// second attempt - for a sample customer_id
MATCH (matched)<-[r:SIMILAR]-(c)-[:HAS|:LIVES_IN|:IS_EMPLOYED_BY]->(b)
WHERE size(keys(b)) > 0
AND c.customer_id = "1b093559-a39b-4f95-889b-a215cac698dc"
AND r.score > 2700
RETURN b AS props, c AS src_cust, r AS relation, matched
Result I got are below, notice related nodes are not having their describing nodes:
If I had two describing nodes with some property (some may have a list) upon which I wanted to query and build the expected graph specified in point 2 above, how can I do that?
I want the database to find a similar customer given the describing nodes. Example: A customer {name: "Dave"} has phone {active_number: "+91-12345"} is similar to customer {name: "Mike"} has phone {active_number: "+91-12345"}. How can get started with this?
If something is unclear, please ask. I can explain with examples.
[EDITED]
Yes, the schema seems fine, except that you should not use the same HAS relationship type between different node label pairs.
The main problem with your first query is that its top MATCH clause uses a directional relationship pattern, ()-->(), which does not allow all Customer nodes to have a chance to be the searched node (because some nodes may only be at the tail end of SIMILAR relationships). This tweaked query should work better:
MATCH (searched:Customer)-[r:SIMILAR]-(matched:Customer)
WHERE r.score > 2700
WITH searched, COLLECT(matched) AS matchedList
WHERE SIZE(matchedList) > 1
UNWIND matchedList AS person
MATCH (person)-[:HAS|LIVES_IN|IS_EMPLOYED_BY]->(pDesc)
WITH searched, person, COLLECT(pDesc) AS personDescribers
MATCH (searched)-[:HAS|LIVES_IN|IS_EMPLOYED_BY]->(sDesc)
RETURN searched, person, personDescribers, COLLECT(sDesc) AS searchedDescribers
It's not clear what you want are trying to do.
To get all Customers who have the same phone number:
MATCH (c:Customer)-[:HAS_PHONE]-(p:Phone)
WHERE p.activeNumber = '+91-12345'
WITH p.activeNumber AS phoneNumber, COLLECT(c) AS customers
WHERE SIZE(customers) > 1
RETURN phoneNumber, customers
I have a requirements to merge the duplicate nodes and keep one copy. Issue I am facing is, when I merge nodes, there will be duplicate relationship created. Instead, I want to merge the relationship as well without duplicates.
Can you give some suggestions?
CREATE (n:People { name: 'Person1', lastname: 'Person1LastName', email_ID:'Person1#test2.com' })
CREATE (n:People { name: 'Person2', lastname: 'Person2LastName', email_ID:'Person2#test2.com' })
CREATE (n:People { name: 'Person2', lastname: 'Person2LastName', staysin:'California' })
CREATE (n:People { name: 'Person3', lastname: 'Person3LastName', email_ID:'Person3#test2.com' })
Person2 -[r:Has_Met]->(Person1)
(Person3)-[r:FRIENDS_WITH]->(Person2) having email_ID='Person2#test2.com'
Now i wants to keep Person2 nodes and keep both the relationship with other nodes -
something like this:
MATCH (p:People{name:"person1"})
WITH p.name as name, collect(p) as nodes, count() as cnt
WHERE cnt > 1
WITH head(nodes) as first, tail(nodes) as rest
UNWIND rest AS to_delete
MATCH (to_delete)-[r:HAS_MET]->(e:name)
MERGE (first)-[r1:HAS_MET]->(e)
on create SET r1=r
SET to_delete.isDuplicate=true
RETURN count();
This is a related question, but here I know only one relationship (HAS_MET) will be considered. How do I consider all the relationships once?
Without presentation of your model or listing of sample data, unfortunately, I am only able to answer in general, which may help you nevertheless.
Have a look at the APOC library and consider the use of the procedures Merge Nodes and Redirect Relationship To. You will find explanatory images and Cypher statements there for each case.
Extension after question update
Initial situation
CREATE
(p1:People {name: 'Person1', lastname: 'Person1LastName', email_ID: 'Person1#test2.com'}),
(p2a:People {name: 'Person2', lastname: 'Person2LastName', email_ID: 'Person2#test2.com'}),
(p2b:People {name: 'Person2', lastname: 'Person2LastName', staysin: 'California'}),
(p3:People {name: 'Person3', lastname: 'Person3LastName', email_ID: 'Person3#test2.com'}),
(p2a)-[:HAS_MET]->(p1),
(p2b)-[:HAS_MET]->(p1),
(p3)-[:FRIENDS_WITH]->(p2a);
Solution
MATCH (oneNode:People {email_ID: 'Person2#test2.com'}), (otherNode:People {staysin: 'California'})
CALL apoc.refactor.mergeNodes([oneNode, otherNode])
YIELD node
MATCH (node)-[relation:HAS_MET]->(:People)
WITH tail(collect(relation)) AS surplusRelations
UNWIND surplusRelations AS surplusRelation
DELETE surplusRelation;
line 1: select both to be combined nodes
line 2: call appropriate merge nodes procedure
line 3: define result variable
line 4: identify all relationships between the combined node and a met person (there are two at least)
line 5: select all relationships but the first one
line 7: delete all surplus relationships
Result
merged node Person2, containing all attributes from source nodes (note especially email_ID and staysin)
one relationship Person1-Person2
I am trying to write a query which returns only the first common node between two nodes in a scenario where there may be multiple.
Using this graph for reference - http://neo4j.com/docs/stable/cypher-cookbook-friend-finding.html.
For example, I'm Joe, and I would like to find the list of friend-of-friends I don't know, with only one person that I should ask for an introduction. An example return set is this, even though Bill is also a connection to Ian:
Bill Derrick
Sara Ian
Sara Jill
I've tried using DISTINCT, but that doesn't group properly:
MATCH (joe { name: 'Joe' })-[:knows]-(friend)-[:knows]-(friend_of_friend)
WHERE NOT (joe)-[:knows]-(friend_of_friend)
WITH DISTINCT friend_of_friend, friend
RETURN friend.name, friend_of_friend.name
I'm starting to believe I need a second query with the friend node passed to it. Hopefully not though, because that sounds painfully inefficient. What am I missing?
You need to do an aggregation on level of friend using the collect function:
MATCH (joe { name: 'Joe' })-[:knows]-(friend)-[:knows]-(friend_of_friend)
WHERE NOT (joe)-[:knows]-(friend_of_friend)
RETURN friend.name, collect(friend_of_friend.name)
update
MATCH path=(joe { name: 'Joe' })-[:knows]-(friend)-[:knows]-(friend_of_friend)
WHERE NOT (joe)-[:knows]-(friend_of_friend)
RETURN collect(friend)[0] AS friend, friend_of_friend
This gives you 3 rows:
Bill, Derrick
Bill, Ian or Sara, Ian
Sara, Jill
Here it's not deterministic if Bill-Ian or Sara-Ian is in the result.
I have the following two node types:
c:City {name: 'blah'}
s:Course {title: 'whatever', city: 'New York'}
Looking to create this:
(s)-[:offered_in]->(c)
I'm trying to get all courses that are NOT tied to cities and create the relationship to the city (city gets created if doesn't exist). However, the issue is that my dataset is about 5 million nodes and any query i make times out (unless i do in increment of 10k).
... anybody has any advice?
EDIT:
Here is a query for jobs i'm running now (that has to be done in 10k chunks (out of millions) because it takes few minutes as it is. creates city if doesn't exist):
match (j:Job)
where not has(j.merged) and has(j.city)
WITH j
LIMIT 10000
MERGE (c:City {name: j.city})
WITH j, c
MERGE (j)-[:in]->(c)
SET j.merged = 1
return count(j)
(for now don't know of a good way to filter out the ones already matched, so trying to do it by tagging it with custom "merged" attribute that i already have an index on)
500000 is a fair few nodes and on your other question you suggested 90% were without the relationship that you want to create here, so it is going to take a bit of time. Without more knowledge of your system (spec, neo setup, programming environment) and when you are running this (on old data or on insert) this is just a best guess at a tidier solution:
MATCH (j:Job)
WHERE NOT (j)-[:IN]->() AND HAS(j.city)
MERGE (c:City {name: j.city})
MERGE (j)-[:IN]->(c)
return count(j)
Obviously you can add your limits back as required.
I have set up a graph gist to show my problem: http://gist.neo4j.org/?dropbox-2900504%2Fnames.adoc
I have the problem that if I don't specifically return the person node, or person id, two of my person nodes get merged into one for the return. They both have the same second name and the same labels on the person node (id 3 and 4, Tom and Sarah Smith).
If I add a label to the person node, as with James Smith (id 1) in this example, there is no problem. If I were to remove his :Foo label he would also be merged in with Sarah and Tom in query 2.
If this is not a bug, is there a way for me to return these people distinctly without the person id or node being returned?
I have shown the problem in the above gist, with the only difference between the two queries being that the second one also returns the person id.
Many thanks for your help,
tekiegirl
Edit:
How I want my results to look (basically like query 3 in the gist, but without the person id):
labels names
[Person, Bar] [Sally, Jones]
[Person, Foo] [James, Smith]
[Person] [Sarah, Smith]
[Person] [Tom, Smith]
I think maybe you're not expecting the aggregation behavior you get with collect. Is this what you're trying to get?
MATCH (:Club { name:'FooFighters' })-[:MEMBER]->(p:Person)-[r:NAMED]->(n:Name)
RETURN labels(p) AS labels, n.content AS names
ORDER BY r.order, names
Update with more info, and now I understand what you were doing with your multiple names and order by in the WITH:
collect actually does an implicit group by on the other terms, making them distinct and grouping on them. If you want to group on person, then you need to include person p in the WITH/RETURN that you're collecting in. Here's a rewrite. You can avoid returning p if you want, in the last return statement:
MATCH (:Club{name:'FooFighters'})-[:MEMBER]->(p:Person)-[r:NAMED]->(n:Name)
WITH p, n, r
ORDER BY r.order
WITH p, labels(p) as labels, collect(n.content) as names
RETURN labels, names
ORDER BY names[length(names)-1], names[0]
http://gist.neo4j.org/?8008646