Finding the most similar node by their shared child nodes - neo4j

I am trying to find a node which would be the most similar to another one by the child nodes they both share and then list those nodes they share.
For example I have:
N1-[has]->A
N1-[has]->B
N1-[has]->C
N1-[has]->D
N2-[has]->A
N2-[has]->B
N2-[has]->E
N2-[has]->F
N3-[has]->A
N3-[has]->B
N3-[has]->C
N3-[has]->G
So then I want to check which node is the most similar by it's child nodes to N1.
It should be N3 because they share 3 child nodes
Now i can find which node it is by using
match (n1:Node {name: "some name"})-[:HAS]->(i)<-[:HAS]-(n2:Node)
with n2.name as n, count(*) as c
return n order by c desc limit 1
But I need the list of the nodes they share, I have been sitting on this for quite some time and can not get my head around it.

You can try using collect() to store similar nodes into a collection and then return it:
match (n1:Node {name: "some name"})-[:HAS]->(i)<-[:HAS]-(n2:Node)
with n2.name as n, collect(i) as similarNodes, count(*) as c
return n, similarNodes
order by c desc limit 1

Related

Neo4j Count the most connected parent node given a set of child node

Say I have multiple tree:
A<-{D, E}<-F
B<-{E, G}
C<-{E, H}
//Where only A, B, and C are of (:parent{name:""})
//There rest is child
Given a set of children nodes:
{E, F} //(:child{name:""})
//Clearly A is the most connected parent even if F is not directly connected to A
Question: How can I find the most connected parent node given the children nodes collection? Any cypher query, plugin function or procedure is welcomed. HELP.
Here's what I have tried but with no luck because it count the total relationship between two nodes:
MATCH (c:child)--(p:parent)
WHERE c.name IN ['E', 'F']
RETURN p ORDER BY size( (p)--(c) ) DESC LIMIT 1
//Also tried size( (p)--() ) but it count all relationship that the parent node has.
The concept you're missing is variable-length relationship patterns. With this you can match from the :child nodes you need to :parent nodes at a variable distance, then count the times the parent nodes occur and take the top:
MATCH (c:child)-[*]->(p:parent) // assumes only incoming rels toward :parent
WHERE c.name IN ['E', 'F'] // make sure you have an index on :child(name)
WITH p, count(p) as connected
RETURN p
ORDER BY connected DESC
LIMIT 1
Alright, so I tried something else, but not sure if it is efficient to work on huge graph (says 2M nodes+):
MATCH path= shortestPath( (c:child)--(p:parent) )
WHERE c.name IN [...]
WITH p, collect(path) as cnt
RETURN p, size(cnt) AS nchild
ORDER BY nchild DESC LIMIT 1
Any opinion on this?

Neo4j how to create relationships of nodes in neo4j within a list?

I want to connect a first node of a list with other nodes of the list via relationship in Neo4j.
My approach is :
MATCH (n)
WITH n.title AS id, COLLECT(n) as nodes
where size(nodes)>1 ,COALESCE(COLLECT(n)) as firstNode
UNWIND TAIL(nodes) as x
CREATE (firstNode)-[r:Child]->(x)
return r
Basically I have some nodes with same titles. I want them to club together and make one of the element of that same title group as superior by creating a child relationship with other nodes in the list.
Try it:
MATCH (n)
WITH n.title AS id, collect(n) as nodes
WHERE size(nodes) > 1
WITH nodes[0] as firstNode, nodes[1..] as otherNodes
UNWIND otherNodes as other
CREATE (firstNode)-[r:Child]->(other)
In the second WITH I'm extracting the first node and a list from the second node to the end of the list. Then I unwind the otherNodes list and create the desired relationship between the firstNode and the unwinded nodes.

Create association between nodes if one doesnt exist using cypher

Say there are 2 labels P and M. M has nodes with names M1,M2,M3..M10. I need to associate 50 nodes of P with each Node of M. Also no node of label P should have 2 association with node of M.
This is the cypher query I could come up with, but doesn't seem to work.
MATCH (u:P), (r:M{Name:'M1'}),(s:M)
where not (s)-[:OWNS]->(u)
with u limit 50
CREATE (r)-[:OWNS]->(u);
This way I would run for all 10 nodes of M. Any help in correcting the query is appreciated.
You can utilize apoc.periodic.* library for batching. More info in documentation
call apoc.periodic.commit("
MATCH (u:P), (r:M{Name:'M1'}),(s:M) where not (s)-[:OWNS]->(u)
with u,r limit {limit}
CREATE (r)-[:OWNS]->(u)
RETURN count(*)
",{limit:10000})
If there will always be just one (r)-[:OWNS]->(u) relationship, I would change my first match to include
call apoc.periodic.commit("
MATCH (u:P), (r:M{Name:'M1'}),(s:M) where not (s)-[:OWNS]->(u) and not (r)-[:OWNS]->(u)
with u,r limit {limit}
CREATE (r)-[:OWNS]->(u)
RETURN count(*)
",{limit:10000})
So there is no way the procedure will fall into a loop
This query should be a fast and easy-to-understand. It is fast because it avoids Cartesian products:
MATCH (u:P)
WHERE not (:M)-[:OWNS]->(u)
WITH u LIMIT 50
MATCH (r:M {Name:'M1'})
CREATE (r)-[:OWNS]->(u);
It first matches 50 unowned P nodes. It then finds the M node that is supposed to be the "owner", and creates an OWNS relationship between it and each of the 50 P nodes.
To make this query even faster, you can first create an index on :M(Name) so that the owning M node can be found quickly (without scanning all M nodes):
CREATE INDEX ON :M(Name);
This worked for me.
MATCH (u:P), (r:M{Name:'M1'}),(s:M)
where not (s)-[:OWNS]->(u)
with u,r limit 50
CREATE (r)-[:OWNS]->(u);
Thanks for Thomas for mentioning limit on u and r.
I think one way to connect all 10 nodes :M in one query
MATCH (m:M)
WITH collect(m) as nodes
UNWIND nodes as node
MATCH (p:P) where not ()-[:OWNS]->(p)
WITH node,p limit 50
CREATE (node)-[:OWNS]->(p)
Although I am not really sure if we need to collect and unwind, could just simplify it to:
MATCH (m:M)
MATCH (p:P) where not ()-[:OWNS]->(p)
WITH m,p limit 50
CREATE (node)-[:OWNS]->(p)

Neo4j duplicate relationship

I have duplicate relationships between nodes e.g:
(Author)-[:CONNECTED_TO {weight: 1}]->(Coauthor)
(Author)-[:CONNECTED_TO {weight: 1}]->(Coauthor)
(Author)-[:CONNECTED_TO {weight: 1}]->(Coauthor)
and I want to merge these relations into one relation of the form: A->{weight: 3} B for my whole graph.
I tried something like the following; (I'm reading the data from a csv file)
MATCH (a:Author {authorid: csvLine.author_id}),(b:Coauthor { coauthorid: csvLine.coauthor_id})
CREATE UNIQUE (a)-[r:CONNECTED_TO]-(b)
SET r.weight = coalesce(r.weight, 0) + 1
But when I start this query, ıt creates duplicate coauthor nodes. The weight will update. It seems like this:
(Author)-[r:CONNECTED_TO]->(Coauthor)
( It creates 3 same coauthor nodes for the author)
If you need to fix it after the fact, you could aggregate all of the relationships and the weight between each set of applicable nodes. Then update the first relationship with the new aggregated number. Then with the collection of relationships delete the second through the last. Perform the update only where there is more than one relationship. Something like this...
MATCH (a:Author {name: 'A'})-[r:CONNECTED_TO]->(b:CoAuthor {name: 'B'})
// aggregate the relationships and limit it to those with more than 1
WITH a, b, collect(r) AS rels, sum(r.weight) AS new_weight
WHERE size(rels) > 1
// update the first relationship with the new total weight
SET (rels[0]).weight = new_weight
// bring the aggregated data forward
WITH a, b, rels, new_weight
// delete the relationships 1..n
UNWIND range(1,size(rels)-1) AS idx
DELETE rels[idx]
If you are doing it for the whole graph and the graph is expansive you may want to perm the update it in batches using limit or some other control mechanism.
MATCH (a:Author)-[r:CONNECTED_TO]->(b:CoAuthor)
WITH a, b, collect(r) AS rels, sum(r.weight) AS new_weight
LIMIT 100
WHERE size(rels) > 1
SET (rels[0]).weight = new_weight
WITH a, b, rels, new_weight
UNWIND range(1,size(rels)-1) AS idx
DELETE rels[idx]
If you want to eliminate the problem when loading...
MATCH (a:Author {authorid: csvLine.author_id}),(b:Coauthor { coauthorid: csvLine.coauthor_id})
MERGE (a)-[r:CONNECTED_TO]->(b)
ON CREATE SET r.weight = 1
ON MATCH SET r.weight = coalesce(r.weight, 0) + 1
Side Note: not really knowing your data model, I would consider modelling CoAuthor as Author as they are likely authors in their own right. It is probably only in the context of a particular project they would be considered a coauthor.

How to retrieve only the nodes from the path in Neo4J Cypher query?

I have a query of the following kind:
MATCH (u1:User{name:"user_name"}), (s1:Statement), s1-[:BY]->u1
WITH DISTINCT s1,u1
MATCH (s2:Statement), s2-[:BY]->u1,
p=s1<-[:OF]-c-[:OF]->s2
WHERE s1 <> s2
WITH collect(p) AS coll, count(p) AS paths, s1, s2
RETURN s1,s2,paths,coll
ORDER BY paths DESC
LIMIT 2;
Right now it returns a list of all the paths p in the coll variable. I want it to list only the nodes c. How to make this possible?
Maybe the query is not right, in this case, what I'm trying to do is to
1) Find all statements made by a user;
2) Find the nodes that connect those two statements;
3) Return those statements, which have the most nodes connecting them, ORDER BY DESC, including the names of the actual nodes that connect them.
Thank you!
I can't test it at the moment, but you could try something like
MATCH (u:User {name:"user_name"})<-[:BY]-(s1)<-[:OF]-(c)-[:OF]->(s2)-[:BY]->(u)
RETURN s1, s2, collect(c) as connections
ORDER BY length(connections) DESC
LIMIT 2

Resources