In Neo4j for every disjoint subgraph return the node with the most relationships - neo4j

I’m new to Neo4j and graph theory and I’m trying to figure out if I can use Neo4j to solve a problem I have. Please correct me if I’m using the wrong words to describe stuff. Since I’m new to the subject I haven’t really wrapped my head around what to call everything.
I think the easiest way to describe my problem is with a lot of pictures.
Let’s say you have two disjoint subgraphs that look like this.
From the subgraphs above I want to get a list of subgraphs that fulfills one of two criteria.
Criteria 1.
If a node has a unique relationship to another node, the nodes and relationship should be returned as a subgraph.
Criteria 2.
If the relations are not unique, I'd like the node with the most relationships to be returned, as a subgraph with its relationships and related nodes.
If other nodes come in tie in criteria 2, I want all subgraphs to be returned.
Or put in the context of this graph,
Give me the people who have unique games, and if there are other people having the same games, give me back the person with the most games. If they come in tie, return all people who come in tie.
Or actually, return the whole subgraph, not only the person.
To clarify what I am after here is a picture that describes the result I want to get. The ordering of the result is not important.
Disjoint subgraph A, because of Criteria 1, Andrew is the only person who has Bubble Bobble.
Disjoint subgraph B, because of Criteria 1, Johan is the only person who has Puzzle Bobble 1.
Disjoint subgraph C, because of Criteria 2, Julia since she has the most games.
Disjoint subgraph D, because of Criteria 2, Anna since she comes in tie with Julia having the most games.
Worth noting is that Johan's relationship to Puzzle Bobble 2 is not returned because it's not unique and he has not the most games.
Is this a problem you could solve with only Neo4j and is it a good idea?
If you could solve it how would you do it in Cypher?
Create script:
CREATE (p1:Person {name:"Johan"}),
(p2:Person {name:"Julia"}),
(p3:Person {name:"Anna"}),
(p4:Person {name:"Andrew"}),
(v1:Videogame {name:"Puzzle Bobble 1"}),
(v2:Videogame {name:"Puzzle Bobble 2"}),
(v3:Videogame {name:"Puzzle Bobble 3"}),
(v4:Videogame {name:"Puzzle Bobble 4"}),
(v5:Videogame {name:"Bubble Bobble"}),
(p1)-[:HAS]->(v1),
(p1)-[:HAS]->(v2),
(p2)-[:HAS]->(v2),
(p2)-[:HAS]->(v3),
(p2)-[:HAS]->(v4),
(p3)-[:HAS]->(v2),
(p3)-[:HAS]->(v3),
(p3)-[:HAS]->(v4),
(p4)-[:HAS]->(v5)

I feel like this solution might not be quite what you're looking for, but it could be a good start:
MATCH (game:Videogame)<-[:HAS]-(owner:Person)
OPTIONAL MATCH owner-[:HAS]->(other_game:Videogame)
WITH game, owner, count(other_game) AS other_game_count
ORDER BY other_game_count DESC
RETURN game, collect(owner)[0]
Here the query:
Finds all of the games and their owners (games without owners will not be matched)
Does an OPTIONAL MATCH against any other games those owners might own (by doing an optional match we're saying that it's OK if they own zero)
Pass through each game/owner pair along with a count of the number of other games owned by that owner, sorting so that those with the most games come first
RETURN the first owner for each game (the ORDER is preserved when doing the collect)

Related

NEO4J - Matching a path where middle node might exist or not

I have the following graph:
I would look to get all contractors and subcontractors and clients, starting from David.
So I thought of a query likes this:
MATCH (a:contractor)-[*0..1]->(b)-[w:works_for]->(c:client) return a,b,c
This would return:
(0:contractor {name:"David"}) (0:contractor {name:"David"}) (56:client {name:"Sarah"})
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Which returns the desired result. The issue here is performance.
If the DB contains millions of records and I leave (b) without a label, the query will take forever. If I add a label to (b) such as (b:subcontractor) I won't hit millions of rows but I will only get results with subcontractors:
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Is there a more efficient way to do this?
link to graph example: https://console.neo4j.org/r/pry01l
There are some things to consider with your query.
The relationship type is not specified- is it the case that the only relationships from contractor nodes are works_for and hired? If not, you should constrain the relationship types being matched in your query. For example
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b)-[w:works_for]->(c:client)
RETURN a,b,c
The fact that (b) is unlabelled does not mean that every node in the graph will be matched. It will be reached either as a result of traversing the works_for or hired relationships if specified, or any relationship from :contractor, or via the works_for relationship.
If you do want to label it, and you have a hierarchy of types, you can assign multiple labels to nodes and just use the most general one in your query. For example, you could have a label such as ExternalStaff as the generic label, and then further add Contractor or SubContractor to distinguish individual nodes. Then you can do something like
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b:ExternalStaff)-[w:works_for]->(c:client)
RETURN a,b,c
Depends really on your use cases.

How to calculate custom degree based on the node label or other conditions?

I have a scenario where I need to calcula a custom degree between the first node (:employee) where it should only be incremented to another node when this node's label is :natural or :relative, but not when it is :legal.
Example:
The thing is I'm having trouble generating this custom degree property as I needed it.
So far I've tried playing with FOREACH and CASE but had no luck. The closest I got to getting some sort of calculated custom degree is this:
match p = (:employee)-[*5..5]-()
WITH distinct nodes(p) AS nodes
FOREACH(i IN RANGE(0, size(nodes)) |
FOREACH(node IN [nodes[i]] |
SET node.degree = i
))
return *
limit 1
But even this isn't right, as despite having 5 distinct nodes, I get SIZE(nodes) = 6, as the :legal node is accounted for twice for some reason.
Does anyone know how to achieve my goal within a single cypher query?
Also, if you know why the :legal node is account for twice, please let me know. I suspect it is because it has 2 :natural nodes related to it, but don't know the inner workings that make it appear twice.
More context:
:employee nodes are, well, employees of an organization
:relative nodes are relatives to an employee
:natural nodes are natural persons that may or may not be related to a :legal
:legal nodes are companies (legal persons) that may, or may not, be related to an :employee, :relative, :natural or another :legal on an IS_PARTNER relationship when, in real life, they are part of the board of directors or are shareholders of that company (:legal).
custom degree is what I aim to create and will define how close one node is to another given some conditions to this project (specified below).
All nodes have a total_contracts property that are the total amount of money received through contracts.
The objective is to find any employees with relationships to another node that has total_contracts > 0 and are up to custom degree <= 3, as employees may be receiving money from external sources, when they shouldn't.
As for why I need this custom degree ignoring the distance when it is a :legal node, is because we threat companies as the same distance as the natural person that is a partner.
On the illustrated example above, the employee has a son, DIEGO, that is a shareholder of a company (ALLURE) and has 2 other business partners (JOSE and ROSIEL). When I ask what's the degree of the son to the employee, I should get 1, as they are directly related; when I ask whats the degree of JOSE to the employee I should get 2, as JOSE is related to DIEGO through ALLURE and we shouldn't increment the custom degree when it is a company, only when its a person.
The trick with this type of graph is making sure we avoid paths that loop back to the same nodes (which is definitely going to happen quite a lot because you're using multiple relationships between nodes instead of just one...you may want to make sure this is necessary in your model).
The easiest way to do that is via APOC Procedures, as you can adjust the uniqueness of traversals so that nodes are unique in each path.
So for example, for a specific start node (let's say the :employee has empId:1 just for the sake of mocking up a lookup of the node, we'll calculate a degree for all nodes within 5 hops of the starting node. The idea here is that we'll take the length of the path (the number of hops) - the number of :legal nodes in the path (by filtering the nodes in the path for just :legal nodes, then getting the size of that filtered list).
MATCH (e:employee {empId:1})
CALL apoc.path.expandConfig(e, {minLevel:1, maxLevel:5, uniqueness:'NODE_PATH'}) YIELD path
WITH e, last(nodes(path)) as endNode,
length(path) - size([x in nodes(path) WHERE x:legal]) as customDegree
RETURN e, endNode, customDegree

How can I mitigate having bidirectional relationships in a family tree, in Neo4j?

I am running into this wall regarding bidirectional relationships.
Say I am attempting to create a graph that represents a family tree. The problem here is that:
* Timmy can be Suzie's brother, but
* Suzie can not be Timmy's brother.
Thus, it becomes necessary to model this in 2 directions:
(Sure, technically I could say SIBLING_TO and leave only one edge...what I'm not sure what the vocabulary is when I try to connect a grandma to a grandson.)
When it's all said and done, I pretty sure there's no way around the fact that the direction matters in this example.
I was reading this blog post, regarding common Neo4j mistakes. The author states that this bidirectionality is not the most efficient way to model data in Neo4j and should be avoided.
And I am starting to agree. I set up a mock set of 2 families:
and I found that a lot of queries I was attempting to run were going very, very slow. This is because of the 'all connected to all' nature of the graph, at least within each respective family.
My question is this:
1) Am I correct to say that bidirectionality is not ideal?
2) If so, is my example of a family tree representable in any other way...and what is the 'best practice' in the many situations where my problem may occur?
3) If it is not possible to represent the family tree in another way, is it technically possible to still write queries in some manner that gets around the problem of 1) ?
Thanks for reading this and for your thoughts.
Storing redundant information (your bidirectional relationships) in a DB is never a good idea. Here is a better way to represent a family tree.
To indicate "siblingness", you only need a single relationship type, say SIBLING_OF, and you only need to have a single such relationship between 2 sibling nodes.
To indicate ancestry, you only need a single relationship type, say CHILD_OF, and you only need to have a single such relationship between a child to each of its parents.
You should also have a node label for each person, say Person. And each person should have a unique ID property (say, id), and some sort of property indicating gender (say, a boolean isMale).
With this very simple data model, here are some sample queries:
To find Person 123's sisters (note that the pattern does not specify a relationship direction):
MATCH (p:Person {id: 123})-[:SIBLING_OF]-(sister:Person {isMale: false})
RETURN sister;
To find Person 123's grandfathers (note that this pattern specifies that matching paths must have a depth of 2):
MATCH (p:Person {id: 123})-[:CHILD_OF*2..2]->(gf:Person {isMale: true})
RETURN gf;
To find Person 123's great-grandchildren:
MATCH (p:Person {id: 123})<-[:CHILD_OF*3..3]-(ggc:Person)
RETURN ggc;
To find Person 123's maternal uncles:
MATCH (p:Person {id: 123})-[:CHILD_OF]->(:Person {isMale: false})-[:SIBLING_OF]-(maternalUncle:Person {isMale: true})
RETURN maternalUncle;
I'm not sure if you are aware that it's possible to query bidirectionally (that is, to ignore the direction). So you can do:
MATCH (a)-[:SIBLING_OF]-(b)
and since I'm not matching a direction it will match both ways. This is how I would suggest modeling things.
Generally you only want to make multiple relationships if you actually want to store different state. For example a KNOWS relationship could only apply one way because person A might know person B, but B might not know A. Similarly, you might have a LIKES relationship with a value property showing how much A like B, and there might be different strengths of "liking" in the two directions

neo4j and uni-directional relationship

I'm new to neo4j. I've just read some information on this tool, installed it on Ubuntu and made a bunch of queries. And at this moment, I must confess, that I realy like it. However, there is something (I think very simple and intuitive), which I do not know how to implement. So, I created three nodes like so:
CREATE (n:Object {id:1}) RETURN n
CREATE (n:Object {id:2}) RETURN n
CREATE (n:Object {id:3}) RETURN n
And I created a hierarchical relationship between them:
MATCH (a:Object {id:1}), (b:Object {id:2}) CREATE (a)-[:PARENT]->(b)
MATCH (a:Object {id:2}), (b:Object {id:3}) CREATE (a)-[:PARENT]->(b)
So, I think this simple hierarchy should look like this:
(id:1)
-> (id:2)
-> (id:3)
What I want now is to get a path from any node. For example, if I want to have a path from node (id:2), I will get (id:2) -> (id:3). And if I want to get a path from node (id:1), I will get (id:1)->(id:2)->(id:3). I tried this query:
MATCH (n:Object {id:2})-[*]-(children) return n, children
which I though should return a path (id:2)->(id:3), but unexpectedly (just for me) it returns (id:1)->(id:2)->(id:3). So, what I'm doing wrong and what is the right query to use?
All relationships in neo4j are directed. When you say (n)-[:foo]->(m), that relationship goes only one way, from n to m.
Now what's tricky about this is that you can navigate the relationship both ways. This doesn't make the relationship bi-directional, it never is -- it only means that you can look at it in either direction.
When you write this query: (n:Object {id:2})-[*]-(children) you didn't put an arrow head on that relationship, so children could refer to something either downstream or upstream of the node in question.
In other words, saying (n)-[:test]-(m) is the same thing as matching both (n)<-[:test]-(m) and (n)-[:test]->(m).
So children could refer to the ID 1 object or ID 2 object.
Returning only children
To directly answer your question,
Your query
MATCH (n:Object {id:2})-[*]-(children) return n, children
matches not only relationships FROM (n {id:2}) TO its children, but also relationships TO (n {id:2}) FROM its parents.
You need to additionally specify the direction that you'd like. This returns the results you expect:
MATCH (n:Object {id:2})-[*]->(children) return n, children
Issues with the example
I'd like to answer your comment about uni-directional and bi-directional relationships, but let's first resolve a couple of issues with the example.
Using correct labels
Let's revisit your example:
(:Object {id:1})-[:PARENT]->(:Object {id:2})-[:PARENT]->(:Object {id:3})
There's no point to using labels like :Object, :Node, :Thing. If you really don't care, don't use a label at all!
In this case, it looks we're talking about people, although it could easily also be motherboards and daughterboards, or something else!
Let's use People instead of Objects:
(:Person {id:1})-[:PARENT]->(:Person {id:2})-[:PARENT]->(:Person {id:3})
IDs in Neo4j
Neo4j stores its own IDs of every node and relationship. You can retrieve those IDs with id(nodeOrRelationship), and access by ID with a WHERE clause or by specifying them as a start point for your match. START n=node(2) MATCH (n)-[*]-(children) return n, children is equivalent to your original query MATCH (n:Object {id:2})-[*]-(children) return n, children.
Let's, instead of IDs, store something useful about the nodes, like names:
(:Person {name:'Bob'})-[:PARENT]->(:Person {name:'Mary'})-[:PARENT]->(:Person {name:'Tom'})
Relationship ambiguity
Lastly, let's disambiguate the relationships. Does PARENT mean "is the parent of", or "has this parent"? It might be clear to you which one you meant, but someone unfamiliar with your system might have the opposite interpretation.
I think you meant "is the parent of", so let's make that clear:
(:Person {name:'Bob'})-[:PARENT_OF]->(:Person {name:'Mary'})-[:PARENT_OF]->(:Person {name:'Tom'})
More information about uni-directional and bi-directional relationships in Neo4j
Now that we've taken care of a few basic issues with the example, let's address the directionality of relationships in Neo4j and graphs in general.
There are several ways we could have expressed the relationships this example. Let's look at a few.
Undirected/bidirectional relationship
Let's abstract the parent relationship that we used above, for the purposes of discussion:
(bob)-[:KIN]-(mary)-[:KIN]-(tom)
Here the relationship KIN indicates that they are related but we don't know exactly who is the parent of whom. Is Tom the child of Mary, or vice-versa?
Notice that I didn't use any arrows. In the graph pseudo-code above, the KIN relationship is a bidirectional or undirected relationship.
Relationships in Neo4j, however, are always directional. If the KIN relationship was really how you wanted to track things, then you'd create a directional relationship, but always ignore the direction in your MATCH queries, e.g. MATCH (a)-[:KIN]-(b) and not MATCH (a)-[:KIN]->(b).
But is the KIN relationship really the best way to store this information? We can make it more specific. Let's go back to the PARENT_OF relationship that we were using earlier.
Directed/unidirectional relationship
Back to the example. We know that Bob is the parent of Mary who is the parent of Tom:
(bob)-[:PARENT_OF]->(mary)-[:PARENT_OF]->(tom)
Obviously, the corollary of this is:
(bob)<-[:CHILD_OF]-(mary)<-[:CHILD_OF]-(tom)
Or, equivalently:
(tom)-[:CHILD_OF]->(mary)-[:CHILD_OF]->(bob)
So, should we go ahead and create both the PARENT_OF and the CHILD_OF relationships between our (bob), (mary) and (tom) nodes?
The answer is no. We can pick one of those relationships, whichever best models the idea, and still be able to search both ways.
Using only the :PARENT_OF relationship, we can do
MATCH (mary {name:'Mary'})-[:PARENT_OF]->(children) RETURN children
to find the children, or
MATCH (mary {name:'Mary'})<-[:PARENT_OF]-(parents) RETURN parents
to find the parents, using (mary) as the starting point each time.
For more information, see this fantastic article from GraphAware

Neo4j - Cypher return 1 to 1 relationships

Using neo4j 1.9.2, I'm trying to find all nodes in my graph that have a one to one relationship to another node. Let's say I have persons in my graph and I would like to find all persons, that have exactly one friend (since 2013), and this one friend only has the other person as friend and no one else. As a return, I would like to have all these pairs of "isolated" friends.
I tried the following:
START n=node(*) MATCH n-[r:is_friend]-m-[s:is_friend]-n
WHERE r.since >= 2013 and s.since >= 2013
WITH n, m, count(r), count(s)
WHERE count(r) = 1 AND count(s) = 1
RETURN n, m
But this query does not what it is supposed to do - it simply returns nothing.
Note: There exists just one relation between the two persons. So one friend has a incoming relationship and the other one an outgoing one. Also, these two persons might have some other relations, like "works_in" or so, but I just want to check if there is a 1:1 relation of type *is_friends* between the persons.
EDIT: The suggestion of Stefan works perfect if using node(*) as starting point. But when trying this query for one specific node as start point (e.g. start n=node(42)), it doesn't work. What would the solution look like in this case?
Update: I'm still wondering about a solution for this szenario: How to check if a given start node has a 1-to-1 relation to another node of a specific relationship type. Any ideas?
Here it's crucial to understand the concept of paths in the MATCH clause. A path is a alternating collection of node, relationship, node, relationship, .... node. There is the constraint that the same relationship will never occur twice in the same path - otherwise there would be a danger of having endless loops.
That said, you need to decide if is_friend in your domain is directed. If it is directed you'd distinguish a being friend to b and b being friend to a. From the description I assume is_friend is undirected and the statement should look like:
START n=node(*) MATCH n-[r:is_friend]-()
WHERE r.since >= 2013
WITH n, count(r) as numberOfFriends
WHERE numberOfFriends=1
RETURN n
You don't have to care about the other end, it's traversed nonetheless since you do a node(*). Be aware that node(*) gets obviously more expensive when your graph grows.

Resources