How can I mitigate having bidirectional relationships in a family tree, in Neo4j? - neo4j

I am running into this wall regarding bidirectional relationships.
Say I am attempting to create a graph that represents a family tree. The problem here is that:
* Timmy can be Suzie's brother, but
* Suzie can not be Timmy's brother.
Thus, it becomes necessary to model this in 2 directions:
(Sure, technically I could say SIBLING_TO and leave only one edge...what I'm not sure what the vocabulary is when I try to connect a grandma to a grandson.)
When it's all said and done, I pretty sure there's no way around the fact that the direction matters in this example.
I was reading this blog post, regarding common Neo4j mistakes. The author states that this bidirectionality is not the most efficient way to model data in Neo4j and should be avoided.
And I am starting to agree. I set up a mock set of 2 families:
and I found that a lot of queries I was attempting to run were going very, very slow. This is because of the 'all connected to all' nature of the graph, at least within each respective family.
My question is this:
1) Am I correct to say that bidirectionality is not ideal?
2) If so, is my example of a family tree representable in any other way...and what is the 'best practice' in the many situations where my problem may occur?
3) If it is not possible to represent the family tree in another way, is it technically possible to still write queries in some manner that gets around the problem of 1) ?
Thanks for reading this and for your thoughts.

Storing redundant information (your bidirectional relationships) in a DB is never a good idea. Here is a better way to represent a family tree.
To indicate "siblingness", you only need a single relationship type, say SIBLING_OF, and you only need to have a single such relationship between 2 sibling nodes.
To indicate ancestry, you only need a single relationship type, say CHILD_OF, and you only need to have a single such relationship between a child to each of its parents.
You should also have a node label for each person, say Person. And each person should have a unique ID property (say, id), and some sort of property indicating gender (say, a boolean isMale).
With this very simple data model, here are some sample queries:
To find Person 123's sisters (note that the pattern does not specify a relationship direction):
MATCH (p:Person {id: 123})-[:SIBLING_OF]-(sister:Person {isMale: false})
RETURN sister;
To find Person 123's grandfathers (note that this pattern specifies that matching paths must have a depth of 2):
MATCH (p:Person {id: 123})-[:CHILD_OF*2..2]->(gf:Person {isMale: true})
RETURN gf;
To find Person 123's great-grandchildren:
MATCH (p:Person {id: 123})<-[:CHILD_OF*3..3]-(ggc:Person)
RETURN ggc;
To find Person 123's maternal uncles:
MATCH (p:Person {id: 123})-[:CHILD_OF]->(:Person {isMale: false})-[:SIBLING_OF]-(maternalUncle:Person {isMale: true})
RETURN maternalUncle;

I'm not sure if you are aware that it's possible to query bidirectionally (that is, to ignore the direction). So you can do:
MATCH (a)-[:SIBLING_OF]-(b)
and since I'm not matching a direction it will match both ways. This is how I would suggest modeling things.
Generally you only want to make multiple relationships if you actually want to store different state. For example a KNOWS relationship could only apply one way because person A might know person B, but B might not know A. Similarly, you might have a LIKES relationship with a value property showing how much A like B, and there might be different strengths of "liking" in the two directions

Related

Why neo4j don't allows not directed or bidirectional relationships at creation time?

I know that Neo4j requires a relationship direction at creation time, but allows ignore this direction in query time. By this way I can query my graph ignoring the relationship direction.
I also know that there are some workarounds for cases when the relationships are naturally bidirectional or not directed, like described here.
My question is: Why is it implemented that way? Has a good reason to not allow not directed or bidirectional relationships at creation time? Is it a limitation of the database architecture?
The Cypher statements like below are not allowed:
CREATE ()-[:KNOWS]-()
CREATE ()<-[:KNOWS]->()
I searched the web for an answer, but I did not find much. For example, this github issue.
Is strange to have to define a relationship direction to one that don't have it. It seems to me that i'm hurting the semantic of my graph.
EDIT 1:
To clarify my standpoint about a "semantic problem" (maybe the term is wrong):
Suppose that I run this simple CREATE statement:
CREATE (a:Person {name:'a'})-[:KNOWS]->(b:Person {name:'b'})
As result i have this very simple graph:
The :KNOWS relationship has a direction only because Neo4j requires a relationship direction at creation time. In my domain a knows b and b knows a.
Now, a new team member will query my graph with this Cypher query:
MATCH path = (a:Person {name:'a'})-[:KNOWS]-(b:Person {name:'b'})
return path
This new team member don't know that when I created this graph I considered that :KNOWS relationship is not directed. The result that he will see is the same:
By the result this new team member can think that only Person a consider knows Person b. It seems to me bad. Not for you? This make any sense?
Fundamentally, it boils down to the internals of how the data is stored on disk in Neo4j -- note Chapter 6 of the O'Reilly Neo4j e-book.
In the data structure of a relationship they have a "firstNode" and a "secondNode", where each is either the left or the right hand side of the relationship.
To flag a relationship as uni/bi-directional would require an additional bit per node, where I would argue it is better to retain the direction in the data store and just ignore direction during querying.
In Neo4j relationships are always directed.
But if you don't care about the direction, you can ignore the direction when querying.
MATCH (p1:Person {name:"me"})-[:KNOWS]-(p2)
RETURN p2;
And with MERGE you can also leave off the direction when creating.
MATCH (p1:Person {name:"me"})
MATCH (p2:Person {name:"you"})
MERGE (p1)-[:KNOWS]-(p2);
You only need 2 relationships if they really convey a different meaning, e.g. :FOLLOWS on Twitter.
It seems to me that i'm hurting the semantic of my graph.
I can't see why a < or > symbol used during creation of a relationship hurts the semantics of your graph if you are going to not use that symbol during matching (and thus treating that relationship as undirected/bidirectional).
Suppose that the syntax proposed by you is supported. Now how will you connect with an undirected relationship two nodes a and b? You still have two options:
CREATE (a)-[:KNOWS]-(b)
CREATE (b)-[:KNOWS]-(a)
The pair (a, b) is always ordered by appearance even if not by semantics. So even if we remove the < or > symbol from the relationship declaration, the problem with the order of nodes in it cannot be eliminated. Therefore simply don't treat it is a problem.

neo4j and uni-directional relationship

I'm new to neo4j. I've just read some information on this tool, installed it on Ubuntu and made a bunch of queries. And at this moment, I must confess, that I realy like it. However, there is something (I think very simple and intuitive), which I do not know how to implement. So, I created three nodes like so:
CREATE (n:Object {id:1}) RETURN n
CREATE (n:Object {id:2}) RETURN n
CREATE (n:Object {id:3}) RETURN n
And I created a hierarchical relationship between them:
MATCH (a:Object {id:1}), (b:Object {id:2}) CREATE (a)-[:PARENT]->(b)
MATCH (a:Object {id:2}), (b:Object {id:3}) CREATE (a)-[:PARENT]->(b)
So, I think this simple hierarchy should look like this:
(id:1)
-> (id:2)
-> (id:3)
What I want now is to get a path from any node. For example, if I want to have a path from node (id:2), I will get (id:2) -> (id:3). And if I want to get a path from node (id:1), I will get (id:1)->(id:2)->(id:3). I tried this query:
MATCH (n:Object {id:2})-[*]-(children) return n, children
which I though should return a path (id:2)->(id:3), but unexpectedly (just for me) it returns (id:1)->(id:2)->(id:3). So, what I'm doing wrong and what is the right query to use?
All relationships in neo4j are directed. When you say (n)-[:foo]->(m), that relationship goes only one way, from n to m.
Now what's tricky about this is that you can navigate the relationship both ways. This doesn't make the relationship bi-directional, it never is -- it only means that you can look at it in either direction.
When you write this query: (n:Object {id:2})-[*]-(children) you didn't put an arrow head on that relationship, so children could refer to something either downstream or upstream of the node in question.
In other words, saying (n)-[:test]-(m) is the same thing as matching both (n)<-[:test]-(m) and (n)-[:test]->(m).
So children could refer to the ID 1 object or ID 2 object.
Returning only children
To directly answer your question,
Your query
MATCH (n:Object {id:2})-[*]-(children) return n, children
matches not only relationships FROM (n {id:2}) TO its children, but also relationships TO (n {id:2}) FROM its parents.
You need to additionally specify the direction that you'd like. This returns the results you expect:
MATCH (n:Object {id:2})-[*]->(children) return n, children
Issues with the example
I'd like to answer your comment about uni-directional and bi-directional relationships, but let's first resolve a couple of issues with the example.
Using correct labels
Let's revisit your example:
(:Object {id:1})-[:PARENT]->(:Object {id:2})-[:PARENT]->(:Object {id:3})
There's no point to using labels like :Object, :Node, :Thing. If you really don't care, don't use a label at all!
In this case, it looks we're talking about people, although it could easily also be motherboards and daughterboards, or something else!
Let's use People instead of Objects:
(:Person {id:1})-[:PARENT]->(:Person {id:2})-[:PARENT]->(:Person {id:3})
IDs in Neo4j
Neo4j stores its own IDs of every node and relationship. You can retrieve those IDs with id(nodeOrRelationship), and access by ID with a WHERE clause or by specifying them as a start point for your match. START n=node(2) MATCH (n)-[*]-(children) return n, children is equivalent to your original query MATCH (n:Object {id:2})-[*]-(children) return n, children.
Let's, instead of IDs, store something useful about the nodes, like names:
(:Person {name:'Bob'})-[:PARENT]->(:Person {name:'Mary'})-[:PARENT]->(:Person {name:'Tom'})
Relationship ambiguity
Lastly, let's disambiguate the relationships. Does PARENT mean "is the parent of", or "has this parent"? It might be clear to you which one you meant, but someone unfamiliar with your system might have the opposite interpretation.
I think you meant "is the parent of", so let's make that clear:
(:Person {name:'Bob'})-[:PARENT_OF]->(:Person {name:'Mary'})-[:PARENT_OF]->(:Person {name:'Tom'})
More information about uni-directional and bi-directional relationships in Neo4j
Now that we've taken care of a few basic issues with the example, let's address the directionality of relationships in Neo4j and graphs in general.
There are several ways we could have expressed the relationships this example. Let's look at a few.
Undirected/bidirectional relationship
Let's abstract the parent relationship that we used above, for the purposes of discussion:
(bob)-[:KIN]-(mary)-[:KIN]-(tom)
Here the relationship KIN indicates that they are related but we don't know exactly who is the parent of whom. Is Tom the child of Mary, or vice-versa?
Notice that I didn't use any arrows. In the graph pseudo-code above, the KIN relationship is a bidirectional or undirected relationship.
Relationships in Neo4j, however, are always directional. If the KIN relationship was really how you wanted to track things, then you'd create a directional relationship, but always ignore the direction in your MATCH queries, e.g. MATCH (a)-[:KIN]-(b) and not MATCH (a)-[:KIN]->(b).
But is the KIN relationship really the best way to store this information? We can make it more specific. Let's go back to the PARENT_OF relationship that we were using earlier.
Directed/unidirectional relationship
Back to the example. We know that Bob is the parent of Mary who is the parent of Tom:
(bob)-[:PARENT_OF]->(mary)-[:PARENT_OF]->(tom)
Obviously, the corollary of this is:
(bob)<-[:CHILD_OF]-(mary)<-[:CHILD_OF]-(tom)
Or, equivalently:
(tom)-[:CHILD_OF]->(mary)-[:CHILD_OF]->(bob)
So, should we go ahead and create both the PARENT_OF and the CHILD_OF relationships between our (bob), (mary) and (tom) nodes?
The answer is no. We can pick one of those relationships, whichever best models the idea, and still be able to search both ways.
Using only the :PARENT_OF relationship, we can do
MATCH (mary {name:'Mary'})-[:PARENT_OF]->(children) RETURN children
to find the children, or
MATCH (mary {name:'Mary'})<-[:PARENT_OF]-(parents) RETURN parents
to find the parents, using (mary) as the starting point each time.
For more information, see this fantastic article from GraphAware

What is the most performant way to create the following MATCH statement and why?

The question:
What is the most performant way to create the following MATCH statement and why?
The detailed problem:
Let's say we have a Place node with a variable amount of properties and need to look up nodes from potentially billions of nodes by it's category. I'm trying to wrap my head around the performance of each query and it's proving to be quite difficult.
The possible queries:
Match Place node using a property lookup:
MATCH (entity:Place { category: "Food" })
Match Place node with isCategory relationship to Food node:
MATCH (entity:Place)-[:isCategory]->(category:Food)
Match Place node with Food relationship to Category node:
MATCH (entity)-[category:Food]->(:Category)
Match Food node with isCategoryFor relationship to Place node:
MATCH (category:Food)-[:isCategoryFor]->(entity:place)
And obviously all the variations in between. With relationship directions going the other way as well.
More complexity:
Let's throw in a little more complexity and say we now need to find all Place nodes using multiple categories. For example: Find all Place nodes with category Food or Bar
Would we just tack on another MATCH statement? If not, what is the most performant route to take here?
Extra:
Is there a tool to help me describe the traversal process and tell me the best method to choose?
If I understand your domain correctly, I would recommend making your Categorys into nodes themselves.
MERGE (:Category {name:"Food"})
MERGE (:Category {name:"Bar"})
MERGE (:Category {name:"Park"})
And connecting each Place node to the Categorys it belongs to.
MERGE (:Place {name:"Central Park"})-[:IS_A]->(:Category {name:"Park"})
MERGE (:Place {name:"Joe's Diner"})-[:IS_A]->(:Category {name:"Food"})
MERGE (:Place {name:"Joe's Diner"})-[:IS_A]->(:Category {name:"Bar"})
Then, if you want to find Places that belong to a Category, it can be pretty quick. Start by matching the category, then branch out to the places related to the category.
MATCH (c:Category {name:"Bar"}), (c)<-[:IS_A]-(p:Place)
RETURN p
You'll have a relatively limited number of categories, so matching the category will be quick. Then, because of the way Neo4j actually stores data, it will be fast to find all the places related to that category.
More Complexity
Finding places within multiple categories will be easy as well.
MATCH (c:Category) WHERE c.name = "Bar" OR c.name = "Food", (c)<-[:IS_A]-(p:Place)
RETURN p
Again, you just match the categories first (fast because there aren't many of them), then branch out to the connected places.
Use an Index
If you want fast, you need to use indexes where it makes sense. In this example, I would use an index on the category's name property.
CREATE INDEX ON :Category(name)
Or better yet, use a uniqueness constraint on the category names, which will index them and prevent duplicates.
CREATE CONSTRAINT ON (c:Category) ASSERT c.name IS UNIQUE
Indexes (and uniqueness) make a big difference on the speed of your queries.
Why this is fastest
Neo4j stores nodes and relationships in a very compact, quick-to-access format. Once you have a node or relationship, getting the adjacent relationships or nodes is very fast. However, it stores each node's (and relationship's) properties separately, meaning that looking through properties is relatively slow.
The goal is to get to a starting node as quickly as possible. Once there, traversing related entities is quick. If you only have 1,000 categories, but you have a billion places, it will be faster to pick out an individual Category than an individual Place. Once you have that starting node, getting to related nodes will be very efficient.
The Other Options
Just to reinforce, this is what makes your other options slower or otherwise worse.
In your first example, you are looking through properties on each node to look for the match. Property lookup is slow and you are doing it a billion times. An index can help with this, but it's still a lot of work. Additionally, you are effectively duplicating the category data over each of you billion places, and not taking advantage of Neo4j's strengths.
In all your other examples, your data models seem odd. "Food", "Bar", "Park", etc. are all instances of categories, not separate types. They should each be their own node, but they should all have the Category label, because that's what they are. In addition, categories are things, and thus they should be nodes. A relationship describes the connection between things. It does not make sense to use categories in this way.
I hope this helps!

Neo4j labels, relationship types, and cypher matching performance

Say I have a massive graph of users and other types of nodes. Each type has a label, some may have multiple labels. Since I am defining users and their access to nodes, there is one relationship type between users and nodes: CAN_ACCESS. Between other objects, there are different relationship types, but for the purpose of access control, everything involves a CAN_ACCESS relationship when we start from a user.
I never perform a match without using labels, so my intention and hope is that any performance downsides to having one heavily-used relationship type from my User nodes should be negated by matching a label. Obviously, this match could get messy:
MATCH (n:`User`)-[r1:`CAN_ACCESS`]->(n2)
But I'd never do that. I'd do this:
MATCH (n:`User`)-[r1:`CAN_ACCESS`]->(n2:`LabelX`)
My question, then is whether the use of labels on the destination side of the match is effectively equivalent to having a dedicated relationship type between a User and any given label. In other words, does this:
MATCH (n:`User`)-[r1:`CAN_ACCESS`]->(n2:`LabelX`)
Give me the same performance as this:
MATCH (n:`User`)-[r1:`CAN_ACCESS_LABEL_X`]->(n2)
If CAN_ACCESS_LABEL_X ALWAYS goes (n:`User`)-->(n:`LabelX`)?
As pointed out by Michael Hunger's comment, Mark Needham's blog post here demonstrates that performance is best when you use a dedicated relationship type instead of relying on labels.

Neo4J cypher efficient comparison of optional properties or relationships to collections

Alice loves kittens and puppies and will only go to places where they are both present. She likes guppies and goldfish and insists on at least one of them being present. She hates echidnas though, and will not show up anywhere that hosts spiny things. Bob likes kittens and hamsters, but has no strong feelings regarding other animals, piscine, spiny or otherwise.
CREATE (alice:PERSON {name: 'Alice', loves: ['kittens','puppies'], likes: ['guppies','goldfish'], hates:['echidnas']})
CREATE (bob:PERSON {name: 'Bob', likes: ['kittens','hamsters']})
I'm throwing a party where I will have kittens, puppies, guppies and manatees. Who will come? Here's how I'm asking.
MATCH (a:PERSON)
WHERE
(NOT HAS (a.loves) OR (LENGTH(FILTER(love IN a.loves WHERE love IN ['kittens','puppies','guppies','manatees'])) = LENGTH(a.loves)))
AND
(NOT HAS (a.likes) OR (LENGTH(FILTER(like IN a.likes WHERE like IN ['kittens','puppies','guppies','manatees'])) > 0))
AND
(NOT HAS (a.hates) OR (LENGTH(FILTER(hate IN a.hates WHERE hate IN ['kittens','puppies','guppies','manatees'])) = 0))
RETURN a.name
Huzzah, Alice and Bob are both OK with that.
However, is that the smartest way to do it in Cypher?
This is of course a toy example: there will also be a shape in the MATCH and other filtering conditions.
However, my focus is on every person in the shape optionally having none, one two or three set of things[*], one of which contains elements which must ALL be matched by the elements of a provided collection, one of which ANY must be matched, and one which contains elements of which NONE must be matched by anything in the (common) provided collection. Implementing this is a core requirement.
[*] I say "things" rather than "properties", because I'm not averse to modelling my animals as nodes, and linking my people to them. Like this, for Carol and Dan who conveniently share the same taste in animals as Alice and Bob.
CREATE (carol:PERSON {name: 'Carol'})-[:LOVES]->(kittens {name:'kittens'}),
(carol)-[:LOVES]->(puppies {name:'puppies'}),
(carol)-[:LIKES]->({name:'guppies'}),
(carol)-[:LIKES]->({name:'goldfish'}),
(carol)-[:HATES]->({name:'echidnas'}),
(dan:PERSON {name: 'Dan'})-[:LIKES]->(kittens),
(dan)-[:LIKES]->({name:'hamsters'})
MATCH (a:PERSON), (a)-[:LOVES]->(a_loves), (a)-[:LIKES]->(a_likes), (a)-[:HATES]->(a_hates)
WHERE
ALL (loves_name IN a_loves.name WHERE loves_name IN ['kittens','puppies','guppies','manatees'])
AND
ANY (likes_name IN a_likes.name WHERE likes_name IN ['kittens','puppies','guppies','manatees'])
AND
NONE (hates_name IN a_hates.name WHERE hates_name IN ['kittens','puppies','guppies','manatees'])
RETURN a.name
This works for each person who does love, like and hate at least one animal, i.e. for Carol. However, it does not work when a person does not have all three of LOVES and LIKES and HATES relationships, as that shape isn't found in the graph, and so it does not find Dan. I cannot see an obvious way to make OPTIONAL MATCH perform that kind of pruning.
To get around that issue, I can add fake animal nodes and give each and every PERSON relationships to them, thusly: -[:LOVES]->(unicorns), -[:LIKES]->(manticores) -[:HATES]->(basilisks) and then always add 'unicorns' to the collection being compared against the LOVES node. However, that feels very contrived and clunky.
Long post short, what would be your favoured way to model this in Neo4J and Cypher, and why?
Here's a first cut http://gist.neo4j.org/?8932364
Do take a look at let me know if first of all, the problem is understood and second, if the query fits.
I'd like to come back later and improve on that query, it was put together quickly.

Resources