Neo4J cypher efficient comparison of optional properties or relationships to collections - neo4j

Alice loves kittens and puppies and will only go to places where they are both present. She likes guppies and goldfish and insists on at least one of them being present. She hates echidnas though, and will not show up anywhere that hosts spiny things. Bob likes kittens and hamsters, but has no strong feelings regarding other animals, piscine, spiny or otherwise.
CREATE (alice:PERSON {name: 'Alice', loves: ['kittens','puppies'], likes: ['guppies','goldfish'], hates:['echidnas']})
CREATE (bob:PERSON {name: 'Bob', likes: ['kittens','hamsters']})
I'm throwing a party where I will have kittens, puppies, guppies and manatees. Who will come? Here's how I'm asking.
MATCH (a:PERSON)
WHERE
(NOT HAS (a.loves) OR (LENGTH(FILTER(love IN a.loves WHERE love IN ['kittens','puppies','guppies','manatees'])) = LENGTH(a.loves)))
AND
(NOT HAS (a.likes) OR (LENGTH(FILTER(like IN a.likes WHERE like IN ['kittens','puppies','guppies','manatees'])) > 0))
AND
(NOT HAS (a.hates) OR (LENGTH(FILTER(hate IN a.hates WHERE hate IN ['kittens','puppies','guppies','manatees'])) = 0))
RETURN a.name
Huzzah, Alice and Bob are both OK with that.
However, is that the smartest way to do it in Cypher?
This is of course a toy example: there will also be a shape in the MATCH and other filtering conditions.
However, my focus is on every person in the shape optionally having none, one two or three set of things[*], one of which contains elements which must ALL be matched by the elements of a provided collection, one of which ANY must be matched, and one which contains elements of which NONE must be matched by anything in the (common) provided collection. Implementing this is a core requirement.
[*] I say "things" rather than "properties", because I'm not averse to modelling my animals as nodes, and linking my people to them. Like this, for Carol and Dan who conveniently share the same taste in animals as Alice and Bob.
CREATE (carol:PERSON {name: 'Carol'})-[:LOVES]->(kittens {name:'kittens'}),
(carol)-[:LOVES]->(puppies {name:'puppies'}),
(carol)-[:LIKES]->({name:'guppies'}),
(carol)-[:LIKES]->({name:'goldfish'}),
(carol)-[:HATES]->({name:'echidnas'}),
(dan:PERSON {name: 'Dan'})-[:LIKES]->(kittens),
(dan)-[:LIKES]->({name:'hamsters'})
MATCH (a:PERSON), (a)-[:LOVES]->(a_loves), (a)-[:LIKES]->(a_likes), (a)-[:HATES]->(a_hates)
WHERE
ALL (loves_name IN a_loves.name WHERE loves_name IN ['kittens','puppies','guppies','manatees'])
AND
ANY (likes_name IN a_likes.name WHERE likes_name IN ['kittens','puppies','guppies','manatees'])
AND
NONE (hates_name IN a_hates.name WHERE hates_name IN ['kittens','puppies','guppies','manatees'])
RETURN a.name
This works for each person who does love, like and hate at least one animal, i.e. for Carol. However, it does not work when a person does not have all three of LOVES and LIKES and HATES relationships, as that shape isn't found in the graph, and so it does not find Dan. I cannot see an obvious way to make OPTIONAL MATCH perform that kind of pruning.
To get around that issue, I can add fake animal nodes and give each and every PERSON relationships to them, thusly: -[:LOVES]->(unicorns), -[:LIKES]->(manticores) -[:HATES]->(basilisks) and then always add 'unicorns' to the collection being compared against the LOVES node. However, that feels very contrived and clunky.
Long post short, what would be your favoured way to model this in Neo4J and Cypher, and why?

Here's a first cut http://gist.neo4j.org/?8932364
Do take a look at let me know if first of all, the problem is understood and second, if the query fits.
I'd like to come back later and improve on that query, it was put together quickly.

Related

How can I mitigate having bidirectional relationships in a family tree, in Neo4j?

I am running into this wall regarding bidirectional relationships.
Say I am attempting to create a graph that represents a family tree. The problem here is that:
* Timmy can be Suzie's brother, but
* Suzie can not be Timmy's brother.
Thus, it becomes necessary to model this in 2 directions:
(Sure, technically I could say SIBLING_TO and leave only one edge...what I'm not sure what the vocabulary is when I try to connect a grandma to a grandson.)
When it's all said and done, I pretty sure there's no way around the fact that the direction matters in this example.
I was reading this blog post, regarding common Neo4j mistakes. The author states that this bidirectionality is not the most efficient way to model data in Neo4j and should be avoided.
And I am starting to agree. I set up a mock set of 2 families:
and I found that a lot of queries I was attempting to run were going very, very slow. This is because of the 'all connected to all' nature of the graph, at least within each respective family.
My question is this:
1) Am I correct to say that bidirectionality is not ideal?
2) If so, is my example of a family tree representable in any other way...and what is the 'best practice' in the many situations where my problem may occur?
3) If it is not possible to represent the family tree in another way, is it technically possible to still write queries in some manner that gets around the problem of 1) ?
Thanks for reading this and for your thoughts.
Storing redundant information (your bidirectional relationships) in a DB is never a good idea. Here is a better way to represent a family tree.
To indicate "siblingness", you only need a single relationship type, say SIBLING_OF, and you only need to have a single such relationship between 2 sibling nodes.
To indicate ancestry, you only need a single relationship type, say CHILD_OF, and you only need to have a single such relationship between a child to each of its parents.
You should also have a node label for each person, say Person. And each person should have a unique ID property (say, id), and some sort of property indicating gender (say, a boolean isMale).
With this very simple data model, here are some sample queries:
To find Person 123's sisters (note that the pattern does not specify a relationship direction):
MATCH (p:Person {id: 123})-[:SIBLING_OF]-(sister:Person {isMale: false})
RETURN sister;
To find Person 123's grandfathers (note that this pattern specifies that matching paths must have a depth of 2):
MATCH (p:Person {id: 123})-[:CHILD_OF*2..2]->(gf:Person {isMale: true})
RETURN gf;
To find Person 123's great-grandchildren:
MATCH (p:Person {id: 123})<-[:CHILD_OF*3..3]-(ggc:Person)
RETURN ggc;
To find Person 123's maternal uncles:
MATCH (p:Person {id: 123})-[:CHILD_OF]->(:Person {isMale: false})-[:SIBLING_OF]-(maternalUncle:Person {isMale: true})
RETURN maternalUncle;
I'm not sure if you are aware that it's possible to query bidirectionally (that is, to ignore the direction). So you can do:
MATCH (a)-[:SIBLING_OF]-(b)
and since I'm not matching a direction it will match both ways. This is how I would suggest modeling things.
Generally you only want to make multiple relationships if you actually want to store different state. For example a KNOWS relationship could only apply one way because person A might know person B, but B might not know A. Similarly, you might have a LIKES relationship with a value property showing how much A like B, and there might be different strengths of "liking" in the two directions

In Neo4j for every disjoint subgraph return the node with the most relationships

I’m new to Neo4j and graph theory and I’m trying to figure out if I can use Neo4j to solve a problem I have. Please correct me if I’m using the wrong words to describe stuff. Since I’m new to the subject I haven’t really wrapped my head around what to call everything.
I think the easiest way to describe my problem is with a lot of pictures.
Let’s say you have two disjoint subgraphs that look like this.
From the subgraphs above I want to get a list of subgraphs that fulfills one of two criteria.
Criteria 1.
If a node has a unique relationship to another node, the nodes and relationship should be returned as a subgraph.
Criteria 2.
If the relations are not unique, I'd like the node with the most relationships to be returned, as a subgraph with its relationships and related nodes.
If other nodes come in tie in criteria 2, I want all subgraphs to be returned.
Or put in the context of this graph,
Give me the people who have unique games, and if there are other people having the same games, give me back the person with the most games. If they come in tie, return all people who come in tie.
Or actually, return the whole subgraph, not only the person.
To clarify what I am after here is a picture that describes the result I want to get. The ordering of the result is not important.
Disjoint subgraph A, because of Criteria 1, Andrew is the only person who has Bubble Bobble.
Disjoint subgraph B, because of Criteria 1, Johan is the only person who has Puzzle Bobble 1.
Disjoint subgraph C, because of Criteria 2, Julia since she has the most games.
Disjoint subgraph D, because of Criteria 2, Anna since she comes in tie with Julia having the most games.
Worth noting is that Johan's relationship to Puzzle Bobble 2 is not returned because it's not unique and he has not the most games.
Is this a problem you could solve with only Neo4j and is it a good idea?
If you could solve it how would you do it in Cypher?
Create script:
CREATE (p1:Person {name:"Johan"}),
(p2:Person {name:"Julia"}),
(p3:Person {name:"Anna"}),
(p4:Person {name:"Andrew"}),
(v1:Videogame {name:"Puzzle Bobble 1"}),
(v2:Videogame {name:"Puzzle Bobble 2"}),
(v3:Videogame {name:"Puzzle Bobble 3"}),
(v4:Videogame {name:"Puzzle Bobble 4"}),
(v5:Videogame {name:"Bubble Bobble"}),
(p1)-[:HAS]->(v1),
(p1)-[:HAS]->(v2),
(p2)-[:HAS]->(v2),
(p2)-[:HAS]->(v3),
(p2)-[:HAS]->(v4),
(p3)-[:HAS]->(v2),
(p3)-[:HAS]->(v3),
(p3)-[:HAS]->(v4),
(p4)-[:HAS]->(v5)
I feel like this solution might not be quite what you're looking for, but it could be a good start:
MATCH (game:Videogame)<-[:HAS]-(owner:Person)
OPTIONAL MATCH owner-[:HAS]->(other_game:Videogame)
WITH game, owner, count(other_game) AS other_game_count
ORDER BY other_game_count DESC
RETURN game, collect(owner)[0]
Here the query:
Finds all of the games and their owners (games without owners will not be matched)
Does an OPTIONAL MATCH against any other games those owners might own (by doing an optional match we're saying that it's OK if they own zero)
Pass through each game/owner pair along with a count of the number of other games owned by that owner, sorting so that those with the most games come first
RETURN the first owner for each game (the ORDER is preserved when doing the collect)

Simple recursive CYPHER query

This is an extremely simple question, but reading through the docs for the first time, I can't figure out how to construct this query. Let's say I have a graph that looks like:
and additionally each person has an age associated with them.
What CYPHER query will get me a list of John's age and all the ages of the entire friend tree of John?
What I've tried so far:
MATCH (start)-[:friend]>(others)
WHERE start.name="John"
RETURN start.age, others.age
This has several problems,
It only goes one one one friend deep and I'd like to go to all friends of John.
It doesn't return a list, but a series of (john.age, other.age).
So what you need is not only friend of john, but also friends of friends. So you need to tell neo4j to recursively go deep in the graph.
This query goes 2 level deep.
MATCH (john:Person { name:"John" })-[:friend *1..2]->(friend: Person)
RETURN friend.name, friend.age;
To go n nevel deep, don't put anything i.e. *1..
Oh and I also found this nice example in neo4j
So what does *1..2 here means:
* to denote that its a recursion.
1 to denote that do not include john itself i.e. is the start node. If you put 0 here, it will also include the John node itself.
.. to denote that go from this node till ...
2 denotes the level of recursion. Here you say to stop at level 2. i.e. dont go beyond steve. If you put nothing there, it will keep on going until it cannot find a node that "has" a friend relationship
Documentation for these types of query match is here and a similar answer here.

What is the most performant way to create the following MATCH statement and why?

The question:
What is the most performant way to create the following MATCH statement and why?
The detailed problem:
Let's say we have a Place node with a variable amount of properties and need to look up nodes from potentially billions of nodes by it's category. I'm trying to wrap my head around the performance of each query and it's proving to be quite difficult.
The possible queries:
Match Place node using a property lookup:
MATCH (entity:Place { category: "Food" })
Match Place node with isCategory relationship to Food node:
MATCH (entity:Place)-[:isCategory]->(category:Food)
Match Place node with Food relationship to Category node:
MATCH (entity)-[category:Food]->(:Category)
Match Food node with isCategoryFor relationship to Place node:
MATCH (category:Food)-[:isCategoryFor]->(entity:place)
And obviously all the variations in between. With relationship directions going the other way as well.
More complexity:
Let's throw in a little more complexity and say we now need to find all Place nodes using multiple categories. For example: Find all Place nodes with category Food or Bar
Would we just tack on another MATCH statement? If not, what is the most performant route to take here?
Extra:
Is there a tool to help me describe the traversal process and tell me the best method to choose?
If I understand your domain correctly, I would recommend making your Categorys into nodes themselves.
MERGE (:Category {name:"Food"})
MERGE (:Category {name:"Bar"})
MERGE (:Category {name:"Park"})
And connecting each Place node to the Categorys it belongs to.
MERGE (:Place {name:"Central Park"})-[:IS_A]->(:Category {name:"Park"})
MERGE (:Place {name:"Joe's Diner"})-[:IS_A]->(:Category {name:"Food"})
MERGE (:Place {name:"Joe's Diner"})-[:IS_A]->(:Category {name:"Bar"})
Then, if you want to find Places that belong to a Category, it can be pretty quick. Start by matching the category, then branch out to the places related to the category.
MATCH (c:Category {name:"Bar"}), (c)<-[:IS_A]-(p:Place)
RETURN p
You'll have a relatively limited number of categories, so matching the category will be quick. Then, because of the way Neo4j actually stores data, it will be fast to find all the places related to that category.
More Complexity
Finding places within multiple categories will be easy as well.
MATCH (c:Category) WHERE c.name = "Bar" OR c.name = "Food", (c)<-[:IS_A]-(p:Place)
RETURN p
Again, you just match the categories first (fast because there aren't many of them), then branch out to the connected places.
Use an Index
If you want fast, you need to use indexes where it makes sense. In this example, I would use an index on the category's name property.
CREATE INDEX ON :Category(name)
Or better yet, use a uniqueness constraint on the category names, which will index them and prevent duplicates.
CREATE CONSTRAINT ON (c:Category) ASSERT c.name IS UNIQUE
Indexes (and uniqueness) make a big difference on the speed of your queries.
Why this is fastest
Neo4j stores nodes and relationships in a very compact, quick-to-access format. Once you have a node or relationship, getting the adjacent relationships or nodes is very fast. However, it stores each node's (and relationship's) properties separately, meaning that looking through properties is relatively slow.
The goal is to get to a starting node as quickly as possible. Once there, traversing related entities is quick. If you only have 1,000 categories, but you have a billion places, it will be faster to pick out an individual Category than an individual Place. Once you have that starting node, getting to related nodes will be very efficient.
The Other Options
Just to reinforce, this is what makes your other options slower or otherwise worse.
In your first example, you are looking through properties on each node to look for the match. Property lookup is slow and you are doing it a billion times. An index can help with this, but it's still a lot of work. Additionally, you are effectively duplicating the category data over each of you billion places, and not taking advantage of Neo4j's strengths.
In all your other examples, your data models seem odd. "Food", "Bar", "Park", etc. are all instances of categories, not separate types. They should each be their own node, but they should all have the Category label, because that's what they are. In addition, categories are things, and thus they should be nodes. A relationship describes the connection between things. It does not make sense to use categories in this way.
I hope this helps!

how to store nodes in sequence?

I am using neo4j 2.1.5 and tying to model a scenario where nodes are video-links and they fall under multiple categories and subcategories, however these nodes will also form a ordered sequence under a particular sequence-category.
e.g;
Sequence 1: (node 1)-[:relatedTo]->(node 2)->[:relatedTo]->(node 3)-[:relatedTo]->(node 4)
Sequence 2: (node 2)-[:someRel]->(node 8)->[:someRel]->(node 4)
What should be the best way to model this?
Note: more video nodes will be added in a sequence-category between the nodes.
The way that you've done this looks fine to me. A set of nodes, with relationships between them, essentially forming a queue. What you're doing here is basically using neo4j to model a linked list.
Linkage to the category is probably where it's going to get interesting. With linked lists, running through them is easy, but random access (jumping into the middle) is hard. Now, if the order were always fixed, you could just add a seq attribute to your relationships, and label them numerically (0, 1, 2...). You could then jump in anywhere by querying for the appropriate seq value. BUT, if you're saying you want to do dynamic insertions into the list, this is probably a terrible idea. If you create a new #3, that means you have to increment all subsequent seq numbers. So that's probably not going to work.
Probably just a Category node that is always guaranteed to link to the first item in the sequence is a good way to go, like this:
(c:Category {name:"Horror"})-[:contains]->(v1:Video {name: "Texas Chainsaw Massacre"})-[:next]->(v2:Video { name: "Hellraiser"});
You can do as many :next relationships as you want. To insert something new, you might do this:
MATCH (v1:Video {name: "Texas Chainsaw Massacre"})-[r:next]->(v2:Video {name: "Hellraiser"})
CREATE v1-[:next]->(v3:Video {name: "New Thing I'm inserting"}-[:next]->v2
DELETE r;
An alternative is that you could link the category to every video (not just the first). Your videos would still have ordering, and you'd be able to jump into them with random access by video name, but you wouldn't know where the head of your list was anymore.

Resources