I am using neo4j 2.1.5 and tying to model a scenario where nodes are video-links and they fall under multiple categories and subcategories, however these nodes will also form a ordered sequence under a particular sequence-category.
e.g;
Sequence 1: (node 1)-[:relatedTo]->(node 2)->[:relatedTo]->(node 3)-[:relatedTo]->(node 4)
Sequence 2: (node 2)-[:someRel]->(node 8)->[:someRel]->(node 4)
What should be the best way to model this?
Note: more video nodes will be added in a sequence-category between the nodes.
The way that you've done this looks fine to me. A set of nodes, with relationships between them, essentially forming a queue. What you're doing here is basically using neo4j to model a linked list.
Linkage to the category is probably where it's going to get interesting. With linked lists, running through them is easy, but random access (jumping into the middle) is hard. Now, if the order were always fixed, you could just add a seq attribute to your relationships, and label them numerically (0, 1, 2...). You could then jump in anywhere by querying for the appropriate seq value. BUT, if you're saying you want to do dynamic insertions into the list, this is probably a terrible idea. If you create a new #3, that means you have to increment all subsequent seq numbers. So that's probably not going to work.
Probably just a Category node that is always guaranteed to link to the first item in the sequence is a good way to go, like this:
(c:Category {name:"Horror"})-[:contains]->(v1:Video {name: "Texas Chainsaw Massacre"})-[:next]->(v2:Video { name: "Hellraiser"});
You can do as many :next relationships as you want. To insert something new, you might do this:
MATCH (v1:Video {name: "Texas Chainsaw Massacre"})-[r:next]->(v2:Video {name: "Hellraiser"})
CREATE v1-[:next]->(v3:Video {name: "New Thing I'm inserting"}-[:next]->v2
DELETE r;
An alternative is that you could link the category to every video (not just the first). Your videos would still have ordering, and you'd be able to jump into them with random access by video name, but you wouldn't know where the head of your list was anymore.
Related
Suppose you've got two nodes that represent the same thing, and you want to merge those two nodes. Both nodes can have any number of relations with other nodes.
The basics are fairly easy, and would look something like this:
MATCH (a), (b) WHERE a.id == b.id
MATCH (b)-[r]->()
CREATE (a)-[s]->()
SET s = PROPERTIES(r)
DELETE DETACH b
Only I can't create a relation without a type. And Cypher doesn't support variable labels either. I'd love to be able to do something like
CREATE (a)-[s:{LABELS(r)}]->(o)
but that doesn't work. To create the relation, you need to know the type of the relation, and in this case I really don't.
Is there a way to dynamically assign types to relationships, or am I going to have to query the types of the old relation, and then string concat new queries with the proper types? That's not impossible, but a lot slower and more complex. And this could potentially match a lot of elements and even more relationships, so having to generate a separate query for every instance is going to slow things down quite a lot.
Or is there a way to change the target of the old relationship? That would probably be the fastest, but I'm not aware of any way to do that.
I think you need to take a look at APOC, especially apoc.create.relationship which enable creating relationships with dynamic type.
Adapting your example, you should end up with something along the line of (not tested):
MATCH (a), (b) WHERE a.id == b.id
MATCH (b)-[r]->(n)
CALL apoc.create.relationship(a, type(r), properties(r), n)
DETACH DELETE b
NB
relationships have TYPE and not label
the proper cypher statement to delete relationships attached to a node and the node itself is DETACH DELETE (and not DELETE DETACH)
Related resource: https://markhneedham.com/blog/2016/10/30/neo4j-create-dynamic-relationship-type/
The APOC procedure apoc.refactor.mergeNodes should be very helpful. That procedure is very powerful, and you need to read the documentation to understand how to configure it to do what you want in your specific situation.
Here is a simple example that shows how to use the procedure's default configuration to merge nodes with the same id:
MATCH (node:Foo)
WITH node.id AS id, COLLECT(node) AS nodes
WHERE SIZE(nodes) > 1
CALL apoc.refactor.mergeNodes(nodes, {}) YIELD node
RETURN node
In this example, I specified an arbitrary Foo label to avoid accidentally merging unwanted nodes. Doing so also helps to speed up the query if you have a lot of nodes with other labels (since they will not need to be scanned for the id property).
The aggregating function COLLECT is used to collect a list of all the nodes with the same id. After checking the size of the list, it is passed to the procedure.
I am new to Cypher and I am trying to learn it through a small project I am trying to set up.
I have the following data model so far:
For every Thought created, I connect Tags through Categories.
The Categories only serve as intermediate between the Tags and Thoughts, this is done to improve querying, prevent Tag duplication and reduce relationships between the objects.
To prevent creation of new Tags with the same value, I thought of the following query:
CREATE (t: Thought {moment:timestamp(), message:'Testing new Thought'})
MERGE (t1: Tag{value: 'work'})
MERGE (t2: Tag{value: 'tasks'})
MERGE (t3: Tag{value: 'administration'})
MERGE (c: Category)
MERGE (t1)<-[u:CONSISTS_OF{index:0}]-(c)
MERGE (t2)<-[v:CONSISTS_OF{index:1}]-(c)
MERGE (t3)<-[w:CONSISTS_OF{index:2}]-(c)
MERGE (t)-[x:CATEGORIZED_AS{index: 0}]->(c)
This works fine, except for one thing: the Thought receives a relationship with all created Categories.
This I understand, I define no restrictions in the MERGE query.
However, I do not know how to apply restrictions to the CATEGORIZED_AS relationship?
I tried to add this to the bottom of the query, but that does not work:
WHERE (t)-[x]->(c)
Any idea how to apply a restriction like I need in my case?
EDIT:
I forgot to mention the unique connection of a Category:
A category is connect to a fixed set of Tags in a specific order.
E.g I have three tags:
work
tasks
administration
The only way the Category matches the Thought is if the Category has the following relationships with the Tags:
work <-[:CONSISTS_OF {index:0}]-(category)
tasks <-[:CONSISTS_OF {index:1}]-(category)
administration <-[:CONSISTS_OF {index:2}]-(category)
Any other order of relationships is invalid and a new Category should be created.
The Problem: Use of MERGE
MERGE will try and find a pattern in the graph, if it finds the pattern it will return it, else it will try and create the entire pattern. This works individually for each MERGE clause. So, this works great and as expected for (n:Tag) nodes, since you only want one tag for each word in the graph, but the issue comes with the later in your query when you try to merge a category.
What you want to do is try and find this (c:Category) that is connected to these three (t:Tag) nodes with these r.index properties on the relationship (:Tag)-[r:CONSISTS_OF]-(). However, you're running four merge clauses which do the following:
MERGE (c: Category)
Find or create any node c with the label `Category.
MERGE (t1)<-[u:CONSISTS_OF{index:0}]-(c)
MERGE (t2)<-[v:CONSISTS_OF{index:1}]-(c)
MERGE (t3)<-[w:CONSISTS_OF{index:2}]-(c)
Find or Create a relationship between that node and t1, then t2, t3 etc.
If you were to run that query, and then change one of the tags to something different like "rest", and run the query again, you'd expect a new category to appear. But it won't with the current query, it'll simply create a new tag, then find the existing (c:Category) node in that first MERGE clause, and create a relationship between it and the new tag. So, rather than having two categories each linked to three tags (with two tags being shared), you'll just have four tags all linked to one category with duplicate indexes on your relationships.
So, what you actually want to do is use MERGE to find the complex pattern like below.
MERGE (t1)<-[:CONSISTS_OF {index:0}]-(c:Category)-[:CONSISTS_OF {index:1}]->(t2),
(t3)<-[:CONSISTS_OF {index:2}]-(c)
Annoyingly, that will give you a syntax error, as cypher can't currently merge complex patterns like that. So, here comes the creative bit.
Solution 1: Conditional Execution with CASE and FOREACH (Easy)
This is quite a handy goto for these kinds of situation, see the commented query below. You'll essentially split the merge up, use OPTIONAL MATCH to try and find the pattern, and then use a little trick in cypher syntax to CREATE the pattern if we find it doesn't exist.
CREATE (t: Thought {moment:timestamp(), message:'Testing new Thought'})
MERGE (t1:Tag{value: 'work'})
MERGE (t2:Tag{value: 'abayo'})
MERGE (t3:Tag{value: 'rest'})
WITH *
// we can't merge this category because it's a complex pattern
// so, can we find it in the db?
OPTIONAL MATCH (t1)<-[:CONSISTS_OF {index:0}]-(c:Category)-[:CONSISTS_OF {index:1}]->(t2),
(t3)<-[:CONSISTS_OF {index:2}]-(c)
// the CASE here works in conjunction with the foreach to
// conditionally execute the create clause
WITH t, t1, t2, t3, c, CASE c WHEN NULL THEN [1] ELSE [] END AS make_cat
FOREACH (i IN make_cat |
// if no such category exists, this code will run as c is null
// if a category does exist, c will not be null, and so this won't run
CREATE (t1)<-[:CONSISTS_OF {index:0}]-(new_cat:Category)-[:CONSISTS_OF {index:1}]->(t2),
(t3)<-[:CONSISTS_OF {index:2}]-(new_cat)
)
// now we're not sure if we're referring to new_cat or cat
// remove variable c from scope
WITH t, t1, t2, t3
// and now match it, we know for sure now we'll find it
// alternatively, use conditional execution again here
MATCH (t1)<-[:CONSISTS_OF]-(c:Category)-[:CONSISTS_OF]->(t2),
(t3)<-[:CONSISTS_OF]-(c)
// now we have the category, we definitely want
// to create the relationship between the thought and the category
CREATE (t)-[:CATEGORIZED_AS]->(c)
RETURN *
Solution 2: Refactor Your Graph (Hard)
I haven't included a query here - although I can do if requested - but an alternative would be to refactor your graph to attach tags to categories in a ring (or chain - with a final member marker) structure, so that you can merge the pattern straight away without having to split it up.
Since the categories are in an order, you could express the data like the below, in one MERGE clause.
MERGE (c:Category)-[:CONSISTS_OF_TAG_SEQUENCE]->(t1)-[:NEXT_TAG_IN_SEQUENCE]->(t2)-[:NEXT_TAG_IN_SEQUENCE]->(t3)-[:NEXT_TAG_IN_SEQUENCE]->(c)
This might seem like a neat solution at first, but the problem is, that since tags will belong to multiple categories, if tags are shared between categories you will need to either:
create a composite index to identify categories and store this as a property of the sequential relationships so you know which relationships to follow in your path (i.e., so you can always find one, and only one, sequence of tags for a category)
still link each tag to the categories it is in and query on this pattern (to allow you to find that single path like in #1)
Use an intermediate node to achieve the same as 1 and 2
All of the above and more.
As you might have guessed, this will make your query much more complicated than it needs to be quite quickly. It could be fun to try, and may suit some use cases, but for the time being I'd stick with the easy solution!
My solution to your problem, is to enforce that every Category has a unique, consistently reproducible id. In your case, add a cid or id field, where the value is something along the lines of tag1<_>tag2<_>tag3<_>. (<_> is used because the chances of that being part of a tag are zero. If _ is an invalid tag character replacing <_> with _ will do just fine).
This way you can lock onto a category node without having to know anything about the nodes it is attached to. Essentially, the unique id IS your merge logic. This can even be dynamicly built up in Cypher using reduce. I usually also have a value field as a "pretty print display id value".
When running the final Cypher, you would Merge on each node alone by instance id, use Set for non node-defining fields, then use Create Unique to make make sure there was one and only one relation between the nodes.
I have a simple model of a chess tournament. It has 5 players playing each other. The graph looks like this:
The graph is generally fine, but upon further inspection, you can see that both sets
Guy1 vs Guy2,
and
Guy4 vs Guy5
have a redundant relationship each.
The problem is obviously in the data, where there is a extraneous complementary row for each of these matches (so in a sense this is a data quality issue in the underlying csv):
I could clean these rows by hand, but the real dataset has millions of rows. So I'm wondering how I could remove these relationships in either of 2 ways, using CQL:
1) Don't read in the extra relationship in the first place
2) Go ahead and create the extra relationship, but then remove it later.
Thanks in advance for any advice on this.
The code I'm using is this:
/ Here, we load and create nodes
LOAD CSV WITH HEADERS FROM
'file:///.../chess_nodes.csv' AS line
WITH line
MERGE (p:Player {
player_id: line.player_id
})
ON CREATE SET p.name = line.name
ON MATCH SET p.name = line.name
ON CREATE SET p.residence = line.residence
ON MATCH SET p.residence = line.residence
// Here create the edges
LOAD CSV WITH HEADERS FROM
'file:///.../chess_edges.csv' AS line
WITH line
MATCH (p1:Player {player_id: line.player1_id})
WITH p1, line
OPTIONAL MATCH (p2:Player {player_id: line.player2_id})
WITH p1, p2, line
MERGE (p1)-[:VERSUS]->(p2)
It is obvious that you don't need this extra relationship as it doesn't add any value nor weight to the graph.
There is something that few people are aware of, despite being in the documentation.
MERGE can be used on undirected relationships, neo4j will pick one direction for you (as realtionships MUST be directed in the graph).
Documentation reference : http://neo4j.com/docs/stable/query-merge.html#merge-merge-on-an-undirected-relationship
An example with the following statement, if you run it for the first time :
MATCH (a:User {name:'A'}), (b:User {name:'B'})
MERGE (a)-[:VERSUS]-(b)
It will create the relationship as it doesn't exist. However if you run it a second time, nothing will be changed nor created.
I guess it would solve your problem as you will not have to worry about cleaning the data in upfront nor run scripts afterwards for cleaning your graph.
I'd suggest creating a "match" node like so
(x:Player)-[:MATCH]->(m:Match)<-[:MATCH]-(y:Player)
to enable tracking details about the match separate from the players.
If you need to track player matchups distinct from the matches themselves, then
(x:Player)-[:HAS_PLAYED]->(pair:HasPlayed)<-[:HAS_PLAYED]-(y:Player)
would do the trick.
If the schema has to stay as-is and the only requirement is to remove redundant relationships, then
MATCH (p1:Player)-[r1:VERSUS]->(p2:Player)-[r2:VERSUS]->(p1)
DELETE r2
should do the trick. This finds all p1, p2 nodes with bi-directional VERSUS relationships and removes one of them.
You need to use UNWIND to do the trick.
MATCH (p1:Player)-[r:VERSUS]-(p2:Player)
WITH p1,p2,collect(r) AS rels
UNWIND tail(rels) as rel
DELETE rel;
THe previous code will find the direct connections of type VERSUS between p1 and p2 using match (note that this is not directed). Then will get the collection of relationships and finally the last of those relations, which is deleted.
Of course you can add a check to see whether the length of the collection is 2.
The question:
What is the most performant way to create the following MATCH statement and why?
The detailed problem:
Let's say we have a Place node with a variable amount of properties and need to look up nodes from potentially billions of nodes by it's category. I'm trying to wrap my head around the performance of each query and it's proving to be quite difficult.
The possible queries:
Match Place node using a property lookup:
MATCH (entity:Place { category: "Food" })
Match Place node with isCategory relationship to Food node:
MATCH (entity:Place)-[:isCategory]->(category:Food)
Match Place node with Food relationship to Category node:
MATCH (entity)-[category:Food]->(:Category)
Match Food node with isCategoryFor relationship to Place node:
MATCH (category:Food)-[:isCategoryFor]->(entity:place)
And obviously all the variations in between. With relationship directions going the other way as well.
More complexity:
Let's throw in a little more complexity and say we now need to find all Place nodes using multiple categories. For example: Find all Place nodes with category Food or Bar
Would we just tack on another MATCH statement? If not, what is the most performant route to take here?
Extra:
Is there a tool to help me describe the traversal process and tell me the best method to choose?
If I understand your domain correctly, I would recommend making your Categorys into nodes themselves.
MERGE (:Category {name:"Food"})
MERGE (:Category {name:"Bar"})
MERGE (:Category {name:"Park"})
And connecting each Place node to the Categorys it belongs to.
MERGE (:Place {name:"Central Park"})-[:IS_A]->(:Category {name:"Park"})
MERGE (:Place {name:"Joe's Diner"})-[:IS_A]->(:Category {name:"Food"})
MERGE (:Place {name:"Joe's Diner"})-[:IS_A]->(:Category {name:"Bar"})
Then, if you want to find Places that belong to a Category, it can be pretty quick. Start by matching the category, then branch out to the places related to the category.
MATCH (c:Category {name:"Bar"}), (c)<-[:IS_A]-(p:Place)
RETURN p
You'll have a relatively limited number of categories, so matching the category will be quick. Then, because of the way Neo4j actually stores data, it will be fast to find all the places related to that category.
More Complexity
Finding places within multiple categories will be easy as well.
MATCH (c:Category) WHERE c.name = "Bar" OR c.name = "Food", (c)<-[:IS_A]-(p:Place)
RETURN p
Again, you just match the categories first (fast because there aren't many of them), then branch out to the connected places.
Use an Index
If you want fast, you need to use indexes where it makes sense. In this example, I would use an index on the category's name property.
CREATE INDEX ON :Category(name)
Or better yet, use a uniqueness constraint on the category names, which will index them and prevent duplicates.
CREATE CONSTRAINT ON (c:Category) ASSERT c.name IS UNIQUE
Indexes (and uniqueness) make a big difference on the speed of your queries.
Why this is fastest
Neo4j stores nodes and relationships in a very compact, quick-to-access format. Once you have a node or relationship, getting the adjacent relationships or nodes is very fast. However, it stores each node's (and relationship's) properties separately, meaning that looking through properties is relatively slow.
The goal is to get to a starting node as quickly as possible. Once there, traversing related entities is quick. If you only have 1,000 categories, but you have a billion places, it will be faster to pick out an individual Category than an individual Place. Once you have that starting node, getting to related nodes will be very efficient.
The Other Options
Just to reinforce, this is what makes your other options slower or otherwise worse.
In your first example, you are looking through properties on each node to look for the match. Property lookup is slow and you are doing it a billion times. An index can help with this, but it's still a lot of work. Additionally, you are effectively duplicating the category data over each of you billion places, and not taking advantage of Neo4j's strengths.
In all your other examples, your data models seem odd. "Food", "Bar", "Park", etc. are all instances of categories, not separate types. They should each be their own node, but they should all have the Category label, because that's what they are. In addition, categories are things, and thus they should be nodes. A relationship describes the connection between things. It does not make sense to use categories in this way.
I hope this helps!
Alice loves kittens and puppies and will only go to places where they are both present. She likes guppies and goldfish and insists on at least one of them being present. She hates echidnas though, and will not show up anywhere that hosts spiny things. Bob likes kittens and hamsters, but has no strong feelings regarding other animals, piscine, spiny or otherwise.
CREATE (alice:PERSON {name: 'Alice', loves: ['kittens','puppies'], likes: ['guppies','goldfish'], hates:['echidnas']})
CREATE (bob:PERSON {name: 'Bob', likes: ['kittens','hamsters']})
I'm throwing a party where I will have kittens, puppies, guppies and manatees. Who will come? Here's how I'm asking.
MATCH (a:PERSON)
WHERE
(NOT HAS (a.loves) OR (LENGTH(FILTER(love IN a.loves WHERE love IN ['kittens','puppies','guppies','manatees'])) = LENGTH(a.loves)))
AND
(NOT HAS (a.likes) OR (LENGTH(FILTER(like IN a.likes WHERE like IN ['kittens','puppies','guppies','manatees'])) > 0))
AND
(NOT HAS (a.hates) OR (LENGTH(FILTER(hate IN a.hates WHERE hate IN ['kittens','puppies','guppies','manatees'])) = 0))
RETURN a.name
Huzzah, Alice and Bob are both OK with that.
However, is that the smartest way to do it in Cypher?
This is of course a toy example: there will also be a shape in the MATCH and other filtering conditions.
However, my focus is on every person in the shape optionally having none, one two or three set of things[*], one of which contains elements which must ALL be matched by the elements of a provided collection, one of which ANY must be matched, and one which contains elements of which NONE must be matched by anything in the (common) provided collection. Implementing this is a core requirement.
[*] I say "things" rather than "properties", because I'm not averse to modelling my animals as nodes, and linking my people to them. Like this, for Carol and Dan who conveniently share the same taste in animals as Alice and Bob.
CREATE (carol:PERSON {name: 'Carol'})-[:LOVES]->(kittens {name:'kittens'}),
(carol)-[:LOVES]->(puppies {name:'puppies'}),
(carol)-[:LIKES]->({name:'guppies'}),
(carol)-[:LIKES]->({name:'goldfish'}),
(carol)-[:HATES]->({name:'echidnas'}),
(dan:PERSON {name: 'Dan'})-[:LIKES]->(kittens),
(dan)-[:LIKES]->({name:'hamsters'})
MATCH (a:PERSON), (a)-[:LOVES]->(a_loves), (a)-[:LIKES]->(a_likes), (a)-[:HATES]->(a_hates)
WHERE
ALL (loves_name IN a_loves.name WHERE loves_name IN ['kittens','puppies','guppies','manatees'])
AND
ANY (likes_name IN a_likes.name WHERE likes_name IN ['kittens','puppies','guppies','manatees'])
AND
NONE (hates_name IN a_hates.name WHERE hates_name IN ['kittens','puppies','guppies','manatees'])
RETURN a.name
This works for each person who does love, like and hate at least one animal, i.e. for Carol. However, it does not work when a person does not have all three of LOVES and LIKES and HATES relationships, as that shape isn't found in the graph, and so it does not find Dan. I cannot see an obvious way to make OPTIONAL MATCH perform that kind of pruning.
To get around that issue, I can add fake animal nodes and give each and every PERSON relationships to them, thusly: -[:LOVES]->(unicorns), -[:LIKES]->(manticores) -[:HATES]->(basilisks) and then always add 'unicorns' to the collection being compared against the LOVES node. However, that feels very contrived and clunky.
Long post short, what would be your favoured way to model this in Neo4J and Cypher, and why?
Here's a first cut http://gist.neo4j.org/?8932364
Do take a look at let me know if first of all, the problem is understood and second, if the query fits.
I'd like to come back later and improve on that query, it was put together quickly.