I am new to Cypher and I am trying to learn it through a small project I am trying to set up.
I have the following data model so far:
For every Thought created, I connect Tags through Categories.
The Categories only serve as intermediate between the Tags and Thoughts, this is done to improve querying, prevent Tag duplication and reduce relationships between the objects.
To prevent creation of new Tags with the same value, I thought of the following query:
CREATE (t: Thought {moment:timestamp(), message:'Testing new Thought'})
MERGE (t1: Tag{value: 'work'})
MERGE (t2: Tag{value: 'tasks'})
MERGE (t3: Tag{value: 'administration'})
MERGE (c: Category)
MERGE (t1)<-[u:CONSISTS_OF{index:0}]-(c)
MERGE (t2)<-[v:CONSISTS_OF{index:1}]-(c)
MERGE (t3)<-[w:CONSISTS_OF{index:2}]-(c)
MERGE (t)-[x:CATEGORIZED_AS{index: 0}]->(c)
This works fine, except for one thing: the Thought receives a relationship with all created Categories.
This I understand, I define no restrictions in the MERGE query.
However, I do not know how to apply restrictions to the CATEGORIZED_AS relationship?
I tried to add this to the bottom of the query, but that does not work:
WHERE (t)-[x]->(c)
Any idea how to apply a restriction like I need in my case?
EDIT:
I forgot to mention the unique connection of a Category:
A category is connect to a fixed set of Tags in a specific order.
E.g I have three tags:
work
tasks
administration
The only way the Category matches the Thought is if the Category has the following relationships with the Tags:
work <-[:CONSISTS_OF {index:0}]-(category)
tasks <-[:CONSISTS_OF {index:1}]-(category)
administration <-[:CONSISTS_OF {index:2}]-(category)
Any other order of relationships is invalid and a new Category should be created.
The Problem: Use of MERGE
MERGE will try and find a pattern in the graph, if it finds the pattern it will return it, else it will try and create the entire pattern. This works individually for each MERGE clause. So, this works great and as expected for (n:Tag) nodes, since you only want one tag for each word in the graph, but the issue comes with the later in your query when you try to merge a category.
What you want to do is try and find this (c:Category) that is connected to these three (t:Tag) nodes with these r.index properties on the relationship (:Tag)-[r:CONSISTS_OF]-(). However, you're running four merge clauses which do the following:
MERGE (c: Category)
Find or create any node c with the label `Category.
MERGE (t1)<-[u:CONSISTS_OF{index:0}]-(c)
MERGE (t2)<-[v:CONSISTS_OF{index:1}]-(c)
MERGE (t3)<-[w:CONSISTS_OF{index:2}]-(c)
Find or Create a relationship between that node and t1, then t2, t3 etc.
If you were to run that query, and then change one of the tags to something different like "rest", and run the query again, you'd expect a new category to appear. But it won't with the current query, it'll simply create a new tag, then find the existing (c:Category) node in that first MERGE clause, and create a relationship between it and the new tag. So, rather than having two categories each linked to three tags (with two tags being shared), you'll just have four tags all linked to one category with duplicate indexes on your relationships.
So, what you actually want to do is use MERGE to find the complex pattern like below.
MERGE (t1)<-[:CONSISTS_OF {index:0}]-(c:Category)-[:CONSISTS_OF {index:1}]->(t2),
(t3)<-[:CONSISTS_OF {index:2}]-(c)
Annoyingly, that will give you a syntax error, as cypher can't currently merge complex patterns like that. So, here comes the creative bit.
Solution 1: Conditional Execution with CASE and FOREACH (Easy)
This is quite a handy goto for these kinds of situation, see the commented query below. You'll essentially split the merge up, use OPTIONAL MATCH to try and find the pattern, and then use a little trick in cypher syntax to CREATE the pattern if we find it doesn't exist.
CREATE (t: Thought {moment:timestamp(), message:'Testing new Thought'})
MERGE (t1:Tag{value: 'work'})
MERGE (t2:Tag{value: 'abayo'})
MERGE (t3:Tag{value: 'rest'})
WITH *
// we can't merge this category because it's a complex pattern
// so, can we find it in the db?
OPTIONAL MATCH (t1)<-[:CONSISTS_OF {index:0}]-(c:Category)-[:CONSISTS_OF {index:1}]->(t2),
(t3)<-[:CONSISTS_OF {index:2}]-(c)
// the CASE here works in conjunction with the foreach to
// conditionally execute the create clause
WITH t, t1, t2, t3, c, CASE c WHEN NULL THEN [1] ELSE [] END AS make_cat
FOREACH (i IN make_cat |
// if no such category exists, this code will run as c is null
// if a category does exist, c will not be null, and so this won't run
CREATE (t1)<-[:CONSISTS_OF {index:0}]-(new_cat:Category)-[:CONSISTS_OF {index:1}]->(t2),
(t3)<-[:CONSISTS_OF {index:2}]-(new_cat)
)
// now we're not sure if we're referring to new_cat or cat
// remove variable c from scope
WITH t, t1, t2, t3
// and now match it, we know for sure now we'll find it
// alternatively, use conditional execution again here
MATCH (t1)<-[:CONSISTS_OF]-(c:Category)-[:CONSISTS_OF]->(t2),
(t3)<-[:CONSISTS_OF]-(c)
// now we have the category, we definitely want
// to create the relationship between the thought and the category
CREATE (t)-[:CATEGORIZED_AS]->(c)
RETURN *
Solution 2: Refactor Your Graph (Hard)
I haven't included a query here - although I can do if requested - but an alternative would be to refactor your graph to attach tags to categories in a ring (or chain - with a final member marker) structure, so that you can merge the pattern straight away without having to split it up.
Since the categories are in an order, you could express the data like the below, in one MERGE clause.
MERGE (c:Category)-[:CONSISTS_OF_TAG_SEQUENCE]->(t1)-[:NEXT_TAG_IN_SEQUENCE]->(t2)-[:NEXT_TAG_IN_SEQUENCE]->(t3)-[:NEXT_TAG_IN_SEQUENCE]->(c)
This might seem like a neat solution at first, but the problem is, that since tags will belong to multiple categories, if tags are shared between categories you will need to either:
create a composite index to identify categories and store this as a property of the sequential relationships so you know which relationships to follow in your path (i.e., so you can always find one, and only one, sequence of tags for a category)
still link each tag to the categories it is in and query on this pattern (to allow you to find that single path like in #1)
Use an intermediate node to achieve the same as 1 and 2
All of the above and more.
As you might have guessed, this will make your query much more complicated than it needs to be quite quickly. It could be fun to try, and may suit some use cases, but for the time being I'd stick with the easy solution!
My solution to your problem, is to enforce that every Category has a unique, consistently reproducible id. In your case, add a cid or id field, where the value is something along the lines of tag1<_>tag2<_>tag3<_>. (<_> is used because the chances of that being part of a tag are zero. If _ is an invalid tag character replacing <_> with _ will do just fine).
This way you can lock onto a category node without having to know anything about the nodes it is attached to. Essentially, the unique id IS your merge logic. This can even be dynamicly built up in Cypher using reduce. I usually also have a value field as a "pretty print display id value".
When running the final Cypher, you would Merge on each node alone by instance id, use Set for non node-defining fields, then use Create Unique to make make sure there was one and only one relation between the nodes.
Related
I have a single csv file whose contents are as follows -
id,name,country,level
1,jon,USA,international
2,don,USA,national
3,ron,USA,local
4,bon,IND,national
5,kon,IND,national
6,jen,IND,local
7,ken,IND,international
8,ben,GB,local
9,den,GB,international
10,lin,GB,national
11,min,AU,national
12,win,AU,local
13,kin,AU,international
14,bin,AU,international
15,nin,CN,national
16,con,CN,local
17,eon,CN,international
18,fon,CN,international
19,pon,SZN,national
20,zon,SZN,international
First of all I created a constraint on id
CREATE CONSTRAINT idConstraint ON (n:Name) ASSERT n.id IS UNIQUE
Then I created nodes for name, then for country and finally for level as follows -
LOAD CSV WITH HEADERS FROM "file:///demo.csv" AS row
MERGE (name:Name {name: row.name, id: row.id, country:row.country, level:row.level})
MERGE (country:Country {name: row.country})
MERGE (level:Level {type: row.level})
I can see the nodes fine. However, I want to be able to query for things like, for a given country how many names are there? For a given level, how many countries and then how many names for that country are there?
So for that I need to make Relationships between the nodes.
For that I tried like this -
LOAD CSV WITH HEADERS FROM "file:///demo.csv" AS row
MATCH (n:Name {name:row.name}), (c:Country {name:row.country})
CREATE (n)-[:LIVES_IN]->(c)
RETURN n,c
However this gives me a warning as follows -
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (c))
Moreover the resulting Graph looks slightly wrong - each Name node has 2 relations with a country whereas I would think there would be only one?
I also have a nagging fear that I am not doing things in an optimized or correct way. This is just a demo. In my real dataset, I often cannot run multiple CREATE or MERGE statements together. I have to LOAD the same CSV file again and again to do pretty much everything from creating nodes. When creating relationships, because a cartesian product forms, the command basically gives Java Heap Memory error.
PS. I just started with neo4j yesterday. I really don't know much about it. I have been struggling with this for a whole day, hence thought of asking here.
You can ignore the cartesian product warning, since that exact approach is needed in order to create the relationships that form the patterns you need.
As for the multiple relationships, it's possible you may have run the query twice. The second run would have created the duplicate relationships. You could use MERGE instead of CREATE for the relationships, that would ensure that there would be no duplicates.
Suppose you've got two nodes that represent the same thing, and you want to merge those two nodes. Both nodes can have any number of relations with other nodes.
The basics are fairly easy, and would look something like this:
MATCH (a), (b) WHERE a.id == b.id
MATCH (b)-[r]->()
CREATE (a)-[s]->()
SET s = PROPERTIES(r)
DELETE DETACH b
Only I can't create a relation without a type. And Cypher doesn't support variable labels either. I'd love to be able to do something like
CREATE (a)-[s:{LABELS(r)}]->(o)
but that doesn't work. To create the relation, you need to know the type of the relation, and in this case I really don't.
Is there a way to dynamically assign types to relationships, or am I going to have to query the types of the old relation, and then string concat new queries with the proper types? That's not impossible, but a lot slower and more complex. And this could potentially match a lot of elements and even more relationships, so having to generate a separate query for every instance is going to slow things down quite a lot.
Or is there a way to change the target of the old relationship? That would probably be the fastest, but I'm not aware of any way to do that.
I think you need to take a look at APOC, especially apoc.create.relationship which enable creating relationships with dynamic type.
Adapting your example, you should end up with something along the line of (not tested):
MATCH (a), (b) WHERE a.id == b.id
MATCH (b)-[r]->(n)
CALL apoc.create.relationship(a, type(r), properties(r), n)
DETACH DELETE b
NB
relationships have TYPE and not label
the proper cypher statement to delete relationships attached to a node and the node itself is DETACH DELETE (and not DELETE DETACH)
Related resource: https://markhneedham.com/blog/2016/10/30/neo4j-create-dynamic-relationship-type/
The APOC procedure apoc.refactor.mergeNodes should be very helpful. That procedure is very powerful, and you need to read the documentation to understand how to configure it to do what you want in your specific situation.
Here is a simple example that shows how to use the procedure's default configuration to merge nodes with the same id:
MATCH (node:Foo)
WITH node.id AS id, COLLECT(node) AS nodes
WHERE SIZE(nodes) > 1
CALL apoc.refactor.mergeNodes(nodes, {}) YIELD node
RETURN node
In this example, I specified an arbitrary Foo label to avoid accidentally merging unwanted nodes. Doing so also helps to speed up the query if you have a lot of nodes with other labels (since they will not need to be scanned for the id property).
The aggregating function COLLECT is used to collect a list of all the nodes with the same id. After checking the size of the list, it is passed to the procedure.
Given the following graph:
(a)<--(b)-->(c)<--(d)-->(e)<--(f)-->(a)
I believe it is (currently) impossible to create a node (g) using the merge clause such that:
(g)-->(a)
(g)-->(c)
(g)-->(e)
The reason being that it requires a comma to describe the above pattern, and the MERGE clause will not accept a comma. e.g. (a)<--(g)-->(c), (g)-->(e)
For ease of reference, see picture below. Given that graph (except node 6), I cannot create node 6 using the MERGE command.
Can someone come up with a way to do this? I believe new functionality needs to be added, but I'd like to be more reasonably sure there's not a viable workaround before heading down that path.
There is no way to do this in Cypher, or in APOC right now. That said, there is a workaround. It's a bit manual, you'll need to acquire locks on the nodes in question (we'll use APOC for that), and we'll use OPTIONAL MATCH along with WHERE ... IS NULL to determine whether or not the center node exists, then create it only when it doesn't.
For this, I'm using the following example graph to mimic yours, before the addition of node 6:
create (zero:Node{name:0})
create (one:Node{name:1})
create (two:Node{name:2})
create (three:Node{name:3})
create (four:Node{name:4})
create (five:Node{name:5})
create (zero)<-[:TYPE]-(one)-[:TYPE]->(two)
create (two)<-[:TYPE]-(three)-[:TYPE]->(four)
create (four)<-[:TYPE]-(five)-[:TYPE]->(zero)
And now, the query to merge
match (node:Node)
where node.name in [0,2,4]
with collect(node) as nodes
call apoc.lock.nodes(nodes)
with nodes[0] as first, nodes[1] as second, nodes[2] as third
optional match (first)<-[:TYPE]-(center)-[:TYPE]->(second)
where (center)-[:TYPE]->(third)
with first, second, third, center
where center is null
// above 'where' will result in no rows if center exists, preventing creation of duplicate pattern below
create (first)<-[:TYPE]-(newCenter:Node{name:6})-[:TYPE]->(second)
create (newCenter)-[:TYPE]->(third)
I have a linked list, in neo4j that looks something like this:
CREATE (p:Procedure {id:1})
CREATE (s1:Step {title:"Do Thing 1"})
CREATE (s2:Step {title:"Do Thing 2"})
MERGE (p)-[:FIRST_STEP {parent:[1]}]->(s1)-[:NEXT {parent:[1]}]->(s2)
Now I might create another list that contains this list, and for that to work, I'd either create a separate set of relationships with a new parent value, or I'd add the new parent id to the list of parents: e.g. parent[1,2].
Now, is it possible to do a match like this:
match (p:Procedure)-[rel:FIRST_STEP|NEXT*]->(steps)
WHERE p.id = 1 and 1 in rel.parent
return p, steps
I can do it if I put the constraint in the initial declaration of the relationship e.g. -[rel:FIRST_STEP|NEXT* {parent:1}]->, but that doesn't allow me to do the "IN" query.
Any thoughts or direction much appreciated.
Are there any expected use cases that will modify the list in some way, such as inserting, rearranging, or removing nodes? And if so, are the changes to one list meant to reflect changes to the other?
If these use cases exist, and if the list changes are meant to stay in sync with each other, single relationships with a list of parent ids makes sense (though the APOC Procedures library contains graph refactoring procedures that could handle either design).
If changes to one list aren't meant to reflect in the other list, then separate relationships per parent make the most sense.
Also, as far as I can tell there aren't easy operations to subtract elements from a list (you can use "+" to add an element, but you can't use "-"). I think you'd have to use a filter() to do this, which is a little awkward. It's easier syntactically to delete relationships entirely than to remove elements from lists on relationships, though that probably won't be a driving concern for your design choice.
I have a simple model of a chess tournament. It has 5 players playing each other. The graph looks like this:
The graph is generally fine, but upon further inspection, you can see that both sets
Guy1 vs Guy2,
and
Guy4 vs Guy5
have a redundant relationship each.
The problem is obviously in the data, where there is a extraneous complementary row for each of these matches (so in a sense this is a data quality issue in the underlying csv):
I could clean these rows by hand, but the real dataset has millions of rows. So I'm wondering how I could remove these relationships in either of 2 ways, using CQL:
1) Don't read in the extra relationship in the first place
2) Go ahead and create the extra relationship, but then remove it later.
Thanks in advance for any advice on this.
The code I'm using is this:
/ Here, we load and create nodes
LOAD CSV WITH HEADERS FROM
'file:///.../chess_nodes.csv' AS line
WITH line
MERGE (p:Player {
player_id: line.player_id
})
ON CREATE SET p.name = line.name
ON MATCH SET p.name = line.name
ON CREATE SET p.residence = line.residence
ON MATCH SET p.residence = line.residence
// Here create the edges
LOAD CSV WITH HEADERS FROM
'file:///.../chess_edges.csv' AS line
WITH line
MATCH (p1:Player {player_id: line.player1_id})
WITH p1, line
OPTIONAL MATCH (p2:Player {player_id: line.player2_id})
WITH p1, p2, line
MERGE (p1)-[:VERSUS]->(p2)
It is obvious that you don't need this extra relationship as it doesn't add any value nor weight to the graph.
There is something that few people are aware of, despite being in the documentation.
MERGE can be used on undirected relationships, neo4j will pick one direction for you (as realtionships MUST be directed in the graph).
Documentation reference : http://neo4j.com/docs/stable/query-merge.html#merge-merge-on-an-undirected-relationship
An example with the following statement, if you run it for the first time :
MATCH (a:User {name:'A'}), (b:User {name:'B'})
MERGE (a)-[:VERSUS]-(b)
It will create the relationship as it doesn't exist. However if you run it a second time, nothing will be changed nor created.
I guess it would solve your problem as you will not have to worry about cleaning the data in upfront nor run scripts afterwards for cleaning your graph.
I'd suggest creating a "match" node like so
(x:Player)-[:MATCH]->(m:Match)<-[:MATCH]-(y:Player)
to enable tracking details about the match separate from the players.
If you need to track player matchups distinct from the matches themselves, then
(x:Player)-[:HAS_PLAYED]->(pair:HasPlayed)<-[:HAS_PLAYED]-(y:Player)
would do the trick.
If the schema has to stay as-is and the only requirement is to remove redundant relationships, then
MATCH (p1:Player)-[r1:VERSUS]->(p2:Player)-[r2:VERSUS]->(p1)
DELETE r2
should do the trick. This finds all p1, p2 nodes with bi-directional VERSUS relationships and removes one of them.
You need to use UNWIND to do the trick.
MATCH (p1:Player)-[r:VERSUS]-(p2:Player)
WITH p1,p2,collect(r) AS rels
UNWIND tail(rels) as rel
DELETE rel;
THe previous code will find the direct connections of type VERSUS between p1 and p2 using match (note that this is not directed). Then will get the collection of relationships and finally the last of those relations, which is deleted.
Of course you can add a check to see whether the length of the collection is 2.