I have a simple model of a chess tournament. It has 5 players playing each other. The graph looks like this:
The graph is generally fine, but upon further inspection, you can see that both sets
Guy1 vs Guy2,
and
Guy4 vs Guy5
have a redundant relationship each.
The problem is obviously in the data, where there is a extraneous complementary row for each of these matches (so in a sense this is a data quality issue in the underlying csv):
I could clean these rows by hand, but the real dataset has millions of rows. So I'm wondering how I could remove these relationships in either of 2 ways, using CQL:
1) Don't read in the extra relationship in the first place
2) Go ahead and create the extra relationship, but then remove it later.
Thanks in advance for any advice on this.
The code I'm using is this:
/ Here, we load and create nodes
LOAD CSV WITH HEADERS FROM
'file:///.../chess_nodes.csv' AS line
WITH line
MERGE (p:Player {
player_id: line.player_id
})
ON CREATE SET p.name = line.name
ON MATCH SET p.name = line.name
ON CREATE SET p.residence = line.residence
ON MATCH SET p.residence = line.residence
// Here create the edges
LOAD CSV WITH HEADERS FROM
'file:///.../chess_edges.csv' AS line
WITH line
MATCH (p1:Player {player_id: line.player1_id})
WITH p1, line
OPTIONAL MATCH (p2:Player {player_id: line.player2_id})
WITH p1, p2, line
MERGE (p1)-[:VERSUS]->(p2)
It is obvious that you don't need this extra relationship as it doesn't add any value nor weight to the graph.
There is something that few people are aware of, despite being in the documentation.
MERGE can be used on undirected relationships, neo4j will pick one direction for you (as realtionships MUST be directed in the graph).
Documentation reference : http://neo4j.com/docs/stable/query-merge.html#merge-merge-on-an-undirected-relationship
An example with the following statement, if you run it for the first time :
MATCH (a:User {name:'A'}), (b:User {name:'B'})
MERGE (a)-[:VERSUS]-(b)
It will create the relationship as it doesn't exist. However if you run it a second time, nothing will be changed nor created.
I guess it would solve your problem as you will not have to worry about cleaning the data in upfront nor run scripts afterwards for cleaning your graph.
I'd suggest creating a "match" node like so
(x:Player)-[:MATCH]->(m:Match)<-[:MATCH]-(y:Player)
to enable tracking details about the match separate from the players.
If you need to track player matchups distinct from the matches themselves, then
(x:Player)-[:HAS_PLAYED]->(pair:HasPlayed)<-[:HAS_PLAYED]-(y:Player)
would do the trick.
If the schema has to stay as-is and the only requirement is to remove redundant relationships, then
MATCH (p1:Player)-[r1:VERSUS]->(p2:Player)-[r2:VERSUS]->(p1)
DELETE r2
should do the trick. This finds all p1, p2 nodes with bi-directional VERSUS relationships and removes one of them.
You need to use UNWIND to do the trick.
MATCH (p1:Player)-[r:VERSUS]-(p2:Player)
WITH p1,p2,collect(r) AS rels
UNWIND tail(rels) as rel
DELETE rel;
THe previous code will find the direct connections of type VERSUS between p1 and p2 using match (note that this is not directed). Then will get the collection of relationships and finally the last of those relations, which is deleted.
Of course you can add a check to see whether the length of the collection is 2.
Related
Suppose you've got two nodes that represent the same thing, and you want to merge those two nodes. Both nodes can have any number of relations with other nodes.
The basics are fairly easy, and would look something like this:
MATCH (a), (b) WHERE a.id == b.id
MATCH (b)-[r]->()
CREATE (a)-[s]->()
SET s = PROPERTIES(r)
DELETE DETACH b
Only I can't create a relation without a type. And Cypher doesn't support variable labels either. I'd love to be able to do something like
CREATE (a)-[s:{LABELS(r)}]->(o)
but that doesn't work. To create the relation, you need to know the type of the relation, and in this case I really don't.
Is there a way to dynamically assign types to relationships, or am I going to have to query the types of the old relation, and then string concat new queries with the proper types? That's not impossible, but a lot slower and more complex. And this could potentially match a lot of elements and even more relationships, so having to generate a separate query for every instance is going to slow things down quite a lot.
Or is there a way to change the target of the old relationship? That would probably be the fastest, but I'm not aware of any way to do that.
I think you need to take a look at APOC, especially apoc.create.relationship which enable creating relationships with dynamic type.
Adapting your example, you should end up with something along the line of (not tested):
MATCH (a), (b) WHERE a.id == b.id
MATCH (b)-[r]->(n)
CALL apoc.create.relationship(a, type(r), properties(r), n)
DETACH DELETE b
NB
relationships have TYPE and not label
the proper cypher statement to delete relationships attached to a node and the node itself is DETACH DELETE (and not DELETE DETACH)
Related resource: https://markhneedham.com/blog/2016/10/30/neo4j-create-dynamic-relationship-type/
The APOC procedure apoc.refactor.mergeNodes should be very helpful. That procedure is very powerful, and you need to read the documentation to understand how to configure it to do what you want in your specific situation.
Here is a simple example that shows how to use the procedure's default configuration to merge nodes with the same id:
MATCH (node:Foo)
WITH node.id AS id, COLLECT(node) AS nodes
WHERE SIZE(nodes) > 1
CALL apoc.refactor.mergeNodes(nodes, {}) YIELD node
RETURN node
In this example, I specified an arbitrary Foo label to avoid accidentally merging unwanted nodes. Doing so also helps to speed up the query if you have a lot of nodes with other labels (since they will not need to be scanned for the id property).
The aggregating function COLLECT is used to collect a list of all the nodes with the same id. After checking the size of the list, it is passed to the procedure.
I am new to Cypher and I am trying to learn it through a small project I am trying to set up.
I have the following data model so far:
For every Thought created, I connect Tags through Categories.
The Categories only serve as intermediate between the Tags and Thoughts, this is done to improve querying, prevent Tag duplication and reduce relationships between the objects.
To prevent creation of new Tags with the same value, I thought of the following query:
CREATE (t: Thought {moment:timestamp(), message:'Testing new Thought'})
MERGE (t1: Tag{value: 'work'})
MERGE (t2: Tag{value: 'tasks'})
MERGE (t3: Tag{value: 'administration'})
MERGE (c: Category)
MERGE (t1)<-[u:CONSISTS_OF{index:0}]-(c)
MERGE (t2)<-[v:CONSISTS_OF{index:1}]-(c)
MERGE (t3)<-[w:CONSISTS_OF{index:2}]-(c)
MERGE (t)-[x:CATEGORIZED_AS{index: 0}]->(c)
This works fine, except for one thing: the Thought receives a relationship with all created Categories.
This I understand, I define no restrictions in the MERGE query.
However, I do not know how to apply restrictions to the CATEGORIZED_AS relationship?
I tried to add this to the bottom of the query, but that does not work:
WHERE (t)-[x]->(c)
Any idea how to apply a restriction like I need in my case?
EDIT:
I forgot to mention the unique connection of a Category:
A category is connect to a fixed set of Tags in a specific order.
E.g I have three tags:
work
tasks
administration
The only way the Category matches the Thought is if the Category has the following relationships with the Tags:
work <-[:CONSISTS_OF {index:0}]-(category)
tasks <-[:CONSISTS_OF {index:1}]-(category)
administration <-[:CONSISTS_OF {index:2}]-(category)
Any other order of relationships is invalid and a new Category should be created.
The Problem: Use of MERGE
MERGE will try and find a pattern in the graph, if it finds the pattern it will return it, else it will try and create the entire pattern. This works individually for each MERGE clause. So, this works great and as expected for (n:Tag) nodes, since you only want one tag for each word in the graph, but the issue comes with the later in your query when you try to merge a category.
What you want to do is try and find this (c:Category) that is connected to these three (t:Tag) nodes with these r.index properties on the relationship (:Tag)-[r:CONSISTS_OF]-(). However, you're running four merge clauses which do the following:
MERGE (c: Category)
Find or create any node c with the label `Category.
MERGE (t1)<-[u:CONSISTS_OF{index:0}]-(c)
MERGE (t2)<-[v:CONSISTS_OF{index:1}]-(c)
MERGE (t3)<-[w:CONSISTS_OF{index:2}]-(c)
Find or Create a relationship between that node and t1, then t2, t3 etc.
If you were to run that query, and then change one of the tags to something different like "rest", and run the query again, you'd expect a new category to appear. But it won't with the current query, it'll simply create a new tag, then find the existing (c:Category) node in that first MERGE clause, and create a relationship between it and the new tag. So, rather than having two categories each linked to three tags (with two tags being shared), you'll just have four tags all linked to one category with duplicate indexes on your relationships.
So, what you actually want to do is use MERGE to find the complex pattern like below.
MERGE (t1)<-[:CONSISTS_OF {index:0}]-(c:Category)-[:CONSISTS_OF {index:1}]->(t2),
(t3)<-[:CONSISTS_OF {index:2}]-(c)
Annoyingly, that will give you a syntax error, as cypher can't currently merge complex patterns like that. So, here comes the creative bit.
Solution 1: Conditional Execution with CASE and FOREACH (Easy)
This is quite a handy goto for these kinds of situation, see the commented query below. You'll essentially split the merge up, use OPTIONAL MATCH to try and find the pattern, and then use a little trick in cypher syntax to CREATE the pattern if we find it doesn't exist.
CREATE (t: Thought {moment:timestamp(), message:'Testing new Thought'})
MERGE (t1:Tag{value: 'work'})
MERGE (t2:Tag{value: 'abayo'})
MERGE (t3:Tag{value: 'rest'})
WITH *
// we can't merge this category because it's a complex pattern
// so, can we find it in the db?
OPTIONAL MATCH (t1)<-[:CONSISTS_OF {index:0}]-(c:Category)-[:CONSISTS_OF {index:1}]->(t2),
(t3)<-[:CONSISTS_OF {index:2}]-(c)
// the CASE here works in conjunction with the foreach to
// conditionally execute the create clause
WITH t, t1, t2, t3, c, CASE c WHEN NULL THEN [1] ELSE [] END AS make_cat
FOREACH (i IN make_cat |
// if no such category exists, this code will run as c is null
// if a category does exist, c will not be null, and so this won't run
CREATE (t1)<-[:CONSISTS_OF {index:0}]-(new_cat:Category)-[:CONSISTS_OF {index:1}]->(t2),
(t3)<-[:CONSISTS_OF {index:2}]-(new_cat)
)
// now we're not sure if we're referring to new_cat or cat
// remove variable c from scope
WITH t, t1, t2, t3
// and now match it, we know for sure now we'll find it
// alternatively, use conditional execution again here
MATCH (t1)<-[:CONSISTS_OF]-(c:Category)-[:CONSISTS_OF]->(t2),
(t3)<-[:CONSISTS_OF]-(c)
// now we have the category, we definitely want
// to create the relationship between the thought and the category
CREATE (t)-[:CATEGORIZED_AS]->(c)
RETURN *
Solution 2: Refactor Your Graph (Hard)
I haven't included a query here - although I can do if requested - but an alternative would be to refactor your graph to attach tags to categories in a ring (or chain - with a final member marker) structure, so that you can merge the pattern straight away without having to split it up.
Since the categories are in an order, you could express the data like the below, in one MERGE clause.
MERGE (c:Category)-[:CONSISTS_OF_TAG_SEQUENCE]->(t1)-[:NEXT_TAG_IN_SEQUENCE]->(t2)-[:NEXT_TAG_IN_SEQUENCE]->(t3)-[:NEXT_TAG_IN_SEQUENCE]->(c)
This might seem like a neat solution at first, but the problem is, that since tags will belong to multiple categories, if tags are shared between categories you will need to either:
create a composite index to identify categories and store this as a property of the sequential relationships so you know which relationships to follow in your path (i.e., so you can always find one, and only one, sequence of tags for a category)
still link each tag to the categories it is in and query on this pattern (to allow you to find that single path like in #1)
Use an intermediate node to achieve the same as 1 and 2
All of the above and more.
As you might have guessed, this will make your query much more complicated than it needs to be quite quickly. It could be fun to try, and may suit some use cases, but for the time being I'd stick with the easy solution!
My solution to your problem, is to enforce that every Category has a unique, consistently reproducible id. In your case, add a cid or id field, where the value is something along the lines of tag1<_>tag2<_>tag3<_>. (<_> is used because the chances of that being part of a tag are zero. If _ is an invalid tag character replacing <_> with _ will do just fine).
This way you can lock onto a category node without having to know anything about the nodes it is attached to. Essentially, the unique id IS your merge logic. This can even be dynamicly built up in Cypher using reduce. I usually also have a value field as a "pretty print display id value".
When running the final Cypher, you would Merge on each node alone by instance id, use Set for non node-defining fields, then use Create Unique to make make sure there was one and only one relation between the nodes.
I am getting multiples nodes by
MATCH(n:Employee{name:"Govind Singh"}) return (n);
actually by mistake i have created duplicates Nodes.
Now I want to delete all duplicates nodes Except One.
Assuming the duplicate nodes are all equivalent and don't have relationships:
MATCH (n:Employee {name: "Govind Singh"})
WITH n
SKIP 1
DELETE n
There are probably a few ways to do this, I just came up with this off of the top of my head. I created a bunch of Govind Singh's, and this appears to work:
MATCH(n:Employee{name:"Govind Singh"})
WITH max(id(n)) as justOneOfThem
MATCH(n:Employee{name:"Govind Singh"})
WHERE id(n)<>justOneOfThem
DELETE n;
When you say "delete duplicate nodes", I interpret this to mean "delete all except one chosen". I'm somewhat arbitrarily choosing here that whichever one has the highest internal ID gets to stay. (The internal IDs mean nothing, don't read anything into the meaning of that choice). So I find all Govind Singh's, figure out which one has the highest ID, then I use that in a second match to find them all again and delete anybody that doesn't have that ID.
We have a scenario to display relationships spreading pictures(or messages) to user.
For example: Relationship 1 of Node A has a message "Foo", Relationship 2 of Node 2 also has same message "Foo" ... Relationship n of Node n also has same message "Foo".
Now we are going to display a relationship graph by query Neo4j.
This is my query:
MATCH (a)-[r1]-()-[r2]-()-[r3]-()-[r4]
WHERE a.id = '59072662'
and r2.message_id = r1.target_message_id
and r3.message_id = r2.target_message_id
and r4.message_id = r3.target_message_id
RETURN r1,r2,r3,r4
The problem is, this query does not work if there are only 2 levels of linking. If there is only a r1 and r2, this query returns nothing.
Please tell me how to write a Cypher query returns a set of relationships of my case?
Adding to Stefan's answer.
If you want to keep track of how pictures spread then you would also include a relationship to the image like:
(message)-[:INCLUDES]->(image)
If you want how a specific picture got spread in the message network:
MATCH (i:Image {url: "X"}), p=(recipient:User)<-[*]-(m:Message)<-[*]-(sender:User)
WHERE (m)-[:INCLUDES]->(i) WITH length(p) as length, sender ORDER BY length
RETURN DISTINCT sender
This will return all senders, ordered by path length, so the top one should be the original sender.
If you're just interested in the original sender you could use LIMIT 1.
Alternatively, if you find yourself traversing huge networks and hitting performance issue because of the massive paths that have to be traversed, you could also add a relationship between the message and the original uploader.
The answer to the question you psoted at the bottom, about the way to get a set of relationships in a variable length path:
You define a path, like in the example above
p=(recipient:User)<-[*]-(m:Message)<-[*]-(sender:User)
Then, to access the relationships in that path, you use the rels function
RETURN rels(p)
You didn't provide much details on your use case. From my experience I suggest that you rethink your way of graph data modelling.
A message seems to be a central concept in your domain. Therefore the message should be probably modeled as a node. To connect (a) and (b) via message (m), you might use something like (a)-[:SENT]->(m {message_id: ....})-[TO:]->(b).
Using this (m) could easily have a REFERS_TO relationship to another message making the query above way more graphy.
My database contains about 300k nodes and 350k relationships.
My current query is:
start n=node(3) match p=(n)-[r:move*1..2]->(m) where all(r2 in relationships(p) where r2.GameID = STR(id(n))) return m;
The nodes touched in this query are all of the same kind, they are different positions in a game. Each of the relationships contains a property "GameID", which is used to identify the right relationship if you want to pass the graph via a path. So if you start traversing the graph at a node and follow the relationship with the right GameID, there won't be another path starting at the first node with a relationship that fits the GameID.
There are nodes that have hundreds of in and outgoing relationships, some others only have a few.
The problem is, that I don't know how to tell Cypher how to do this. The above query works for a depth of 1 or 2, but it should look like [r:move*] to return the whole path, which is about 20-200 hops.
But if i raise the values, the querys won't finish. I think that Cypher looks at each outgoing relationship at every single path depth relating to the start node, but as I already explained, there is only one right path. So it should do some kind of a DFS search instead of a BFS search. Is there a way to do so?
I would consider configuring a relationship index for the GameID property. See http://docs.neo4j.org/chunked/milestone/auto-indexing.html#auto-indexing-config.
Once you have done that, you can try a query like the following (I have not tested this):
START n=node(3), r=relationship:rels(GameID = 3)
MATCH (n)-[r*1..]->(m)
RETURN m;
Such a query would limit the relationships considered by the MATCH cause to just the ones with the GameID you care about. And getting that initial collection of relationships would be fast, because of the indexing.
As an aside: since neo4j reuses its internally-generated IDs (for nodes that are deleted), storing those IDs as GameIDs will make your data unreliable (unless you never delete any such nodes). You may want to generate and use you own unique IDs, and store them in your nodes and use them for your GameIDs; and, if you do this, then you should also create a uniqueness constraint for your own IDs -- this will, as a nice side effect, automatically create an index for your IDs.