Removing Duplicates Nodes in neo4j - neo4j

I am getting multiples nodes by
MATCH(n:Employee{name:"Govind Singh"}) return (n);
actually by mistake i have created duplicates Nodes.
Now I want to delete all duplicates nodes Except One.

Assuming the duplicate nodes are all equivalent and don't have relationships:
MATCH (n:Employee {name: "Govind Singh"})
WITH n
SKIP 1
DELETE n

There are probably a few ways to do this, I just came up with this off of the top of my head. I created a bunch of Govind Singh's, and this appears to work:
MATCH(n:Employee{name:"Govind Singh"})
WITH max(id(n)) as justOneOfThem
MATCH(n:Employee{name:"Govind Singh"})
WHERE id(n)<>justOneOfThem
DELETE n;
When you say "delete duplicate nodes", I interpret this to mean "delete all except one chosen". I'm somewhat arbitrarily choosing here that whichever one has the highest internal ID gets to stay. (The internal IDs mean nothing, don't read anything into the meaning of that choice). So I find all Govind Singh's, figure out which one has the highest ID, then I use that in a second match to find them all again and delete anybody that doesn't have that ID.

Related

Remove redundant two way relationships in a Neo4j graph

I have a simple model of a chess tournament. It has 5 players playing each other. The graph looks like this:
The graph is generally fine, but upon further inspection, you can see that both sets
Guy1 vs Guy2,
and
Guy4 vs Guy5
have a redundant relationship each.
The problem is obviously in the data, where there is a extraneous complementary row for each of these matches (so in a sense this is a data quality issue in the underlying csv):
I could clean these rows by hand, but the real dataset has millions of rows. So I'm wondering how I could remove these relationships in either of 2 ways, using CQL:
1) Don't read in the extra relationship in the first place
2) Go ahead and create the extra relationship, but then remove it later.
Thanks in advance for any advice on this.
The code I'm using is this:
/ Here, we load and create nodes
LOAD CSV WITH HEADERS FROM
'file:///.../chess_nodes.csv' AS line
WITH line
MERGE (p:Player {
player_id: line.player_id
})
ON CREATE SET p.name = line.name
ON MATCH SET p.name = line.name
ON CREATE SET p.residence = line.residence
ON MATCH SET p.residence = line.residence
// Here create the edges
LOAD CSV WITH HEADERS FROM
'file:///.../chess_edges.csv' AS line
WITH line
MATCH (p1:Player {player_id: line.player1_id})
WITH p1, line
OPTIONAL MATCH (p2:Player {player_id: line.player2_id})
WITH p1, p2, line
MERGE (p1)-[:VERSUS]->(p2)
It is obvious that you don't need this extra relationship as it doesn't add any value nor weight to the graph.
There is something that few people are aware of, despite being in the documentation.
MERGE can be used on undirected relationships, neo4j will pick one direction for you (as realtionships MUST be directed in the graph).
Documentation reference : http://neo4j.com/docs/stable/query-merge.html#merge-merge-on-an-undirected-relationship
An example with the following statement, if you run it for the first time :
MATCH (a:User {name:'A'}), (b:User {name:'B'})
MERGE (a)-[:VERSUS]-(b)
It will create the relationship as it doesn't exist. However if you run it a second time, nothing will be changed nor created.
I guess it would solve your problem as you will not have to worry about cleaning the data in upfront nor run scripts afterwards for cleaning your graph.
I'd suggest creating a "match" node like so
(x:Player)-[:MATCH]->(m:Match)<-[:MATCH]-(y:Player)
to enable tracking details about the match separate from the players.
If you need to track player matchups distinct from the matches themselves, then
(x:Player)-[:HAS_PLAYED]->(pair:HasPlayed)<-[:HAS_PLAYED]-(y:Player)
would do the trick.
If the schema has to stay as-is and the only requirement is to remove redundant relationships, then
MATCH (p1:Player)-[r1:VERSUS]->(p2:Player)-[r2:VERSUS]->(p1)
DELETE r2
should do the trick. This finds all p1, p2 nodes with bi-directional VERSUS relationships and removes one of them.
You need to use UNWIND to do the trick.
MATCH (p1:Player)-[r:VERSUS]-(p2:Player)
WITH p1,p2,collect(r) AS rels
UNWIND tail(rels) as rel
DELETE rel;
THe previous code will find the direct connections of type VERSUS between p1 and p2 using match (note that this is not directed). Then will get the collection of relationships and finally the last of those relations, which is deleted.
Of course you can add a check to see whether the length of the collection is 2.

Kevin path at neo4j

I am trying to discover how many hops I have to do to know a friend with Cipher. I have these relationships.
(Gutierrez)-[:Conhece]->(Felipe),
(Felipe)-[:Conhece]->(Gutierrez),
(Felipe)-[:Conhece]->(Fernando),
(Fernando)-[:Conhece]->(Felipe),
(Fernando)-[:Conhece]->(Pedro),
(Pedro)-[:Conhece]->(Fernando),
(Pedro)-[:Conhece]->(Arthur),
(Arthur)-[:Conhece]->(Pedro),
(Arthur)-[:Conhece]->(Vitor),
(Vitor)-[:Conhece]->(Arthur),
and when I execute my query it shows Fernando. What I want is to show only Vitor, Pedro and Arthur.
MATCH (n:Leitor)-[r:Conhece]-m
WHERE n.nome='Pedro' OR m.nome='Vitor'
RETURN n,r,m
with my Bacon Path query ->
Ok, I'm adding another answer because I think I understand what you want and it's different than my other answer.
If you want to find everybody that both Pedro and Vitor have met (in this case, just Author):
MATCH (pedro:Leitor)-[:Conhece]-(in_common:Leitor)-[:Conhece]-(vitor:Leitor)
WHERE pedro.nome='Pedro' AND vitor.nome='Vitor'
RETURN in_common
That also might look a bit better like this:
MATCH (pedro:Leitor {nome: 'Pedro'})-[:Conhece]-(in_common:Leitor)-[:Conhece]-(vitor:Leitor {name: 'Vitor'})
RETURN in_common
I also notice from your screenshots that every meeting has two relationships. That may very well be what you want, but if your plan was to always have two relationships whenever two people meet then you can get away with just one relationship. When you query bidirectionally (that is, without specifying direction like in the queries above) then you'll get relationships in either direction.
Normally you only want relationships going in both directions if the direction is important. That could be because your just recording that it goes from one node to another, or it could be because you're storing different values on the relationships.
Here you go:
MATCH shortestPath((n:Leitor)-[rels:Conhece*]-m)
WHERE n.nome IN ['Pedro', 'Vitor']
RETURN n,rels,m,length(rels)
In this case rels will be a collection of relationships because the path is variable length. You can also do:
MATCH path=shortestPath((n:Leitor)-[rels:Conhece*]-m)
WHERE n.nome IN ['Pedro', 'Vitor']
RETURN n,rels,m,length(rels),path,length(path)

Neo4j: error in adding relationships among existing nodes

When writing a query to add relationships to existing nodes, it keeps me warning with this message:
"This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (e))"
If I run the query, it creates no relationships.
The query is:
match
(a{name:"Angela"}),
(b{name:"Carlo"}),
(c{name:"Andrea"}),
(d{name:"Patrizia"}),
(e{name:"Paolo"}),
(f{name:"Roberta"}),
(g{name:"Marco"}),
(h{name:"Susanna"}),
(i{name:"Laura"}),
(l{name:"Giuseppe"})
create
(a)-[:mother]->(b),
(a)-[:grandmother]->(c), (e)-[:grandfather]->(c), (i)-[:grandfather]->(c), (l)-[:grandmother]->(c),
(b)-[:father]->(c),
(e)-[:father]->(b),
(l)-[:father]->(d),
(i)-[:mother]->(d),
(d)-[:mother]->(c),
(c)-[:boyfriend]->(f),
(g)-[:brother]->(f),
(g)-[:brother]->(h),
(f)-[:sister]->(g), (f)-[:sister]->(h)
Can anyone help me?
PS: if I run the same query, but with just one or two relationships (and less nodes in the match clause), it creates the relationships correctly.
What is wrong here?
First of all, as I mentionned in my comments, you don't have any Labels, it's a really bad practice because Labels are useful to match properties in a certains dataset (if you match "name" property, you don't want to match it on a node who doesn't have a name, Labels are here for that.
The second problem is that your query doesn't know how many nodes it will get before it does. It means that if you have 500 000 nodes having name : "Angela" and 500 000 nodes having name : "Carlo", you will create one relation from each Angela node, going on each Carlo, that's quite a big query (500 000 * 500 000 relations to create if my maths aren't bad). Cypher is giving you a warning for that.
Cypher will still tell you this warning because you aren't using Unique properties to match your nodes, even with Labels, you will still have the warning.
Solution?
Use unique properties to create and match your nodes, so you avoid cartesian product.
Always use labels, Neo4j without labels is like using one giant table in SQL to store all of your data.
If you want to know how your query will run, use PROFILE before your query, here is the profile plan for your query:
Does every single one of those name strings exist? If not then you're not going to get any results because it's all one big match. You could try changing it to a MERGE.
But Supamiu is right, you really should have a label (say Person) and an index on :Person(name).

Neo4j: Multiple Relationship Combinations Between Nodes

I've been playing around with Neo4j and have a problem for which I do not have a solution, hence my question here.
For my particular problem I'll describe a simplified version that captures the essence. Suppose I have a graph of locations that are connected either directly or via a detour:
direct: (A)-[:GOES_TO]->(B)
indirect: (A)->[:GOES_THROUGH]->(C)-[:COMES_BACK_TO]->(B)
If I want to have everything between "Go" and the "Finish" with a GOES_TO relationship I can easily use the Cypher query:
START a=node:NODE_IDX(Id = "Go"), b=node:NODE_IDX(Id = "Finish)
MATCH a-[r:GOES_TO*]->b
RETURN a,r,b
Here, NODE_IDX is an index on the nodes (Id).
Where I get stuck is when I want to have all the paths between "Go" and "Finish" that are not GOES_TO relationships but rather multiple GOES_THROUGH-->()-->COMES_BACK_TO relationship combinations (of variable depth).
I do not want to filter out the GOES_TO relationships because there are many more relationships among the nodes, and I do not want to accommodate removing all of them (dynamically). Is it possible to have a variable-depth, multi-relationship MATCH that I envisage?
Thanks!
Let me restate what I believe is being asked.
"If there is a path of the form (a)-[:X]->(b), find all other paths from a to b."
The answer is simple:
MATCH p=(a)-[:X]->(b), q=(a)-[r*]->(b)
WHERE p<>q
RETURN r;

Find all sub-graphes containing at least one node having a certain property

My graph is composed of multiple "sub-graphes" that are disconnected from one another. These sub-graphes are composed of nodes that are connected with a given relation type.
I would like to get (for example) the list of sub-graphes that contain at least one node that has the property "name" equals "John".
It's equivalent to finding one node per subgraph having this property.
One solution would be to find all the nodes having this property and loop through this list to only pick the ones that are not connected to the previously picked ones. But that would be ugly and quite heavy. Is there an elegant way to do that with Cypher?
I'm trying with something along this direction but have no success so far:
START source=node:user('name:"John"')
MATCH source-[r?:KNOWS*]-target
WHERE r is null
RETURN source
Try this one it may help
START source=node:user('name:"John"')
MATCH source-[r:KNOWS]-()-[r2:KNOWS]-target
WHERE NOT(source-[r:KNOWS]-target)
RETURN target

Resources