Deleting duplicate relationships in neo4j - is this correct? - neo4j

I have developed a query which, by trial and error, appears to find all of the duplicated relationships in a Neo4j DB. I want delete all but one of these relationships but I'm concerned that I have not thought of problematic cases that could result in data deletion.
So, does this query delete all but one of a duplicated relationship?
MATCH (a)-->(b)<--(a) # identify where the duplication is present
WITH DISTINCT a, b
MATCH (a)-[r]->(b) # get all duplicated paths themselves
WITH a, b, collect(r)[1..] as rs # remove the first instance from the list
UNWIND rs as r
DELETE r
If I replace the UNWIND rs as r; DELETE r with WITH a, b, count(rs) as cnt RETURN cnt it seems to return the unnecessary relationships.
I'm still relucant to put this somewhere to be used by others, though....
Thanks

First of all, let me (strictly) define the term: "duplicate relationships". Two relationships are duplicates if they:
Connect the same pair of nodes (call them a and b)
Have the same relationship type
Have exactly the same set of properties (both names and values)
Have the same directionality between a and b (iff directionality is significant for use case)
Your query only considers #1 and #4, so it generally could delete non-duplicate relationships as well.
Here is a query that will take all of the above into consideration (assuming #4 should be included):
MATCH (a)-[r1]->(b)<-[r2]-(a)
WHERE TYPE(r1) = TYPE(r2) AND PROPERTIES(r1) = PROPERTIES(r2)
WITH a, b, apoc.coll.union(COLLECT(r1), COLLECT(r2))[1..] AS rs
UNWIND rs as r
DELETE r
Aggregating functions (like COLLECT) use non-aggregated terms as grouping keys, so there is no need for the query to perform a separate redundant DISTINCT a,b test.
The APOC function apoc.coll.union returns the distinct union of its 2 input lists.

Related

Cypher Query: Related Nodes Distinct

I am trying to write a query to return all related nodes related by "IMMEDIATE_FAMILY_MEMBER"
This is my query so far
MATCH (f:NaturalPerson)-[r:IMMEDIATE_FAMILY_MEMBER*1..6]-(t)
WHERE f.Name="Jacob"
RETURN f AS fromNode, t AS toNode, r AS Metadata
Initially i thought it worked quite well, but as soon as i added a the child Thuthukile (Parents Jacob and Nkosazana) i get "duplicate" results.
At the moment the query will return a pair of related nodes and all the relationships that were traversed to link them together (ie the metadata).
How do i change this query so i return a distinct pair of nodes with the shortest path (all the relationships) between them.
Also as an extra question, is it possible to specify an or for the label of the relation itself. Ie, the same query but also include the :KNOWS relationship
Edit:
cybersams answer was correct, i made one small change to get the result i wanted.
My final query was this
MATCH (f:NaturalPerson)-[r:IMMEDIATE_FAMILY_MEMBER*..6]-(t)
WHERE f.Name="Jacob" AND t.Name<>"Jacob"
WITH f, t, r
ORDER BY SIZE(r)
RETURN f AS fromNode, t AS toNode, COLLECT(r)[0] AS Metadata
I needed to exclude the "from person" as a destination as i was not interested in the shortest path back to the parent
Aside: why is it NaturalPerson? Are there "unnatural" people in the DB as well?
This should work:
MATCH (f:NaturalPerson)-[r:IMMEDIATE_FAMILY_MEMBER*..6]-(t)
WHERE f.Name="Jacob"
WITH f, t, r
ORDER BY SIZE(r)
RETURN f AS fromNode, t AS toNode, COLLECT(r)[0] AS Metadata
The query sorts by the length of the paths found, uses the aggregating function COLLECT to get a list of all the r paths for a given pair of f and t values, and uses the first (i.e., the shortest) path in each list.

Filtering out nodes on two cypher paths

I have a simplified Neo4j graph (old version 2.x) as the image with 'defines' and 'same' edges. Assume the number on the define edge is a property on the edge
The queries I would like to run are:
1) Find nodes defined by both A and B -- Requried result: C, C, D
START A=node(885), B=node(996) MATCH (A-[:define]->(x)<-[:define]-B) RETURN DISTINCT x
Above works and returns C and D. But I want C twice since its defined twice. But without the distinct on x, it returns all the paths from A to B.
2)Find nodes that are NOT (defined by both A,B OR are defined by both A,B but connected via a same edge) -- Required result: G
Something like:
R1: MATCH (A-[:define]->(x)<-[:define]-B) RETURN DISTINCT x
R2: MATCH (A-[:define]->(e)-(:similar)-(f)<-[:define]-B) RETURN e,f
(Nodes defined by A - (R1+R2) )
3) Find 'middle' nodes that do not have matching calls from both A and B --Required result: C,G
I want to output C due to the 1 define(either 45/46) that does not have a matching define from B.
Also output G because there's no define to G from B.
Appreciate any help on this!
Your syntax is a bit strange to me, so I'm going to assume you're using an older version of Neo4j. We should be able to use the same approaches, though.
For #1, Your proposed match without distinct really should be working. The only thing I can see is adding missing parenthesis around A and B node variables.
START A=node(885), B=node(996)
MATCH (A)-[:define]->(x)<-[:define]-(B)
RETURN x
Also, I'm not sure what you mean by "returns all paths from A to B." Can you clarify that, and provide an example of the output?
As for #2, we'll need several several parts to this query, separating them with WITH accordingly.
START A=node(885), B=node(996)
MATCH (A)-[:define]->(x)<-[:define]-(B)
WITH A, B, COLLECT(DISTINCT x) as exceptions
OPTIONAL MATCH (A)-[:define]->(x)-[:same]-(y)<-[:define]-(B)
WHERE x NOT IN exceptions AND y NOT IN exceptions
WITH A, B, exceptions + COLLECT(DISTINCT x) + COLLECT(DISTINCT y) as allExceptions
MATCH (aNode)
WHERE aNode NOT IN allExceptions AND aNode <> A AND aNode <> B
RETURN aNode
Also, you should really be using labels on your nodes. The final match will match all nodes in your graph and will have to filter down otherwise.
EDIT
Regarding your #3 requirement, the SIZE() function will be very helpful here, as you can get the size of a pattern match, and it will tell you the number of occurrences of that pattern.
The approach on this query is to first get the collection of nodes defined by A or B, then filter down to the nodes where the number of :defines relationships from A are not equal to the number of :defines relationships from B.
While we would like to use something like a UNION WITH in order to get the union of nodes defined by A and union it with the nodes defined by B, Neo4j's UNION support is weak right now, as it doesn't let you do any additional operations after the UNION happens, so instead we have to resort to adding both sets of nodes into the same collection then unwinding them back into rows.
START A=node(885), B=node(996)
MATCH (A)-[:define]->(x)
WITH A, B, COLLECT(x) as middleNodes
MATCH (B)-[:define]->(x)
WITH A, B, middleNodes + COLLECT(x) as allMiddles
UNWIND allMiddles as middle
WITH DISTINCT A, B, middle
WHERE SIZE((A)-[:define]->(middle)) <> SIZE((B)-[:define]->(middle))
RETURN middle

Cypher query that will return only 1 relation of each type between two nodes

How can I craft a query that will return only one relation of a certain type between two nodes?
For example:
MATCH (a)-[r:InteractsWith*..5]->(b) RETURN a,r,b
Because (a) may have interacted with (b) many times, the result will contain many relations between the two. However, the relations are not identical. They have different properties because they occurred at different points in time.
But what if you're only interested in the fact that they have interacted at least once?
Instead of the result as it appears currently I'd like to receive a result that has either:
Only one random relation from the set of relations between (a) and (b)
Only those relations that fit to some criteria (e.g. "newest" or one of each type, ...)
One approach I have thought of is creating new relations of the type "hasEverInteractedWith". But there should be another way, right?
Use shortestPath() to get the quickest single result.
MATCH (a)-[:InteractsWith*..5]->(b)
WITH DISTINCT a, b
MATCH p = shortestPath((a)-[:InteractsWith*..5]->(b))
RETURN a, b, RELATIONSHIPS(p) AS r
If you want to get a specific one, you'll have to get all of the r and then filter them down, which will be slower (but provide more context).
MATCH (a)-[r:InteractsWith*..5]->(b)
WITH a, b, COLLECT(r) AS rs
RETURN a, b, REDUCE(s = HEAD(rs), r IN TAIL(rs)|CASE WHEN s.date > r.date THEN s ELSE r END)

How to find other relations of a node in a query in Cypher

My graph is like this:
a-[sends]->b-[sends]->d
c-[sends]->d
a-[hostedOn]->S1
a-[hostedOn]->S3
b-[hostedOn]->S1
b-[hostedOn]->S2
I have queries which filter on the property of "sends" relationship and returning the desired results. Now I also want that in the same query if I can also ask it to return "hostedOn" as well. Say, my output is b-[sends]->d, how can I also have in the same output b-[hostedOn]->S1 & S2? b & d will change every time depending on the filters applied on "sends" relation.
Here is a possible solution, given the very little information provided. Many solutions are possible, depending on exactly what you need to be returned and if you want any aggregation.
MATCH (a)-[r:sends]->(b)
WHERE r.foo = "bar"
MATCH (a)-[r1:hostedOn]->(s1), (b)-[r2:hostedOn]->(s2)
RETURN a, r, b, r1, s1, r2, s2;
This query assumes all a and b nodes must also have :hostedOn relationships, so there are no OPTIONAL MATCH clauses.

Cypher: preventing results from duplicating on WITH / sequential querying

In a query like this
MATCH (a)
WHERE id(a) = {x}
MATCH (a)-->(b:x)
WITH a, collect(DISTINCT id(b)) AS Bs
MATCH (a)-->(c:y)
RETURN collect(c) + Bs
what I'm trying to do is to gather two sets of nodes that came from different queries, but with this kind of procedure all the b rows get to be returned multiplied by the number of a rows.
How should I deal with this kind of problem that arises from sequential queries?
[Note that the reported query is only a conceptual representation of what I mean. Please don't try to solve the code (that would be trivial) but only the presented problem.]
Your query shouldn't return any cross product since you aggregate in the WITH clause, so there is only one result item/row (the disconnected path a, collect(b)) when the second match begins. It's not clear therefore what the problem is that you want solved–cross products can be solved differently in different cases.
The way your query would work, conceptually speaking, is: match anything related from a, then filter that anything on having label :x. The second leg of the query does the same but filters on label :y. You can therefore combine your queries as
MATCH (a)-->(b)
WHERE id(a) = {x} AND (b:x OR b:y)
RETURN b
Other cases of 'path explosion' can't be solved as easily (sometimes UNION is good, sometimes you can reorder your pattern, sometimes you can do some aggregate-and-reduce to make it happen) , but you'll have to ask about that separately.
How about using UNION for this? See http://docs.neo4j.org/chunked/milestone/query-union.html#union-combine-two-queries-and-remove-duplicates
-brian

Resources