Cypher: preventing results from duplicating on WITH / sequential querying - neo4j

In a query like this
MATCH (a)
WHERE id(a) = {x}
MATCH (a)-->(b:x)
WITH a, collect(DISTINCT id(b)) AS Bs
MATCH (a)-->(c:y)
RETURN collect(c) + Bs
what I'm trying to do is to gather two sets of nodes that came from different queries, but with this kind of procedure all the b rows get to be returned multiplied by the number of a rows.
How should I deal with this kind of problem that arises from sequential queries?
[Note that the reported query is only a conceptual representation of what I mean. Please don't try to solve the code (that would be trivial) but only the presented problem.]

Your query shouldn't return any cross product since you aggregate in the WITH clause, so there is only one result item/row (the disconnected path a, collect(b)) when the second match begins. It's not clear therefore what the problem is that you want solved–cross products can be solved differently in different cases.
The way your query would work, conceptually speaking, is: match anything related from a, then filter that anything on having label :x. The second leg of the query does the same but filters on label :y. You can therefore combine your queries as
MATCH (a)-->(b)
WHERE id(a) = {x} AND (b:x OR b:y)
RETURN b
Other cases of 'path explosion' can't be solved as easily (sometimes UNION is good, sometimes you can reorder your pattern, sometimes you can do some aggregate-and-reduce to make it happen) , but you'll have to ask about that separately.

How about using UNION for this? See http://docs.neo4j.org/chunked/milestone/query-union.html#union-combine-two-queries-and-remove-duplicates
-brian

Related

Remove automorphisms of a cypher query output

When doing a Cypher query to retrieve a specific subgraph with automorphisms, let's say
MATCH (a)-[:X]-(b)-[:X]-(c),
RETURN a, b, c
It seems that the default behaviour is to return every retrieved subgraph and all their automorphisms.
In that exemple, if (u)-[:X]-(v)-[:X]-(w) is a graph matching the pattern, the output will be u,v,w but also w,v,u, which consist in the same graph.
Is there a way to retrieve each subgraph only once ?
EDIT: It would be great if Cypher have a feature to do that in the search, using some kind of symmetry breaking condition as it would reduce the computing time. If that is not the case, how would you post-process to find the desired output ?
In the query you are making, (a)-[r:X]-(b) and (a)-[t:X]-(c) refer to a similar pattern. Since (b) and (c) can be interchanged. What is the need to repeat matching twice? MATCH (a)-[r:X]-(b) RETURN a, r, b returns all the subgraphs you are looking for.
EDIT
You can do something as follows to find the nodes, which are having two relations of type X.
MATCH (a)-[r:X]-(b) WHERE size((a)-[:X]-()) = 2 RETURN a, r, b
For these kind of mirrored patterns, we can add a restriction on the internal graph ids so only one of the two paths is kept:
MATCH (a)-[:X]-(b)-[:X]-(c)
WHERE id(a) < id(c)
RETURN a, b, c
This will also prevent the case where a = c.

simple match query taking ages

I have a simple query
MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m
and when executing the query "manually" (i.e. using the browser interface to follow edges) I only get a single node as a result as there are no further connections. Checking this with the query
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o
shows no results and
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE) RETURN m
shows a single node so I have made no mistake doing the query manually.
However, the issue is that the first question takes ages to finish and I do not understand why.
Consequently: What is the reason such trivial query takes so long even though the maximum result would be one?
Bonus: How to fix this issue?
As Tezra mentioned, the variable-length pattern match isn't in the same category as the other two queries you listed because there's no restrictions given on any of the nodes in between n and m, they can be of any type. Given that your query is taking a long time, you likely have a fairly dense graph of :CONNECTION relationships between nodes of different types.
If you want to make sure all nodes in your path are of the same label, you need to add that yourself:
MATCH path = (n:TYPE {id:123})<-[:CONNECTION*]-(m:TYPE)
WHERE all(node in nodes(path) WHERE node:TYPE)
RETURN m
Alternately you can use APOC Procedures, which has a fairly efficient means of finding connected nodes (and restricting nodes in the path by label):
MATCH (n:TYPE {id:123})
CALL apoc.path.subgraphNodes(n, {labelFilter:'TYPE', relationshipFilter:'<CONNECTION'}) YIELD node
RETURN node
SKIP 1 // to avoid returning `n`
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o Is not a fair test of MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m because it excludes the possibility of MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:ANYTHING_ELSE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o.
For your main query, you should be returning DISTINCT results MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN DISTINCT m.
This is for 2 main reasons.
Without distinct, each node needs to be returned the number of times for each possible path to it.
Because of the previous point, that is a lot of extra work for no additional meaningful information.
If you use RETURN DISTINCT, it gives the cypher planner the choice to do a pruning search instead of an exhaustive search.
You can also limit the depth of the exhaustive search using ..# so that it doesn't kill your query if you run against a much older version of Neo4j where the Cypher Planner hasn't learned pruning search yet. Example use MATCH (n:TYPE {id:123})<-[:CONNECTION*..10]<-(m:TYPE) RETURN m

Cypher query doesn't return all the expected nodes

I have this graph:
A<-B->C
B is the root of a tiny tree. There is exactly one relation between A and B, and one between B and C.
When I run the following, one node is returned. Why does this Cypher query not return the A and C nodes?
MATCH(a {name:"A"})<-[]-(rewt)-[]->(c) RETURN c
It would seem to be that the first half of that query would find the root, and the second half would find both child nodes.
Until a few minutes ago, I would have thought it logically identical to the following query which works. What's the difference?
MATCH (a {name:"A"})<-[]-(rewt)
MATCH (rewt)-[]->(c)
RETURN c
EDIT for cybersam
I have abstracted my database so we could discuss my specific issue. Now, we still have a tiny tree, but there are 4 nodes that are children of the root.(Sorry this is different, but I'm developing and don't want to change my environment too much.)
This query returns all 4:
match(a)<-[]-(b:ROOT)-[]->(c) return c
One of them has a name of "dddd"...
match(a {name"dddd"})<-[]-(b:ROOT)-[]->(c) return c
This query only returns three of them. "dddd" is not included. omg.
To answer cybersam's specific question, this query:
MATCH (a {name:"dddd"})<--(rewt:CODE_ROOT)
MATCH (rewt)-->(c)
RETURN a = c;
Returns four rows. The values are true, false, false, false
[UPDATED]
There is a difference between your 2 queries. A MATCH clause will filter out all duplicate relationships.
Therefore, your first query would filter out all matches where the left-side relationship is the same as the right-side relationship:
MATCH(a {name:"A"})<--(rewt)-->(c)
RETURN c;
Your second query would allow the 2 relationships to be the same, since the relationships are found by 2 separate MATCH clauses:
MATCH (a {name:"A"})<--(rewt)
MATCH (rewt)-->(c)
RETURN c;
If I am right, then the following query should return N rows (where N is the number of outgoing relationships from rewt) and only one value should be true:
MATCH (a {name:"A"})<--(rewt)
MATCH (rewt)-->(c)
RETURN a = c;
Both work just fine for me. I've tried on 2.3.0 Community.
Do you mind posting your CREATE command ?
In each MATCH clause, each relationship will be matched only once. See http://neo4j.com/docs/stable/cypherdoc-uniqueness.html for reference.
See this related question as well: What does a comma in a Cypher query do?

How to find other relations of a node in a query in Cypher

My graph is like this:
a-[sends]->b-[sends]->d
c-[sends]->d
a-[hostedOn]->S1
a-[hostedOn]->S3
b-[hostedOn]->S1
b-[hostedOn]->S2
I have queries which filter on the property of "sends" relationship and returning the desired results. Now I also want that in the same query if I can also ask it to return "hostedOn" as well. Say, my output is b-[sends]->d, how can I also have in the same output b-[hostedOn]->S1 & S2? b & d will change every time depending on the filters applied on "sends" relation.
Here is a possible solution, given the very little information provided. Many solutions are possible, depending on exactly what you need to be returned and if you want any aggregation.
MATCH (a)-[r:sends]->(b)
WHERE r.foo = "bar"
MATCH (a)-[r1:hostedOn]->(s1), (b)-[r2:hostedOn]->(s2)
RETURN a, r, b, r1, s1, r2, s2;
This query assumes all a and b nodes must also have :hostedOn relationships, so there are no OPTIONAL MATCH clauses.

In neo4j is there a way to get path between more than 2 random nodes whose direction of relation is not known

I have a scenario where I have more than 2 random nodes.
I need to get all possible paths connecting all three nodes. I do not know the direction of relation and the relationship type.
Example : I have in the graph database with three nodes person->Purchase->Product.
I need to get the path connecting these three nodes. But I do not know the order in which I need to query, for example if I give the query as person-Product-Purchase, it will return no rows as the order is incorrect.
So in this case how should I frame the query?
In a nutshell I need to find the path between more than two nodes where the match clause may be mentioned in what ever order the user knows.
You could list all of the nodes in multiple bound identifiers in the start, and then your match would find the ones that match, in any order. And you could do this for N items, if needed. For example, here is a query for 3 items:
start a=node:node_auto_index('name:(person product purchase)'),
b=node:node_auto_index('name:(person product purchase)'),
c=node:node_auto_index('name:(person product purchase)')
match p=a-->b-->c
return p;
http://console.neo4j.org/r/tbwu2d
I actually just made a blog post about how start works, which might help:
http://wes.skeweredrook.com/cypher-it-all-starts-with-the-start/
Wouldn't be acceptable to make several queries ? In your case you'd automatically generate 6 queries with all the possible combinations (factorial on the number of variables)
A possible solution would be to first get three sets of nodes (s,m,e). These sets may be the same as in the question (or contain partially or completely different nodes). The sets are important, because starting, middle and end node are not fixed.
Here is the code for the Matrix example with added nodes.
match (s) where s.name in ["Oracle", "Neo", "Cypher"]
match (m) where m.name in ["Oracle", "Neo", "Cypher"] and s <> m
match (e) where e.name in ["Oracle", "Neo", "Cypher"] and s <> e and m <> e
match rel=(s)-[r1*1..]-(m)-[r2*1..]-(e)
return s, r1, m, r2, e, rel;
The additional where clause makes sure the same node is not used twice in one result row.
The relations are matched with one or more edges (*1..) or hops between the nodes s and m or m and e respectively and disregarding the directions.
Note that cypher 3 syntax is used here.

Resources