Multiple relationships in Match Cypher - neo4j

Trying to find similar movies on the basis of tags. But I also need all the tags for the given movie and its each similar movie (to do some calculations). But surprisingly collect(h.w) gives repeated values of h.w (where w is a property of h)
Here is the cypher query. Please help.
MATCH (m:Movie{id:1})-[h1:Has]->(t:Tag)<-[h2:Has]-(sm:Movie),
WHERE m <> sm
RETURN distinct(sm), collect(h.w)
Basically a query like
MATCH (x)-[h]->(y), (a)-[H]->(b)
is returning each result for h n times where n is the number of results for H. Any way around this?

I replicated the data model for this question to help answer it.
I then setup a sample dataset using Neo4j's online console:
Running the following query from your question:
MATCH (m:Movie { title: "The Matrix" })-[h1:HAS_TAG]->(t:Tag),
WHERE m <> sm
RETURN DISTINCT sm, collect(h.weight)
Which results in:
(1:Movie {title:"The Matrix: Reloaded"}) [0.31, 0.12, 0.31, 0.12, 0.31, 0.01, 0.31, 0.01]
The issue is that there are duplicate relationships being returned, which results in duplicated weight in the collection. The solution is to use WITH to limit relationships to distinct records and then return the collection of weights of those relationships.
MATCH (m:Movie { title: "The Matrix" })-[h1:HAS_TAG]->(t:Tag),
WHERE m <> sm
RETURN sm, collect(h.weight)
(1:Movie {title:"The Matrix: Reloaded"}) [0.31, 0.12, 0.01]

I'm afraid I still don't quite get your intention, but about the general question of duplicate results, that is just the way a disconnected pattern works. Cypher must consider something like
(:A), (:B)
as one pattern, not two. That means that any satisfying graph structure is considered a distinct match. Suppose you have the graph resulting from
CREATE (:A), (:B), (:B)
and query it for the pattern above, you get two results, namely
neo4j-sh (?)$ MATCH (a:A),(b:B) RETURN *;
==> +-------------------------------+
==> | a | b |
==> +-------------------------------+
==> | Node[15204]{} | Node[15207]{} |
==> | Node[15204]{} | Node[15208]{} |
==> +-------------------------------+
==> 2 rows
==> 53 ms
Similarly when matching your pattern (x)-[h]->(y), (a)-[H]->(b) cypher considers each combination of the two pattern parts to make up a unique match for the one whole pattern–so the results for h are compounded by the results for H.
This the way the pattern matching works. To achieve what you want you could first consider if you really need to query for a disconnected pattern. If you do, or if a connected pattern also generates redundant matches, then aggregate one or more of the pattern parts. A simple case might be
CREATE (a:A), (b1:B), (b2:B)
, (c1:C), (c2:C), (c3:C)
, a-[:X]->b1, a-[:X]->b2
, a-[:Y]->c1, a-[:Y]->c2, a-[:Y]->c3
queried with
MATCH (b:B)<-[:X]-(a:A)-[:Y]->(c:C) // with 1 (a), 2 (b) and 3 (c) you get 6 matched paths
RETURN a, collect (b) as bb, collect (c) as cc // after aggregation by (a) there is one path
Sometimes it makes sense to do the aggregation as an intermediate step
MATCH (b)<-[:X]-(a:A) // 2 paths
WITH a, collect(b) as bb // 1 path
MATCH a-[:Y]->(c) // 3 paths
RETURN a, bb, collect(c) as cc // 1 path


Pattern Matching in Neo4j

Assume that in an application, the user gives us a graph and we want to consider it as a pattern and find all occurrences of the pattern in the neo4j database. If we knew what the pattern is, we could write the pattern as a Cypher query and run it against our database. However, now we do not know what the pattern is beforehand and receive it from the user in the form of a graph. How can we perform a pattern matching on the database based on the given graph (pattern)? Is there any apoc for that? Any external library?
One way of doing this is to decompose your input graph into edges and create a dynamic cypher from it. I have worked on this quite some time ago, and the solution below is not perfect but indicates a possible direction.
For example, if you feed this graph:
and you take the id(node) from the graph, (i am not taking the rel ids, this is one of the imperfections)
this query
WITH $nodeids AS selection
UNWIND selection AS s
WITH selection,
SPLIT(left('a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z',SIZE(selection)*2-1),",") AS nodeletters
WITH selection,
REDUCE (acc="", nl in nodeletters |
CASE acc
WHEN "" THEN acc+nl
ELSE acc+','+nl
END) AS rtnnodes
MATCH (n) WHERE id(n) IN selection
WITH COLLECT(n) AS nodes,selection,nodeletters,rtnnodes
UNWIND nodes AS n
UNWIND nodes AS m
MATCH (n)-[r]->(m)
+nodeletters[REDUCE(x=[-1,0], i IN selection | CASE WHEN i = id(n) THEN [x[1], x[1]+1] ELSE [x[0], x[1]+1] END)[0]]
+TRIM(REDUCE(acc = '', p IN labels(n)| acc + ':'+ p))+")-[:"+type(r)+"]->("
+ nodeletters[REDUCE(x=[-1,0], i IN selection | CASE WHEN i = id(m) THEN [x[1], x[1]+1] ELSE [x[0], x[1]+1] END)[0]]
+TRIM(REDUCE(acc = '', p IN labels(m)| acc + ':'+ p))+")" as z,rtnnodes
WITH COLLECT(z) AS parts,rtnnodes
WITH REDUCE(y=[], x in range(0, size(parts)-1) | y + replace(parts[x],"[","[r" + (x+1))) AS parts2,
REDUCE (acc="", x in range(0, size(parts)-1) | CASE acc WHEN "" THEN acc+"r"+(x+1) ELSE acc+",r"+(x+1) END) AS rtnrels,
REDUCE (acc="MATCH ",p in parts2 |
CASE acc
ELSE acc+','+p
" LIMIT "+{limit}
AS cypher
returns something like
cypher: "MATCH (a:Person)-[r1:DRIVES]->(b:Car),(a:Person)-[r2:KNOWS]->(c:Person) RETURN a,b,c,r1,r2 LIMIT 50"
which you can feed to the next query.
In Graphileon, you can just select the nodes, and the result will be visualized as well.
Disclosure : I work for Graphileon
I have used patterns in genealogy queries.
The X-chromosome is not transmitted from father to son. As you traverse a family tree you can use the reduce function to create a concatenated string of the sex of the ancestor. You can then accept results that lack MM (father-son). This query gives all the descendants inheriting the ancestor's (RN=32) X-chromosome.
match p=(n:Person{RN:32})<-[:father|mother*..99]-(m)
with m, reduce(status ='', q IN nodes(p)| status + AS c
where c=replace(c,'MM','')
return distinct m.fullname as Fullname
I am developing other pattern specific queries as part of a Neo4j PlugIn for genealogy. These will include patterns of triangulation groups.
GitHub repository for Neo4j Genealogy PlugIn

Efficiently getting relationship histogram for a set of nodes

I want to create a histogram of the relationships starting from a set of nodes.
Input is a set of node ids, for example set = [ id_0, id_1, id_2, id_3, ... id_n ].
The output is a the relationship type histogram for each node (e.g. Map<Long, Map<String, Long>>):
- ACTED_IN: 14
- WROTE: 5
The current cypher query I've written is:
MATCH (n)-[r]-()
WHERE id(n) IN [ id_0, id_1, id_2, id_3, ... id_n ] # set
RETURN id(n) as id, type(r) as type, count(r) as count
It returns the pair of [ id, type ] count like:
id | rel type | count
id0 | ACTED_IN | 14
id0 | DIRECTED | 1
id1 | DIRECTED | 12
id1 | WROTE | 5
id1 | ACTED_IN | 2
The result is collected using java and merged to the first structure (e.g. Map<Long, Map<String, Long>>).
Getting the relationship histogram on smaller graphs is fast but can be very slow on bigger datasets. For example if I want to create the histogram where the set-size is about 100 ids/nodes and each of those nodes have around 1000 relationships the cypher query took about 5 minutes to execute.
Is there more efficient way to collect the histogram for a set of nodes?
Could this query be parallelized? (With java code or using UNION?)
Is something wrong with how I set up my neo4j database, should these queries be this slow?
There is no need for parallel queries, just the need to understand Cypher efficiency and how to use statistics.
Bit of background :
Using count, will execute an expandAll, which is as expensive as the number of relationships a node has
MATCH (n) WHERE id(n) = 21
MATCH (n)-[r]-(x)
RETURN n, type(r), count(*)
Using size and a relationship type, uses internally getDegree which is a statistic a node has locally, and thus is very efficient
MATCH (n) WHERE id(n) = 0
RETURN n, size((n)-[:SEARCH_RESULT]-())
Morale of the story, for using size you need to know the relationship types a labeled node can have. So, you need to know the schema of the database ( in general you will want that, it makes things easily predictable and building dynamically efficient queries becomes a joy).
But let's assume you don't know the schema, you can use APOC cypher procedures, allowing you to build dynamic queries.
The flow is :
Get all the relationship types from the database ( fast )
Get the nodes from id list ( fast )
Build dynamic queries using size ( fast )
CALL db.relationshipTypes() YIELD relationshipType
WITH collect(relationshipType) AS types
MATCH (n) WHERE id(n) IN [21, 0]
UNWIND types AS type
CALL"RETURN size((n)-[:`" + type + "`]-()) AS count", {n: n})
YIELD value
RETURN id(n), type, value.count

neo4j get random path from known node

I have a big neo4j db with info about celebs, all of them have relations with many others, they are linked, dated, married to each other. So I need to get random path from one celeb with defined count of relations (5). I don't care who will be in this chain, the only condition I have I shouldn't have repeated celebs in chain.
To be more clear: I need to get "new" chain after each query, for example:
I try to get chain started with Rita Ora
She has relations with
Drake, Jay Z and Justin Bieber
Query takes random from these guys, for example Jay Z
Then Query takes relations of Jay Z: Karrine
Steffans, Rosario Dawson and Rita Ora
Query can't take Rita Ora cuz
she is already in chain, so it takes random from others two, for
example Rosario Dawson
And at the end we should have a chain Rita Ora - Jay Z - Rosario Dawson - other celeb - other celeb 2
Is that possible to do it by query?
This is doable in Cypher, but it's quite tricky. You mention that
the only condition I have I shouldn't have repeated celebs in chain.
This condition could be captured by using node-isomorphic pattern matching, which requires all nodes in a path to be unique. Unfortunately, this is not yet supported in Cypher. It is proposed as part of the openCypher project, but is still work-in-progress. Currently, Cypher only supports relationship uniqueness, which is not enough for this use case as there are multiple relationship types (e.g. A is married to B, but B also collaborated with A, so we already have a duplicate with only two nodes).
APOC solution. If you can use the APOC library, take a look at the path expander, which supports various uniqueness constraints, including NODE_GLOBAL.
Plain Cypher solution. To work around this limitation, you can capture the node uniqueness constraint with a filtering operation:
MATCH p = (c1:Celebrity {name: 'Rita Ora'})-[*5]-(c2:Celebrity)
UNWIND nodes(p) AS node
WITH p, count(DISTINCT node) AS countNodes
WHERE countNodes = 5
Performance-wise this should be okay as long as you limit its results because the query engine will basically keep enumerating new paths until one of them passes the filtering test.
The goal of the UNWIND nodes(p) AS node WITH count(DISTINCT node) ... construct is to remove duplicates from the list of nodes by first UNWIND-ing it to separate rows, then aggregating them to a unique collection using DISTINCT. We then check whether the list of unique nodes still has 5 elements - if so, the original list was also unique and we RETURN the results.
Note. Instead of UNWIND and count(DISTINCT ...), getting unique elements from a list could be expressed in other ways:
(1) Using a list comprehension and ranges:
WITH [1, 2, 2, 3, 2] AS l
RETURN [i IN range(0, length(l)-1) WHERE NOT l[i] IN l[0..i] | l[i]]
(2) Using reduce:
WITH [1, 2, 2, 3, 2] AS l
RETURN reduce(acc = [], i IN l | acc + CASE NOT i IN acc WHEN true THEN [i] ELSE [] END)
However, I believe both forms are less readable than the original one.

Neo4j cypher query efficiency and syntax

I am attempting to query an ontology of health represented as an acyclic, directed graph in Neo4j v2.1.5. The database consists of 2 million nodes and 5 million edges/relationships. The following query identifies all nodes subsumed by a disease concept and caused by a particular bacteria or any of the bacteria subtypes as follows:
MATCH p = (a:ObjectConcept{disease}) <-[:ISA*]- (b:ObjectConcept),
WHERE NOT (b)-->()--(c) AND NOT (b)-->()-->(d)
RETURN distinct b.sctid, b.FSN
This query runs in < 1 second and returns the correct answers. However, adding one additional parameter adds substantial time (20 minutes). Example:
MATCH p = (a:ObjectConcept{disease}) <-[:ISA*]- (b:ObjectConcept),
WHERE NOT (b)-->()--(c)
AND NOT (b)-->()-->(d)
AND NOT (b)-->()-->(e)
AND NOT (b)-->()-->(f)
RETURN distinct b.sctid, b.FSN
I am new to cypher coding, but I have to imagine there is a better way to write this query to be more efficient. How would Collections improve this?
I already answered that on the google group:
Hi Scott,
I presume you created indexes or constraints for :ObjectConcept(name) ?
I am working with an acyclic, directed graph (an ontology) that models
human health and am needing to identify certain diseases (example:
Pneumonia) that are infectious but NOT caused by certain bacteria
(staph or streptococcus). All concepts are Nodes defined as
ObjectConcepts. ObjectConcepts are connected by relationships such as
[ISA], [Pathological_process], [Causative_agent], etc.
The query requires:
a) Identification of all concepts subsumed by the concept Pneumonia as follows:
MATCH p = (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)
this already returns a number of paths, potentially millions, can you check that with
MATCH p = (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept) return count(*)
b) Identification of all concepts subsumed by Genus Staph and Genus Strep (including the concept Genus Staph and Genus Strep) as follows. Note:
with b MATCH (b) q = (c:ObjectConcept{Strep})<-[:ISA*]-(d:ObjectConcept), h = (e:ObjectConcept{Staph})<-[:ISA*]-(f:ObjectConcept)
this is then the cross product of the paths from "p", "q" and "h", e.g. if all 3 of them return 1000 paths, you're at 1bn paths !!
c) Identify all nodes(p) that do not have a causative agent of Strep (i.e., nodes(q)) or Staph (nodes(h)) as follows:
with b,c,d,e,f MATCH (b),(c),(d),(e),(f) WHERE (b)--()-->(c) OR (b)-->()-->(d) OR (b)-->()-->(e) OR (b)-->()-->(f) RETURN distinct b.Name;
you don't need the WITH or even the MATCH (b),(c),(d),(e),(f)
what connections are there between b and the other nodes ? do you have concrete ones? for the first there is also missing one direction.
the where clause can be a problem, in general you want to show that perhaps this query is better reproduced by a UNION of simpler matches
MATCH (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)-->()-->(c:ObjectConcept{name:Strep}) RETURN
MATCH (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)-->()-->(e:ObjectConcept{name:Staph}) RETURN
MATCH (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)-->()-->(d:ObjectConcept)-[:ISA*]->(c:ObjectConcept{name:Strep}) return
MATCH (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)-->()-->(d:ObjectConcept)-[:ISA*]->(c:ObjectConcept{name:Staph}) return
another option would be to utilize the shortestPath() function to find one or all shortest path(s) between Pneumonia and the bacteria with certain rel-types and direction.
Perhaps you can share the dataset and the expected result.
The query was successfully accomplished using UNION functions as follows:
MATCH p = (a:ObjectConcept{sctid:233604007}) <-[:ISA*]- (b:ObjectConcept),
q = (c:ObjectConcept{sctid:58800005})<-[:ISA*]-(d:ObjectConcept)
WHERE NOT (b)-->()--(c) AND NOT (b)-->()-->(d)
RETURN distinct b
MATCH p = (a:ObjectConcept{sctid:233604007}) <-[:ISA*]- (b:ObjectConcept),
t = (e:ObjectConcept{sctid:65119002}) <-[:ISA*]- (f:ObjectConcept)
WHERE NOT (b)-->()-->(e) AND NOT (b)-->()-->(f)
RETURN distinct b
The query runs in sub 20 seconds vs. 20 minutes by reducing the cardinality of the objects being queried.

neo4j collecting nodes and relations type b-->a<--c,a<--d

I am extending maxdemarzi's excellent graph visualisation example ( using VivaGraph backed by neo4j.
I want to display relationships of the type
I tried the query
MATCH p = (a)--(b:X)--(c),(b:X)--(d)
RETURN EXTRACT(n in nodes(p) | {id:ID(n), name:COALESCE(, n.title, ID(n)), type:LABELS(n)}) AS nodes,
EXTRACT(r in relationships(p)| {source:ID(startNode(r)) , target:ID(endNode(r))}) AS rels
It looks like the named query picks up only a-->b<--c pattern and omits the b<--d patterns.
Am i missing something... can i not add multiple patterns in a named query?
The most immediate problem is that the comma in the MATCH clause separates the first pattern from the second. The variable 'p' only stores the first pattern. This is why you aren't getting the results you desire. Independent of that, you are at risk of having a 'loose binding' by putting a label on both of your nodes named 'b' in the two patterns. The second 'b' node should not have a label.
So here is a version of your query that should work.
MATCH p1=(a)-->(b:X)<--(c), p2=(b)<--(d)
WITH nodes(p1) + d AS ns, relationships(p1) + relationships(p2) AS rs
RETURN EXTRACT(n IN ns | {id:ID(n), name:COALESCE(, n.title, ID(n)), type:LABELS(n)}) AS nodes,
EXTRACT(r in rs| {source:ID(startNode(r)) , target:ID(endNode(r))}) AS rels
Capture both paths, then build collections from the nodes and relationships of both paths. The collection of nodes actually only extracts the nodes from p1 and adds the 'd' node. You could write that part as
nodes(p1) + nodes(p2) as ns
but then the 'b' node will appear in the list twice.
