Assume that in an application, the user gives us a graph and we want to consider it as a pattern and find all occurrences of the pattern in the neo4j database. If we knew what the pattern is, we could write the pattern as a Cypher query and run it against our database. However, now we do not know what the pattern is beforehand and receive it from the user in the form of a graph. How can we perform a pattern matching on the database based on the given graph (pattern)? Is there any apoc for that? Any external library?
One way of doing this is to decompose your input graph into edges and create a dynamic cypher from it. I have worked on this quite some time ago, and the solution below is not perfect but indicates a possible direction.
For example, if you feed this graph:
and you take the id(node) from the graph, (i am not taking the rel ids, this is one of the imperfections)
this query
WITH $nodeids AS selection
UNWIND selection AS s
WITH COLLECT (DISTINCT s) AS selection
WITH selection,
SPLIT(left('a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z',SIZE(selection)*2-1),",") AS nodeletters
WITH selection,
nodeletters,
REDUCE (acc="", nl in nodeletters |
CASE acc
WHEN "" THEN acc+nl
ELSE acc+','+nl
END) AS rtnnodes
MATCH (n) WHERE id(n) IN selection
WITH COLLECT(n) AS nodes,selection,nodeletters,rtnnodes
UNWIND nodes AS n
UNWIND nodes AS m
MATCH (n)-[r]->(m)
WITH DISTINCT "("
+nodeletters[REDUCE(x=[-1,0], i IN selection | CASE WHEN i = id(n) THEN [x[1], x[1]+1] ELSE [x[0], x[1]+1] END)[0]]
+TRIM(REDUCE(acc = '', p IN labels(n)| acc + ':'+ p))+")-[:"+type(r)+"]->("
+ nodeletters[REDUCE(x=[-1,0], i IN selection | CASE WHEN i = id(m) THEN [x[1], x[1]+1] ELSE [x[0], x[1]+1] END)[0]]
+TRIM(REDUCE(acc = '', p IN labels(m)| acc + ':'+ p))+")" as z,rtnnodes
WITH COLLECT(z) AS parts,rtnnodes
WITH REDUCE(y=[], x in range(0, size(parts)-1) | y + replace(parts[x],"[","[r" + (x+1))) AS parts2,
REDUCE (acc="", x in range(0, size(parts)-1) | CASE acc WHEN "" THEN acc+"r"+(x+1) ELSE acc+",r"+(x+1) END) AS rtnrels,
rtnnodes
RETURN
REDUCE (acc="MATCH ",p in parts2 |
CASE acc
WHEN "MATCH " THEN acc+p
ELSE acc+','+p
END)+
" RETURN "+
rtnnodes+","+rtnrels+
" LIMIT "+{limit}
AS cypher
returns something like
cypher: "MATCH (a:Person)-[r1:DRIVES]->(b:Car),(a:Person)-[r2:KNOWS]->(c:Person) RETURN a,b,c,r1,r2 LIMIT 50"
which you can feed to the next query.
In Graphileon, you can just select the nodes, and the result will be visualized as well.
Disclosure : I work for Graphileon
I have used patterns in genealogy queries.
The X-chromosome is not transmitted from father to son. As you traverse a family tree you can use the reduce function to create a concatenated string of the sex of the ancestor. You can then accept results that lack MM (father-son). This query gives all the descendants inheriting the ancestor's (RN=32) X-chromosome.
match p=(n:Person{RN:32})<-[:father|mother*..99]-(m)
with m, reduce(status ='', q IN nodes(p)| status + q.sex) AS c
where c=replace(c,'MM','')
return distinct m.fullname as Fullname
I am developing other pattern specific queries as part of a Neo4j PlugIn for genealogy. These will include patterns of triangulation groups.
GitHub repository for Neo4j Genealogy PlugIn
Related
I would like to query for various things and returned a combined set of relationships. In the example below, I want to return all people named Joe living on Main St. I want to return both the has_address and has_state relationships.
MATCH (p:Person),
(p)-[r:has_address]-(a:Address),
(a)-[r1:has_state]-(s:State)
WHERE p.name =~ ".*Joe.*" AND a.street = ".*Main St.*"
RETURN r, r1;
But when I run this query in the Neo4J browser and look under the "Text" view, it seems to put r and r1 as columns in a table (something like this):
│r │r1 │
╞═══╪═══|
│{} │{} │
rather than as desired with each relationship on a different row, like:
Joe Smith | has_address | 1 Main Street
1 Main Street | has_state | NY
Joe Richards | has_address | 22 Main Street
I want to download this as a CSV file for filtering elsewhere. How do I re-write the query in Neo4J to get the desired result?
You may want to look at the Cypher cheat sheet, specifically the Relationship Functions.
That said, you have variables on all the nodes you need. You can output all the data you need on each row.
MATCH (p:Person),
(p)-[r:has_address]-(a:Address),
(a)-[r1:has_state]-(s:State)
WHERE p.name =~ ".*Joe.*" AND a.street = ".*Main St.*"
RETURN p.name AS name, a.street AS address, s.name AS state
That should be enough.
What you seem to be asking for above is a way to union r and r1, but in such a way that they alternate in-order, one row being r and the next being its corresponding r1. This is a rather atypical kind of query, and as such there isn't a lot of support for easily making this kind of output.
If you don't mind rows being out of order, it's easy to do, but your start and end nodes for each relationship are no longer the same type of thing.
MATCH (p:Person),
(p)-[r:has_address]-(a:Address),
(a)-[r1:has_state]-(s:State)
WHERE p.name =~ ".*Joe.*" AND a.street = ".*Main St.*"
WITH COLLECT(r) + COLLECT(r1) as rels
UNWIND rels AS rel
RETURN startNode(rel) AS start, type(rel) AS type, endNode(rel) as end
The query purpose is pretty trivial. For a given nodeId(userId) I want to return on the graph all nodes which has relaionship within X hops and I want to aggregate and return the distance(param which set on the relationship) between them)
I came up with this:
MATCH p=shortestPath((user:FOLLOWERS{userId:{1}})-[r:follow]-(f:FOLLOWERS)) " +
"WHERE f <> user " +
"RETURN (f.userId) as userId," +
"reduce(s = '', rel IN r | s + rel.dist + ',') as dist," +
"length(r) as hop"
userId({1}) is given as Input and is indexed.
I believe Iam having here cartesian product. how would you suggest avoiding it?
You can make the cartesian product less onerous by creating an index on :FOLLOWERS(userId) to speed up one of the two "legs" of the cartesian product:
CREATE INDEX ON :FOLLOWERS(userId);
Even though this will not get rid of the cartesian product, it will run in O(N log N) time, which is much faster than O(N ^ 2).
By the way, your r relationship needs to be variable-length in order for your query to work. You should specify a reasonable upper bound (which depends on your DB) to assure that the query will finish in a reasonable time and not run out of memory. For example:
MATCH p=shortestPath((user:FOLLOWERS { userId: 1 })-[r:follow*..5]-(f:FOLLOWERS))
WHERE f <> user
RETURN (f.userId) AS userId,
REDUCE (s = '', rel IN r | s + rel.dist + ',') AS dist,
LENGTH(r) AS hop;
I am attempting to query an ontology of health represented as an acyclic, directed graph in Neo4j v2.1.5. The database consists of 2 million nodes and 5 million edges/relationships. The following query identifies all nodes subsumed by a disease concept and caused by a particular bacteria or any of the bacteria subtypes as follows:
MATCH p = (a:ObjectConcept{disease}) <-[:ISA*]- (b:ObjectConcept),
q=(c:ObjectConcept{bacteria})<-[:ISA*]-(d:ObjectConcept)
WHERE NOT (b)-->()--(c) AND NOT (b)-->()-->(d)
RETURN distinct b.sctid, b.FSN
This query runs in < 1 second and returns the correct answers. However, adding one additional parameter adds substantial time (20 minutes). Example:
MATCH p = (a:ObjectConcept{disease}) <-[:ISA*]- (b:ObjectConcept),
q=(c:ObjectConcept{bacteria})<-[:ISA*]-(d:ObjectConcept),
t=(e:ObjectConcept{bacteria})<-[:ISA*]-(f:ObjectConcept),
WHERE NOT (b)-->()--(c)
AND NOT (b)-->()-->(d)
AND NOT (b)-->()-->(e)
AND NOT (b)-->()-->(f)
RETURN distinct b.sctid, b.FSN
I am new to cypher coding, but I have to imagine there is a better way to write this query to be more efficient. How would Collections improve this?
Thanks
I already answered that on the google group:
Hi Scott,
I presume you created indexes or constraints for :ObjectConcept(name) ?
I am working with an acyclic, directed graph (an ontology) that models
human health and am needing to identify certain diseases (example:
Pneumonia) that are infectious but NOT caused by certain bacteria
(staph or streptococcus). All concepts are Nodes defined as
ObjectConcepts. ObjectConcepts are connected by relationships such as
[ISA], [Pathological_process], [Causative_agent], etc.
The query requires:
a) Identification of all concepts subsumed by the concept Pneumonia as follows:
MATCH p = (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)
this already returns a number of paths, potentially millions, can you check that with
MATCH p = (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept) return count(*)
b) Identification of all concepts subsumed by Genus Staph and Genus Strep (including the concept Genus Staph and Genus Strep) as follows. Note:
with b MATCH (b) q = (c:ObjectConcept{Strep})<-[:ISA*]-(d:ObjectConcept), h = (e:ObjectConcept{Staph})<-[:ISA*]-(f:ObjectConcept)
this is then the cross product of the paths from "p", "q" and "h", e.g. if all 3 of them return 1000 paths, you're at 1bn paths !!
c) Identify all nodes(p) that do not have a causative agent of Strep (i.e., nodes(q)) or Staph (nodes(h)) as follows:
with b,c,d,e,f MATCH (b),(c),(d),(e),(f) WHERE (b)--()-->(c) OR (b)-->()-->(d) OR (b)-->()-->(e) OR (b)-->()-->(f) RETURN distinct b.Name;
you don't need the WITH or even the MATCH (b),(c),(d),(e),(f)
what connections are there between b and the other nodes ? do you have concrete ones? for the first there is also missing one direction.
the where clause can be a problem, in general you want to show that perhaps this query is better reproduced by a UNION of simpler matches
e.g
MATCH (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)-->()-->(c:ObjectConcept{name:Strep}) RETURN b.name
UNION
MATCH (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)-->()-->(e:ObjectConcept{name:Staph}) RETURN b.name
UNION
MATCH (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)-->()-->(d:ObjectConcept)-[:ISA*]->(c:ObjectConcept{name:Strep}) return b.name
UNION
MATCH (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)-->()-->(d:ObjectConcept)-[:ISA*]->(c:ObjectConcept{name:Staph}) return b.name
another option would be to utilize the shortestPath() function to find one or all shortest path(s) between Pneumonia and the bacteria with certain rel-types and direction.
Perhaps you can share the dataset and the expected result.
The query was successfully accomplished using UNION functions as follows:
MATCH p = (a:ObjectConcept{sctid:233604007}) <-[:ISA*]- (b:ObjectConcept),
q = (c:ObjectConcept{sctid:58800005})<-[:ISA*]-(d:ObjectConcept)
WHERE NOT (b)-->()--(c) AND NOT (b)-->()-->(d)
RETURN distinct b
UNION
MATCH p = (a:ObjectConcept{sctid:233604007}) <-[:ISA*]- (b:ObjectConcept),
t = (e:ObjectConcept{sctid:65119002}) <-[:ISA*]- (f:ObjectConcept)
WHERE NOT (b)-->()-->(e) AND NOT (b)-->()-->(f)
RETURN distinct b
The query runs in sub 20 seconds vs. 20 minutes by reducing the cardinality of the objects being queried.
I am extending maxdemarzi's excellent graph visualisation example (http://maxdemarzi.com/2013/07/03/the-last-mile/) using VivaGraph backed by neo4j.
I want to display relationships of the type
a-->b<--c,b<--d
I tried the query
MATCH p = (a)--(b:X)--(c),(b:X)--(d)
RETURN EXTRACT(n in nodes(p) | {id:ID(n), name:COALESCE(n.name, n.title, ID(n)), type:LABELS(n)}) AS nodes,
EXTRACT(r in relationships(p)| {source:ID(startNode(r)) , target:ID(endNode(r))}) AS rels
It looks like the named query picks up only a-->b<--c pattern and omits the b<--d patterns.
Am i missing something... can i not add multiple patterns in a named query?
The most immediate problem is that the comma in the MATCH clause separates the first pattern from the second. The variable 'p' only stores the first pattern. This is why you aren't getting the results you desire. Independent of that, you are at risk of having a 'loose binding' by putting a label on both of your nodes named 'b' in the two patterns. The second 'b' node should not have a label.
So here is a version of your query that should work.
MATCH p1=(a)-->(b:X)<--(c), p2=(b)<--(d)
WITH nodes(p1) + d AS ns, relationships(p1) + relationships(p2) AS rs
RETURN EXTRACT(n IN ns | {id:ID(n), name:COALESCE(n.name, n.title, ID(n)), type:LABELS(n)}) AS nodes,
EXTRACT(r in rs| {source:ID(startNode(r)) , target:ID(endNode(r))}) AS rels
Capture both paths, then build collections from the nodes and relationships of both paths. The collection of nodes actually only extracts the nodes from p1 and adds the 'd' node. You could write that part as
nodes(p1) + nodes(p2) as ns
but then the 'b' node will appear in the list twice.
Trying to find similar movies on the basis of tags. But I also need all the tags for the given movie and its each similar movie (to do some calculations). But surprisingly collect(h.w) gives repeated values of h.w (where w is a property of h)
Here is the cypher query. Please help.
MATCH (m:Movie{id:1})-[h1:Has]->(t:Tag)<-[h2:Has]-(sm:Movie),
(m)-[h:Has]->(t0:Tag),
(sm)-[H:Has]->(t1:Tag)
WHERE m <> sm
RETURN distinct(sm), collect(h.w)
Basically a query like
MATCH (x)-[h]->(y), (a)-[H]->(b)
RETURN h
is returning each result for h n times where n is the number of results for H. Any way around this?
I replicated the data model for this question to help answer it.
I then setup a sample dataset using Neo4j's online console: http://console.neo4j.org/?id=dakmi3
Running the following query from your question:
MATCH (m:Movie { title: "The Matrix" })-[h1:HAS_TAG]->(t:Tag),
(t)<-[h2:HAS_TAG]-(sm:Movie),
(m)-[h:HAS_TAG]->(t0:Tag),
(sm)-[H:HAS_TAG]->(t1:Tag)
WHERE m <> sm
RETURN DISTINCT sm, collect(h.weight)
Which results in:
(1:Movie {title:"The Matrix: Reloaded"}) [0.31, 0.12, 0.31, 0.12, 0.31, 0.01, 0.31, 0.01]
The issue is that there are duplicate relationships being returned, which results in duplicated weight in the collection. The solution is to use WITH to limit relationships to distinct records and then return the collection of weights of those relationships.
MATCH (m:Movie { title: "The Matrix" })-[h1:HAS_TAG]->(t:Tag),
(t)<-[h2:HAS_TAG]-(sm:Movie),
(m)-[h:HAS_TAG]->(t0:Tag),
(sm)-[H:HAS_TAG]->(t1:Tag)
WHERE m <> sm
WITH DISTINCT sm, h
RETURN sm, collect(h.weight)
(1:Movie {title:"The Matrix: Reloaded"}) [0.31, 0.12, 0.01]
I'm afraid I still don't quite get your intention, but about the general question of duplicate results, that is just the way a disconnected pattern works. Cypher must consider something like
(:A), (:B)
as one pattern, not two. That means that any satisfying graph structure is considered a distinct match. Suppose you have the graph resulting from
CREATE (:A), (:B), (:B)
and query it for the pattern above, you get two results, namely
neo4j-sh (?)$ MATCH (a:A),(b:B) RETURN *;
==> +-------------------------------+
==> | a | b |
==> +-------------------------------+
==> | Node[15204]{} | Node[15207]{} |
==> | Node[15204]{} | Node[15208]{} |
==> +-------------------------------+
==> 2 rows
==> 53 ms
Similarly when matching your pattern (x)-[h]->(y), (a)-[H]->(b) cypher considers each combination of the two pattern parts to make up a unique match for the one whole pattern–so the results for h are compounded by the results for H.
This the way the pattern matching works. To achieve what you want you could first consider if you really need to query for a disconnected pattern. If you do, or if a connected pattern also generates redundant matches, then aggregate one or more of the pattern parts. A simple case might be
CREATE (a:A), (b1:B), (b2:B)
, (c1:C), (c2:C), (c3:C)
, a-[:X]->b1, a-[:X]->b2
, a-[:Y]->c1, a-[:Y]->c2, a-[:Y]->c3
queried with
MATCH (b:B)<-[:X]-(a:A)-[:Y]->(c:C) // with 1 (a), 2 (b) and 3 (c) you get 6 matched paths
RETURN a, collect (b) as bb, collect (c) as cc // after aggregation by (a) there is one path
Sometimes it makes sense to do the aggregation as an intermediate step
MATCH (b)<-[:X]-(a:A) // 2 paths
WITH a, collect(b) as bb // 1 path
MATCH a-[:Y]->(c) // 3 paths
RETURN a, bb, collect(c) as cc // 1 path