I am currently modelling a database with over 50.000 nodes and every node has 2 directed relationships. I try to get all nodes for one input node (the root node), which are connected to it with one relationship and all so-called children of these nodes and so on, until every node connected direct and indirect to this root node is reached.
String query =
"MATCH (m {title:{title},namespaceID:{namespaceID}})-[:categorieLinkTo*..]->(n) " +
"RETURN DISTINCT n.title AS Title, n.namespaceID " +
"ORDER BY n.title";
Result result = db.execute(query, params);
String infos = result.resultAsString();
I have read that the runtime is more likely in O(n^x), but I cannot find any command that excludes for example loops or multiple paths to one node, so the query takes simple over 2 hours and that is not acceptable for my use case.
For simple relationship expressions, Cypher excludes multiple relationships automatically by enforcing uniqueness:
While pattern matching, Neo4j makes sure to not include matches where the same graph relationship is found multiple times in a single pattern.
The documentation is not entirely clear on whether this works for variable length paths - so let's design a small experiment to confirm it:
CREATE
(n1:Node {name: "n1"}),
(n2:Node {name: "n2"}),
(n3:Node {name: "n3"}),
(n4:Node {name: "n4"}),
(n1)-[:REL]->(n2),
(n2)-[:REL]->(n3),
(n3)-[:REL]->(n2),
(n2)-[:REL]->(n4)
This results in the following graph:
Query with:
MATCH (n:Node {name:"n1"})-[:REL*..]->(m)
RETURN m
The result is:
╒══════════╕
│m │
╞══════════╡
│{name: n2}│
├──────────┤
│{name: n3}│
├──────────┤
│{name: n2}│
├──────────┤
│{name: n4}│
├──────────┤
│{name: n4}│
└──────────┘
As you can see n4 is included multiple times (as it can be accessed with avoiding the loop and going through the loop as well).
Check the execution with PROFILE:
So we should use DISTINCT to get rid of the duplicates:
MATCH (n:Node {name:"n1"})-[:REL*..]->(m)
RETURN DISTINCT m
The result is:
╒══════════╕
│m │
╞══════════╡
│{name: n2}│
├──────────┤
│{name: n3}│
├──────────┤
│{name: n4}│
└──────────┘
Again, check the execution with PROFILE:
There are certainly some things we can do to improve this query.
For one, you aren't using labels at all. And because you aren't using labels, your match on the input node can't take advantage of any schema indexes you may have in place, and it must do a scan of all 50k nodes accessing and comparing properties until it finds every one with the given title and namespace (it will not stop when it's found one, as it does not know if there are other nodes that satisfy the conditions). You can check the timing by matching on the start node alone.
To improve this, your nodes should be labeled, and your match on your start node should include the label, and your title and namespaceID properties should be indexed.
That alone should provide a noticeable improvement in query speed.
The next question is if the remaining bottleneck is more due to the sorting, or returning a massive result set?
You can check the cost for sorting alone by LIMITing the results you return.
You can use this for the end of the query, after the match.
WITH DISTINCT n
ORDER BY n.title
LIMIT 10
RETURN n.title AS Title, n.namespaceID
ORDER BY n.title
Also, when doing any performance tuning, you should PROFILE your queries (at least the ones that finish in a reasonable amount of time) and EXPLAIN the ones that are taking an absurd amount of time to examine the query plan.
public static HashSet<Node> breadthFirst(String name, int namespace, GraphDatabaseService db) {
// Hashmap for storing the cypher query and its parameters
Map<String, Object> params = new HashMap<>();
// Adding the title and namespaceID as parameters to the Hashmap
params.put("title", name);
params.put("namespaceID", namespace);
/*it is a simple BFS with these variables below
* basically a Queue (touched) as usual, a Set to store the nodes
* which have been used (finished), a return variable and 2 result
* variables for the queries
*/
Node startNode = null;
String query = "Match (n{title:{title},namespaceID:{namespaceID}})-[:categorieLinkTo]-> (m) RETURN m";
Queue<Node> touched = new LinkedList<Node>();
HashSet<Node>finished = new HashSet<Node>();
HashSet<Node> returnResult = new HashSet<Node>();
Result iniResult = null;
Result tempResult=null;
/*the part below get the direct nodes and puts them
* into the queue
*/
try (Transaction tx = db.beginTx()) {
iniResult =db.execute(query,params);
while(iniResult.hasNext()){
Map<String,Object> iniNode=iniResult.next();
startNode=(Node) iniNode.get("m");
touched.add(startNode);
finished.add(startNode);
}
tx.success();
}catch (QueryExecutionException e) {
logger.error("Fehler bei Ausführung der Anfrage", e);
}
/*and now we just execute the BFS (don't think i need more to
* say here.. we are all pros ;))
* as usual, marking every node we have visited
* and saving every visited node.
* the difficult part had been the casting from
* and to node and result, everything else is pretty much
* straightforward. I think the variables explain their self
* via their name....
*/
while(! (touched.isEmpty())){
try (Transaction tx = db.beginTx()) {
Node currNode=touched.poll();
returnResult.add(currNode);
tempResult=null;
Map<String, Object> paramsTemp = new HashMap<>();
paramsTemp.put("title",currNode.getProperty("title").toString());
paramsTemp.put("namespaceID", 14);
String tempQuery = "MATCH (n{title:{title},namespaceID:{namespaceID}})-[:categorieLinkTo] -> (m) RETURN m";
tempResult = db.execute(tempQuery,paramsTemp);
while(tempResult.hasNext()){
Map<String, Object> currResult= null;
currResult=tempResult.next();
Node tempCurrNode = (Node) currResult.get("m");
if (!finished.contains(tempCurrNode)){
touched.add(tempCurrNode);
finished.add(tempCurrNode);
}
}
tx.success();
}catch (QueryExecutionException f) {
logger.error("Fehler bei Ausführung der Anfrage", f);
}
}
return returnResult;
}
Since I was not able to find a fitting Cypher expression, I just wrote a pretty BFS myself - it works it seems.
Related
I have for example the following graph in Neo4j
(startnode)-[:BELONG_TO]-(Interface)-[:IS_CONNECTED]-(Interface)-[:BELONG_TO]-
#the line below can repeat itself 0..n times
(node)-[:BELONG_TO]-(Interface)-[:IS_CONNECTED]-(Interface)-[:BELONG_TO]-
#up to the endnode
(endnode)
There is an Interface properties I also need to match on. I do not want to follow all the paths, I just the one with Interface Node property I am looking for. For example Interface.VlanList CONTAINS ",23,"
I have done the following in Cypher but it applies that I already know how many iterations I am going to find which in reality is not the case.
match (n:StartNode {name:"device name"}) -[:BELONG_TO]- (i:Interface) -[:IS_CONNECTED]- (ii:Interface)-[:BELONG_TO]-(nn:Node) -[:BELONG_TO]- (iii:Interface) -[:IS_CONNECTED]- (iiii:Interface) -[:BELONG_TO]-(nnn:Node)
where i.VlanList CONTAINS ",841,"
AND ii.VlanList CONTAINS ",841,"
AND iii.VlanList CONTAINS ",841,"
return n, i,ii,nn,iii,iiii,nnn
I have been looking at the documentation but can not work out how the above could be resolved.
This should work:
// put the searchstring in a variable
WITH ',841,' AS searchstring
// look up start end endnode
MATCH (startNode: .... {...}), (endNode: .... {...})
// look for paths of variable length
// that have your search string in all nodes,
// except the first and the last one
WITH searchstring,startNode,endNode
MATCH path=(startnode)-[:BELONG_TO|IS_CONNECTED*]-(endnode)
WHERE ALL(i IN nodes(path)[1..-1] WHERE i.VlanList CONTAINS searchstring)
RETURN path
You can also look at https://neo4j.com/labs/apoc/4.1/graph-querying/path-expander/ for more ideas about how you can limit the pathfinding.
This query should work for you (assuming that the relationship directions I chose are correct):
MATCH p = (sNode:StartNode)-[:BELONG_TO]->(i1:Interface)-[:IS_CONNECTED]->(i2:Interface)-[:BELONG_TO]->(n1)-[:BELONG_TO|IS_CONNECTED*0..]->(eNode:Node)
WHERE sNode.name = "device name" AND eNode.name = "foo" AND LENGTH(p)%3 = 0
WITH p, i1, i2, n1, eNode, RELATIONSHIPS(p) AS rels, NODES(p) AS ns
WHERE n1 = eNode OR (
ALL(j IN RANGE(3, SIZE(rels)-3, 3) WHERE
'BELONG_TO' = TYPE(rels[j]) = TYPE(rels[j+2]) AND
'IS_CONNECTED' = TYPE(rels[j+1])) AND
ALL(x IN ([i1, i2] + REDUCE(s = [], i IN RANGE(3, SIZE(ns)-2, 3) | CASE WHEN i%3 = 0 THEN s ELSE s +ns[i] END))
WHERE x:Interface AND x.VlanList CONTAINS $substring)
)
RETURN p
It checks that the returned paths have the required pattern of node labels, node property value, and relationship types. It takes advantage of the variable length relationship syntax, using zero as the lower bound. Since there is no upper bound, the variable length relationship query query can take "forever" to finish (and in such a situation, you should use a reasonable upper bound).
I've been playing with neo4j for a geneology site and it's worked great!
I've run into a snag where finding the starting node isn't as easy. Looking through the docs and the posts online I haven't seen anything that hints at this so maybe it isn't possible.
What I would like to do is pass in a list of genders and from that list follow a specific path through the nodes to get a single node.
in context of the family:
I want to get my mother's father's mother's mother. so I have my id so I would start there and traverse four nodes from mine.
so pseudo query would be
select person (follow childof relationship)
where starting node is me
where firstNode.gender == female
AND secondNode.gender == male
AND thirdNode.gender == female
AND fourthNode.gender == female
Focusing on the general solution:
MATCH p = (me:Person)-[:IS_CHILD_OF*]->(ancestor:Person)
WHERE me.uuid = {uuid}
AND length(p) = size({genders})
AND extract(x in tail(nodes(p)) | x.gender) = {genders}
RETURN ancestor
here's how it works:
match the starting node by id
match all the variable-length paths going to any ancestor
constrain the length of the path (i.e. the number of relationships, which is the same as the number of ancestors), as you can't parameterize the length in the query
extract the genders in the path
nodes(p) returns all the nodes in the path, including the starting node
tail(nodes(p)) skips the first element of the list, i.e. the starting node, so now we only have the ancestors
extract() extracts the genders of all the ancestor nodes, i.e. it transforms the list of ancestor nodes into their genders
the extracted list of genders can be compared to the parameter
if the path matched, we can return the bound ancestor, which is the end of the path
However, I don't think it will be faster than the explicit solution, though the performance could remain comparable. On my small test data (just 5 nodes), the general solution does 26 DB accesses whereas the specific solution only does 22, as reported by PROFILE. Further profiling would be needed on a larger database to compare the performances:
PROFILE MATCH p = (me:Person)-[:IS_CHILD_OF*]->(ancestor:Person)
WHERE me.uuid = {uuid}
AND length(p) = size({genders})
AND extract(x in tail(nodes(p)) | x.gender) = {genders}
RETURN ancestor
The general solution has the advantage of being a single query which won't need to be parsed again by the Cypher engine, whereas each generated query will need to be parsed.
It was more simple than I thought. Maybe there is still a better way so I'll leave this open for a bit.
the query would be
MATCH (n1:Person { Id: 'f59c40de-506d-4829-a765-7a3ae94af8d1' })
<-[:CHILDOF]-(n2 { Gender:'0'})
<-[:CHILDOF]-(n3 { Gender:'1'})
<-[:CHILDOF]-(n4 { Gender:'1'})
RETURN n4
and for each generation back would add a new row.
The equivalent query would look something like this:
MATCH (me:Person)
WHERE me.ID = ?
WITH me
MATCH (me)-[r:childof*4]->(ancestor:Person)
WITH ancestor, EXTRACT(rel IN r | endNode(rel).gender) AS genders
WHERE genders = ?
RETURN ancestor
Disclaimer, I haven't double-checked the syntax.
In Neo4j you typically find your start node first, typically by an ID of some sort (modify as required to match on a unique property). We then traverse a number of relationships to an ancestor, extract the gender property of all end nodes in the traversed relationships, and compare the genders to the expected list of genders (you'll need to make sure the argument is a bracketed list in the desired order).
Note that this approach filters down all possible results with that degree of childof relationship as opposed to walking your graph, so higher degrees of relationship (the higher the degree of ancestry you're querying), the slower the call will get.
I'm also unsure if you can parameterize the degree of the variable relationship, so that might prevent this from being a generalized solution for any degree of ancestry.
I'm not sure if you want a generic query which can work whatever the collection of genders you pass, or a specific solution.
Here's the specific solution: you match the path with the wanted length, and match each gender, as you've already noted in your own answer.
MATCH (me:Person)-[:IS_CHILD_OF]->(p1:Person)
-[:IS_CHILD_OF]->(p2:Person)
-[:IS_CHILD_OF]->(p3:Person)
-[:IS_CHILD_OF]->(p4:Person)
WHERE me.uuid = {uuid}
AND p1.gender = {genders}[0]
AND p2.gender = {genders}[1]
AND p3.gender = {genders}[2]
AND p4.gender = {genders}[3]
RETURN p4
Now, if you want to pass in a list of genders of an arbitrary length, it's actually possible. You match a variable-length path, make sure it has the right length (matching the number of genders), then match each gender in sequence.
MATCH p = (me:Person)-[:IS_CHILD_OF*]->(ancestor:Person)
WHERE me.uuid = {uuid}
AND length(p) = size({genders})
AND all(i IN range(0, size({genders}) - 1)
WHERE {genders}[i] = extract(x in tail(nodes(p)) | x.gender)[i])
RETURN ancestor
Building on #InverseFalcon's answer, you can actually compare collections, which simplifies the query:
MATCH p = (me:Person)-[:IS_CHILD_OF*]->(ancestor:Person)
WHERE me.uuid = {uuid}
AND length(p) = size({genders})
AND extract(x in tail(nodes(p)) | x.gender) = {genders}
RETURN ancestor
I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
I'm new to cypher, neo4j and graph databases in general. The data model I was given to work with is a tad confusing, but it looks like the nodes are just GUID placeholders with all the real "data" as properties on relationships to the nodes (which relates each node back to node zero).
Each node (which basically only has a guid) has a dozen relations with key/value pairs that are the actual data I need. (I'm guessing this was done for versioning?.. )
I need to be able to make a single cypher call to get properties off two (or more) relationships connected to the same node -- here is two calls that I'd like to make into one call;
start n = Node(*)
match n-[ar]->m
where has(ar.value) and has(ar.proptype) and ar.proptype = 'ccid'
return ar.value
and
start n = Node(*)
match n-[br]->m
where has(br.value) and has(br.proptype) and br.proptype = 'description'
return br.value
How do I go about doing this in a single cypher call?
EDIT - for clarification;
I want to get the results as a list of rows with a column for each value requested.
Something returned like;
n.id as Node, ar.value as CCID, br.value as Description
Correct me if I'm wrong, but I believe you can just do this:
start n = Node(*)
match n-[r]->m
where has(r.value) and has(r.proptype) and (r.proptype = 'ccid' or r.proptype = 'description')
return r.value
See here for more documentation on Cypher operations.
Based on your edits I'm guessing that you're actually looking to find cases where a node has both a ccid and a description? The question is vague, but I think this is what you're looking for.
start n = Node(*)
match n-[ar]->m, n-[br]->m
where (has(ar.value) and has(ar.proptype) and ar.proptype = 'ccid') and
(has(br.value) and has(br.prototype) and br.proptype = 'description')
return n.id as Node, ar.value as CCID, br.value as Description
You can match relationships from two sides:
start n = Node(*)
match m<-[br]-n-[ar]->m
where has(ar.value) and has(ar.proptype) and ar.proptype = 'ccid' and
has(br.value) and has(br.proptype) and br.proptype = 'description'
return ar.value, br.value
is there a default way how to match only first n relationships except that filtering on LIMIT n later?
i have this query:
START n=node({id})
MATCH n--u--n2
RETURN u, count(*) as cnt order by cnt desc limit 10;
but assuming the number of n--u relationships is very high, i want to relax this query and took for example first 100 random relationships and than continue with u--n2...
this is for a collaborative filtering task, and assuming the users are more-less similar i dont want to match all users u but a random subset. this approach should be faster in performance - now i got ~500ms query time but would like to drop it under 50ms.
i know i could break the above query into 2 separate ones, but still in the first query it goes through all users and than later it limits the output. i want to limit the max rels during match phase.
You can pipe the current results of your query using WITH, then LIMIT those initial results, and then continue on in the same query:
START n=node({id})
MATCH n--u
WITH u
LIMIT 10
MATCH u--n2
RETURN u, count(*) as cnt
ORDER BY cnt desc
LIMIT 10;
The query above will give you the first 10 us found, and then continue to find the first ten matching n2s.
Optionally, you can leave off the second LIMIT and you will get all matching n2s for the first ten us (meaning you could have more than ten rows returned if they matched the first 10 us).
This is not a direct solution to your question, but since I was running into a similar problem, my work-around might be interesting for you.
What I need to do is: get relationships by index (might yield many thousands) and get the start node of these. Since the start node is always the same with that index-query, I only need the very first relationship's startnode.
Since I wasn't able to achieve that with cypher (the proposed query by ean5533 does not perform any better), I am using a simple unmanaged extension (nice template).
#GET
#Path("/address/{address}")
public Response getUniqueIDofSenderAddress(#PathParam("address") String addr, #Context GraphDatabaseService graphDB) throws IOException
{
try {
RelationshipIndex index = graphDB.index().forRelationships("transactions");
IndexHits<Relationship> rels = index.get("sender_address", addr);
int unique_id = -1;
for (Relationship rel : rels) {
Node sender = rel.getStartNode();
unique_id = (Integer) sender.getProperty("unique_id");
rels.close();
break;
}
return Response.ok().entity("Unique ID: " + unique_id).build();
} catch (Exception e) {
return Response.serverError().entity("Could not get unique ID.").build();
}
}
For this case here, the speed up is quite nice.
I don't know your exact use case, but since Neo4j even supports HTTP streaming afaik, you should be able to create to convert your query to an unmanaged extension and still get the full performance.
E.g., "java-querying" all your qualifying nodes and emit the partial result to the HTTP stream.