Pruning a general tree using cypher - neo4j

Given a neo4j database containing vertices which are either of type folder or leaf. A general tree is modelled using :childof relations, and there is a single 'root' node which is the common ancestor for all vertices.
When presenting the tree, I want to filter out either full branches based on the id of any vertex of type folder. Additionally there is a filter for any properties on vertices of type leaf. The tricky part is that I do not want to see any folders where all descendant leaf nodes are filtered out. Each query only returns immediate descendants, but the filter is applied to the whole subtree. The query must return the immediate children, and a collection of the id of each folder containing leafs which are not filtered out.
The use case is an API for showing a hierarchy based on some filter constraints. I have programmed this in the API application code, but transferring all data from the db to the API application is too slow, so I need to improve the query to condense the data transfer. A third approach is using a purpose built process that does this filtering, keeping the tree in memory. This has been done with some success, but I prefer to use shelf-ware if I can.
The following code is used to get the top level nodes, without filtering. I struggle with expressing MATCH only if at least one descendant also matches
MATCH (p)-[:childof]->(s:Folder) WHERE s.name = 'root'
WITH p OPTIONAL MATCH (v)-[:childof*1..]->(p)
WHERE NOT((v)<-[:childof]-(:Folder))
RETURN p, collect(v.id) as folder_ids
My personal inclination to the problem is that it is too specific for a general purpose graph engine, but I am hoping to be proved wrong.

It sounds like you're close.
We can use pattern comprehensions at the folder level to check for children that meet the filter, and make sure we only keep folders that have at least one child that meets the filter criteria.
And at the immediate descendent level, if we use a MATCH instead of an OPTIONAL MATCH, since folders will get filtered for you, the only immediate descendants that are left are ones with at least one of these folders.
Let's say for example that our filter is that leaf nodes must have active = true, so we want to make sure that our folders for consideration must have at least one child node meeting the filter, and when we get back to the immediate descendants, we only want to keep that descendent if the collection of eligible folders isn't empty.
Something like this:
MATCH (p)-[:childof]->(s:Folder) WHERE s.name = 'root'
WITH p
MATCH (folder)-[:childof*1..]->(p)
WHERE NOT((folder)<-[:childof]-(:Folder)) AND
size([(folder)<-[:childof]-(child) WHERE child.active = true | child]) <> 0
RETURN p, collect(folder.id) as folder_ids

Related

How to query Neo4j N levels deep with variable length relationship and filters on each level

I'm new(ish) to Neo4j and I'm attempting to build a tool that allows users on a UI to essentially specify a path of nodes they would like to query neo4j for. For each node in the path they can specify specific properties of the node and generally they don't care about the relationship types/properties. The relationships need to be variable in length because the typical use case for them is they have a start node and they want to know if it reaches some end node without caring about (all of) the intermediate nodes between the start and end.
Some restrictions the user has when building the path from the UI is that it can't have cycles, it can't have nodes who has more than one child with children and nodes can't have more than one incoming edge. This is only enforced from their perspective, not in the query itself.
The issue I'm having is being able to specify filtering on each level of the path without getting strange behavior.
I've tried a lot of variations of my Cypher query such as breaking up the path into multiple MATCH statements, tinkering with the relationships and anything else I could think of.
Here is a Gist of a sample Cypher dump
cypher-dump
This query gives me the path that I'm trying to get however it doesn't specify name or type on n_four.
MATCH path = (n_one)-[*0..]->(n_two)-[*0..]->(n_three)-[*0..]->(n_four)
WHERE n_one.type IN ["JCL_JOB"]
AND n_two.type IN ["JCL_PROC"]
AND n_three.name IN ["INPA", "OUTA", "PRGA"]
AND n_three.type IN ["RESOURCE_FILE", "COBOL_PROGRAM"]
RETURN path
This query is what I'd like to work however it leaves out the leafs at the third level which I am having trouble understanding.
MATCH path = (n_one)-[*0..]->(n_two)-[*0..]->(n_three)-[*0..]->(n_four)
WHERE n_one.type IN ["JCL_JOB"]
AND n_two.type IN ["JCL_PROC"]
AND n_three.name IN ["INPA", "OUTA", "PRGA"]
AND n_three.type IN ["RESOURCE_FILE", "COBOL_PROGRAM"]
AND n_four.name IN ["TAB1", "TAB2", "COPYA"]
AND n_four.type IN ["RESOURCE_TABLE", "COBOL_COPYBOOK"]
RETURN path
What I've noticed is that when I "... RETURN n_four" in my query it is including nodes that are at the third level as well.
This behavior is caused by your (probably inappropriate) use of [*0..] in your MATCH pattern.
FYI:
[*0..] matches 0 or more relationships. For instance, (a)-[*0..]->(b) would succeed even if a and b are the same node (and there is no relationship from that node back to itself).
The default lower bound is 1. So [*] is equivalent to [*..] and [*1..].
Your 2 queries use the same MATCH pattern, ending in ...->(n_three)-[*0..]->(n_four).
Your first query does not specify any WHERE tests for n_four, so the query is free to return paths in which n_three and n_four are the same node. This lack of specificity is why the query is able to return 2 extra nodes.
Your second query specifies WHERE tests for n_four that make it impossible for n_three and n_four to be the same node. The query is now more picky, and so those 2 extra nodes are no longer returned.
You should not use [*0..] unless you are sure you want to optionally match 0 relationships. It can also add unnecessary overhead. And, as you now know, it also makes the query a bit trickier to understand.

Cyper query- Property value change propagation

Hi,
In the above graph, we have a scenario where in any one of value property of a node is updating, the effect of that value, to be propagated to the remaining nodes. How should this value change event be propagated thru' the cypher query.
Appreciate your support
Is the requirement that this particular property should always be the same for this group of nodes? If it must be the same, then I would recommend extracting it into a node instead, and create relationships to that node from all nodes that should be using it.
With the value in a single place, it will only require a single property change on that node and everything will be in the right state.
EDIT
Requirements are rather fuzzy, so my answer will be fuzzy as well.
If you're matching based upon relationship types, then you'll want some kind of multiplicity on the relationship and maybe specifying allowed types in the match. Such as:
MATCH (start:RNode)-[:R45|R34|R23|R12*]->(r:RNode)
WHERE start.ID = 123 (or however you're matching on your start node)
That will match on every single node from your startNode up the relationship chain until there are no more of the allowed relationships to continue traversing.
If you need a more complicated expansion, you may want to look at the APOC Procedure library's Path Expander.
After you find the right matching query, then it should just be a matter of doing the recalculation for all matched nodes.

Neo4j labels and properties, and their differences

Say we have a Neo4j database with several 50,000 node subgraphs. Each subgraph has a root. I want to find all nodes in one subgraph.
One way would be to recursively walk the tree. It works but can be thousands of trips to the database.
One way is to add a subgraph identifier to each node:
MATCH(n {subgraph_id:{my_graph_id}}) return n
Another way would be to relate each node in a subgraph to the subgraph's root:
MATCH(n)-[]->(root:ROOT {id: {my_graph_id}}) return n
This feels more "graphy" if that matters. Seems expensive.
Or, I could add a label to each node. If {my_graph_id} was "BOBS_QA_COPY" then
MATCH(n:BOBS_QA_COPY) return n
would scoop up all the nodes in the subgraph.
My question is when is it appropriate to use a garden-variety property, add relationships, or set a label?
Setting a label to identify a particular subgraph makes me feel weird, like I am abusing the tool. I expect labels to say what something is, not which instance of something it is.
For example, if we were graphing car information, I could see having parts labeled "FORD EXPLORER". But I am less sure that it would make sense to have parts labeled "TONYS FORD EXPLORER". Now, I could see (USER id:"Tony") having a relationship to a FORD EXPLORER graph...
I may be having a bout of "SQL brain"...
Let's work this through, step by step.
If there are N non-root nodes, adding an extra N ROOT relationships makes the least sense. It is very expensive in storage, it will pollute the data model with relationships that don't need to be there and that can unnecessarily complicate queries that want to traverse paths, and it is not the fastest way to find all the nodes in a subgraph.
Adding a subgraph ID property to every node is also expensive in storage (but less so), and would require either: (a) scanning every node to find all the nodes with a specific ID (slow), or (b) using an index, say, :Node(subgraph_id) (faster). Approach (b), which is preferable, would also require that all the nodes have the same Node label.
But wait, if approach 2(b) already requires all nodes to be labelled, why don't we just use a different label for each subgroup? By doing that, we don't need the subgraph_id property at all, and we don't need an index either! And finding all the nodes with the same label is fast.
Thus, using a per-subgroup label would be the best option.

Cypher: Ordering Nodes on Same Level by Property on Relationship

I am new to Neo4j and currently playing with this tree structure:
The numbers in the yellow boxes are a property named order on the relationship CHILD_OF.
My goal was
a) to manage the sorting order of nodes at the same level through this property rather than through directed relationships (like e.g. LEFT, RIGHT or IS_NEXT_SIBLING, etc.).
b) being able to use plain integers instead of complete paths for the order property (i.e. not maintaining sth. like 0001.0001.0002).
I can't however find the right hint on how or if it is possible to recursively query the graph so that it keeps returning the nodes depth-first but for the sorting at each level consider the order property on the relationship.
I expect that if it is possible it might include matching the complete path iterating over it with the collection utilities of Cypher, but I am not even close enough to post some good starting point.
Question
What I'd expect from answers to this question is not necessarily a solution, but a hint on whether this is a bad approach that would perform badly anyways. In terms of Cypher I am interested if there is a practical solution to this.
I have a general idea on how I would tackle it as a Neo4j server plugin with the Java traversal or core api (which doesn't mean that it would perform well, but that's another topic), so this question really targets the design and Cypher aspect.
This might work:
match path = (n:Root {id:9})-[:CHILD_OF*]->(m)
WITH path, extract(r in rels(path) | r.order) as orders
ORDER BY orders
if it complains about sorting arrays then computing a number where each digit (or two digits) are your order and order by that number
match path = (n:Root {id:9})-[:CHILD_OF*]->(m)
WITH path, reduce(a=1, r in rels(path) | a*10+r.order) as orders
ORDER BY orders

implementing a 'greedy' match to find the extent of a subtree in Cypher

I have a graph that contains many 'subtrees' of items where an original item can be cloned which results in
(clone:Item)-[:clones]->(original:Item)
and a cloned item can also be cloned:
(newclone:Item)-[:clones]->(clone:Item)
the first item is created by a user:
(:User)-[:created]->(:item)
and the clones are collected by a user:
(:User)-[:collected]->(:item)
Given any item in the tree, I want to be able to match all the items in the tree. I'm using:
(1) match (known:Item)-[:clones*]-(others:Item)
My understanding is that this implements a 'greedy' match, traversing the tree in all directions, matching all items.
In general this works, however in some circumstances it doesn't seem to match all the items in the tree. For example, in the following query, this doesn't seem to be matching the whole subtree.
match p = (known:Item)-[r:clones*]-(others:Item) where not any(x in nodes(p) where (x)<-[:created]-(:User)) return p
Here I'm trying to find subtrees which are missing a 'created' Item (which were deleted in the source SQL database.
What I'm finding is that it giving me false positives because it's matching only part of a particular tree. For example, if there is a tree with 5 items structured properly as described above, it seems (in some cases) to be matching a subset of the tree (maybe 2 out of 5 items) and that subset doesn't contain the created card and so is returned by the query when I didn't expect it to.
Question
Is my logic correct or am I misunderstanding something? I'm suspecting that I'm misunderstanding paths, but I'm confused by the fact that the basic 'greedy' match works in most cases.
I think that my problem is that I've been confused because the query is finding multiple paths in the tree, some of which satisfy the test in the query and some don't. When viewed in the neo4j visualisation, the multiple paths are consolidated into what looks like the whole tree whereas the tabular results show that the match (1) above actually gives multiple paths.
I'm now thinking that I should be using collections rather than paths for this.
You are quite right that the query matches more paths than what is apparent in the browser visualization. The query is greedy in the sense that it has no upper bound for depth, but it also has no lower bound (well, strictly the lower bound is 1), which means it will emit a short path and a longer path that includes it if there are such. For data like
CREATE
(u)-[:CREATED]->(i)<-[:CLONES]-(c1)<-[:CLONES]-(c2)
the query will match paths
i<--c1
i<--c1<--c2
c1<--c2
c2-->c1
c2-->c1-->i
c1-->i
Of these paths, only the ones containing i will be filtered by the condition NOT x<-[:CREATED]-(), leaving paths
c1<--c2
c2-->c1
You need a further condition in your pattern before that filter, a condition such that each path that passes it should contain some node x where x<-[:CREATED]-(). That way that filter condition is unequivocal. Judging from the example model/data in your question, you could try matching all directed variable depth (clone)-[:CLONES]->(cloned) paths, where the last cloned does not itself clone anything. That last cloned should be a created item, so each path found can now be expected to contain a b<-[:CREATED]-(). That is, if created items don't clone anything, something like this should work
MATCH (a)-[:CLONES*]->(b)
WHERE NOT b-[:CLONES]->()
AND NOT b<-[:CREATED]-()
This relies on only matching paths where a particular node in each path can be expected to be created. An alternative is to work on each whole tree by itself by getting a single pointer into the tree, and test the entire tree for any created item node. Then the problem with your query could be said to be that it treats c1<--c2 as if it's a full tree and he solution is a pattern that only matches once for a tree. You can then collect the nodes of the tree with the variable depth match from there. You can get such a pointer in different ways, easiest is perhaps to provide a discriminating property to find a specific node and collect all the items in that node's tree. Perhaps something like
MATCH (i {prop:"val"})-[:CLONES*]-(c)
WITH i, collect(distinct c) as cc
WHERE NOT (
i<-[:CREATED]-() OR
ANY (c IN cc WHERE c<-[:CREATED]-()
) //etc
This is not a generic query, however, since it only works on the one tree of the one node. If you have a property pattern that is unique per tree, you can use that. You can also model your data so that each tree has exactly one relationship to a containing 'forest'.
MATCH (forest)-[:TREE]->(tree)-->(item)-[:CLONES*]-(c) // etc
If your [:COLLECTED] or some other relationship, or a combination of relationships and properties make a unique pattern per tree, these can also be used.

Resources