I've this kind of data model in the db:
(a)<-[:has_parent]<-(b)-[:has_parent]-(c)<-[:has_parent]-(...)
every parent can have multiple children & this can go on to unknown number of levels.
I want to find these values for every node
the number of descendants it has
the depth [distance from the node] of every descendant
the creation time of every descendant
& I want to rank the returned nodes based on these values. Right now, with no optimization, the query runs very slow (especially when the number of descendants increases).
The Questions:
what can I do in the model to make the query performant (indexing, data structure, ...)
what can I do in the query
what can I do anywhere else?
edit:
the query starts from a specific node using START or MATCH
to clarify:
a. the query may start from any point in the hierarchy, not just the root node
b. every node under the starting node is returned ranked by the total number of descendants it has, the distance (from the returned node) of every descendant & timestamp of every descendant it has.
c. by descendant I mean everything under it, not just it's direct children
for example,
here's a sample graph:
http://console.neo4j.org/r/awk6m2
First you need to know how to find the root node. The following statement finds the nodes having no outboung parent relationship - be aware that statement is potentially expensive in a large graph.
MATCH (n)
WHERE NOT ((n)-[:has_parent]->())
RETURN n
Instead you should use an index to find that node:
MATCH (n:Node {name:'abc'})
Starting with our root node, we traverse inbound parent relationship with variable depth. On each node traversed we calculate the number of children - since this might be zero a OPTIONAL MATCH is used:
MATCH (root:Node) // line 1-3 to find root node, replace by index lookup
WHERE NOT ((root)-[:has_parent]->())
WITH root
MATCH p =(root)<-[:has_parent*]-() // variable path length match
WITH last(nodes(p)) AS currentNode, length(p) AS currentDepth
OPTIONAL MATCH (currentNode)<-[:has_parent]-(c) // tranverse children
RETURN currentNode, currentNode.created, currentDepth, count(c) AS countChildren
Related
I'm using cypher and neo4j
I have a big dataset of parent and child relations as
(:Person)-[:PARENT_OF*]-(:Person)
I need to get the family tree with only 5 children(nodes) on each level of the tree
I've tried:
MATCH path = (jon:Person )-[:PARENT_OF*]-(:Person)
WITH collect(path) as paths
CALL apoc.convert.toTree(paths) yield value
RETURN value;
its returning me the whole tree structure, I've tried limiting the nodes with limit but it isn't working properly
I guess you have to filter out the paths first. My approach would be to make sure that all children in the path are among the first 5 child nodes of the previous parent . I don't have a dataset ready to test it, but it could be along the lines of this
MATCH path = (jon:Person )-[:PARENT_OF*]->(leaf:Person)
// limit the number of paths, by only considering the ones that are not parent of someone else.
WHERE NOT (leaf)-[:PARENT_OF]->(:Person)
// and all nodes in the path (except the first one, the root) be in the first five children of the parent
AND
ALL(child in nodes(path)[1..] WHERE child IN [(sibling)<-[:PARENT_OF]-(parent)-[:PARENT_OF]->(child) | sibling][..5])
WITH collect(path) as paths
CALL apoc.convert.toTree(paths) yield value
RETURN value
another approach, perhaps faster, would be to first collect all the first five children of descendants of jon
// find all sets of "firstFiveChildren"
MATCH (jon:Person { name:'jon'}),
(p:Person)-[:PARENT_OF]->(child)
WHERE EXISTS((jon)-[:PARENT_OF*]->(p))
WITH jon,p,COLLECT(child)[..5] AS firstFiveChildren
// create a unique list of the persons that could be part of the tree
WITH jon,apoc.coll.toSet(
apoc.coll.flatten(
[jon]+COLLECT(firstFiveChildren)
)
) AS personsInTree
MATCH path = (jon)-[:PARENT_OF*]->(leaf:Person)
WHERE NOT (leaf)-[:PARENT_OF]->(:Person)
AND ALL(node in nodes(path) WHERE node IN personsInTree)
WITH collect(path) as paths
CALL apoc.convert.toTree(paths) yield value
RETURN value;
UPDATE
The issue with the data is that the tree is not symmetric, e.g. not all the paths have the same depth. node d0 for instance has no children. So if you pick five children at the first level, you may not be getting any deeper.
I added a slightly different approach, that should work with symmetric trees, and which allows you to set the number max number of children per node. Try it with 3, and you will see that you only get nodes from the first level., with 8 you get more.
// find all sets of "firstChildren"
WITH 8 AS numberChildren
MATCH (jon:Person { name:'00'}),
(p:Person)-[:PARENT_OF]->(child)
WHERE EXISTS((jon)-[:PARENT_OF*0..]->(p))
WITH jon,p,COLLECT(child)[..numberChildren] AS firstChildren
// create a unique list of the persons that could be part of the tree
WITH jon,apoc.coll.toSet(
apoc.coll.flatten(
[jon]+COLLECT(firstChildren)
)
) AS personsInTree
MATCH path = (jon)-[:PARENT_OF*]->(leaf:Person)
WHERE NOT (leaf)-[:PARENT_OF]->(:Person)
AND ALL(node in nodes(path) WHERE node IN personsInTree)
WITH collect(path) as paths
CALL apoc.convert.toTree(paths) yield value
RETURN value
I am working with bill of materials (BOM) and part data in a Neo4J database.
There are 3 types of nodes in my graph:
(ItemUsageInstance) these are the elements of the bill of materials tree
(Item) one exists for each unique item on the BOM tree
(Material)
The relationships are:
(ItemUsageInstance)-[CHILD_OF]->(ItemUsageInstance)
(ItemUsageInstance)-[INSTANCE_OF]->(Item)
(Item)-[MADE_FROM]->(Material)
The schema is pictured below:
Here is a simplified picture of the data. (Diagram with nodes repositioned to enhance visibility):
What I would like to do is find subtrees of adjacent ItemUsageInstances whose Itemss are all made from the same Materials
The query I have so far is:
MATCH (m:Material)
WITH m AS m
MATCH (m)<-[:MADE_FROM]-(i1:Item)<-[]-(iui1:ItemUsageInstance)-[:CHILD_OF]->(iui2:ItemUsageInstance)-[]->(i2:Item)-[:MADE_FROM]->(m) RETURN iui1, i1, iui2, i2, m
However, this only returns one such subtree, the adjacent nodes in the middle of the graph that have a common Material of "M0002". Also, the rows of the results are separate entries, one for each parent-child pair in the subtree:
╒══════════════════════════╤══════════════════════╤══════════════════════════╤══════════════════════╤═══════════════════════╕
│"iui1" │"i1" │"iui2" │"i2" │"m" │
╞══════════════════════════╪══════════════════════╪══════════════════════════╪══════════════════════╪═══════════════════════╡
│{"instance_id":"inst5002"}│{"part_number":"p003"}│{"instance_id":"inst7003"}│{"part_number":"p004"}│{"material_id":"M0002"}│
├──────────────────────────┼──────────────────────┼──────────────────────────┼──────────────────────┼───────────────────────┤
│{"instance_id":"inst7002"}│{"part_number":"p003"}│{"instance_id":"inst7003"}│{"part_number":"p004"}│{"material_id":"M0002"}│
├──────────────────────────┼──────────────────────┼──────────────────────────┼──────────────────────┼───────────────────────┤
│{"instance_id":"inst7001"}│{"part_number":"p002"}│{"instance_id":"inst7002"}│{"part_number":"p003"}│{"material_id":"M0002"}│
└──────────────────────────┴──────────────────────┴──────────────────────────┴──────────────────────┴───────────────────────┘
I was expecting a second subtree, which happens to also be a linked list, to be included. This second subtree consists of ItemUsageInstances inst7006, inst7007, inst7008 at the far right of the graph. For what it's worth, not only are these adjacent instances made from the same Material, they are all instances of the same Item.
I confirmed that every ItemUsageInstance node has an [INSTANCE_OF] relationship to an Item node:
MATCH (iui:ItemUsageInstance) WHERE NOT (iui)-[:INSTANCE_OF]->(:Item) RETURN iui
(returns 0 records).
Also confirmed that every Item node has a [MADE_FROM] relationship to a Material node:
MATCH (i:Item) WHERE NOT (i)-[:MADE_FROM]->(:Material) RETURN i
(returns 0 records).
Confirmed that inst7008 is the only ItemUsageInstance without an outgoing [CHILD_OF] relationship.
MATCH (iui:ItemUsageInstance) WHERE NOT (iui)-[:CHILD_OF]->(:ItemUsageInstance) RETURN iui
(returns 1 record: {"instance_id":"inst7008"})
inst5000 and inst7001 are the only ItemUsageInstances without an incoming [CHILD_OF] relationship
MATCH (iui:ItemUsageInstance) WHERE NOT (iui)<-[:CHILD_OF]-(:ItemUsageInstance) RETURN iui
(returns 2 records: {"instance_id":"inst7001"} and {"instance_id":"inst5000"})
I'd like to collect/aggregate the results so that each row is a subtree. I saw this example of how to collect() and got the array method to work. But it still has duplicate ItemUsageInstances in it. (The "map of items" discussed there failed completely...)
Any insights as to why my query is only finding one subtree of adjacent item usage instances with the same material?
What is the best way to aggregate the results by subtree?
Finding the roots is easy. MATCH (root:ItemUsageInstance) WHERE NOT ()-[:CHILD_OF]->(root)
And for the children, you can include the root by specifying a min distance of 0 (default is 1).
MATCH p=(root)-[:CHILD_OF*0..25]->(ins), (m:Material)<-[:MADE_FROM]-(:Item)<-[:INSTANCE_OF]-(ins)
And then assuming only one item-material per instance, aggregate everything based on material (You can't aggregate in an aggregate, so use WITH to get the depth before collecting the depth with the node)
WITH ins, SIZE(NODES(p)) as depth, m RETURN COLLECT({node:ins, depth:depth}) as instances, m as material
So, all together
MATCH (root:ItemUsageInstance),
p=(root)<-[:CHILD_OF*0..25]-(ins),
(m:Material)<-[:MADE_FROM]-(:Item)<-[:INSTANCE_OF]-(ins)
WHERE NOT ()<-[:CHILD_OF]-(root)
AND NOT (m:Material)<-[:MADE_FROM]-(:Item)<-[:INSTANCE_OF]-()<-[:CHILD_OF]-(ins)
MATCH p2=(ins)<-[:CHILD_OF*1..25]-(cins)
WHERE ALL(n in NODES(p2) WHERE (m)<-[:MADE_FROM]-(:Item)<-[:INSTANCE_OF]-(n))
WITH ins, cins, SIZE(NODES(p2)) as depth, m ORDER BY depth ASC
RETURN ins as collection_head, ins+COLLECT(cins) as instances, m as material
In your pattern, you don't account for situations like the link between inst_5001 and inst_7001. Inst_5001 doesn't have any links to any part usages, but your match pattern requires that both usages have such a link. I think this is where you're going off track. The inst_5002 tree you're finding because it happens to have a link to an usage as your pattern requires.
In terms of "aggregating by subtree", I would return the ID of the root of the tree (e.g. id(iui1) and then count(*) the rest, to show how many subtrees a given root participates in.
Here is my heavily edited query:
MATCH path = (cinst:ItemUsageInstance)-[:CHILD_OF*1..]->(pinst:ItemUsageInstance), (m:Material)<-[:MADE_FROM]-(:Item)<-[:INSTANCE_OF]-(pinst)
WHERE ID(cinst) <> ID(pinst) AND ALL (x in nodes(path) WHERE ((x)-[:INSTANCE_OF]->(:Item)-[:MADE_FROM]->(m)))
WITH nodes(path) as insts, m
UNWIND insts AS instance
WITH DISTINCT instance, m
RETURN collect(instance), m
It returns what I was expecting:
╒═════════════════════════════════════════════════════════════════════════════════════════════════════════════╤═══════════════════════╕
│"collect(instance)" │"m" │
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╪═══════════════════════╡
│[{"instance_id":"inst7002"},{"instance_id":"inst7003"},{"instance_id":"inst7001"},{"instance_id":"inst5002"}]│{"material_id":"M0002"}│
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────┤
│[{"instance_id":"inst7007"},{"instance_id":"inst7008"},{"instance_id":"inst7006"}] │{"material_id":"M0001"}│
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────┘
The one limitation is that it does not distinguish the root of the subtree from the children. Ideally the list of {"instance_id"} would be sorted by depth in the tree.
I have a set of (n) values which all have corresponding nodes in my graph. I start with unknown relationships to each other. (see start nodes in blue)
I want to find, as simply as possible, is if any of the value/nodes are children of any of the others then applying these rules to filter the results:
If the node is a child then discard it. (white nodes)
If the node is a root then return it. (green nodes)
If the node does not have any children also return it. (green node 673)
There can be up to 50 starting nodes. I've tried iterating through them comparing two at a time discarding them if they are a child - but the number of iterations quickly gets out of hand in larger sets. I'm hoping there is some graph magic I've overlooked. Cypher please!
Thanks!
Let's say that you have an input parameter nids - set of values for the id property of node, target nodes have the label Node, the relationship between nodes is of type hasChild.
Then you need to find such nodes corresponding to the input set, and which do not have parents from the nodes corresponding to the input set:
UNWIND {nids} as nid
MATCH (N:Node {id: nid})
OPTIONAL MATCH (N:Node {id: nid})<-[:hasChild]-(P:Node) WHERE P.id IN {nids}
WITH N, collect(P) AS ts WHERE size(ts) = 0
RETURN N
And do not forget to add an index to the id property for the node:
CREATE INDEX ON :Node(id)
In my application, i will try to calculate betweenness centrality of nodes. Here is the
Betweenness Centrality Calculation Formula. Deminator is just count of relationship between two nodes. But numerator is the existence count of a specific node between this relation. So how can i find the existence count of a node in the relation?
For example as a result of this cypher:MATCH p=allShortestPaths( (u1{name:1174}) - [*..20] - (u2{name:1179}) ) return p
A graph between 2 specific nodes. how can i find the transition count of node:1204 on the relationships between nodes:1174-1179
Shortly, there are shortest paths between node A to node B. How many of them includes node node C ?
You can use the nodes() function on the path variable to get a collection of nodes for the given path. From there you can check if a specific previously matched node is present. Here's one way to do it:
MATCH (n{name:1204})
MATCH p=allShortestPaths( (u1{name:1174}) - [*..20] - (u2{name:1179}) )
WITH p, case when n in nodes(p) then 1 else 0 end as occurrences
RETURN count(p) as shortestPathCount, sum(occurrences) as occurrencesCount
Also, it's recommended you use labels in your graph and in your queries. Currently this is performing an all nodes scan to find any node with the given name, and that gets less and less efficient as the number of nodes grows. If you add a label on your nodes and use them in your match, then you'll at least be performing a label scan instead. If you create an index on the label/property then you'll be doing index lookups for starting nodes in your query.
I have a bunch of nodes that "refer" to other nodes. The nodes referred to (refer_to being the relationship) may then have a relationship to another node called changed_to. Those nodes that are related by the changed_to relationship may also have another changed_to relationship with yet another node. I want to return the nodes referred to, but also the nodes that the referred nodes were changed into. I tried a query that returns referred to nodes combined with a union with an optional match of ReferencedNode changed to ResultNode, but I don't think this will work as it will only get me the referenced node plus the first changed to node and nothing after that, assuming that would work at all to begin with. How can I make a query with the described behavior?
Edit:
This is an example of the relationships going on. I would like to return the referenced to node and the node that the referenced node was ultimately became, with some indicator showing that it became that node ultimately.
Can you give some examples of queries you've tried? Here's what I'm thinking:
MATCH path=(n)-[:refer_to]->(o)-[:changed_to*1..5]->(p)
WHERE n.id = {start_id}
RETURN nodes(path), rels(path)
Of course I don't know if you have an id property so that might need to change. Also you'll be passing in a start_id parameter here.
If you want to return the "references" node and the last "changed_to" node if there is one, you can first match the relationship you know is there, then optionally match at variable depth the path that may be there. If there is more than one "changed_to" relationship you will have more than one result item at this point. If you want all the "changed_to" nodes you can return now, but if you want only the last one you can order the result items by path depth descending with a limit of 1 to get the longest path and then return the last node in that path. That query could look something like
MATCH (n)-[:REFERENCES]->(o)
WHERE n.uid = {uid}
OPTIONAL MATCH path=o-[:CHANGED_TO*1..5]->(p)
WITH n, o, path
ORDER BY length(path) DESC
LIMIT 1
RETURN n, o, nodes(path)[-1]
This returns the start node, the "references" node and
nothing when there is no "changed_to" node
the one "changed_to" node when there is only one
the last "changed_to" node when there is more than one
You can test the query in this console. It contains these three cases and you can test them by replacing {uid} above with the values 1,5 and 8 to get the starting nodes for the three paths.