I have a hierarchical structure of nodes, which all have a custom-assigned sorting property (numeric). Here's a simple Cypher query to recreate:
merge (p {my_id: 1})-[:HAS_CHILD]->(c1 { my_id: 11, sort: 100})
merge (p)-[:HAS_CHILD]->(c2 { my_id: 12, sort: 200 })
merge (p)-[:HAS_CHILD]->(c3 { my_id: 13, sort: 300 })
merge (c1)-[:HAS_CHILD]->(cc1 { my_id: 111 })
merge (c2)-[:HAS_CHILD]->(cc2 { my_id: 121 })
merge (c3)-[:HAS_CHILD]->(cc3 { my_id: 131 });
The problem I'm struggling with is that often I need to make decisions based on child node rank relative to some parent node, with regads to this sort identifier. So, for example, node c1 has rank 1 relative to node p (because it has the least sort property), c2 has rank 2, and c3 has rank 3 (the biggest sort).
The kind of decision I need to make based to this information: display children only of the first 2 cX nodes. Here's what I want to get:
cc1 and cc2 are present, but cc3 is not because c3 (its parent) is not the first or the second child of p. Here's a dumb query for that:
match (p {my_id: 1 })-->(c)
optional match (c)-->(cc) where c.sort <= 200
return p, c, cc
The problem is, these sort properties are custom-set and imported, so I have no way of knowing which value will be held for child number 2.
My current solution is to rank it during import, and since I'm using Oracle, that's quite simple -- I just need to use rank window function. But it seems awkward to me and I feel like there could be more elegant solution to that. I tried the next query and it works, but it looks weird and it's quite slow on bigger graphs:
match (p {my_id: 1 })-->(c)
optional match (c)-->(cc)
where size([ (p)-->(c1) where c1.sort < c.sort |c1]) < 2
return p, c, cc
Here's the plan for this query and the most expensive part is in fact the size expression:
The slowness you're seeing is likely because you're not performing an index lookup in your query, so it's performing an all nodes scan and accessing the my_id property of every node in your graph to find the one with id 1 (your p node).
You need to add labels on your nodes and use these labels in your queries (at least for your p node), and create an index (or in this case, probably a unique constraint) on the label for my_id so this lookup becomes fast.
You can confirm what's going on by doing a PROFILE of your query (if you can add the profile plan to your description, with all elements of the plan expanded that would help determine further optimizations).
As for your query, something like this should work (I'm using a :Node label as a standin for your actual label)
match (p:Node {my_id: 1 })-->(c)
with p, c
order by c.sort asc
with p, collect(c) as children // children are in order
unwind children[..2] as child // one row for each of the first 2 children
optional match (child)-->(cc) // only matched for the first 2 children
return p, children, collect(cc) as grandchildren
Note that this only returns nodes, not paths or relationships. The reason why you're getting the result graph in the graphical view is because, in the Browser Setting tab (the gear icon in the lower left menu) you have Connect result nodes checked at the bottom.
Related
I am working with bill of materials (BOM) and part data in a Neo4J database.
There are 3 types of nodes in my graph:
(ItemUsageInstance) these are the elements of the bill of materials tree
(Item) one exists for each unique item on the BOM tree
(Material)
The relationships are:
(ItemUsageInstance)-[CHILD_OF]->(ItemUsageInstance)
(ItemUsageInstance)-[INSTANCE_OF]->(Item)
(Item)-[MADE_FROM]->(Material)
The schema is pictured below:
Here is a simplified picture of the data. (Diagram with nodes repositioned to enhance visibility):
What I would like to do is find subtrees of adjacent ItemUsageInstances whose Itemss are all made from the same Materials
The query I have so far is:
MATCH (m:Material)
WITH m AS m
MATCH (m)<-[:MADE_FROM]-(i1:Item)<-[]-(iui1:ItemUsageInstance)-[:CHILD_OF]->(iui2:ItemUsageInstance)-[]->(i2:Item)-[:MADE_FROM]->(m) RETURN iui1, i1, iui2, i2, m
However, this only returns one such subtree, the adjacent nodes in the middle of the graph that have a common Material of "M0002". Also, the rows of the results are separate entries, one for each parent-child pair in the subtree:
╒══════════════════════════╤══════════════════════╤══════════════════════════╤══════════════════════╤═══════════════════════╕
│"iui1" │"i1" │"iui2" │"i2" │"m" │
╞══════════════════════════╪══════════════════════╪══════════════════════════╪══════════════════════╪═══════════════════════╡
│{"instance_id":"inst5002"}│{"part_number":"p003"}│{"instance_id":"inst7003"}│{"part_number":"p004"}│{"material_id":"M0002"}│
├──────────────────────────┼──────────────────────┼──────────────────────────┼──────────────────────┼───────────────────────┤
│{"instance_id":"inst7002"}│{"part_number":"p003"}│{"instance_id":"inst7003"}│{"part_number":"p004"}│{"material_id":"M0002"}│
├──────────────────────────┼──────────────────────┼──────────────────────────┼──────────────────────┼───────────────────────┤
│{"instance_id":"inst7001"}│{"part_number":"p002"}│{"instance_id":"inst7002"}│{"part_number":"p003"}│{"material_id":"M0002"}│
└──────────────────────────┴──────────────────────┴──────────────────────────┴──────────────────────┴───────────────────────┘
I was expecting a second subtree, which happens to also be a linked list, to be included. This second subtree consists of ItemUsageInstances inst7006, inst7007, inst7008 at the far right of the graph. For what it's worth, not only are these adjacent instances made from the same Material, they are all instances of the same Item.
I confirmed that every ItemUsageInstance node has an [INSTANCE_OF] relationship to an Item node:
MATCH (iui:ItemUsageInstance) WHERE NOT (iui)-[:INSTANCE_OF]->(:Item) RETURN iui
(returns 0 records).
Also confirmed that every Item node has a [MADE_FROM] relationship to a Material node:
MATCH (i:Item) WHERE NOT (i)-[:MADE_FROM]->(:Material) RETURN i
(returns 0 records).
Confirmed that inst7008 is the only ItemUsageInstance without an outgoing [CHILD_OF] relationship.
MATCH (iui:ItemUsageInstance) WHERE NOT (iui)-[:CHILD_OF]->(:ItemUsageInstance) RETURN iui
(returns 1 record: {"instance_id":"inst7008"})
inst5000 and inst7001 are the only ItemUsageInstances without an incoming [CHILD_OF] relationship
MATCH (iui:ItemUsageInstance) WHERE NOT (iui)<-[:CHILD_OF]-(:ItemUsageInstance) RETURN iui
(returns 2 records: {"instance_id":"inst7001"} and {"instance_id":"inst5000"})
I'd like to collect/aggregate the results so that each row is a subtree. I saw this example of how to collect() and got the array method to work. But it still has duplicate ItemUsageInstances in it. (The "map of items" discussed there failed completely...)
Any insights as to why my query is only finding one subtree of adjacent item usage instances with the same material?
What is the best way to aggregate the results by subtree?
Finding the roots is easy. MATCH (root:ItemUsageInstance) WHERE NOT ()-[:CHILD_OF]->(root)
And for the children, you can include the root by specifying a min distance of 0 (default is 1).
MATCH p=(root)-[:CHILD_OF*0..25]->(ins), (m:Material)<-[:MADE_FROM]-(:Item)<-[:INSTANCE_OF]-(ins)
And then assuming only one item-material per instance, aggregate everything based on material (You can't aggregate in an aggregate, so use WITH to get the depth before collecting the depth with the node)
WITH ins, SIZE(NODES(p)) as depth, m RETURN COLLECT({node:ins, depth:depth}) as instances, m as material
So, all together
MATCH (root:ItemUsageInstance),
p=(root)<-[:CHILD_OF*0..25]-(ins),
(m:Material)<-[:MADE_FROM]-(:Item)<-[:INSTANCE_OF]-(ins)
WHERE NOT ()<-[:CHILD_OF]-(root)
AND NOT (m:Material)<-[:MADE_FROM]-(:Item)<-[:INSTANCE_OF]-()<-[:CHILD_OF]-(ins)
MATCH p2=(ins)<-[:CHILD_OF*1..25]-(cins)
WHERE ALL(n in NODES(p2) WHERE (m)<-[:MADE_FROM]-(:Item)<-[:INSTANCE_OF]-(n))
WITH ins, cins, SIZE(NODES(p2)) as depth, m ORDER BY depth ASC
RETURN ins as collection_head, ins+COLLECT(cins) as instances, m as material
In your pattern, you don't account for situations like the link between inst_5001 and inst_7001. Inst_5001 doesn't have any links to any part usages, but your match pattern requires that both usages have such a link. I think this is where you're going off track. The inst_5002 tree you're finding because it happens to have a link to an usage as your pattern requires.
In terms of "aggregating by subtree", I would return the ID of the root of the tree (e.g. id(iui1) and then count(*) the rest, to show how many subtrees a given root participates in.
Here is my heavily edited query:
MATCH path = (cinst:ItemUsageInstance)-[:CHILD_OF*1..]->(pinst:ItemUsageInstance), (m:Material)<-[:MADE_FROM]-(:Item)<-[:INSTANCE_OF]-(pinst)
WHERE ID(cinst) <> ID(pinst) AND ALL (x in nodes(path) WHERE ((x)-[:INSTANCE_OF]->(:Item)-[:MADE_FROM]->(m)))
WITH nodes(path) as insts, m
UNWIND insts AS instance
WITH DISTINCT instance, m
RETURN collect(instance), m
It returns what I was expecting:
╒═════════════════════════════════════════════════════════════════════════════════════════════════════════════╤═══════════════════════╕
│"collect(instance)" │"m" │
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╪═══════════════════════╡
│[{"instance_id":"inst7002"},{"instance_id":"inst7003"},{"instance_id":"inst7001"},{"instance_id":"inst5002"}]│{"material_id":"M0002"}│
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────┤
│[{"instance_id":"inst7007"},{"instance_id":"inst7008"},{"instance_id":"inst7006"}] │{"material_id":"M0001"}│
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────┘
The one limitation is that it does not distinguish the root of the subtree from the children. Ideally the list of {"instance_id"} would be sorted by depth in the tree.
I have this dataset containing 3M nodes and more than 5M relationships. There about 8 different relationship types. Now I want to return 2 nodes if they are inter-connected.. Here the 2 nodes are A & B and I would like to see if they are inter-connected.
MATCH (n:WCD_Ent)
USING INDEX n:WCD_Ent(WCD_NAME)
WHERE n.WCD_NAME = "A"
MATCH (m:WCD_Ent)
USING INDEX m:WCD_Ent(WCD_NAME)
WHERE m.WCD_NAME = "B"
MATCH (n) - [r*] - (m)
RETURN n,r,m
This gives me Java Heap Space error.
Another conditionality I am looking to put in my query is if the relationship between the 2 nodes A&B contains one particular relationship type(NAME_MATCH) atleast once. A Could you help me address the same?
Gabor's suggestion is the most important fix; you are blowing up heap space because you are generating a cartesian product of rows to start, then filtering out using the pattern. Generate rows using the pattern and you'll be much more space efficient. If you have an index on WCD_Ent(WCD_NAME), you don't need to specify the index, either; this is something you only do if your query is running very slow and a PROFILE shows that the query planner is skipping the index. Try this one instead:
MATCH (n:WCD_Ent { WCD_NAME: "A" })-[r*..5]-(m:WCD_Ent { WCD_NAME: "B" })
WHERE ANY(rel IN r WHERE TYPE(rel) = 'NAME_MATCH')
RETURN n, r, m
The WHERE filter here will check all of the relationships in r (which is a collection, the way you've assigned it) and ensure that at least 1 of them matches the desired type.
Tore's answer (including the variable relationship upper bound) is the best one for finding whether two nodes are connected and if a certain relationship exists in a path connecting them.
One weakness with most of the solutions given so far is that there is no limitation on the variable relationship match, meaning the query is going to crawl your entire graph attempting to match on all possible paths, instead of only checking that one such path exists and then stopping. This is likely the cause of your heap space error.
Tore's suggesting on adding an upper bound on the variable length relationships in your match is a great solution, as it also helps out in cases where the two nodes aren't connected, preventing you from having to crawl the entire graph. In all cases, the upper bound should prevent the heap from blowing up.
Here are a couple more possibilities. I'm leaving off the relationship upper bound, but that can easily be added in if needed.
// this one won't check for the particular relationship type in the path
// but doesn't need to match on all possible paths, just find connectedness
MATCH (n:WCD_Ent { WCD_NAME: "A" }), (m:WCD_Ent { WCD_NAME: "B" })
RETURN EXISTS((n)-[*]-(m))
// using shortestPath() will only give you a single path back that works
// however WHERE ANY may be a filter to apply after matches are found
// so this may still blow up, not sure
MATCH (n:WCD_Ent { WCD_NAME: "A" }), (m:WCD_Ent { WCD_NAME: "B" })
RETURN shortestPath((n)-[r*]-(m))
WHERE ANY(rel IN r WHERE TYPE(rel) = 'NAME_MATCH')
// Adding LIMIT 1 will only return one path result
// Unsure if this will prevent the heap from blowing up though
// The performance and outcome may be identical to the above query
MATCH (n:WCD_Ent { WCD_NAME: "A" }), (m:WCD_Ent { WCD_NAME: "B" })
MATCH (n)-[r*]-(m)
WHERE ANY(rel IN r WHERE TYPE(rel) = 'NAME_MATCH')
RETURN n, r, m
LIMIT 1
Some enhancements:
Instead of the WHERE condition, you can bind the property value inside the pattern.
You can combine the three MATCH conditions into a single one, which makes sure that the query engine will not calculate a Cartesian product of n AND m. (You can also use EXPLAIN to visualize the query plan and check this.)
The resulting query:
MATCH (n:WCD_Ent { WCD_NAME: "A" })-[r*]-(m:WCD_Ent { WCD_NAME: "B" })
RETURN n, r, m
Update: Tore Eschliman pointed out that you don't need to specify the indices, so I removed these two lines from the query:
USING INDEX n:WCD_Ent(WCD_NAME)
USING INDEX m:WCD_Ent(WCD_NAME)
I have 2 different nodes with label Class and Parents. These nodes are connected with hasParents Relationship. There are 4 million Class nodes, 700K Parents nodes. I wanted to create a Sibling Relationship between the Class nodes. I did the following query:
Match (A:Class)-[:hasParents]-> (B:Parents) <-[:hasParents]-(C:Class) Merge (A)-[:Sibling]-[C]
This query is taking ages to complete. I have indexed in both class_id and parent_id property of Class and Parents node. I am using Neo4j version 2.1.6. Any suggestion to speed this up.
First of all, the indices won't help the query since the properties are not referenced anywhere in the query.
With 700K Parent nodes and 4M Class nodes, you have on average 5.7 classes per parent. With 5 classes under one parent, there are 15 Sibling relationships, so there would be more than 10M relationships to create for the whole graph.
That's a lot for one transaction, you're almost guaranteed to hit an OutOfMemory error.
To avoid that, you should batch changes into several smaller transactions.
I'd use a marker label to manage the progression. First, mark all the parents:
MATCH (p:Parent) SET p:ToProcess
Then, repeatedly select a subset of the nodes that remain to be processed, and connect the siblings:
MATCH (p:ToProcess)
REMOVE p:ToProcess
WITH p
LIMIT 1000
OPTIONAL MATCH (p)<-[:hasParents]-(c:Class)
WITH p, collect(c) AS children
FOREACH (c1 IN children |
FOREACH (c2 IN filter(c IN children WHERE c <> c1) |
MERGE (c1)-[:Sibling]-(c2)))
RETURN count(p)
As the query returns the number of parents that were processed, you just repeat it until it returns 0. At that point, no parent has the ToProcess label anymore.
I've this kind of data model in the db:
(a)<-[:has_parent]<-(b)-[:has_parent]-(c)<-[:has_parent]-(...)
every parent can have multiple children & this can go on to unknown number of levels.
I want to find these values for every node
the number of descendants it has
the depth [distance from the node] of every descendant
the creation time of every descendant
& I want to rank the returned nodes based on these values. Right now, with no optimization, the query runs very slow (especially when the number of descendants increases).
The Questions:
what can I do in the model to make the query performant (indexing, data structure, ...)
what can I do in the query
what can I do anywhere else?
edit:
the query starts from a specific node using START or MATCH
to clarify:
a. the query may start from any point in the hierarchy, not just the root node
b. every node under the starting node is returned ranked by the total number of descendants it has, the distance (from the returned node) of every descendant & timestamp of every descendant it has.
c. by descendant I mean everything under it, not just it's direct children
for example,
here's a sample graph:
http://console.neo4j.org/r/awk6m2
First you need to know how to find the root node. The following statement finds the nodes having no outboung parent relationship - be aware that statement is potentially expensive in a large graph.
MATCH (n)
WHERE NOT ((n)-[:has_parent]->())
RETURN n
Instead you should use an index to find that node:
MATCH (n:Node {name:'abc'})
Starting with our root node, we traverse inbound parent relationship with variable depth. On each node traversed we calculate the number of children - since this might be zero a OPTIONAL MATCH is used:
MATCH (root:Node) // line 1-3 to find root node, replace by index lookup
WHERE NOT ((root)-[:has_parent]->())
WITH root
MATCH p =(root)<-[:has_parent*]-() // variable path length match
WITH last(nodes(p)) AS currentNode, length(p) AS currentDepth
OPTIONAL MATCH (currentNode)<-[:has_parent]-(c) // tranverse children
RETURN currentNode, currentNode.created, currentDepth, count(c) AS countChildren
I have a scenario where I have more than 2 random nodes.
I need to get all possible paths connecting all three nodes. I do not know the direction of relation and the relationship type.
Example : I have in the graph database with three nodes person->Purchase->Product.
I need to get the path connecting these three nodes. But I do not know the order in which I need to query, for example if I give the query as person-Product-Purchase, it will return no rows as the order is incorrect.
So in this case how should I frame the query?
In a nutshell I need to find the path between more than two nodes where the match clause may be mentioned in what ever order the user knows.
You could list all of the nodes in multiple bound identifiers in the start, and then your match would find the ones that match, in any order. And you could do this for N items, if needed. For example, here is a query for 3 items:
start a=node:node_auto_index('name:(person product purchase)'),
b=node:node_auto_index('name:(person product purchase)'),
c=node:node_auto_index('name:(person product purchase)')
match p=a-->b-->c
return p;
http://console.neo4j.org/r/tbwu2d
I actually just made a blog post about how start works, which might help:
http://wes.skeweredrook.com/cypher-it-all-starts-with-the-start/
Wouldn't be acceptable to make several queries ? In your case you'd automatically generate 6 queries with all the possible combinations (factorial on the number of variables)
A possible solution would be to first get three sets of nodes (s,m,e). These sets may be the same as in the question (or contain partially or completely different nodes). The sets are important, because starting, middle and end node are not fixed.
Here is the code for the Matrix example with added nodes.
match (s) where s.name in ["Oracle", "Neo", "Cypher"]
match (m) where m.name in ["Oracle", "Neo", "Cypher"] and s <> m
match (e) where e.name in ["Oracle", "Neo", "Cypher"] and s <> e and m <> e
match rel=(s)-[r1*1..]-(m)-[r2*1..]-(e)
return s, r1, m, r2, e, rel;
The additional where clause makes sure the same node is not used twice in one result row.
The relations are matched with one or more edges (*1..) or hops between the nodes s and m or m and e respectively and disregarding the directions.
Note that cypher 3 syntax is used here.