I have 2 nodes. First of them "b1" has 16m relationships and second one "b" - 17k. Label B is indexed on the id property.
My query to retrieve if they have a direct relation is:
profile
MATCH (b:B {id :'D006019' }) WITH b
MATCH (b1:B {id :'D006801' }) WITH b, b1
MATCH (b)-[r]-(b1) RETURN r
Several observations:
Query is extremely slow. It's running for like 5 mins. First it makes a nodeindexscan which is very fast, but somehow it manages to grab the node b1 and continues execution with expanding this node. Byt "b1" has 16m relations and this with the following filter ruins the performance
I can make this query fast enough if I change it a little.
Here is the much faster query:
profile
MATCH (bB {id :'D006019' }) WITH b
MATCH (b1:B) WHERE b1.id IN ['D006801' ] WITH b, b1
MATCH (b)-[r]-(b1) RETURN r
So now "b1" is in "IN" clause and neo4j starts expanding over "b" which has only 17k relations and the query executes around 100 ms.
My question is: can the query be written in a way that neo4j expands automatically on the less connected node.
Sometimes you have to give Cypher some hints:
MATCH (b:B {id :'D006019'})
USING INDEX b:B(id)
MATCH (b1:B {id :'D006801'})
USING INDEX b1:B(id)
MATCH (b)-[r]-(b1)
RETURN r;
The above query tells Cypher that it should use the :B(id) index for each of the first 2 matches. Without the hints, there is currently a tendency for the planner to only use the index once.
Related
I have a following network result when I run this query in neo4j browser:
MATCH (n1:Item {name: 'A'})-[r]-(n2:Item) Return n1,r,n2
At the bottom of the graph, it says: Displaying 6 nodes, 7 relationships.
But when I look on the table in the neo4j browser, I only have 5 records
n1,r,n2
A,A->B,B
A,A->C,C
A,A->D,D
A,A->E,E
A,A->F,F
So in the java code, when I get the list of records using the code below:
List<Record> records = session.run(query).list();
I only get 5 records, so I only get the 5 relationships.
But I want to get all 7 relationships including the 2 below:
B->C
C->F
How can i achieve that using the cypher query?
This should work:
MATCH (n:Item {name: 'A'})-[r1]-(n2:Item)
WITH n, COLLECT(r1) AS rs, COLLECT(n2) as others
UNWIND others AS n2
OPTIONAL MATCH (n2)-[r2]-(x)
WHERE x IN others
RETURN n, others, rs + COLLECT(r2) AS rs
Unlike #FrantišekHartman's first approach, this query uses UNWIND to bind n2 (which is not specified in the WITH clause and therefore becomes unbound) to the same n2 nodes found in the MATCH clause. This query also combines all the relationships into a single rs list.
There are many ways to achieve this. One ways is to travers to 2nd level and check that the 2nd level node is in the first level as well
MATCH (n1:Item {name: 'A'})-[r]-(n2:Item)
WITH n1,collect(r) AS firstRels,collect(n2) AS firstNodes
OPTIONAL MATCH (n2)-[r2]-(n3:Item)
WHERE n3 IN firstNodes
RETURN n1,firstRels,firstNodes,collect(r2) as secondRels
Or you could do a Cartesian product between the first level nodes and match:
MATCH (n1:Item {name: 'A'})-[r]-(n2:Item)
WITH n1,collect(r) AS firstRels,collect(n2) as firstNodes
UNWIND firstNodes AS x
UNWIND firstNodes AS y
OPTIONAL MATCH (x)-[r2]-(y)
RETURN n1,firstRels,firstNodes,collect(r2) as secondRels
Depending on on cardinality of firstNodes and secondRels and other existing relationships one might be faster than the other.
I have a hierarchical structure of nodes, which all have a custom-assigned sorting property (numeric). Here's a simple Cypher query to recreate:
merge (p {my_id: 1})-[:HAS_CHILD]->(c1 { my_id: 11, sort: 100})
merge (p)-[:HAS_CHILD]->(c2 { my_id: 12, sort: 200 })
merge (p)-[:HAS_CHILD]->(c3 { my_id: 13, sort: 300 })
merge (c1)-[:HAS_CHILD]->(cc1 { my_id: 111 })
merge (c2)-[:HAS_CHILD]->(cc2 { my_id: 121 })
merge (c3)-[:HAS_CHILD]->(cc3 { my_id: 131 });
The problem I'm struggling with is that often I need to make decisions based on child node rank relative to some parent node, with regads to this sort identifier. So, for example, node c1 has rank 1 relative to node p (because it has the least sort property), c2 has rank 2, and c3 has rank 3 (the biggest sort).
The kind of decision I need to make based to this information: display children only of the first 2 cX nodes. Here's what I want to get:
cc1 and cc2 are present, but cc3 is not because c3 (its parent) is not the first or the second child of p. Here's a dumb query for that:
match (p {my_id: 1 })-->(c)
optional match (c)-->(cc) where c.sort <= 200
return p, c, cc
The problem is, these sort properties are custom-set and imported, so I have no way of knowing which value will be held for child number 2.
My current solution is to rank it during import, and since I'm using Oracle, that's quite simple -- I just need to use rank window function. But it seems awkward to me and I feel like there could be more elegant solution to that. I tried the next query and it works, but it looks weird and it's quite slow on bigger graphs:
match (p {my_id: 1 })-->(c)
optional match (c)-->(cc)
where size([ (p)-->(c1) where c1.sort < c.sort |c1]) < 2
return p, c, cc
Here's the plan for this query and the most expensive part is in fact the size expression:
The slowness you're seeing is likely because you're not performing an index lookup in your query, so it's performing an all nodes scan and accessing the my_id property of every node in your graph to find the one with id 1 (your p node).
You need to add labels on your nodes and use these labels in your queries (at least for your p node), and create an index (or in this case, probably a unique constraint) on the label for my_id so this lookup becomes fast.
You can confirm what's going on by doing a PROFILE of your query (if you can add the profile plan to your description, with all elements of the plan expanded that would help determine further optimizations).
As for your query, something like this should work (I'm using a :Node label as a standin for your actual label)
match (p:Node {my_id: 1 })-->(c)
with p, c
order by c.sort asc
with p, collect(c) as children // children are in order
unwind children[..2] as child // one row for each of the first 2 children
optional match (child)-->(cc) // only matched for the first 2 children
return p, children, collect(cc) as grandchildren
Note that this only returns nodes, not paths or relationships. The reason why you're getting the result graph in the graphical view is because, in the Browser Setting tab (the gear icon in the lower left menu) you have Connect result nodes checked at the bottom.
I'm using Neo4j to try to find any node that is not connected to a specific node "a". The query that I have so far is:
MATCH p = shortestPath((a:Node {id:"123"})-[*]-(b:Node))
WHERE p IS NULL
RETURN b.id as b
So it tries to find the shortest path between a and b. If it doesn't find a path, then it returns that node's id. However, this causes my query to run for a few minutes then crashes when it runs out of memory. I was wondering if this method would even work, and if there is a more efficient way? Any help would be greatly appreciated!
edit:
MATCH (a:Node {id:"123"})-[*]-(b:Node),
(c:Node)
WITH collect(b) as col, a, b, c
WHERE a <> b AND NOT c IN col
RETURN c.id
So col (collect(b)) contains every node connected to a, therefore if c is not in col then c is not connected to a?
For one, you're giving this MATCH an impossible predicate to fulfill, so it will never find the shortest path.
WHERE clauses are associated with MATCH, OPTIONAL MATCH, and WITH clauses, so your query is asking for the shortest path where the path doesn't exist. That will never return anything.
Also, the shortestPath will start at the node you DON'T want to be connected, so this has no way of finding the nodes that aren't connected to it.
Probably the easiest way to approach this is to MATCH to all nodes connected to your node in question, then MATCH to all :Nodes checking for those that aren't in the connected set. That means you won't have to do a shortestPath from every single node in the db, just a membership check in a collection.
You'll need APOC Procedures for this, as it has the fastest means of matching to nodes within a subgraph.
MATCH (a:Node {id:"123"})
CALL apoc.path.subgraphNodes(a, {}) YIELD node
WITH collect(node) as subgraph
MATCH (b:Node)
WHERE NOT b in subgraph
RETURN b.id as b
EDIT
Your edited query is likely to blow up, that's going to generate a huge result set (the query will build a result set of every node reachable from your start node by a unique path in a cartesian product with every :Node).
Instead, go step by step, collect the distinct nodes (because otherwise you'll get multiples of the same nodes that can be reached via different paths), and then only after you have your collection should you start your match for nodes that aren't in the list.
MATCH (:Node {id:"123"})-[*0..]-(b:Node)
WITH collect(DISTINCT b) as col
MATCH (a:Node)
WHERE NOT a IN col
RETURN a.id
I have about 3.5M nodes with label A and about 400 nodes with label B.
Nodes with label B already have directed relation like (b1:B)-(c:CONNECTS)->(b2:B) now I need to add 3.5M another type of relationships by comparing A node properties with :CONNECTS relationship properties.
My statement looks like this:
MATCH (a:A)
MATCH (c:C)
MATCH (b1:B {id: a.a1_id})-[rl:CONNECTS*1..21]->(b2:B {id: a.b2_id}) WHERE ALL(x in rl WHERE x.connect_id = c.connect_id)
MATCH (new_a:B)-[r:TO]->(new_b:B) WHERE r in rl
CREATE (new_a)-[:TICKET {ticket_id: ID(a)}]->(new_b)
This statement is extremely slow and just hangs up. I even tried to do some performance tuning mentioned here, especially I allocated heap size to 16GB.
I find it quite strange that it can't handle this size of data. What am I missing? I tried to model differently and reduce relationship queries and use more schema index, but I failed to do a lot differently because of type of data I have and type of query I want to perform after all data is there.
I also tried to use periodic commit while creating A nodes with csv import. It has same issues.
I hope I am clear enough. I would really appreciate some inputs. Thanks.
What are the labels A, B, C ? A CONNECTS relationship is also free of meaning.
Queries like this are meant to be comprehensible not the opposite!
// generates 3.5M rows
MATCH (a:A)
// generates x-times 3.5M rows
// you never use that C except for checking an connect id?
MATCH (c:C)
// many million times execute this variable length expand
MATCH (b1:B {id: a.a1_id})-[rl:CONNECTS*1..21]->(b2:B {id: b2_id})
WHERE ALL(x in rl WHERE x.connect_id = c.connect_id)
// lookup by relationship is very bad esp. as you looking over a cross product of all 400x400 B's
MATCH (new_a:B)-[r:TO]->(new_b:B) WHERE r in rl
// why do you store the id of a on this self!!-relationship?
CREATE (new_b)-[:TICKET {ticket_id: ID(a)}]->(new_b);
Where does b2_id come from?
Perhaps something like this:
MATCH (a:A)
MATCH (b1:B {id: a.a1_id})
MATCH (b2:B {id: {b2_id}})
MATCH (b1)-[rels:CONNECTS*..21]->(b2)
WHERE ALL(x in tail(rels) WHERE x.connect_id = head(rels).connect_id)
UNWIND rels AS r
WITH a,startNode(r) as new_a, endNode(r) as new_b
CREATE (new_a)-[:TICKET {ticket_id: ID(a)}]->(new_b);
I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;