How to stop Neo4J Cypher from processing an empty collection? - neo4j

I have this kind of request Cypher Neo4J request:
MATCH (c1:Concept)
WHERE c1.name in (['word'])
WITH COLLECT(distinct c1) as concepts
MATCH (ctx:Context)
WHERE ALL(c in concepts
WHERE (c)-->(ctx) AND ((ctx.by) = '15229100-b20e-11e3-80d3-6150cb20a1b9'))
RETURN ctx
If there is a c1 with the name word, then it gets processed fine and I get acceptable results.
However, if there's no c1 with word then an empty collection is returned, however, it gets further processed and I just get all the ctx:Context nodes that satisfy the ctx.by criteria. Which is not right.
How to fix that in the request?

Aggregations (alone, without any non-aggregation variables as grouping keys) will succeed even when there are no rows, emitting a single row with the result, which will allow further processing, since there is a row to operate on.
To get the behavior you want, add a filter after the aggregation to ensure you have a non-empty list. This will ensure that if the list is empty, rows go to 0 and the subsequent operations won't take place:
MATCH (c1:Concept)
WHERE c1.name in (['word'])
WITH COLLECT(distinct c1) as concepts
WHERE size(concepts) <> 0
MATCH (ctx:Context)
WHERE ALL(c in concepts
WHERE (c)-->(ctx) AND ((ctx.by) = '15229100-b20e-11e3-80d3-6150cb20a1b9'))
RETURN ctx

Related

Neo4j count Query

match(m:master_node:Application)-[r]-(k:master_node:Server)-[r1]-(n:master_node)
where (m.name contains '' and (n:master_node:DeploymentUnit or n:master_node:Schema))
return distinct m.name,n.name
Hi,I am trying to get total number of records for the above query.How I change the query using count function to get the record count directly.
Thanks in advance
The following query uses the aggregating funtion COUNT. Distinct pairs of m.name, n.name values are used as the "grouping keys".
MATCH (m:master_node:Application)--(:master_node:Server)--(n:master_node)
WHERE EXISTS(m.name) AND (n:DeploymentUnit OR n:Schema)
RETURN m.name, n.name, COUNT(*) AS cnt
I assume that m.name contains '' in your query was an attempt to test for the existence of m.name. This query uses the EXISTS() function to test that more efficiently.
[UPDATE]
To determine the number of distinct n and m pairs in the DB (instead of the number of times each pair appears in the DB):
MATCH (m:master_node:Application)--(:master_node:Server)--(n:master_node)
WHERE EXISTS(m.name) AND (n:DeploymentUnit OR n:Schema)
WITH DISTINCT m.name AS n1, n.name AS n2
RETURN COUNT(*) AS cnt
Some things to consider for speeding up the query even further:
Remove unnecessary label tests from the MATCH pattern. For example, can we omit the master_node label test from any nodes? In fact, can we omit all label testing for any nodes without affecting the validity of the result? (You will likely need a label on at least one node, though, to avoid scanning all nodes when kicking off the query.)
Can you add a direction to each relationship (to avoid having to traverse relationships in both directions)?
Specify the relationship types in the MATCH pattern. This will filter out unwanted paths earlier. Once you do so, you may also be able to remove some node labels from the pattern as long as you can still get the same result.
Use the PROFILE clause to evaluate the number of DB hits needed by different Cypher queries.
You can find examples of how to use count in the Neo4j docs here
In your case the first example where:
count(*)
Is used to return a count of each returned item should work.

Update nodes by a list of ids and values in one cypher query

I've got a list of id's and a list of values. I want to catch each node with the id and set a property by the value.
With just one Node that is super basic:
MATCH (n) WHERE n.id='node1' SET n.name='value1'
But i have a list of id's ['node1', 'node2', 'node3'] and same amount of values ['value1', 'value2', 'value3'] (For simplicity i used a pattern but values and id's vary a lot). My first approach was to use the query above and just call the database each time. But nowadays this isn't appropriate since i got thousand of id's which would result in thousand of requests.
I came up with this approach that I iterate over each entry in both lists and set the values. The first node from the node list has to get the first value from the value list and so on.
MATCH (n) WHERE n.id IN["node1", "node2"]
WITH n, COLLECT(n) as nodeList, COLLECT(["value1","value2"]) as valueList
UNWIND nodeList as nodes
UNWIND valueList as values
FOREACH (index IN RANGE(0, size(nodeList)) | SET nodes.name=values[index])
RETURN nodes, values
The problem with this query is that every node gets the same value (the last of the value list). The reason is in the last part SET nodes.name=values[index] I can't use the index on the left side nodes[index].name - doesn't work and the database throws error if i would do so. I tried to do it with the nodeList, node and n. Nothing worked out well. I'm not sure if this is the right way to achieve the goal maybe there is a more elegant way.
Create pairs from the ids and values first, then use UNWIND and simple MATCH .. SET query:
// THe first line will likely come from parameters instead
WITH ['node1', 'node2', 'node3'] AS ids,['value1', 'value2', 'value3'] AS values
WITH [i in range(0, size(ids)) | {id:ids[i], value:values[i]}] as pairs
UNWIND pairs AS pair
MATCH (n:Node) WHERE n.id = pair.id
SET n.value = pair.value
The line
WITH [i in range(0, size(ids)) | {id:ids[i], value:values[i]}] as pairs
combines two concepts - list comprehensions and maps. Using the list comprehension (with omitted WHERE clause) it converts list of indexes into a list of maps with id,value keys.

Inconsistent query results depending on the order of the query in neo4j

[image 1][1]
Depending on the order I query 2 relationships, I get 2 different answers despite the query being the same (as far as I understand). The query obviously is not the same but I don't know why.
MATCH p1=(:Barrier {code: 'B2'})-[:REL1]->()
WITH count(DISTINCT p1) AS failed_B2
MATCH p2=(:Barrier {code: 'B2'})-[:REL2]->()
RETURN count(DISTINCT p2) AS worked_B2, failed_B2
Returns 1 and 0 - which is correct
But the other way round:
MATCH p1=(:Barrier {code: 'B2'})-[:REL2]->()
WITH count(DISTINCT p1) AS failed_B2
MATCH p2=(:Barrier {code: 'B2'})-[:REL1]->()
RETURN count(DISTINCT p2) AS worked_B2, failed_B2
Returns 0 and 0 - which is incorrect
I would like to combine the results of multiple queries but UNION does not work because it needs to group the results under the same column which in my case would be incorrect. I need the results in different columns.
So this is an interesting thing that centers around what happens when rows get filtered out (such as when a MATCH fails or a WHERE condition filters the row).
But first we need to address that you observed in the second case: Returns 0 and 0. I don't think that's really true, and I'd like to know what version of Neo4j you're using here. In this particular case, I would have instead expected no rows being returned, and this is ENTIRELY different than a row being returned with 0 values for both.
When Cypher queries execute, they build up records (or rows) of data. And Cypher operations execute per row. So when you do a MATCH in some point in your query, that's being performed per row, and when the MATCH fails, when no such pattern exists (that adheres to your WHERE clause, if present), then the row is filtered out. This is important, because this means that any other data in that record is gone and no longer addressable.
The second thing to keep in mind is that we allow certain aggregations such as count() and collect() to execute even when no rows are present, as it is conceivable that you may have a query where nothing matches, and getting that count of 0 (or that empty collection when you collect) is an entirely valid case and should be allowed. In these cases, where there may not have been any rows left at all after a MATCH or filter (and because of no rows, nothing else would be able to execute, as Cypher operations execute per row, so it's no-op if there are no rows), the count() or collect() would cause a new row to emerge with that count of 0, or that empty collection. And since a row is now present, the remaining operators in the query have something to execute on, and the query can continue execution.
This is what happens in your first case, where pattern p1 doesn't exist, but pattern p2 does (once). Here's the breakdown of what happens:
The first match fails to find anything. Rows go to 0. There is nothing left to execute subsequent operations upon.
You perform a standalone count() aggregation (with no other variables in scope, this is important). This emits a single row with a count of 0, which is correct: there are no occurrences of that pattern in the graph.
You perform the second MATCH, and there is a record/row for it to execute upon (with the value {failed_B2:0}), and it finds the single occurrence, and gets its count (1), and is able to output the expected answer (1, 0), with the 1 being the count of the pattern matches at the end of the query, p2, and the 0 being the count of the pattern matches from the first two lines of the query, p1.
Now let's see what happens when we reverse this.
In your second query, it is now pattern p1 that exists once in the graph, and pattern p2 that doesn't exist. Here's the breakdown that happens:
The first MATCH succeeds and finds the pattern.
You get the count of the patterns found: 1. You now have a single record/row with the value {failed_B2:1}
You execute the second MATCH, and the pattern isn't found. The record/row is filtered out. You now have no records/rows, so not only is there nothing to operate on, anything that was previously in the record/rows is gone. There IS no failed_B2 value anywhere to reference.
You attempt to get the count of the p2 along with failed_B2. But this isn't allowed by Cypher, we only allow aggregation across 0 rows when it's a standalone count() or collect(), there IS no failed_B2 to reference, it was wiped out when the record/row that contained it got filtered out. There is no way to process that sanely, as that previously existing data is just not there (and this is correct behavior). The query should be returning no rows...which is NOT the same as 0, 0, as that implies you got a row returned (which is why I'm interested in clarifying that point with you).
As for how you should be correctly executing this, when you have to aggregate like this and you know that some patterns may not exist, use an OPTIONAL MATCH instead.
When you OPTIONAL MATCH, it doesn't filter out the row if no match is found. Instead newly introduced variables in the pattern go to null, and when you count() or collect() over nulls, it ignores them, giving you a correct count of 0, but not wiping out the record/row that contains the failed_B2 value you also want to return at the end.

Optimizing Cypher Query Neo4j

I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;

Perform MATCH on collection / Break apart collection

I am trying to mimc the functionality of the neo4j browser to display my graph in my front end. The neo4j browser issues two calls for every query - the first call performs the query that the user types into the query box and the second call uses find the relationships between every node returned in the first user-entered query.
{
"statements":[{
"statement":"START a = node(1,2,3,4), b = node(1,2,3,4)
MATCH a -[r]-> b RETURN r;",
"resultDataContents":["row","graph"],
"includeStats":true}]
}
In my application I would like to be more efficient so I would like to be able to get all of my nodes and relationships in a single query. The query that I have at present is:
START person = node({personId})
MATCH person-[:RELATIONSHIP*]-(p:Person)
WITH distinct p
MATCH p-[r]-(d:Data), p-[:DETAILS]->(details), d-[:FACT]->(facts)
RETURN p, r, d, details, facts
This query runs well but it doesn't give me the "d" and "details" nodes which were linked to the original "person".
I have tried to join the "p" and "person" results in a collection:
collect(p) + collect(person) AS people
But this does not allow me to perform a MATCH on the resulting collection. As far as I can figure out there is no way of breaking apart a collection.
The only option I see at the moment is to split the query into two; return the "collect(p) + collect(person) AS people" collection and then use the node values in a second query. Is there a more efficient way of performing this query?
If you use the quantifier *0.. RELATIONSHIP is also match at a depth of 0 making person the same as p in this case. The * without specified limits defaults to 1..infinity
START person = node({personId})
MATCH person-[:RELATIONSHIP*0..]-(p:Person)
WITH distinct p
MATCH p-[r]-(d:Data), p-[:DETAILS]->(details), d-[:FACT]->(facts)
RETURN p, r, d, details, facts

Resources