Neo4J visit inner circle nodes only once - neo4j

I have a Cypher query that returns closed circle results like
a-b-t-y-a
b-t-y-w-b
and so on ...
From time to time I get results like
a-b-c-b-e-f-a
c-e-r-e-f-g-c
c-e-r-c-d-a-c
that I do not need
In this two results I have b-in first and e in second visited twice , and c is
shown also not only at the start and at the end as it should be but in the middle also .
.
How to avoid getting results where I get same nodes in inner part of results or getting the start and end node inside also .The same node at the start and at the end of result is fine .
My current cypher query is :
start n=node(*) match p=n-[r:OWES*1..200]->n
where HAS(n.taxnumber)
return extract(s in `relationships(p) : s.amount), extract(t in nodes(p) : ID(t)), length(p);
I am getting all nodes that have taxnumber property and connected with relation OWES , all
up to level of 200 .

If you can't use CASE you could try this. I tested it on a 1.8.3 server and it runs ok. I was not able to get it working with reduce. I got several different unexpected errors, including 'unclosed parenthesis' and type comparison problems, this was the only query that I got to work. I also ran it on a 2.0 server and it finished in ~39s what Michael's query with case and reduce did in ~23s on a mockup data set.
START n=node(*)
MATCH p=n-[r:OWES*1..200]->n
WHERE HAS(n.taxnumber) AND
ALL(x IN tail(nodes(p)) WHERE SINGLE(y IN tail(nodes(p)) WHERE x=y))
RETURN EXTRACT(s IN relationships(p) : s.amount), EXTRACT(t IN nodes(p) : ID(t)), length(p)

Can't you check for non-duplicates in the path?
Right now cypher misses a uniq function that would make this simple, here is a workaround.
This is for neo4j 2.0, as 1.9 doesn't have case when there it might be more involved, probably using filter
start n=node(*)
match p=n-[r:OWES*1..200]->n
where HAS(n.taxnumber) AND
reduce(a=tail(nodes(p)), x in tail(nodes(p)) |
case when a IS NOT null OR x in tail(a) then null else tail(a) end) IS NOT NULL
return extract(s in relationships(p) | s.amount), extract(t in nodes(p) | ID(t)), length(p);
for 1.9 you might use an expression like this to find duplicates:
WITH [1,2,3] AS coll
reduce(a=false, x IN coll : a OR length(filter(y IN coll : x = y))> 1)
It works like this:
uses tail(nodes(path)) b/c you have the same start and end node which would always be a duplicate
it reduces over all elements of the collection and when it doesn't find the element again in the rest of the collection it returns the rest of the collection and repeats
otherwise it returns null and shortcuts

Related

Neo4j Cypher query running endlessly

I have rerun the query below multiple times for the last two days and the Neo4j interface says it's running but it seems like it is running endlessly. I have run other queries which have all return an output. I left the query running for 9 hours and it was still running after 9 hr. I'm not sure what the issue is but would appreciate any help.
I'm running Neo4j-community-2.3.12 which is an older version but it should work as I am following a tutorial and the rest of the queries work fine.
Cypher script - which is very basic:
match p=(ione)-[:ResponseTo*]->(itwo)
where length(p)=9 with p
match (u)-[:CreateChat]->(i)
where i in nodes(p)
return count(distinct u);
Image of query running endlessly:
This query looks like an endless loop.
I would suggest instead of getting all the paths and checking length later get the paths of the desired length(9).
Also, consider adding labels in path query.
match p=(ione)-[:ResponseTo*9]->(itwo)
with p
match (u)-[:CreateChat]->(i)
where i in nodes(p)
return count(distinct u);
As Raj noted, you will want to use labels in this, as right now this is doing an all nodes scan which isn't performant.
We can also make sure the second match is more performant by ensuring we start i with the previously matched nodes, rather than applying that as a filter after the match:
match p=(ione)-[:ResponseTo*9]->(itwo)
unwind nodes(p) as i
with DISTINCT i
match (u)-[:CreateChat]->(i)
return count(distinct u);

Neo4j indices slow when querying across 2 labels

I've got a graph where each node has label either A or B, and an index on the id property for each label:
CREATE INDEX ON :A(id);
CREATE INDEX ON :B(id);
In this graph, I want to find the node(s) with id "42", but I don't know a-priori the label. To do this I am executing the following query:
MATCH (n {id:"42"}) WHERE (n:A OR n:B) RETURN n;
But this query takes 6 seconds to complete. However, doing either of:
MATCH (n:A {id:"42"}) RETURN n;
MATCH (n:B {id:"42"}) RETURN n;
Takes only ~10ms.
Am I not formulating my query correctly? What is the right way to formulate it so that it takes advantage of the installed indices?
Here is one way to use both indices. result will be a collection of matching nodes.
OPTIONAL MATCH (a:B {id:"42"})
OPTIONAL MATCH (b:A {id:"42"})
RETURN
(CASE WHEN a IS NULL THEN [] ELSE [a] END) +
(CASE WHEN b IS NULL THEN [] ELSE [b] END)
AS result;
You should use PROFILE to verify that the execution plan for your neo4j environment uses the NodeIndexSeek operation for both OPTIONAL MATCH clauses. If not, you can use the USING INDEX clause to give a hint to Cypher.
You should use UNION to make sure that both indexes are used. In your question you almost had the answer.
MATCH (n:A {id:"42"}) RETURN n
UNION
MATCH (n:B {id:"42"}) RETURN n
;
This will work. To check your query use profile or explain before your query statement to check if the indexes are used .
Indexes are formed and and used via a node label and property, and to use them you need to form your query the same way. That means queries w/out a label will scan all nodes with the results you got.

Neo4J exclude nodes after set property

I have only just started with Neo4j so I suspect I am missing something very basic here. Given I have the following graph.
And starting on the highlighted node (ID 7937) I need to get all the connected nodes but nothing passed any of the "off" nodes.
Using this
match (n:TestNode)-[:LINK*]-(m)
where ID(n) = 7937
return *
Gives me everything of course which I would suspect due to no filter.
I need the end result to be:
This seems to give me the result I need. Is this the correct way or something better:
match p=n-[:LINK*..]-m
where ID(n) = 7937 and all(x in nodes(p) WHERE x.status = 'on')
return p;
Your query does not give you the results you said you wanted. That is, it does not include the nearest off nodes in the results.
Here is a query that does include the nearest off nodes. However, the results will contain partial paths as well as full paths; this is because your data has no consistent directionality to the LINK relationships, so it is hard to determine when we have reached the end of a path (it can probably be done, but it would make the query more complex).
MATCH p=n-[:LINK*..]-m
WHERE ID(n) = 7937
RETURN DISTINCT REDUCE(s =[n], x IN NODES(p)[1..] |
CASE (s[-1]).status
WHEN 'on' THEN s + x
ELSE s
END) AS res;
Here is a console that shows sample results.

Optimizing Cypher Query Neo4j

I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;

Cypher with clause unknown identifier `n` error.

This is my cypher query:
start n=node(*) match p=n-[r:OWES*1..200]->n
with count(n) as numbern ,count(r) as numberr
where HAS(n.taxnumber) and numbern >= numberr
return extract(s in `relationships(p) : s.amount), extract(t in nodes(p) : ID(t)), length(p); `
This gives me Unknown identifier n error. What is wrong with this? I use Neo4j 1.8.2 for this.
n is no longer visible after your WITH,
p is also not visible then.
This query doesn't make any sense, what do you want to achieve with the aggregation?
Both counts return the same number btw.
Besides what we already discussed in the other issues, what do you want to achieve?

Resources