Determining distinct clusters with Neo4j - neo4j

I have a cypher query that returns a series of paths, which are partly overlapping and result in a number of distinct clusters. In this case there will be a modest number of clusters (100 - 1000) of relatively small size (1-50 nodes). The complete dataset is typically a few million nodes (the query extracts a relatively small subset of the total nodes).
A simplified version of the query looks like this:
MATCH p=(a:M)-[:F2EDGE]-(b:M) WHERE a.prop > 90 AND b.prop > 90 RETURN p
The actual query would be a bit more complex than that with a variable number of intermediate nodes, but that should exemplify the problem.
Now I want to explore the different clusters that are generated by that query.
I have found the docs on the Connected Components algorithm which seems on the right lines, but I can't see how that can be applied to a list of paths that is the result of the query.
I would want to be able to:
get list of the clusters and some basic properties for then (e.g. number of nodes)
fetch data that allowed me to reproducibly fetch that cluster again in the future (maybe by fetching the node ids or by adding new "cluster" nodes that linked to each cluster)
Can someone suggest how to achieve this?

You can use cypher projections with that
something along these lines:
CALL algo.unionFind('
MATCH (a:M) WHERE a.prop > 90 RETURN id(a) as id
UNION
MATCH (b:M) AND b.prop > 90 RETURN id(b) as id
', '
MATCH p=(a:M)-[:F2EDGE]->(b:M) WHERE a.prop > 90 AND b.prop > 90 RETURN id(a) as source, id(b) as target
', {graph:"cypher",write:true, partitionProperty:"partition"})
Please note that in this case one of the node queries would have been enough as they both have the same criteria, I just wanted to demonstrate how to combine queries on source and target nodes.
If you want to restrict the nodes to only the ones in your connected graph you can also use this as "node-query":
MATCH (a:M)-[:F2EDGE]-(b:M)
WHERE a.prop > 90 AND b.prop > 90
UNWIND [id(a), id(b)] as id
RETURN distinct id

Related

I want to range the nodes by degree - why is this Neo4J Cypher request so slow?

I want to first get all the nodes of a certain type connected to a context and then simply range them by their degree, but only for the (:TO) type of connection to the other nodes that belong to the same context. I tried several ways including the ones below but they are too slow (10s of seconds). Is there any way to make it faster?
MATCH (ctx:Context{uid:'60156a60-d3e1-11ea-9477-f71401ca7fdb'})<-[:AT]-(c1:Concept)
WITH c1 MATCH (c1)-[r:TO]-(c2:Concept)
WHERE r.context = '60156a60-d3e1-11ea-9477-f71401ca7fdb'
RETURN c2, count(r) as degree ORDER BY degree DESC LIMIT 10;
MATCH (ctx:Context{uid:'60156a60-d3e1-11ea-9477-f71401ca7fdb'})<-[:AT]-(c1:Concept)-[:TO]-(c2:Concept)
RETURN c1, count(c2) as degree
ORDER BY degree DESC LIMIT 10;
One way to examine degree is using the size function, have you tried something like this?
size((c1)-[:TO]-(:Concept))
In my graph size() appears to be more efficient, but it might be my cypher rearrangement as well.
Example: (in my graph) This statement is 81db hits
PROFILE MATCH (g:Gene {name:'ACE2'})-[r:EXPRESSED_IN]-(a)
return count(r)
And this is 4 db hits
PROFILE MATCH (g:Gene {name:'ACE2'})
return size((g)-[:EXPRESSED_IN]-())
I'm not sure this next suggestion is faster/more efficient, but if you always calculate degree on a single or subset of relationships, you might look into storing the degree values just to see if that might be an option (faster?).
I do this on my entire graph right after a bulk load
CALL apoc.periodic.iterate(
"MATCH (n) return n",
"set n.degree = size((n)--())",
{batchSize:50000, batchMode: "BATCH", parallel:true});
but for a different reason, I want to see the degree value in the neo4j browser (for example...) Note: I rebuilt my graphs daily from the ground up but then it is static until the next rebuild

Neo4j: How to find for each node its next neighbour by distance and create a relationship

I imported a large set of nodes (>16 000) where each node contains the information about a location (longitudinal/lateral geo-data). All nodes have the same label. There are no relationships in this scenario. Now I want to identify for each node the next neighbour by distance and create a relationship between these nodes.
This (brute force) way worked well for sets containing about 1000 nodes: (1) I first defined relationships between all nodes containing the distance information. (2) Then I defined for all relationships the property "mindist=false".(3) After that I identified the next neighbour looking at the the distance information for each relationship and set "mindist" property "true" where the relationship represents the shortest distance. (4) Finally I deleted all relationships with "mindist=false".
(1)
match (n1:XXX),(n2:XXX)
where id(n1) <> id(n2)
with n1,n2,distance(n1.location,n2.location) as dist
create(n1)-[R:DISTANCE{dist:dist}]->(n2)
Return R
(2)
match (n1:XXX)-[R:DISTANCE]->(n2:XXX)
set R.mindist=false return R.mindist
(3)
match (n1:XXX)-[R:DISTANCE]->(n2:XXX)
with n1, min(R.dist) as mindist
match (o1:XXX)-[r:DISTANCE]->(o2:XXX)
where o1.name=n1.name and r.dist=mindist
Set r.mindist=TRUE
return r
(4)
match (n)-[R:DISTANCE]->()
where R.mindist=false
delete R return n
With sets containing about 16000 nodes this solution didn't work (memory problems ...). I am sure there is a smarter way to solve this problem (but at this point of time I am still short on experience working with neo4j/cypher). ;-)
You can process find the closest neighbor one by one for each node in batch using APOC. (This is also a brute-force way, but runs faster). It takes around 75 seconds for 7322 nodes.
CALL apoc.periodic.iterate("MATCH (n1:XXX)
RETURN n1", "
WITH n1
MATCH (n2:XXX)
WHERE id(n1) <> id(n2)
WITH n1, n2, distance(n1.location,n2.location) as dist ORDER BY dist LIMIT 1
CREATE (n1)-[r:DISTANCE{dist:dist}]->(n2)", {batchSize:1, parallel:true, concurrency:10})
NOTE: batchSize should be always 1 in this query. You can change
concurrency for experimentation.
Our options within Cypher are I think limited to a naive O(n^2) brute-force check of the distance from every node to every other node. If you were to write some custom Java to do it (which you could expose as a Neo4j plugin), you could do the check much quicker.
Still, you can do it with arbitrary numbers of nodes in the graph without blowing out the heap if you use APOC to split the query up into multiple transactions. Note: you'll need to add the APOC plugin to your install.
Let's first create 20,000 points of test data:
WITH range(0, 20000) as ids
WITH [x in ids | { id: x, loc: point({ x: rand() * 100, y: rand() * 100 }) }] as points
UNWIND points as pt
CREATE (p: Point { id: pt.id, location: pt.loc })
We'll probably want a couple of indexes too:
CREATE INDEX ON :Point(id)
CREATE INDEX ON :Point(location)
In general, the following query (don't run it yet...) would, for each Point node create a list containing the ID and distance to every other Point node in the graph, sort that list so the nearest one is at the top, pluck the first item from the list and create the corresponding relationship.
MATCH (p: Point)
MATCH (other: Point) WHERE other.id <> p.id
WITH p, [x in collect(other) | { id: x.id, dist: distance(p.location, x.location) }] AS dists
WITH p, head(apoc.coll.sortMaps(dists, '^dist')) AS closest
MATCH (closestPoint: Point { id: closest.id })
MERGE (p)-[:CLOSEST_TO]->(closestPoint)
However, the first two lines there cause a cartesian product of nodes in the graph: for us, it's 400 million rows (20,000 * 20,000) that flow into the rest of the query all of which is happening in memory - hence the blow-up. Instead, let's use APOC and apoc.periodic.iterate to split the query in two:
CALL apoc.periodic.iterate(
"
MATCH (p: Point)
RETURN p
",
"
MATCH (other: Point) WHERE other.id <> p.id
WITH p, [x in collect(other) | { id: x.id, dist: distance(p.location, x.location) }]
AS dists
WITH p, head(apoc.coll.sortMaps(dists, '^dist')) AS closest
MATCH (closestPoint: Point { id: closest.id })
MERGE (p)-[:CLOSEST_TO]->(closestPoint)
", { batchSize: 100 })
The first query just returns all Point nodes. apoc.periodic.iterate will then take the 20,000 nodes from that query and split them up into batches of 100 before running the inner query on each of the nodes in each batch. We'll get a commit after each batch, and our memory usage is constrained to whatever it costs to run the inner query.
It's not quick, but it does complete. On my machine it's running about 12 nodes a second on a graph with 20,000 nodes but the cost exponentially increases as the number of nodes in the graph increases. You'll rapidly hit the point where this approach just doesn't scale well enough.

Find a set of (n) nodes where relationship weight between each pair of node is greater than a value(w)

I have a database where each node is connected to all other nodes with a relationship, and each relationship has a weight. I need a query where given a weight w and a number of nodes n, I want all n nodes where each pair of relationship has a weight greater than w.
Any help on this would be great
It depends on what you would like your result set to look like. Something as simple as this query would return all paths that fall under your criteria:
MATCH p=()-[r:my_rel]->() WHERE r.weight > w RETURN p;
This would return all such paths.
If you would like the two nodes only (and not the entire pattern's results), you can return only those two nodes:
MATCH (n1)-[r:my_rel]->(n2) WHERE r.weight > w RETURN n1,n2;
Do note that due to Neo4J's storage internals, performing a search based on the properties of a relationship tends to not perform as well as those based on properties of a node.

Optimizing Cypher Query Neo4j

I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;

neo4J graph query generator

I was looking for the feature to generate some graph queries in neo4j.
As the database size is huge so can anyone suggest the procedure to generate small queries (3-5 nodes a -> b -> c ->a).
I can run BFS from a node but how can I find the small graph containing only a specific number of nodes as graph structure?
a
/ \
b-----c----d
[UPDATED]
If you want to get a single arbitrary path of length 4 (having 4 relationships and 5 nodes), and you do not need the path to be unidirectional, then you can simply do this:
MATCH p=()-[*4]-()
RETURN p
LIMIT 1;
If you want the path to be unidirectional (where all relationships point in the same direction), then you just need to specify a direction:
MATCH p=()-[*4]->()
RETURN p
LIMIT 1;

Resources