Can graph algorithms take nodes' and relationships' properties in Neo4J? - neo4j

I'm starting to use Graph Algorithms plugin of Neo4J (3.3.x) and wanted to ask if the plugin can take in the properties of the nodes/relationships, so that I could add in a request like this:
CALL algo.pageRank.stream('Page', 'LINKS', {iterations:20, dampingFactor:0.85})
YIELD node, score
RETURN node,score order by score desc limit 20
Some properties of the Nodes labeled Page (e.g. only the ones with timestamp > certain_date) or only the LINKS which have a specific property x.
Or then if it's not possible, shall I use Cypher projection and simply make a Cypher query inside the pageRank algorithm?

You can use Cypher projection to be more selective about which nodes and relationships to process with an graph algorithm.
For example, to execute the algo.pageRank algorithm only on Page nodes whose timestamp > 1000, and LINKS relationships that have a specific property x, this should work:
MERGE (dummy:Dummy)
WITH dummy, ID(dummy) AS dummy_id
CALL algo.pageRank.stream(
'OPTIONAL MATCH (p:Page) WHERE p.timestamp > 1000 RETURN CASE WHEN p IS NOT NULL THEN ID(p) ELSE ' + dummy_id + ' END AS id',
'OPTIONAL MATCH (p1:Page)-[link:LINKS]->(p2:Page) WHERE EXISTS(link.x) WITH CASE WHEN link IS NOT NULL THEN [ID(p1), ID(p2)] ELSE [' + dummy_id + ',' + dummy_id + '] END AS res RETURN res[0] AS source, res[1] as target',
{graph:'cypher', iterations:20, dampingFactor:0.85})
YIELD node, score
WITH dummy, node, score
WHERE node <> dummy
RETURN node, score ORDER BY score DESC LIMIT 20;
NOTE: The graph algorithms are currently badly behaved (i.e., they throw exceptions) when either of the Cypher statements used in a Cypher projection return no results. The above query works around that by making sure that both statements return a dummy node instead of returning nothing. The Cypher statement that "wraps" the algorithm call will then filter out the dummy node if it is returned by the algorithm.

Related

Determining distinct clusters with Neo4j

I have a cypher query that returns a series of paths, which are partly overlapping and result in a number of distinct clusters. In this case there will be a modest number of clusters (100 - 1000) of relatively small size (1-50 nodes). The complete dataset is typically a few million nodes (the query extracts a relatively small subset of the total nodes).
A simplified version of the query looks like this:
MATCH p=(a:M)-[:F2EDGE]-(b:M) WHERE a.prop > 90 AND b.prop > 90 RETURN p
The actual query would be a bit more complex than that with a variable number of intermediate nodes, but that should exemplify the problem.
Now I want to explore the different clusters that are generated by that query.
I have found the docs on the Connected Components algorithm which seems on the right lines, but I can't see how that can be applied to a list of paths that is the result of the query.
I would want to be able to:
get list of the clusters and some basic properties for then (e.g. number of nodes)
fetch data that allowed me to reproducibly fetch that cluster again in the future (maybe by fetching the node ids or by adding new "cluster" nodes that linked to each cluster)
Can someone suggest how to achieve this?
You can use cypher projections with that
something along these lines:
CALL algo.unionFind('
MATCH (a:M) WHERE a.prop > 90 RETURN id(a) as id
UNION
MATCH (b:M) AND b.prop > 90 RETURN id(b) as id
', '
MATCH p=(a:M)-[:F2EDGE]->(b:M) WHERE a.prop > 90 AND b.prop > 90 RETURN id(a) as source, id(b) as target
', {graph:"cypher",write:true, partitionProperty:"partition"})
Please note that in this case one of the node queries would have been enough as they both have the same criteria, I just wanted to demonstrate how to combine queries on source and target nodes.
If you want to restrict the nodes to only the ones in your connected graph you can also use this as "node-query":
MATCH (a:M)-[:F2EDGE]-(b:M)
WHERE a.prop > 90 AND b.prop > 90
UNWIND [id(a), id(b)] as id
RETURN distinct id

Neo4J Graph Algorithms Cypher Projection should return only numbers?

Hello I make a Graph Algorithm Neo4J request in Cypher of the following kind, which first finds the nodes and then the relations between them:
CALL algo.pageRank.stream('MATCH (u:User{uid:"0ee14110-426a-11e8-9d67-e79789c69fd7"}),
(ctx:Context{name:"news180417"}), (u)<-[:BY]-(c:Concept)-[:AT]->(ctx)
RETURN DISTINCT id(c) as id',
'CALL apoc.index.relationships("TO","user:0ee14110-426a-11e8-9d67-e79789c69fd7")
YIELD rel, start, end WITH DISTINCT rel, start, end MATCH (ctx:Context)
WHERE rel.context = ctx.uid AND (ctx.name="news180417" )
RETURN DISTINCT id(start) AS source, id(end) AS target',
{graph:'cypher', iterations:5});
Which works fine. However, when I try to return c.uid instead of its Neo4J id() the Graph Algorithms don't accept it.
Does it mean I can only operate using Neo4J ids in Graph Algorithms?
When you use Cypher projection with the Graph Algorithms procedures, you pass 2 Cypher statements (and a config map).
The first Cypher statement must return an id variable whose value is the native ID of a node.
The second Cypher statement must return source and target variables whose values are also node IDs.
So, yes, your Cypher statements must always return neo4j native IDs.

Cypher-How to set property for the nodes along the shortestpath

I'm new to neo4j and cypher, with about a week's experience... I'm working on a small project to manipulate with a graph of the 10s of thousands of TWS batch jobs running on the mainframe of my company. A key mission is to found out what we called key-path of the batch jobs of the last batch in the midnight, which is actually the weighted shortestPath in neo4j. I have already achieved that goal using a cypher like below.
MATCH (a:Job {Jobname:...}),(b:Job {Jobname:...})
call apoc.algo.dijkstra(a,b,'runafter>','Duration') YIELD path, weight
RETURN path,weight`
I created a python with neo4j driver and it runs every day automatically, to extract the batch jobs data from rbdms and created a new graph every day in neo4j and run the cyphers and format the result key-path to fit my MySQL so that I can compare the key-path of every different day
But a new idea came to my mind, what if I can enhance this cypher so that the nodes along the returned path will be set a label/or a property? so that I can later easily refer to the key-path again without calling the Dijkstra every time. I know I can use my python program to do that, just after the key-path is back and generate a series of cypher to do that job, but I think there should be a solution with cypher alone. Thanks a lot in advance!
Compute the path identifier value
You need to take the array of nodes along the path - NODES
Go through each node - UNWIND or FOREACH
Set or label or property - now you can not use the value of a variable as a label, it means writing to the property - SET
MATCH (a:Job {Jobname:...}),(b:Job {Jobname:...}) WITH a, b,
a.Jobname + '-' + b.Jobname AS pathID
CALL apoc.algo.dijkstra(a,b,'runafter>','Duration') YIELD path, weight
FOREACH (n IN NODES(path)|
SET n.pathID = pathID,
n.pathWeight = weight
)
RETURN path,weight
Since you use apoc, you can set labels:
MATCH (a:Job {Jobname:...}),(b:Job {Jobname:...}) WITH a, b,
'inCalculatedPath' + '-' + a.Jobname + '-' + b.Jobname AS pathID
CALL apoc.algo.dijkstra(a,b,'runafter>','Duration') YIELD path, weight
CALL apoc.create.addLabels( NODES(path), ['inCalculatedPath', pathID])
RETURN path,weight
An additional way is to add the something like Calculated path node:
MATCH (a:Job {Jobname:...}),(b:Job {Jobname:...}) WITH a, b
CALL apoc.algo.dijkstra(a,b,'runafter>','Duration') YIELD path, weight
CREATE (P:CalculatedPath)
SET P.weight = weight,
P.start = ID(a),
P.end = ID(b),
P.pathNodes = REDUCE(ids=[], n IN NODES(path)| ids + ID(n)),
P.pathRels = REDUCE(ids=[], r IN RELS(path) | ids + ID(r))
FOREACH (n IN NODES(path)|
MERGE (n)-[:inPath]->(P)
)
RETURN path, weight
And get paths back:
MATCH (a:Job {Jobname:...}),(b:Job {Jobname:...}) WITH a, b
MATCH (path:CalculatedPath {start: ID(A), end: ID(b)})
RETURN path, path.weight AS weight

Neo4J order by count relationships extremely slow

I'm trying to model a large knowledge graph. (using v3.1.1).
My actual graph contains only two types of Nodes (Topic, Properties) and a single type of Relationships (HAS_PROPERTIES).
The count of nodes is about 85M (47M :Topic, the rest of nodes are :Properties).
I'm trying to get the most connected node:Topic for this. I'm using the following query:
MATCH (n:Topic)-[r]-()
RETURN n, count(DISTINCT r) AS num
ORDER BY num
This query or almost any query I try to perform (without filtering the results) using the count(relationships) and order by count(relationships) is always extremely slow: these queries take more than 10 minutes and still no response.
Am i missing indexes or is the a better syntax?
Is there any chance i can execute this query in a reasonable time?
Use this:
MATCH (n:Topic)
RETURN n, size( (n)--() ) AS num
ORDER BY num DESC
LIMIT 100
Which reads the degree from a node directly.

Optimizing Cypher Query Neo4j

I want to write a query in Cypher and run it on Neo4j.
The query is:
Given some start vertexes, walk edges and find all vertexes that is connected to any of start vertex.
(start)-[*]->(v)
for every edge E walked
if startVertex(E).someproperty != endVertex(E).someproperty, output E.
The graph may contain cycles.
For example, in the graph above, vertexes are grouped by "group" property. The query should return 7 rows representing the 7 orange colored edges in the graph.
If I write the algorithm by myself it would be a simple depth / breadth first search, and for every edge visited if the filter condition is true, output this edge. The complexity is O(V+E)
But I can't express this algorithm in Cypher since it's very different language.
Then i wrote this query:
find all reachable vertexes
(start)-[*]->(v), reachable = start + v.
find all edges starting from any of reachable. if an edge ends with any reachable vertex and passes the filter, output it.
match (reachable)-[]->(n) where n in reachable and reachable.someprop != n.someprop
so the Cypher code looks like this:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
The performance of this query is not good as I thought. There are index on :Col(schema)
I am running neo4j 2.3.0 docker image from dockerhub on my windows laptop. Actually it runs on a linux virtual machine on my laptop.
My sample data is a small dataset that contains 0.1M vertexes and 0.5M edges. For some starting nodes it takes 60 or more seconds to complete this query. Any advice for optimizing or rewriting the query? Thanks.
The following code block is the logic I want:
VertexQueue1 = (starting vertexes);
VisitedVertexSet = (empty);
EdgeSet1 = (empty);
While (VertexSet1 is not empty)
{
Vertex0 = VertexQueue1.pop();
VisitedVertexSet.add(Vertex0);
foreach (Edge0 starting from Vertex0)
{
Vertex1 = endingVertex(Edge0);
if (Vertex1.schema <> Vertex0.schema)
{
EdgeSet1.put(Edge0);
}
if (VisitedVertexSet.notContains(Vertex1)
and VertexQueue1.notContains(Vertex1))
{
VertexQueue1.push(Vertex1);
}
}
}
return EdgeSet1;
EDIT:
The profile result shows that expanding all paths has a high cost. Looking at the row number, it seems that Cypher exec engine returns all paths but I want distint edge list only.
LEFT one:
match (start:Col {table:"F_XXY_DSMK_ITRPNL_IDX_STAT_W"})
,(start)-[*0..]->(prev:Col)-->(node:Col)
where prev.schema<>node.schema
return distinct prev,node
RIGHT one:
MATCH (n:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
WITH n MATCH (n:Col)-[*]->(m:Col)
WITH collect(distinct n) + collect(distinct m) AS c1
UNWIND c1 AS rn
MATCH (rn:Col)-[]->(xn:Col) WHERE rn.schema<>xn.schema and xn in c1
RETURN rn,xn
I think Cypher lets this be much easier than you're expecting it to be, if I'm understanding the query. Try this:
MATCH (start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WHERE start.schema <> node.schema
RETURN start, node
Though I'm not sure why you're comparing the schema property on the nodes. Isn't the schema for the start node fixed by the value that you pass in?
I might not be understanding the query though. If you're looking for more than just the nodes connected to the start node, you could do:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN prev, node
That open-ended variable length relationship specification might be slow, though.
Also note that when Cypher is browsing a particular path it stops which it finds that it's looped back onto some node (EDIT relationship, not node) in the path matched so far, so cycles aren't really a problem.
Also, is the DWMDATA value that you're passing in interpolated? If so, you should think about using parameters for security / performance:
http://neo4j.com/docs/stable/cypher-parameters.html
EDIT:
Based on your comment I have a couple of thoughts. First limiting to DISTINCT path isn't going to help because every path that it finds is distinct. What you want is the distinct set of pairs, I think, which I think could be achieved by just adding DISTINCT to the query:
MATCH
(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})
(start)-[*0..]->(prev:Col)-->(node:Col)
WHERE prev.schema <> node.schema
RETURN DISTINT prev, node
Here is another way to go about it which may or may not be more efficient, but might at least give you an idea for how to go about things differently:
MATCH
path=(start:Col {schema:"${DWMDATA}",table:"CHK_P_T80_ASSET_ACCT_AMT_DD"})-->(node:Col)
WITH rels(path) AS rels
UNWIND rels AS rel
WITH DISTINCT rel
WITH startNode(rel) AS start_node, endNode(rel) AS end_node
WHERE start_node.schema <> end_node.schema
RETURN start_node, end_node
I can't say that this would be faster, but here's another way to try:
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH collect(ID(node)) AS node_ids
MATCH (:Col)-[r]->(node:Col)
WHERE ID(node) IN node_ids
WITH DISTINCT r
RETURN startNode(r) AS start_node, endNode(r) AS end_node
I suspect that the problem in all cases is with the open-ended variable length path. I've actually asked on the Slack group to try to get a better understanding of how it works. In the meantime, for all the queries that you try I would suggest prefixing them with the PROFILE keyword to get a report from Neo4j on what parts of the query are slow.
// this is very inefficient!
MATCH (start:Col)-[*]->(node:Col)
WHERE start.property IN {property_values}
WITH distinct node
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;
you might be better off with this:
MATCH (start:Col)
WHERE start.property IN {property_values}
MATCH (node:Col)
WHERE shortestPath((start)-[*]->(node)) IS NOT NULL
MATCH (prev)-[r]->(node)
RETURN distinct prev, node;

Resources