How to implement weighted pagerank algorithm in neo4j ? - neo4j

Is there any argument in inbuilt pagerank algorithm or a separate algorithm is available for applying pagerank algorithm on a weighted neo4j graph. I found the algorithm here but don't know how to run it interatively on neo4j dekstop.

There is a Neo4j Graph Algorithms Library that contains a page rank algorithm procedure. The procedure signature is as follow:
CALL algo.pageRank(label:String, relationship:String, {iterations:5,
dampingFactor:0.85, write: true, writeProperty:'pagerank',
concurrency:4}) YIELD nodes, iterations, loadMillis, computeMillis,
writeMillis, dampingFactor, write, writeProperty - calculates page
rank and potentially writes back
You can use the algorithm running a query like this:
CALL algo.pageRank.stream('Page', 'LINKS', {iterations:20, dampingFactor:0.85})
YIELD node, score
RETURN node,score order by score desc limit 20

Related

Network graph clustering

I have a large-scale network consisting of 62,578 nodes. It represents the network topology of an SDN network (Software Defined Networking).
I want to partition the graph into a number of clusters, each cluster should be controlled by an SDN controller.
I tried to use the k-means algorithm, but it doesn't take into account the relationships between the nodes. This algorithm relies on the nodes and their properties.
Then I tried the similarity algorithm, but it calculates the similarity score between 2 nodes and creates a new relationship that holds this value between these 2 nodes. As a result, I couldn't benefit from this way in k-means algorithm.
Louvain and Leiden don't allow the number of clusters in advance. Is there any way to do that with these algorithms?
Suggestions Please. Any idea may be of great help to me. Many thanks.
Update: This photo is part of the whole graph.
Update 2:
CALL gds.graph.project
(
'myGraph',
'node',
'to',
{
relationshipProperties: 'cost'
}
)
CALL gds.fastRP.write
(
'myGraph',
{
embeddingDimension: 1,
writeProperty: 'fastrp-embedding'
}
)
YIELD nodePropertiesWritten
CALL gds.graph.project
(
'myGraph_1',
{
node: {
properties: 'fastrp-embedding'
}
},
'*'
)
CALL gds.alpha.kmeans.write
('myGraph_1',
{
nodeProperty: 'fastrp-embedding',
k: 3,
randomSeed: 42,
writeProperty: 'kmeans'
})
YIELD nodePropertiesWritten
Update 3:
I applied FastRP to create node embeddings for a graph consisting of 6,301 nodes. These embeddings are stored as node properties to be used in clustering using the K-means algorithm. I noticed that although the nodes are nearby to each other, they are assigned to different clusters.
Notes:
For FastRP, I set the embedding dimensions to 256.
For K-means, I set K to 2.
I tried a smaller embedding equals 2, 4, etc, The same results occurred.
In addition, I tried another graph size with 8,846 nodes. Similar incomprehensible results occurred.
I didn't specify a random seed for FastRP. I didn't know how to set a preferred value for this parameter. I it related to the graph size like the node embedding?
For the sub-graph below, the following are the results of clustering:
If you want to define the number of clusters, you can try the following.
First create node embeddings using any of the available models like the FastRP or node2vec. Next, use the k-means algorithm to cluster nodes based on the embeddings.
Use the mutate mode for the FastRP and store the results under lets say embedding property. Next, run k-means algorithm on the embeddings like in the example:
CALL gds.alpha.kmeans.write('cities', {
nodeProperty: 'embedding',
k: 3,
writeProperty:'cluster'
})
where the k parameter defines the number of clusters and the nodeProperty should point to the embedding property.

Is it a problem if mean similarity score is high when building a similarity graph?

I'm building a similarity graph in Neo4j and gds.nodeSimilarity.stats is reporting a mean similarity score in the 0.60 to 0.85 range for the projection I'm using regardless of how I transform the graph. I've tried:
Only projecting relationships with edge weights greater than 1
Deleting the core node to increase the number of components (my graph is about a single topic, with the core node representing that topic)
Changing it to an undirected graph
I realize I can always set the similarityCutoff in gds.nodeSimilarity.write to a higher value, but I'm second-guessing myself since all the toy problems I used for training, including Neo4j's practices, had mean Jaccard scores less than 0.5. Am I overthinking this or is it a sign that something is wrong?
*** EDITED TO ADD ***
This is a graph that has two types of nodes: Posts and entities. The posts reflect various media types, while the entities reflect various authors and proper nouns. In this case, I'm mostly focused on Twitter. Some examples of relationships:
(e1 {Type:TwitterAccount})-[TWEETED]->(p:Post
{Type:Tweet})-[AT_MENTIONED]->(e2 {Type:TwitterAccount})
(e1 {Type:TwitterAccount})-[TWEETED]->(p2:Post
{Type:Tweet})-[QUOTE_TWEETED]->(p2:Post
{Type:Tweet})-[AT_MENTIONED]->(e2 {Type:TwitterAccount})
For my code, I've tried first projecting only AT_MENTIONED relationships:
CALL gds.graph.create('similarity_graph', ["Entity", "Post"],
"AT_MENTIONED")
I've tried doing that with a reversed orientation:
CALL gds.graph.create('similarity_graph', ["Entity", "Post"], {AT_MENTIONED:{type:'AT_MENTIONED', orientation:'REVERSE'}})
I've tried creating a monopartite, weighted relationship between all the nodes with a RELATED_TO relationship ...
MATCH (e1:Entity)-[*2..3]->(e2:Entity) WHERE e1.Type = 'TwitterAccount' AND e2.Type = 'TwitterAccount' AND id(e1) < id(e2) WITH e1, e2, count(*) as strength MERGE (e1)-[r:RELATED_TO]->(e2) SET r.strength
= strength
...and then projecting that:
CALL gds.graph.create("similarity_graph", "Entity", "RELATED_TO")
Whichever one of the above I try, I then get my Jaccard distribution by running:
CALL gds.nodeSimilarity.stats('similarity_graph') YIELD nodesCompared, similarityDistribution
Part of why you are getting a high similarity score is because the default topK value is 10. This means that the relationships will be created / are considered only between the top 10 neighbors of a node. Try running the following query:
CALL gds.nodeSimilarity.stats('similarity_graph', {topK:1000})
YIELD nodesCompared, similarityDistribution
Now you will probably get a lower mean similarity distribution.
How dense the similarity graph should be depends on your use-case. You can try the default values and see how it goes. If that is still too dense you can raise the similarityCutoff threshold, and if it is too sparse you can raise the topK parameter. There is no silver bullet, it depends on your usecase and dataset.
Changing the relationship direction will heavily influence the results. In a graph of
(:User)-[:RELATIONSHIP]->(:Item)
the resulting monopartite network will be a network of users. However if you reverse the relationship
(:User)<-[:RELATIONSHIP]-(:Item)
Then the resulting network will be a network of items.
Finally, having Jaccard mean at 0.7 when you use topK 10 is actually great as that means that the relationship will be between actual similar nodes. The Neo4j examples lower the similarity cutoff just so some relationships are created and the similarity graph is not too sparse. You can also raise the topK parameter, it's hard to say exactly without more information about the size of your graph.

How to create vocabulary graph with word vectors using neo4j?

I want to create a vocabulary graph with word vectors. The aim is to query for nearest word in vocabulary graph based on word similarity. How can we achieve this on neo4j?
The following is an example:
Suppose vocabulary consists of the following:
Product Quality
Wrong Product
Product Price
Product Replacement
And query word is: Affordable Product
In a single query I should be able to figure out that "Affordable Product" is more closely related to "Product Price" than any others.
Please note that I am storing word embedding in the graph, and hence cosine similarity check on each of the words in the vocabulary one by one will help me achieve this. However when vocabulary becomes large, querying one by one hinders speed and performance.
If there is any way to store the word embeddings for domain vocabulary as a graph, which can be queried for nearest node based on cosine similarity, it can be a possible solution. However not been able to find any thing like this so far.
Looking forward for pointers if any as well. Thanks
What you want to do is to store your embedding results into the graph. Next step is to use Neo4j Graph Data Science library, and run specifically cosine similarity algorithm. It should look something along the lines of:
MATCH (p:Word)
WITH {item:id(p), weights: p.embedding} AS wordData
WITH collect(wordData) AS data
CALL gds.alpha.similarity.cosine.write({
nodeProjection: '*',
relationshipProjection: '*',
data: data,
// here is where you define how many nearest neighbours should be stored
topK: 1,
// here you define what is the minimal similarity between a
// given pair of node to be still relevant
similarityCutoff: 0.1
})
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95
You have now preprocessed your nearest neighbors and can easily query them like:
MATCH (w:Word)-[:SIMILAR]-(other)
RETURN other
Hope this helps, let me know if you have any other questions.
After tryout and reading our several options I found that https://github.com/facebookresearch/faiss is the best option for this use case.

Explore Graph from a source node, based on weighted distance

In Neo4j, I have a network of connected nodes, and the connections all have a weight associated to them.
I want to be able to specify a starting node and a max distance (by distance I mean sum of weights on the edges the path goes through), and get in return all the nodes that are reachable within that distance.
I do not want to compute the minimum distance for all the nodes in my graph, so I was wondering if there was an algorithm that can "explore" the graph from a starting node, and stop once it hits a threshold.
I am not necessarily looking for a solution, but I could use some links to relevant documentation
I'm using the following, which does the trick (bracketed fields are formatted with some input).
CALL apoc.path.spanningTree(n, {{relationshipFilter:'{relationship_filter}', labelFilter:'{label_filter}', minLevel:{min_level}, maxLevel:{max_level}}}) YIELD path
WITH last(nodes(path)) as node, reduce(weight = 0, rel IN relationships(path) | weight+rel.weight) as depth
WHERE depth<{weighted_depth_limit}
WITH depth, collect(node) as nodes_at_depth
ORDER BY depth ASC
RETURN nodes_at_depth, depth

Synonym chains - Efficient routing algorithm for iOS/sqlite

A synonym chain is a series of closely related words that span two anchors. For example, the English words "black" and "white" can connected as:
black-dark-obscure-hidden-concealed-snug-comfortable-easy-simple-pure-white
Or, here's "true" and "false":
true-just=fair=beautiful=pretty-artful-artificial-sham-false
I'm working on a thesaurus iOS app, and I would like to display synonym chains also. The goal is to return a chain from within a weighted graph of word relations. My source is a very large thesaurus with weighted data, where the weights measure similarity between words. (e.g., "outlaw" is closely related to "bandit", but more distantly related to "rogue.") Our actual values range from 0.001 to ~50, but you can assume any weight range.
What optimization strategies do you recommend to make this realistic, e.g., within 5 seconds of processing on a typical iOS device? Assume the thesaurus has half a million terms, each with 20 associations. I'm sure there's a ton of prior research on these kinds of problems, and I'd appreciate pointers on what might be applied to this.
My current algorithm involves recursively descending a few levels from the start and end words, and then looking for intercepting words, but that becomes too slow with thousands of sqlite (or Realm) selects.
Since you said your source is a large thesaurus with weighted data, I'm assuming if you pick any word, you will have the weight to its successor in the similarity graph. I will always use the sequence below, when I'm giving any example:
black-dark-obscure-hidden-concealed-snug-comfortable-easy-simple-pure-white
Let's think of the words as being a node on a graph, each relationship of similarity a word has with another is a path on that graph. Each path is weighted with a cost, which is the weight you have on the source file. So the best solution to find a path from one word to another is to use the A* (A star) path finding.
I'm using the minimum "cost" to travel from a word to its successor to be 1. You can adjust it accordingly. First you will need a good heuristic function to use, since this is a greedy algorithm. This heuristic function will return the "greedy" distance between two words, any words. You must respect the fact the the "distance" it returns can never be bigger than the real distance between the two words. Since I don't know any relationship between any words for a thesaurus, my heuristic function will always return the minimum cost 1. In other words, it will always say a word is the most similar word to any other. For example, my heuristic function tells me that 'black' is the best synonym for 'white'.
You must tune the heuristic function if you can, so it will respond with more accurate distances making the algorithm runs faster. That's the tricky part I guess.
You can see the pseudo-code for the algorithm on the Wikipedia article I sent. But here it is for a faster explanation:
function A*(start,goal)
closedset := the empty set -- The set of nodes already evaluated.
openset := {start} -- The set of tentative nodes to be evaluated, initially containing the start node
came_from := the empty map -- The map of navigated nodes.
g_score[start] := 0 -- Cost from start along best known path.
-- Estimated total cost from start to goal through y.
f_score[start] := g_score[start] + heuristic_cost_estimate(start, goal)
while openset is not empty
current := the node in openset having the lowest f_score[] value
if current = goal
return reconstruct_path(came_from, goal)
remove current from openset
add current to closedset
for each neighbor in neighbor_nodes(current)
if neighbor in closedset
continue
tentative_g_score := g_score[current] + dist_between(current,neighbor)
if neighbor not in openset or tentative_g_score < g_score[neighbor]
came_from[neighbor] := current
g_score[neighbor] := tentative_g_score
f_score[neighbor] := g_score[neighbor] + heuristic_cost_estimate(neighbor, goal)
if neighbor not in openset
add neighbor to openset
return failure
function reconstruct_path(came_from,current)
total_path := [current]
while current in came_from:
current := came_from[current]
total_path.append(current)
return total_path
Now, for the algorithm you'll have 2 arrays of nodes, the ones you are going to visit (opened) and the ones you already visited (closed). You will also have two arrays of distances for each node, that you will be completing as you travel through the graph.
One array (g_score) will tell you the real lowest traveled distance between the starting node and the specified node. For example, g_score["hidden"] will return the lowest weighted cost to travel from 'black' to 'hidden'.
The other array (f_score) will tell you the supposed distance between the node you specified to the goal you want to reach. For example, f_score["snug"] will return the supposed weighted cost to travel from "snug" to "white" using the heuristic function. Remember, this cost will always be less or equal the real cost to travel between words, since our heuristic function need to respect the aforementioned rule.
As the algorithm runs, you will be traveling from node to node, from the starting word, saving all the nodes you traveled and the costs you "used" to travel. You will be replacing the traveled path when you find a better cost to travel on the g_score array. You will use the f_score to predict which node will be best visited first, from the array of 'unvisited' nodes. It's best if you save your f_score as a minimum Heap.
You will end the algorithm when you find the node that is the goal that you want. Then you will reconstruct the minimum path using the array of nodes visited that you kept saving at each iteration. Another way the algorithm will stop is if it visited all neighbor nodes and didn't find the goal. When this happens, you can say there is no path from the starting node to the goal.
This algorithm is the most used on games to find the better path between two objects on a 3D world. To improve it, you just need to create a better heuristic function, that can let the algorithm find the better nodes to travel first, leding it to the goal faster.
-- 7f
Here's a closely related question and answer: Algorithm to find multiple short paths
There you can see comments about Dijkstra's and A-star, Dinic's, but more broadly also the idea of maximum flow and minimum cost flow.

Resources