Explore Graph from a source node, based on weighted distance - neo4j

In Neo4j, I have a network of connected nodes, and the connections all have a weight associated to them.
I want to be able to specify a starting node and a max distance (by distance I mean sum of weights on the edges the path goes through), and get in return all the nodes that are reachable within that distance.
I do not want to compute the minimum distance for all the nodes in my graph, so I was wondering if there was an algorithm that can "explore" the graph from a starting node, and stop once it hits a threshold.
I am not necessarily looking for a solution, but I could use some links to relevant documentation

I'm using the following, which does the trick (bracketed fields are formatted with some input).
CALL apoc.path.spanningTree(n, {{relationshipFilter:'{relationship_filter}', labelFilter:'{label_filter}', minLevel:{min_level}, maxLevel:{max_level}}}) YIELD path
WITH last(nodes(path)) as node, reduce(weight = 0, rel IN relationships(path) | weight+rel.weight) as depth
WHERE depth<{weighted_depth_limit}
WITH depth, collect(node) as nodes_at_depth
ORDER BY depth ASC
RETURN nodes_at_depth, depth

Related

Is it a problem if mean similarity score is high when building a similarity graph?

I'm building a similarity graph in Neo4j and gds.nodeSimilarity.stats is reporting a mean similarity score in the 0.60 to 0.85 range for the projection I'm using regardless of how I transform the graph. I've tried:
Only projecting relationships with edge weights greater than 1
Deleting the core node to increase the number of components (my graph is about a single topic, with the core node representing that topic)
Changing it to an undirected graph
I realize I can always set the similarityCutoff in gds.nodeSimilarity.write to a higher value, but I'm second-guessing myself since all the toy problems I used for training, including Neo4j's practices, had mean Jaccard scores less than 0.5. Am I overthinking this or is it a sign that something is wrong?
*** EDITED TO ADD ***
This is a graph that has two types of nodes: Posts and entities. The posts reflect various media types, while the entities reflect various authors and proper nouns. In this case, I'm mostly focused on Twitter. Some examples of relationships:
(e1 {Type:TwitterAccount})-[TWEETED]->(p:Post
{Type:Tweet})-[AT_MENTIONED]->(e2 {Type:TwitterAccount})
(e1 {Type:TwitterAccount})-[TWEETED]->(p2:Post
{Type:Tweet})-[QUOTE_TWEETED]->(p2:Post
{Type:Tweet})-[AT_MENTIONED]->(e2 {Type:TwitterAccount})
For my code, I've tried first projecting only AT_MENTIONED relationships:
CALL gds.graph.create('similarity_graph', ["Entity", "Post"],
"AT_MENTIONED")
I've tried doing that with a reversed orientation:
CALL gds.graph.create('similarity_graph', ["Entity", "Post"], {AT_MENTIONED:{type:'AT_MENTIONED', orientation:'REVERSE'}})
I've tried creating a monopartite, weighted relationship between all the nodes with a RELATED_TO relationship ...
MATCH (e1:Entity)-[*2..3]->(e2:Entity) WHERE e1.Type = 'TwitterAccount' AND e2.Type = 'TwitterAccount' AND id(e1) < id(e2) WITH e1, e2, count(*) as strength MERGE (e1)-[r:RELATED_TO]->(e2) SET r.strength
= strength
...and then projecting that:
CALL gds.graph.create("similarity_graph", "Entity", "RELATED_TO")
Whichever one of the above I try, I then get my Jaccard distribution by running:
CALL gds.nodeSimilarity.stats('similarity_graph') YIELD nodesCompared, similarityDistribution
Part of why you are getting a high similarity score is because the default topK value is 10. This means that the relationships will be created / are considered only between the top 10 neighbors of a node. Try running the following query:
CALL gds.nodeSimilarity.stats('similarity_graph', {topK:1000})
YIELD nodesCompared, similarityDistribution
Now you will probably get a lower mean similarity distribution.
How dense the similarity graph should be depends on your use-case. You can try the default values and see how it goes. If that is still too dense you can raise the similarityCutoff threshold, and if it is too sparse you can raise the topK parameter. There is no silver bullet, it depends on your usecase and dataset.
Changing the relationship direction will heavily influence the results. In a graph of
(:User)-[:RELATIONSHIP]->(:Item)
the resulting monopartite network will be a network of users. However if you reverse the relationship
(:User)<-[:RELATIONSHIP]-(:Item)
Then the resulting network will be a network of items.
Finally, having Jaccard mean at 0.7 when you use topK 10 is actually great as that means that the relationship will be between actual similar nodes. The Neo4j examples lower the similarity cutoff just so some relationships are created and the similarity graph is not too sparse. You can also raise the topK parameter, it's hard to say exactly without more information about the size of your graph.

Graph Algorithm : Similar to TSP

I want to solve a problem similar to the TSP( Travelling Salesman Problem).
I have N ( N > 0, N < 20 ) nodes and i must visit all nodes.
The cost between nodes are equal.
I can visit a node unlimited times.
I want to find more than one path and the cost have not restriction.
Tell me some effective algorithms about this problem?
Here is a solution that works with a weighted graph.
First, the naive solution, enumerating.
It works in O(n!) because there are (n-1)! Hamiltonian paths, and you need O(n) to check each one.
There is better algorithm, with dynamic programming in O(n*2^n)
Define the state as the following: for x a node, and S a set of nodes containing x:
w[S][x] = the weight of the shortest path that start at node x, and goes through all the node in the set S, and then finishes at 0.
Note that 0 does not necessarily belongs to S.
S = {x} is the basic case: w[S][x] = weight(w,0)
Then the recursion formula:
If S is larger than, {x}, Iterate over the possible next step y
w[S][x] = min(weight(x,y) + w[S\x][y] for all y in S\x)
This algorithm will output just one optimal path.

Feedback on algorithm for Steiner Tree with restrictions

For an assignment, I have to create a Steiner Tree. However, this is not a typical Steiner Tree, as the graph structure we're required to use does not allow insertion of new vertices. Rather, the test cases define a graph structure of N vertices and M edges while specifically marking X vertices as target nodes. These are the nodes we have to span while using some, none or all of the unmarked vertices in the graph.
My solution to this problem is
Implement Dijkstra's Algorithm to find the shortest path between all the target vertices
For each of the shortest paths 1:n
Extract all current selected path vertices into a set
Extract all remaining vertices into a set
For all vertices of the current selected path 1:m
Execute Dijkstra to find shortest path between current vertex and other path's vertices
If this creates a spanning tree, save path and length in priority queue sorted by length value
Pop top of priority queue and return path
My issue is that this is an exhaustive search that uses the initial application of Dijkstra to create a reduced set of possible start-end vertices for a shorter path than a minimum spanning tree.
Is there a heuristic or other algorithm that may solve this problem?
With some help, I worked out this answer for a similar problem that I had. Rather than adding new vertices as in a spacial steiner tree problem, the new steiner points in this graph are the vertices that lie along the path between the marked nodes. For a graph with N vertices, M edges, X require vertices, and S found vertices (vertices along our path):
Compute All Pairs Shorest Paths (Floyd-Warshall, Johnson's, whatever)
for k in X
remove k from X, insert k into S
for v in (X + S) - Both sets
find the shortest distance from k to v - path P
for u in P (all vertices on the path)
insert u into S
if u exists in k, remove u from k
Now for the wall of text as to what this algorithm does. We pick a vertex k in X, and then find the minimum distance to the nearest other vertex in the target set X, or in the result set S, and call it v. Then we follow the path of nodes from {k,u}, inserting them into our result set. Finally, double check and make sure that any vertices in X that were on the path (shouldn't happen) are removed from X.
Any new vertex that you want to add, c, will have a minimum distance to some node already in your result set S. Since the nodes already in S are the minimum distance apart, it follows that c will be the minimum distance from any point in S to c. For example, if you have three nodes, A, B, and C, if A and B are already found to be a minimum distance apart, adding C fulfills the requirement that it is the minimum distance from B, and the minimum distance path from A to C goes through B.
I did some research on the discrete Steiner Tree problem (which is what this is), and this is the best brute force solution that I found. The main problem is going to be the O(n^3) time it takes to do all pairs shortest paths, but then the construction of the minimum tree should be straightforward and quick, since you just need to look up distance information. The implementation I wound up working with is outlined nicely on wikipedia.

Decide Whether All Shortest Paths From s to t Contain The Edge e

Let G = (V;E) be a directed graph whose edges all have non-negative weights. Let s,t be 2 vertices in V, and let e be an edge in E.
Describe an algorithm that decides whether all shortest paths from s to t contain the edge e.
Well, this is how you can achieve Dijsktra's time complexity:
Simply run Dijkstra from s and calculate delta(s,t) (the weight of the shortest path from s to t).
Remove the edge e, and run Djikstra again from s in the new graph.
If delta(s,t) in the new graph has increased, it means that all shortest paths from s to t contain the edge e, otherwise it's not true.
I was wondering whether there is a more efficient algorithm for solving this problem. Do you think that it's possible to beat Dijkstra's time complexity ?
Thanks in advance
Your approach sounds correct to me. You just calculate the shortest path with and without the possibility of taking edge e. That gives you 2 Dijkstra searches.
There is room for improvement if you go to A*, bidirectional search or recover your Dijkstra search tree:
A* would speed up your Dijkstra-query but it might not be possible for your graph as you need to be able to define a good bound on your remaining distance.
bidirectional search could be done with both searches meeting around the edge. You can then examine all paths with and without the edge with only 1 fast bidirectional query+ some extra work for both cases instead of having 2 full Dijkstra's that are very similar
you could search once without the edge and maintain your search tree. Then you add e and update the shortest path tree starting from the start point of e. If the label of the end point > the label of the start point + length e, the end point can be reached faster when using e. Recursively search the neighbours of your end point and only update the distances if they could be reached faster than before. Should save you some work.

Synonym chains - Efficient routing algorithm for iOS/sqlite

A synonym chain is a series of closely related words that span two anchors. For example, the English words "black" and "white" can connected as:
black-dark-obscure-hidden-concealed-snug-comfortable-easy-simple-pure-white
Or, here's "true" and "false":
true-just=fair=beautiful=pretty-artful-artificial-sham-false
I'm working on a thesaurus iOS app, and I would like to display synonym chains also. The goal is to return a chain from within a weighted graph of word relations. My source is a very large thesaurus with weighted data, where the weights measure similarity between words. (e.g., "outlaw" is closely related to "bandit", but more distantly related to "rogue.") Our actual values range from 0.001 to ~50, but you can assume any weight range.
What optimization strategies do you recommend to make this realistic, e.g., within 5 seconds of processing on a typical iOS device? Assume the thesaurus has half a million terms, each with 20 associations. I'm sure there's a ton of prior research on these kinds of problems, and I'd appreciate pointers on what might be applied to this.
My current algorithm involves recursively descending a few levels from the start and end words, and then looking for intercepting words, but that becomes too slow with thousands of sqlite (or Realm) selects.
Since you said your source is a large thesaurus with weighted data, I'm assuming if you pick any word, you will have the weight to its successor in the similarity graph. I will always use the sequence below, when I'm giving any example:
black-dark-obscure-hidden-concealed-snug-comfortable-easy-simple-pure-white
Let's think of the words as being a node on a graph, each relationship of similarity a word has with another is a path on that graph. Each path is weighted with a cost, which is the weight you have on the source file. So the best solution to find a path from one word to another is to use the A* (A star) path finding.
I'm using the minimum "cost" to travel from a word to its successor to be 1. You can adjust it accordingly. First you will need a good heuristic function to use, since this is a greedy algorithm. This heuristic function will return the "greedy" distance between two words, any words. You must respect the fact the the "distance" it returns can never be bigger than the real distance between the two words. Since I don't know any relationship between any words for a thesaurus, my heuristic function will always return the minimum cost 1. In other words, it will always say a word is the most similar word to any other. For example, my heuristic function tells me that 'black' is the best synonym for 'white'.
You must tune the heuristic function if you can, so it will respond with more accurate distances making the algorithm runs faster. That's the tricky part I guess.
You can see the pseudo-code for the algorithm on the Wikipedia article I sent. But here it is for a faster explanation:
function A*(start,goal)
closedset := the empty set -- The set of nodes already evaluated.
openset := {start} -- The set of tentative nodes to be evaluated, initially containing the start node
came_from := the empty map -- The map of navigated nodes.
g_score[start] := 0 -- Cost from start along best known path.
-- Estimated total cost from start to goal through y.
f_score[start] := g_score[start] + heuristic_cost_estimate(start, goal)
while openset is not empty
current := the node in openset having the lowest f_score[] value
if current = goal
return reconstruct_path(came_from, goal)
remove current from openset
add current to closedset
for each neighbor in neighbor_nodes(current)
if neighbor in closedset
continue
tentative_g_score := g_score[current] + dist_between(current,neighbor)
if neighbor not in openset or tentative_g_score < g_score[neighbor]
came_from[neighbor] := current
g_score[neighbor] := tentative_g_score
f_score[neighbor] := g_score[neighbor] + heuristic_cost_estimate(neighbor, goal)
if neighbor not in openset
add neighbor to openset
return failure
function reconstruct_path(came_from,current)
total_path := [current]
while current in came_from:
current := came_from[current]
total_path.append(current)
return total_path
Now, for the algorithm you'll have 2 arrays of nodes, the ones you are going to visit (opened) and the ones you already visited (closed). You will also have two arrays of distances for each node, that you will be completing as you travel through the graph.
One array (g_score) will tell you the real lowest traveled distance between the starting node and the specified node. For example, g_score["hidden"] will return the lowest weighted cost to travel from 'black' to 'hidden'.
The other array (f_score) will tell you the supposed distance between the node you specified to the goal you want to reach. For example, f_score["snug"] will return the supposed weighted cost to travel from "snug" to "white" using the heuristic function. Remember, this cost will always be less or equal the real cost to travel between words, since our heuristic function need to respect the aforementioned rule.
As the algorithm runs, you will be traveling from node to node, from the starting word, saving all the nodes you traveled and the costs you "used" to travel. You will be replacing the traveled path when you find a better cost to travel on the g_score array. You will use the f_score to predict which node will be best visited first, from the array of 'unvisited' nodes. It's best if you save your f_score as a minimum Heap.
You will end the algorithm when you find the node that is the goal that you want. Then you will reconstruct the minimum path using the array of nodes visited that you kept saving at each iteration. Another way the algorithm will stop is if it visited all neighbor nodes and didn't find the goal. When this happens, you can say there is no path from the starting node to the goal.
This algorithm is the most used on games to find the better path between two objects on a 3D world. To improve it, you just need to create a better heuristic function, that can let the algorithm find the better nodes to travel first, leding it to the goal faster.
-- 7f
Here's a closely related question and answer: Algorithm to find multiple short paths
There you can see comments about Dijkstra's and A-star, Dinic's, but more broadly also the idea of maximum flow and minimum cost flow.

Resources