What is the best way to find all possible paths between two nodes in large networks scale? - scalability

I wonder that what is the best way to find all possible paths from a source to a destination in a very large network scale (in a network matrix), i.e. 5000 nodes. I have used this function that is implemented using stacks, but its limit seems about 60 nodes and it can't retrieve the paths for a 200-node network. In another approach, DFS (depth-first search) could be one of the options but this algorithm also uses stack, so I am afraid of its scalability. Thus, do we have any efficient way for finding all paths between two given nodes in such a large network?

Depth-first is the only way to make it scalable at the level you specify, at least until quantum computing gives us infinite processing power. The number of paths if you have 100% adjacency among all nodes is about the same as the number of atoms in the universe, around 2^120.

Related

Cytoscape: Trying to Determine Which of Two Node Types Is More "Central"

So I have a decently large dataset (4k+ nodes, 16k+ edges), and there are two nodes types (let's call them "A" and "B," combined ~130 nodes) that should be considered the centers of many sub-networks. I'm trying to create a visualization that can illustrate whether A or B is more "central" to these sub-networks. To put it another way, is A or B the more "important" organizing type? If any of this makes any sense at all, I'd appreciate your thoughts. (As a disclaimer, I'm fairly new to the software but pretty comfortable with the fundamentals. Consider me a decently intelligent noob haha)
There is a tool included with Cytoscape called Network Analyzer (Tools->Analyze Network). What you are asking for is a measure of the "centrality" of the nodes. There are several types of centrality measures that can be used for "importance" depending on what you mean by importance. Network Analyzer will provide new columns with the main measures of centrality: degree centrality (the extent to which the node is a hub), betweenness centrality (the extent to which paths go through this node) and closeness centrality (the extent to which this node is close to other nodes). See https://cytoscape.org/cytoscape-tutorials/presentations/intro-cytoscape-2020-ucsf.html#/12 for a brief discussion of some of the common network centrality measures.
-- scooter

Is it ok to use graphdb on a large dense graph

We want to present our data in a graph and thought about using one of graphdbs. During our vendor investigation process, one of the experts suggested that using graphdb on dense graph won't be efficient and we'd better off with columnar-based db like cassandra.
I gave your use case some thought and given your graph is very dense (number of relationships = number of nodes squared) and that you seem to only need a few hop traversals from the particular node along different relationships. I’d actually recommend you also try out a columnar database.
Graph databases tend to work well when you have sparse graphs (num of relationships << num of nodes ^ 2) and with deep traversals - from 4-5 hops to hundreds of hops. If I understood your use-case correctly, a columnar database should generally outperform graphs there.
Our use case will probably end up with nodes connected to 10s of millions of other nodes with about 30% overlap between different nodes - so in a way, it's probably a dense graph. Overall there will be probably a few billion nodes.
Looking in Neo4j source code I found some reference of isDense flag on the nodes to differentiate the processing logic - not sure what that does. But I also wonder whether it was done as an edge case patch and won't work well if most of the nodes in the graph are dense.
Does anyone have any experience with graphdbs on dense graphs and should it be considered in such cases?
All opinions are appreciated!
When the use of graph DB comes into mind it shows multiple tables are linked with each other, which is a perfect use case for graph DB.
We are handling JansuGraph with a scale of 20B vertices and 15B edges. It's not a large dense graph with a vertex connected with 10s M vertices. But still, we observed the super node case, where a vertex is connected with more number of vertices than expectation. But with our use case while doing traversal (DFS) we always traverse with max N children nodes of a node and a limited depth say M, which is absolutely fine considering the number of joins required in non-graph DBS (columnar, relational, Athena, etc..).
The only way (i feel) to get all relations of a node is to do a full DFS or inner joins datasets until no common data found.
Excited to know more about other creative solutions.
I do not have experience with dense graphs using graph databases, but I do not think that dense graph is a problem. Since You are going to use graph algorithms, I suppose, You would benefit from using graph database (depending on the algorithms complexity - the more "hops", the more You benefit from constant edge traversing time).
A good trade-off could be to use one of not native graph databases (like Titan, its follow-up JanusGraph, Mongo Db, ..), which actually uses column based storages (Cassandra, Barkley DB, .. ) as its backend.

What parameters can I play with using mcl?

I am clustering undirected graphs using mcl. To do so, I have choose a threshold under which nodes are connected, a similarity measure for each edge and the inflation parameter to tune the granularity of my graph. I have been playing around with these parameters, but so far, the clusters I have seem to be too large (I did visualizations that suggest that the largest clusters should be cut into 2 or more clusters). Therefore, I was wondering what are the other parameters I can play with to improve my clustering (I am currently working with the scheme parameter of mcl to see whether increasing the accuracy would help, but if there are other 'more specific' parameters that could help to get smaller clusters for instance, please let me know)?
There are really mainly two things to consider. The first and most important is outside mcl (http://micans.org/mcl/) itself, namely how the network is constructed. I've written about it elsewhere, but I'll repeat it here because it is important.
If you have a weighted similarity, choose an edge-weight (similarity) cutoff
such that the topology of the network becomes informative; i.e. too many edges
or too few edges yield little discriminative information in the
absence/presence structure of edges. Choose it such that no edges connect
things you consider very dissimilar, and that edges connect things you consider
somewhat similar to quite similar. In the case of mcl, the dynamic range in
edge weight between 'a bit similar' and 'very similar' should be, as a rule of
a thumb, one order of magnitude, i.e. two-fold or five-fold or ten-fold, as
opposed to varying from 0.9 to 1.0. Of course, it is possible to give simple
networks to mcl and it will just utilise the absence/presence of edges. Make sure
the network does not become very dense - a very rough rule of thumb could be to aim
for a total number of edges that is in the order of V * sqrt(V) if the number of nodes (vertcies) is V, that is, each node has, on average, in the order of sqrt(V) neighbours.
The above, network construction, is really crucial, and it is advisable
to try different approaches. Now, given a network,
there is really only one mcl parameter to vary: the inflation parameter (the -I option).
A good set of values to test with is 1.4, 2, 3, 4, 6.
In summary, if you are exploring, try different ways of network construction,
using your knowledge of the data to make the network a meaningful representation,
and combine this with trying different mcl inflation values.

Determining groups in a hierarchical cluster

I have an algorithm that can group data into a hierarchical cluster tree. The algorithm is the one described in Toby Seagram's Programming Collective Intelligence. The tree output is a binary tree with a "distance" value at each node, that tells you how far apart the two child nodes are.
I can then display this as a Dendrogram and it makes it fairly easy for a human spot which values are grouped together. However I'm having difficult coming up with an algorithm that automatically decides what the groups should be. I'd like to be able to determine automatically:
The number of group
Which points should be placed in each group
Is there a standard algorithm for this?
I think there is no default way to do this. Simple 'manual' methods would be to either:
specify the number of clusters you want/expect
set a threshold for the maximum distance between two nodes; any nodes with a larger distance belong to another cluster
There are some automatic methods to determine the number of clusters. R has the Dynamic Tree Cut package which automatically deals with this problem, also pvclust could be used. Here are two more methods described to deal with this problem, Salvador (2002) and Daniels (2006).
I have found out that the Calinski-Harabasz index (also known as Variance Ratio Criterion) works well with dendrograms produced by hierarchical clustering. You can find more information (and a comparative study) in this paper.

Mahout: RowSimilarity vs Clustering

I was trying to cluster some documents using the KMeansClustering approach and successfully created the clusters. I saved the cluster id corresponding to a particular document for recommendations. So whenever I wanted to recommend documents similar to a particular document, I would query all the documents in a particular cluster and return n random documents from the cluster. However, returning any random document from the cluster did not seem appropriate and I read somewhere that we should be returning the documents nearest to the document in question.
So I started searching for calculating distance between documents and stumbled upon the RowSimilarity approach which returns 10 most similar documents to each document, ordered by distance. Now this approach relies on a similarity metric like LogLikelihood etc to calculate the distance between documents.
Now my question is this. How is clustering better/worse than RowSimilarity given that both the approaches use a similarity distance metric to calculate the distance between documents?
What I'm trying to achieve is that I'm trying to cluster products on the basis of their titles and other text properties to recommend similar products. Any help is appreciated.
Clustering is not just another variant of classification or recommendation. It is a different discipline.
When you are doing cluster analysis, you want to discover structure in the data. But then, you should actually be analyzing the structure you found.
Now k-means is not really meant for documents. It tries to find a near optimal partitioning of a data set into k Voronoi cells. Unless you have a good reason to believe that Voronoi cells are a good partitioning for your data, the algorithm may be pretty much useless. Just because it returns a result does not at all indicate that the result is useful.
For documents, Euclidean distance (and k-means is in fact optimizing Euclidean distances) are usually pretty much meaningless. The vectors are very sparse, and k-means cluster centers will then often resemble impossible (and thus insensible) "average documents".
And I havn't started on the need to find an appropriate value of k, on the Mahout implementation likely just being an approximation of Lloyds k-means approximation, and so on. Did you even check the cluster sizes? In situations like these, k-means will often produce degenerate results. For example, almost all clusters containing 1 or 0 elements, and a mega-cluster containing the rest. In this situation, you might in fact be returning just random documents from your database...
Just because you can use it does not mean it is helpful. Make sure to validate the individual steps of your approach, for example if the clusters are in any way useful and sensible!
Similarity is not the same thing as distance -- one is big when the other is small. Clustering is not the same as computing distances either. First you should decide whether you have a clustering problem -- it does not sound like you do based on what you say. So, don't use k-means.

Resources