Edge Pair Prediction with Graph Neural Networks (GNN) - machine-learning

I am working on a Project about a systems stability regarding the failure/removal of PAIRS of edges. I am currently reading into GNNs. GNNs of course usually produce predictions for the whole graph or for single nodes or edges.
So, can GNNs be adapted to classify Edge Pairs?

Related

Why k-means in scikit learn have a predict function but DBSCAN/agglomerative doesnt?

Scikit-learn implementation of K-means has a predict() function which can be applied on unseen data. Where as DBSCAN and Agglomerative does not have a predict() function.
All the three algorithms has fit_predict() which is used to fit the model and then predict. But k-means has predict() which can be directly used on unseen data which is not the case for the other algorithm.
I am very much aware that there are clustering algorithms and as per my opinion, predict() should not be there for K-means also.
What is the possible intuition/reason behind this discrepancy? is it only because k-means performs "1NN classification", so it has a predict() function?
My interpretation is that the difference comes from the way the cluster are computed. In the KMeans there is a native way to assign a new point to a cluster, while not in DBSCAN or Agglomerative clustering.
A) KMeans
In KMeans, during the construction of the clusters, a data point is assigned to the cluster with the closest centroid, and the centroids are updated afterwards. "Predicting" in the KMeans algorithm is actually doing the assignment step without updating the clusters.
If you assume that the new data points are drawn from the same distribution than the "training" set, and that your "training" set was representative enough, it is reasonable to think that one can assign the new data points following the heuristic of the algorithm without updating the cluster centroids, thus making predictions.
Of course, if the data points distribution is likely to be change one should rerun the KMeans clustering on the updated dataset.
B) DBSCAN
DBSCAN creates the cluster by finding high density areas of the dataset (parametrized by the parameters epsilon and min_points). This is done by computing point-level properties (whether the point is a core point, a directly reachable point, a reachable point or a noise point). Adding a new data point can modify the definition of the neighboring points, and thus make the computed clusters obsolete.
As an example, let's look at this illustration from wikipedia, copied below. On this image there is one cluster (red+yellow points) and one noise point (blue). Red points are core points and yellow points are reachable points.
and consider two cases:
Adding a new point halfway between A and N would make N a reachable point from A and thus belonging to the cluster.
Adding (min_points-1) new points in the epsilon-neighborhood of N, but in no other epsilon-neighborhood (as an example at the top of the picture), would change the status of N which would become a core point, and form a new cluster with the newly added points.
Here adding new data points clearly requires to recompute the clusters.
C) Aggglomerative clustering
Agglomerative clustering iteratively builds the cluster starting from points and merges them according to a linkage measure. Similarly to DBSCAN, adding new data points can entirely modify the final clusters because it can trigger different mergings.
As an example, if the linkage strategy you choose in sklearn is "single", clusters are merged if the minimum distance between all elements of the two clusters is below a chosen threshold. You can easily figure out that a single well placed new data point can trigger a merge between two clusters that would have been separated otherwise.
Thus predicting here also requires to recompute the clusters

How to find probability of predicted weight of a link in weighted graph

I have an undirected weighted graph. Let's say node A and node B don't have a direct link between them but there are paths connects both nodes through other intermediate nodes. Now I want to predict the possible weight of the direct link between the node A and B as well as the probability of it.
I can predict the weight by finding the possible paths and their average weight but how can I find the probability of it
The problem you are describing is called link prediction. Here is a short tutorial explaining about the problem and some simple heuristics that can be used to solve it.
Since this is an open-ended problem, these simple solutions can be improved a lot by using more complicated techniques. Another approach for predicting the probability for an edge is to use Machine Learning rather than rule-based heuristics.
A recent article called node2vec, proposed an algorithm that maps each node in a graph to a dense vector (aka embedding). Then, by applying some binary operator on a pair of nodes, we get an edge representation (another vector). This vector is then used as input features to some classifier that predicts the edge-probability. The paper compared a few such binary operators over a few different datasets, and significantly outperformed the heuristic benchmark scores across all of these datasets.
The code to compute embeddings given your graph can be found here.

why do we have multiple layers and multiple nodes per layer in a neural network?

I just started to learn about neural networks and so far my knowledge of machine learning is simply linear and logistic regression. from my understanding of the latter algorithms, is that given multiples of inputs the job of the learning algorithm is to come up with appropriate weights for each input so that eventually I have a polynomial that either describes the data which is the case of linear regression or separates it as in the case of logistic regression.
if I was to represent the same mechanism in neural network, according to my understanding, it would look something like this,
multiple nodes at the input layer and a single node in the output layer. where I can back propagate the error proportionally to each input. so that also eventually I arrive to a polynomial X1W1 + X2W2+....XnWn that describes the data. to me having multiple nodes per layer, aside from the input layer, seems to make the learning process parallel, so that I can arrive to the result faster. it's almost like running multiple learning algorithms each with different starting points to see which one converges faster. and as for the multiple layers I'm at a lose of what mechanism and advantage does it have on the learning outcome.
why do we have multiple layers and multiple nodes per layer in a neural network?
We need at least one hidden layer with a non-linear activation to be able to learn non-linear functions. Usually, one thinks of each layer as an abstraction level. For computer vision, the input layer contains the image and the output layer contains one node for each class. The first hidden layer detects edges, the second hidden layer might detect circles / rectangles, then there come more complex patterns.
There is a theoretical result which says that an MLP with only one hidden layer can fit every function of interest up to an arbitrary low error margin if this hidden layer has enough neurons. However, the number of parameters might be MUCH larger than if you add more layers.
Basically, by adding more hidden layers / more neurons per layer you add more parameters to the model. Hence you allow the model to fit more complex functions. However, up to my knowledge there is no quantitative understanding what adding a single further layer / node exactly makes.
It seems to me that you might want a general introduction into neural networks. I recommend chapter 4.3 and 4.4 of [Tho14a] (my bachelors thesis) as well as [LBH15].
[Tho14a]
M. Thoma, “On-line recognition of handwritten mathematical symbols,”
Karlsruhe, Germany, Nov. 2014. [Online]. Available: https://arxiv.org/abs/1511.09030
[LBH15]
Y. LeCun,
vol. 521,
Y. Bengio,
no. 7553,
and G. Hinton,
pp. 436–444,
“Deep learning,”
Nature,
May 2015. [Online]. Available:
http://www.nature.com/nature/journal/v521/n7553/abs/nature14539.html

Iterative modularity

I have run modulartiy edge_weight/randomized at a resolution of 1, atleast 20 times on the same network. This is the same network I have created based on the following rule. Two nodes are related if they have atleast one item in common. Every time I run modularity I get a little different node distribution among communities. Additionally, I get 9 or 10 communities but it is never consistent. Any comment or help is much appreciated.
I found a solution to my problem using consensus clustering. Here is the paper that describes it. One way to get the optimum clusters without having to solve them in a high-dimensional space using spectral clustering would be to run the algorithm repeatedly until no more partitions can be achieved. Here is the article and complete explanation details:
SCIENTIFIC REPORTS | 2 : 336 | DOI: 10.1038/srep00336
Consensus clustering in complex networks Andrea Lancichinetti & Santo Fortunato
The consensus matrix. Let us suppose that we wish to combine nP partitions found by a clustering algorithm on a network with n vertices. The consensus matrix D is an n x n matrix, whose entry Dij indicates the number of partitions in which vertices i and j of the network were assigned to the same cluster, divided by the number of partitions nP. The matrix D is usually much denser than the adjacency matrix A of the original network, because in the consensus matrix there is an edge between any two vertices which have cooccurred in the same cluster at least once. On the other hand, the weights are large only for those vertices which are most frequently coclustered, whereas low weights indicate that the vertices are probably at the boundary between different (real) clusters, so their classification in the same cluster is unlikely and essentially due to noise. We wish to maintain the large weights and to drop the low ones, therefore a filtering procedure is in order. Among the other things, in the absence of filtering the consensus matrix would quickly grow into a very dense matrix, which would make the application of any clustering algorithm computationally expensive.We discard all entries of D below a threshold t. We stress that there might be some noisy vertices whose edges could all be below the threshold, and they would be not connected anymore. When this happens, we just connect them to their neighbors with highest weights, to keep the graph connected all along the procedure.
Next we apply the same clustering algorithm to D and produce another set of partitions, which is then used to construct a new consensus matrix D9, as described above. The procedure is iterated until the consensus matrix turns into a block diagonal matrix Dfinal, whose weights equal 1 for vertices in the same block and 0 for vertices in different blocks. The matrix Dfinal delivers the community structure of the original network. In our calculations typically one iteration is sufficient to lead to stable results. We remark that in order to use the same clustering method all along, the latter has to be able to detect clusters in weighted networks, since the consensus matrix is weighted. This is a necessary constraint on the choice of the methods for which one could use the procedure proposed here. However, it is not a severe limitation,as most clustering algorithms in the literature can handle weighted networks or can be trivially extended to deal with them.
I think that the answer is in the randomizing part of the algorithm. You can find more details here:
https://github.com/gephi/gephi/wiki/Modularity
https://sites.google.com/site/findcommunities/
http://lanl.arxiv.org/abs/0803.0476

Clustering Method Selection in High-Dimension?

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.

Resources