Minimum Spanning Tree of Subgraph - graph-algorithm

Suppose you have a graph G = (V, E). You can do whatever you want in terms of preprocessing on this graph G (within reasonable time and space constraints for a graph with a few thousands vertices, so you couldn't just store every possible answer for example).
Now suppose I select a subset V' of V. I want the MST over just these vertices V'. How do you do this quickly and efficiently?

There are two ways to solve the problem. Their performance is dependent to different states of the problem.
Applying MST algorithms on sub-graph(solve from scratch).
Using dynamic algorithms to update tree after changes in the problem.
There are two types of dynamic algorithms:
I) Edge insertion and deletion**
G. Ramalingam and T. Reps, “On the computational complexity of
dynamic graph problems,” Theoret. Comput. Sci., vol. 158, no. 1, pp.
233–277, 1996.
II) Edge weight decreasing and increasing**
D. Frigioni, A. Marchetti-Spaccamela, and U. Nanni, “Fully dynamic
output bounded single source shortest path problem,” in ACM-SIAM
Symp. Discrete Algorithms, 1996, pp. 212–221.
“Fully dynamic algorithms for maintaining shortest paths trees,”
J. Algorithms, vol. 34, pp. 251–281, 2000.
You can use them directly or change them with respect to the problem and consider node insertion and deletion.

Related

How to combine various distance functions into one given the following dataset?

I have a few distance functions which return distance between two images , I want to combine these distance into a single distance, using weighted scoring e.g. ax1+bx2+cx3+dx4 etc i want to learn these weights automatically such that my test error is minimised.
For this purpose i have a labeled dataset which has various triplets of images such that (a,b,c) , a has less distance to b than it has to c.
i.e. d(a,b)<d(a,c)
I want to learn such weights so that this ordering of triplets can be as accurate as possible.(i.e. the weighted linear score given is less for a&b and more for a&c).
What sort of machine learning algorithm can be used for the task,and how the desired task can be achieved?
Hopefully I understand your question correctly, but it seems that this could be solved more easily with constrained optimization directly, rather than classical machine learning (the algorithms of which are often implemented via constrained optimization, see e.g. SVMs).
As an example, a possible objective function could be:
argmin_{w} || e ||_2 + lambda || w ||_2
where w is your weight vector (Oh god why is there no latex here), e is the vector of errors (one component per training triplet), lambda is some tunable regularizer constant (could be zero), and your constraints could be:
max{d(I_p,I_r)-d(I_p,I_q),0} <= e_j for jth (p,q,r) in T s.t. d(I_p,I_r) <= d(I_p,I_q)
for the jth constraint, where I_i is image i, T is the training set, and
d(u,v) = sum_{w_i in w} w_i * d_i(u,v)
with d_i being your ith distance function.
Notice that e is measuring how far your chosen weights are from satisfying all the chosen triplets in the training set. If the weights preserve ordering of label j, then d(I_p,I_r)-d(I_p,I_q) < 0 and so e_j = 0. If they don't, then e_j will measure the amount of violation of training label j. Solving the optimization problem would give the best w; i.e. the one with the lowest error.
If you're not familiar with linear/quadratic programming, convex optimization, etc... then start googling :) Many libraries exist for this type of thing.
On the other hand, if you would prefer a machine learning approach, you may be able to adapt some metric learning approaches to your problem.

a new edge is insert to a Minimum spanning tree

I trying to find an algorithm to the following question with one different :
the edge are not distinct.
Give an efficient algorithm to test if T remains the minimum-cost spanning tree with the new edge added to G.
in this link- there is a solution but it is not for the different I wrote up:
the edges are not nessecerliy distinct.
Updating a Minimum spanning tree when a new edge is inserted
someone has an idea?
Well, the naive approach of just using Prim or Kruskal to find the min cost spanning tree of the new graph and then see which one has a lower total cost isn't too bad at O(|E|log|E|).
But we don't need to look at the whole graph.
Suppose your new edge connects vertices A and B. Let C be the parent of A. If B is not a descendent of A, then if A-B is lower cost than A-C, then T is no longer the MST and B should be the new parent of the subtree rooted at A.
If B is a descendant of A, then if A-B is shorter than any of the branches in T along the path from A to B, then T is no longer the MST, and the highest cost edge along that path should be removed, B is the root of the newly disconnected component, and should be added as a child of A.
I believe you may need to check these things a second time, reversing which vertices are A and B. The complexity of this is log|V| where the base of the log is the average number of children per node of T. In the case of T being a straight line, it's O(|V|), but otherwise, I think you could say it is O(log|V|).
First find an MST using one of the existing efficient algorithms.
Now adding an edge (v,w) creates a cycle in the MST. If the newly added edge has the maximum cost among the edges on the cycle then the MST remains as it is. If some other edge on the cycle has the maximum cost, then that's the edge to be removed to get a tree with lower cost.
So we need an efficient way to find the edge with the maximum value on the cycle. You can climb from v and w until you reach LCA(v, w) (the least common ancestor of v and w) to get the edge with the max cost. This takes linear time in the worst case.
If you are going to answer multiple such queries then pre-processing the MST is probably better. You can pre-process the MST to get a sparse table data structure in O(N lg N) time and then use this data structure to answer max queries in O(lg N) time in the worst case.

In DBSCAN, how to determine border points?

In DBSCAN, the core points is defined as having more than MinPts within Eps.
So if MinPts = 4, a points with total 5 points in Eps is definitely a core point.
How about a point with 4 points (including itself) in Eps? Is it a core point, or a border point?
Border points are points that are (in DBSCAN) part of a cluster, but not dense themselves (i.e. every cluster member that is not a core point).
In the followup algorithm HDBSCAN, the concept of border points was discarded.
Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013).
Density-Based Clustering Based on Hierarchical Density Estimates.
Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013. Lecture Notes in Computer Science 7819. p. 160.
doi:10.1007/978-3-642-37456-2_14
which states:
Our new definitions are more consistent with a statistical interpretation of clusters as connected components of a level set of a density [...] border objects do not technically belong to the level set (their estimated density is below the threshold).
Actually, I just re-read the original paper and Definition 1 makes it look like the core point belongs to its own eps neighborhood. So if minPts is 4, then a point needs at least 3 others in its eps neighborhood.
Notice in Definition 1 that they say NEps(p) = {q ∈D | dist(p,q) ≤ Eps}. If the point were excluded from its eps neighborhood, then it would have said NEps(p) = {q ∈D | dist(p,q) ≤ Eps and p != q}. Where != is "not equal to".
This is also reinforced by the authors of DBSCAN in their OPTICS paper in Figure 4. http://fogo.dbs.ifi.lmu.de/Publikationen/Papers/OPTICS.pdf
So I think the SciKit interpretation is correct and the Wikipedia illustration is misleading in http://en.wikipedia.org/wiki/DBSCAN
This largely depends on the implementation. The best way is to just play with the implementation yourself.
In the original DBSCAN1 paper, core point condition is given as N_Eps>=MinPts, where N_Eps is the Epsilon neighborhood of a certain data point, which is excluded from its own N_Eps.
Following your example, if MinPts = 4 and N_Eps = 3 (or 4 including itself as you say), then they don't form a cluster according to the original paper. On the other hand, the scikit-learn2 implementation of DBSCAN works otherwise, meaning it counts the point itself for forming a group. So for MinPts=4, four points are needed in total to form a cluster.
[1] Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). "A density-based algorithm for discovering clusters in large spatial databases with noise."
[2] http://scikit-learn.org

Kohonen Self Organizing Maps: Determining the number of neurons and grid size

I have a large dataset I am trying to do cluster analysis on using SOM. The dataset is HUGE (~ billions of records) and I am not sure what should be the number of neurons and the SOM grid size to start with. Any pointers to some material that talks about estimating the number of neurons and grid size would be greatly appreciated.
Thanks!
Quoting from the som_make function documentation of the som toolbox
It uses a heuristic formula of 'munits = 5*dlen^0.54321'. The
'mapsize' argument influences the final number of map units: a 'big'
map has x4 the default number of map units and a 'small' map has
x0.25 the default number of map units.
dlen is the number of records in your dataset
You can also read about the classic WEBSOM which addresses the issue of large datasets
http://www.cs.indiana.edu/~bmarkine/oral/self-organization-of-a.pdf
http://websom.hut.fi/websom/doc/ps/Lagus04Infosci.pdf
Keep in mind that the map size is also a parameter which is also application specific. Namely it depends on what you want to do with the generated clusters. Large maps produce a large number of small but "compact" clusters (records assigned to each cluster are quite similar). Small maps produce less but more generilized clusters. A "right number of clusters" doesn't exists, especially in real world datasets. It all depends on the detail which you want to examine your dataset.
I have written a function that, with the data set as input, returns the grid size. I rewrote it from the som_topol_struct() function of Matlab's Self Organizing Maps Toolbox into a R function.
topology=function(data)
{
#Determina, para lattice hexagonal, el número de neuronas (munits) y su disposición (msize)
D=data
# munits: número de hexágonos
# dlen: número de sujetos
dlen=dim(data)[1]
dim=dim(data)[2]
munits=ceiling(5*dlen^0.5) # Formula Heurística matlab
#munits=100
#size=c(round(sqrt(munits)),round(munits/(round(sqrt(munits)))))
A=matrix(Inf,nrow=dim,ncol=dim)
for (i in 1:dim)
{
D[,i]=D[,i]-mean(D[is.finite(D[,i]),i])
}
for (i in 1:dim){
for (j in i:dim){
c=D[,i]*D[,j]
c=c[is.finite(c)];
A[i,j]=sum(c)/length(c)
A[j,i]=A[i,j]
}
}
VS=eigen(A)
eigval=sort(VS$values)
if (eigval[length(eigval)]==0 | eigval[length(eigval)-1]*munits<eigval[length(eigval)]){
ratio=1
}else{
ratio=sqrt(eigval[length(eigval)]/eigval[length(eigval)-1])}
size1=min(munits,round(sqrt(munits/ratio*sqrt(0.75))))
size2=round(munits/size1)
return(list(munits=munits,msize=sort(c(size1,size2),decreasing=TRUE)))
}
hope it helps...
Iván Vallés-Pérez
I don't have a reference for it, but I would suggest starting off by using approximately 10 SOM neurons per expected class in your dataset. For example, if you think your dataset consists of 8 separate components, go for a map with 9x9 neurons. This is completely just a ballpark heuristic though.
If you'd like the data to drive the topology of your SOM a bit more directly, try one of the SOM variants that change topology during training:
Growing SOM
Growing Neural Gas
Unfortunately these algorithms involve even more parameter tuning than plain SOM, but they might work for your application.
Kohenon has written on the issue of selecting parameters and map size for SOM in his book "MATLAB Implementations and Applications of the Self-Organizing Map". In some cases, he suggest the initial values can be arrived at after testing several sizes of the SOM to check that the cluster structures were shown with sufficient resolution and statistical accuracy.
my suggestion would be the following
SOM is distantly related to correspondence analysis. In statistics, they use 5*r^2 as a rule of thumb, where r is the number of rows/columns in a square setup
usually, one should use some criterion that is based on the data itself, meaning that you need some criterion for estimating the homogeneity. If a certain threshold would be violated, you would need more nodes. For checking the homogeneity you would need some records per node. Agai, from statistics you could learn that for simple tests (small number of variables) you would need around 20 records, for more advanced tests on some variables at least 8 records.
remember that the SOM represents a predictive model. So validation is the key, absolutely mandatory. Yet, validation of predictive models (see typeI / II error entry in Wiki) is a subject on its own. And the acceptable risk as well as the risk structure also depend fully on your purpose.
You may test the dynamics of the error rate of the model by reducing its size more and more. Then take the smallest one with acceptable error.
It is a strength of the SOM to allow for empty nodes. Yet, there should not be too much of them. Let me say, less than 5%.
Taken all together, from experience, I would recommend the following criterion a minimum of the absolute number of 8..10 records, but those should not be more than 5% of all clusters.
Those 5% rule is of of course a heuristics, which however can be justified by the general usage of the confidence level in statistical tests. You may choose any percentage from 1% to 5%.

How do I cluster with KL-divergence?

I want to cluster my data with KL-divergence as my metric.
In K-means:
Choose the number of clusters.
Initialize each cluster's mean at random.
Assign each data point to a cluster c with minimal distance value.
Update each cluster's mean to that of the data points assigned to it.
In the Euclidean case it's easy to update the mean, just by averaging each vector.
However, if I'd like to use KL-divergence as my metric, how do I update my mean?
Clustering with KL-divergence may not be the best idea, because KLD is missing an important property of metrics: symmetry. Obtained clusters could then be quite hard to interpret. If you want to go ahead with KLD, you could use as distance the average of KLD's i.e.
d(x,y) = KLD(x,y)/2 + KLD(y,x)/2
It is not a good idea to use KLD for two reasons:-
It is not symmetry KLD(x,y) ~= KLD(y,x)
You need to be careful when using KLD in programming: the division may lead to Inf values and NAN as a result.
Adding a small number may affect the accuracy.
Well, it might not be a good idea use KL in the "k-means framework". As it was said, it is not symmetric and K-Means is intended to work on the euclidean space.
However, you can try using NMF (non-negative matrix factorization). In fact, in the book Data Clustering (Edited by Aggarwal and Reddy) you can find the prove that NMF (in a clustering task) works like k-means, only with the non-negative constrain. The fun part is that NMF may use a bunch of different distances and divergences. If you program python: scikit-learn 0.19 implements the beta divergence, which has a variable beta as a degree of liberty. Depending on the value of beta, the divergence has a different behavour. On beta equals 2, it assumes the behavior of the KL divergence.
This is actually very used in the topic model context, where people try to cluster documents/words over topics (or themes). By using KL, the results can be interpreted as a probabilistic function on how the word-topic and topic distributions are related.
You can find more information:
FÉVOTTE, C., IDIER, J. “Algorithms for Nonnegative Matrix
Factorization with the β-Divergence”, Neural Computation, v. 23, n.
9, pp. 2421– 2456, 2011. ISSN: 0899-7667. doi: 10.1162/NECO_a_00168.
Dis- ponível em: .
LUO, M., NIE, F., CHANG, X., et al. “Probabilistic Non-Negative
Matrix Factorization and Its Robust Extensions for Topic Modeling.”
In: AAAI, pp. 2308–2314, 2017.
KUANG, D., CHOO, J., PARK, H. “Nonnegative matrix factorization for
in- teractive topic modeling and document clustering”. In:
Partitional Clus- tering Algorithms, Springer, pp. 215–243, 2015.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
K-means is intended to work with Euclidean distance: if you want to use non-Euclidean similarities in clustering, you should use a different method. The most principled way to cluster with an arbitrary similarity metric is spectral clustering, and K-means can be derived as a variant of this where the similarities are the Euclidean distances.
And as #mitchus says, KL divergence is not a metric. You want the Jensen-Shannon divergence or its square root named as the Jensen-Shannon distance as it has symmetry.

Resources