How do I cluster with KL-divergence?

How do I cluster with KL-divergence? - machine-learning

I want to cluster my data with KL-divergence as my metric.
In K-means:
Choose the number of clusters.
Initialize each cluster's mean at random.
Assign each data point to a cluster c with minimal distance value.
Update each cluster's mean to that of the data points assigned to it.
In the Euclidean case it's easy to update the mean, just by averaging each vector.
However, if I'd like to use KL-divergence as my metric, how do I update my mean?

Clustering with KL-divergence may not be the best idea, because KLD is missing an important property of metrics: symmetry. Obtained clusters could then be quite hard to interpret. If you want to go ahead with KLD, you could use as distance the average of KLD's i.e.
d(x,y) = KLD(x,y)/2 + KLD(y,x)/2

It is not a good idea to use KLD for two reasons:-
It is not symmetry KLD(x,y) ~= KLD(y,x)
You need to be careful when using KLD in programming: the division may lead to Inf values and NAN as a result.
Adding a small number may affect the accuracy.

Well, it might not be a good idea use KL in the "k-means framework". As it was said, it is not symmetric and K-Means is intended to work on the euclidean space.
However, you can try using NMF (non-negative matrix factorization). In fact, in the book Data Clustering (Edited by Aggarwal and Reddy) you can find the prove that NMF (in a clustering task) works like k-means, only with the non-negative constrain. The fun part is that NMF may use a bunch of different distances and divergences. If you program python: scikit-learn 0.19 implements the beta divergence, which has a variable beta as a degree of liberty. Depending on the value of beta, the divergence has a different behavour. On beta equals 2, it assumes the behavior of the KL divergence.
This is actually very used in the topic model context, where people try to cluster documents/words over topics (or themes). By using KL, the results can be interpreted as a probabilistic function on how the word-topic and topic distributions are related.
You can find more information:
FÉVOTTE, C., IDIER, J. “Algorithms for Nonnegative Matrix
Factorization with the β-Divergence”, Neural Computation, v. 23, n.
9, pp. 2421– 2456, 2011. ISSN: 0899-7667. doi: 10.1162/NECO_a_00168.
Dis- ponível em: .
LUO, M., NIE, F., CHANG, X., et al. “Probabilistic Non-Negative
Matrix Factorization and Its Robust Extensions for Topic Modeling.”
In: AAAI, pp. 2308–2314, 2017.
KUANG, D., CHOO, J., PARK, H. “Nonnegative matrix factorization for
in- teractive topic modeling and document clustering”. In:
Partitional Clus- tering Algorithms, Springer, pp. 215–243, 2015.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

K-means is intended to work with Euclidean distance: if you want to use non-Euclidean similarities in clustering, you should use a different method. The most principled way to cluster with an arbitrary similarity metric is spectral clustering, and K-means can be derived as a variant of this where the similarities are the Euclidean distances.
And as #mitchus says, KL divergence is not a metric. You want the Jensen-Shannon divergence or its square root named as the Jensen-Shannon distance as it has symmetry.

Related

In DBSCAN, how to determine border points?

In DBSCAN, the core points is defined as having more than MinPts within Eps.
So if MinPts = 4, a points with total 5 points in Eps is definitely a core point.
How about a point with 4 points (including itself) in Eps? Is it a core point, or a border point?

Border points are points that are (in DBSCAN) part of a cluster, but not dense themselves (i.e. every cluster member that is not a core point).
In the followup algorithm HDBSCAN, the concept of border points was discarded.
Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013).
Density-Based Clustering Based on Hierarchical Density Estimates.
Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013. Lecture Notes in Computer Science 7819. p. 160.
doi:10.1007/978-3-642-37456-2_14
which states:
Our new definitions are more consistent with a statistical interpretation of clusters as connected components of a level set of a density [...] border objects do not technically belong to the level set (their estimated density is below the threshold).

Actually, I just re-read the original paper and Definition 1 makes it look like the core point belongs to its own eps neighborhood. So if minPts is 4, then a point needs at least 3 others in its eps neighborhood.
Notice in Definition 1 that they say NEps(p) = {q ∈D | dist(p,q) ≤ Eps}. If the point were excluded from its eps neighborhood, then it would have said NEps(p) = {q ∈D | dist(p,q) ≤ Eps and p != q}. Where != is "not equal to".
This is also reinforced by the authors of DBSCAN in their OPTICS paper in Figure 4. http://fogo.dbs.ifi.lmu.de/Publikationen/Papers/OPTICS.pdf
So I think the SciKit interpretation is correct and the Wikipedia illustration is misleading in http://en.wikipedia.org/wiki/DBSCAN

This largely depends on the implementation. The best way is to just play with the implementation yourself.
In the original DBSCAN1 paper, core point condition is given as N_Eps>=MinPts, where N_Eps is the Epsilon neighborhood of a certain data point, which is excluded from its own N_Eps.
Following your example, if MinPts = 4 and N_Eps = 3 (or 4 including itself as you say), then they don't form a cluster according to the original paper. On the other hand, the scikit-learn2 implementation of DBSCAN works otherwise, meaning it counts the point itself for forming a group. So for MinPts=4, four points are needed in total to form a cluster.
[1] Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). "A density-based algorithm for discovering clusters in large spatial databases with noise."
[2] http://scikit-learn.org

How to normalize tf-idf vectors for SVMs?

I am using Support Vector Machines for document classification. My feature set for each document is a tf-idf vector. I have M documents with each tf-idf vector of size N.
Giving M * N matrix.
The size of M is just 10 documents and tf-idf vector is 1000 word vector. So my features are much larger than number of documents. Also each word occurs in either 2 or 3 documents. When i am normalizing each feature ( word ) i.e. column normalization in [0,1] with
val_feature_j_row_i = ( val_feature_j_row_i - min_feature_j ) / ( max_feature_j - min_feature_j)
It either gives me 0, 1 of course.
And it gives me bad results. I am using libsvm, with rbf function C = 0.0312, gamma = 0.007815
Any recommendations ?
Should i include more documents ? or other functions like sigmoid or better normalization methods ?

The list of things to consider and correct is quite long, so first of all I would recommend some machine-learning reading before trying to face the problem itself. There are dozens of great books (like ie. Haykin's "Neural Networks and Learning Machines") as well as online courses, which will help you with such basics, like those listed here: http://www.class-central.com/search?q=machine+learning .
Getting back to the problem itself:
10 documents is rows of magnitude to small to get any significant results and/or insights into the problem,
there is no universal method of data preprocessing, you have to analyze it through numerous tests and data analytics,
SVMs are parametrical models, you cannot use a single C and gamma values and expect any reasonable results. You have to check dozens of them to even get a clue "where to search". The most simple method for doing so is so called grid search,
1000 of features is a great number of dimensions, this suggest that using a kernel, which implies infinitely dimensional feature space is quite... redundant - it would be a better idea to first analyze simplier ones, which have smaller chance to overfit (linear or low degree polynomial)
finally is tf*idf a good choice if "each word occurs in 2 or 3 documents"? It can be doubtfull, unless what you actually mean is 20-30% of documents
finally why is simple features squashing
It either gives me 0, 1 of course.
it should result in values in [0,1] interval, not just its limits. So if this is a case you are probably having some error in your implementation.

Restricted Boltzmann Machine for real-valued data - gaussian linear units (glu) -

I want my Restricted Boltzmann Machine to learn a new representation of real-valued data (see: Hinton - 2010 - A Practical Guide to Training RBMs). I'm struggling with an implementation of Gaussian linear units.
With Gaussian linear units in the visible layer the energy changes to E(v,h)= ∑ (v-a)²/2σ - ∑ bh - ∑v/σ h w. Now I don't know how to change the Contrastive Divergence Learning Algorithm. The visible units won't be sampled any more as they are linear. I use the expectation (mean-fied activation) p(v_i=1|h)= a +∑hw + N(0,1) as their state. The associations are left unchangend ( pos: data*p(h=1|v)' neg: p(v=1|h)*p(h=1|v)' ). But this only leads to random noise when I want to reconstruct the data. The error rate will stop improving around 50%.
Finally I want to use Gaussian linear units in both layers. How will I get the states of the hidden units then? I suggest by using the mean-field activation p(h_i=1|v)= b +∑vw + N(0,1) but I'm not sure.

You could take a look at the gaussian RBM that Hinton himself has provided
Please find it here.
http://www.cs.toronto.edu/~hinton/code/rbmhidlinear.m

I have been working on a similar project implementing an RBM with c++ and matlab mexfunction.
i have found from the implementation of Professor Hinton (in matlab) that the the binary activation for visible units uses the simoid function.
ie : vs = sigmoid(bsxfun(#plus, hs*obj.W2', obj.b));
But when implementing the gaussian visible units RBM you simply have to sample the visible units without using the simoid.
ie: vs = bsxfun(#plus, h0*obj.W2', obj.b);
or better look at the Professor Hinton implementation (see: http://www.cs.toronto.edu/~hinton/code/rbmhidlinear.m)

Which of the parameters in LibSVM is the slack variable?

I am a bit confused about the namings in the SVM. I am using this library LibSVM. There are so many parameters that can be set. Does anyone know which of these is the slack variable?
thx

The "slack variable" is C in c-svm and nu in nu-SVM. These both serve the same function in their respective formulations - controlling the tradeoff between a wide margin and classifier error. In the case of C, one generally test it in orders of magnitude, say 10^-4, 10^-3, 10^-2,... to 1, 5 or so. nu is a number between 0 and 1, generally from .1 to .8, which controls the ratio of support vectors to data points. When nu is .1, the margin is small, the number of support vectors will be a small percentage of the number of data points. When nu is .8, the margin is very large and most of the points will fall in the margin.
The other things to consider are your choice of kernel (linear, RBF, sigmoid, polynomial) and the parameters for the chosen kernel. Generally one has to do a lot of experimenting to find the best combination of parameters. However, be careful of over-fitting to your dataset.
Burges wrote a great tutorial: A Tutorial on Support Vector Machines for Pattern
Recognition
But if you mostly just want to know how to USE it and less about how it works, read "A Practical Guide to Support Vector Classication" by Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin (authors of libsvm)

First decide which type of SVM are u intending to use: C-SVC, nu-SVC , epsilon-SVR or nu-SVR. In my opinion u need to vary C and gamma most of the time... the rest are usually fixed..

Recommended anomaly detection technique for simple, one-dimensional scenario?

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.
For example, with the following example data:
a = 10
b = 14
c = 25
d = 467
e = 12
d is clearly an anomaly, and I would want to perform a specific action based on this.
I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.
Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.
Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.
The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.
The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.
So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.
Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.
What is a recommended anomaly detection technique for simple, one-dimensional data?

Check out the three-sigma rule:
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
An alternative method is the IQR outlier test:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN x is an extreme outlier
this test is usually employed by Box plots (indicated by the whiskers):
EDIT:
For your case (simple 1D univariate data), I think my first answer is well suited.
That however isn't applicable to multivariate data.
#smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.
A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.
DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.
If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.
If the number of neighbors is less than minPoints, the point is marked as noise.
If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.
Finally the set of all points marked as noise are considered outliers.

There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.
After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by #Amro as a good starting point.
For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.

This is an old topic but still it lacks some information.
Evidently, this can be seen as a case of univariate outlier detection. The approaches presented above have several pros and cons. Here are some weak spots:
Detection of outliers with the mean and sigma has the obvious disadvantage of dependence of mean and sigma on the outliers themselves.
The case of the small sample limit (see question for example) is not adequately covered by, 3 sigma, K-Means, IQR etc.
And I could go on... However the statistical literature offers a simple metric: the median absolute deviation. (Medians are insensitive to outliers)
Details can be found here: https://www.sciencedirect.com/book/9780128047330/introduction-to-robust-estimation-and-hypothesis-testing
I think this problem can be solved in a few lines of python code like this:
import numpy as np
import scipy.stats as sts
x = np.array([10, 14, 25, 467, 12]) # your values
np.abs(x - np.median(x))/(sts.median_abs_deviation(x)/0.6745) #MAD criterion
Subsequently you reject values above a certain threshold (97.5 percentile of the distribution of data), in case of an assumed normal distribution the threshold is 2.24. Here it translates to:
array([ 0.6745 , 0. , 1.854875, 76.387125, 0.33725 ])
or the 467 entry being rejected.
Of course, one could argue, that the MAD (as presented) also assumes a normal dist. Therefore, why is it that argument 2 above (small sample) does not apply here? The answer is that MAD has a very high breakdown point. It is easy to choose different threshold points from different distributions and come to the same conclusion: 467 is the outlier.

Both three-sigma rule and IQR test are often used, and there are a couple of simple algorithms to detect anomalies.
The three-sigma rule is correct
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
The IQR test should be:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
If x > Q75 + 1.5 * IQR or x < Q25 - 1.5 * IQR THEN x is a mild outlier
If x > Q75 + 3.0 * IQR or x < Q25 – 3.0 * IQR THEN x is a extreme outlier

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart