Why is a Gaussian mixture model used? - machine-learning

Can anyone give a realistic example of what a Gaussian mixture model(GMM) is? Why do we go for GMM and how it works?
I read some content about the same but they have explained in a mathematical way. They say it is used on heterogeneous population but I am not able to relate it with some realistic example .
Thanks for the help.

GMM is used for the data, where GMM distribution describes it's good, or where the True distribution can be approximated by a GMM.
For example you can look on the 'people height' distribution, I believe it could approximated by a GMM with 2 components, where fist component will describe the men height (for example: 175 cm) and the second component will describe the women height (for example: 160 cm).

Related

Machine Learning - SVM

If one trains a model using a SVM from kernel data, the resultant trained model contains support vectors. Now consider the case of training a new model using the old data already present plus a small amount of new data as well.
SO:
Should the new data just be combined with the support vectors from the previously formed model to form the new training set. (If yes, then how to combine the support vectors with new graph data? I am working on libsvm)
Or:
Should the new data and the complete old data be combined together and form the new training set and not just the support vectors?
Which approach is better for retraining, more doable and efficient in terms of accuracy and memory?
You must always retrain considering the entire, newly concatenated, training set.
The support vectors from the "old" model might not be support vectors anymore in case some "new points" are closest to the decision boundary. Behind the SVM there is an optimization problem that must be solved, keep that in mind. With a given training set, you find the optimal solution (i.e. support vectors) for that training set. As soon as the dataset changes, such solution might not be optimal anymore.
The SVM training is nothing more than a maximization problem where the geometrical and functional margins are the objective function. Is like maximizing a given function f(x)...but then you change f(x): by adding/removing points from the training set you have a better/worst understanding of the decision boundary since such decision boundary is known via sampling where the samples are indeed the patterns from your training set.
I understand your concerned about time and memory efficiency, but that's a common problem: indeed training the SVMs for the so-called big data is still an open research topic (there are some hints regarding backpropagation training) because such optimization problem (and the heuristic regarding which Lagrange Multipliers should be pairwise optimized) are not easy to parallelize/distribute on several workers.
LibSVM uses the well-known Sequential Minimal Optimization algorithm for training the SVM: here you can find John Pratt's article regarding the SMO algorithm, if you need further information regarding the optimization problem behind the SVM.
Idea 1 has been already examined & assessed by research community
anyone interested in faster and smarter aproach (1) -- re-use support-vectors and add new data -- kindly review research materials published by Dave MUSICANT and Olvi MANGASARIAN on such their method referred as "Active Support Vector Machine"
MATLAB implementation: available from http://research.cs.wisc.edu/dmi/asvm/
PDF:[1] O. L. Mangasarian, David R. Musicant; Active Support Vector Machine Classification; 1999
[2] David R. Musicant, Alexander Feinberg; Active Set Support Vector Regression; IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 2, MARCH 2004
This is a purely theoretical thought on your question. The idea is not bad. However, it needs to be extended a bit. I'm looking here purely at the goal to sparsen the training data from the first batch.
The main problem -- which is why this is purely theoretical -- is that your data is typically not linear separable. Then the misclassified points are very important. And they will spoil what I write below. Furthermore the idea requires a linear kernel. However, it might be possible to generalise to other kernels
To understand the problem with your approach lets look at the following support vectors (x,y,class): (-1,1,+),(-1,-1,+),(1,0,-). The hyperplane is the a vertical line going trough zero. If you would have in your next batch the point (-1,-1.1,-) the max margin hyperplane would tilt. This could now be exploited for sparsening. You calculate the - so to say - minimal margin hyperplane between the two pairs ({(-1,1,+),(1,0,-)}, {(-1,-1,+),(1,0,-)}) of support vectors (in 2d only 2 pairs. higher dimensions or non-linear kernel might be more). This is basically the line going through these points. Afterwards you classify all data points. Then you add all misclassified points in either of the models, plus the support vectors to the second batch. Thats it. The remaining points can't be relevant.
Besides the C/Nu problem mentioned above. The curse of dimensionality will obviously kill you here
An image to illustrate. Red: support vectors, batch one, Blue, non-support vector batch one. Green new point batch two.
Redline first Hyperplane, Green minimal margin hyperplane which misclassifies blue point, blue new hyperplane (it's a hand fit ;) )

What validation for outlier detection?

Another general question on data science!
Let's say I have a bunch of samples and I have to detect outliers on each sample. My data would be univariate, so I can use simple methods like standard deviation or median absolute deviation.
Now my question is: how would one do any sort of validation to see if results are coherent, especially if looking at them by eye wouldn't be an option because of the size of the data? For example to choose how many standard deviations to use to define outliers. I haven't seen any quantitative method so far. Does it even exist?
Cheers
Interestingly you didn't define the dimension of "size of the data". Which is I think important here. E.g., you can draw a q-q plot for high-dimensional data but not that easy for many data-points.
However, when looking for a general methodology I would attack this problem from a probabilistic perspective. This will never tell you which data point is an outlier, however, it will tell you what is the probability that you have an outlier (in certain areas of your data).
I have to make two assumptions (a) you know the family of distribution your data stems from, e.g., normal or poisson (b) you can estimate the parameters of this family given a data set.
Now you can define the Hypothesis that you data is from this Distribution and the alternative Hypothesis (H0) that the data is not from this distribution. If you draw a random sample from your estimated distribution, this drawn distribution should be on average as likely to come from the distribution as your observed sample. If this is not the case
However, probably more interesting is to find the sub-space which contains the outlier. This can be done with the following empirical procedure. If you now estimate the parameters of your distribution given your by Data. You can compare the estimated distribution with the histogram of the seen data. This gives you for each bin of the histogram a probability that ic contains an outlier. For high dimensional data this can be checked programtically.

what is the main difference between linear discriminant analysis and pronciple component analysis

"The Principal Component Analysis (PCA), which is the core of the Eigenfaces method, finds a linear combination of features that maximizes the total variance in data. While this is clearly a powerful way to represent data, it doesn’t consider any classes and so a lot of discriminative information may be lost when throwing components away." (Open CV)
What is mean by "CLASSES" here????
"
Linear Discriminant Analysis maximizes the ratio of between-classes to within-classes scatter, instead of maximizing the overall scatter. The idea is simple: same classes should cluster tightly together, while different classes are as far away as possible from each other in the lower-dimensional representation.
in here also what is mean by CLASSES????
Can some one please explain this in image processing view thanx
Classes in these contexts means groups or classifications. Like 'faces' or 'letters', things that have a set of geometric properties that can be identified with some degree of generality. PCA tried to classify objects in an image by them selves while LDS tries to classify things with some consideration to how many of the same thing they are near.
An example might be a picture of the ball "Wilson". By itself a it doesn't look much like a face and PCA would give it a low likelihood as being a face, but an LDS approach if the picture included Tom Hanks right next to it would classify Tom Hanks as having a face and cause the Wilson to more likely be a face as well. As you can see from this contrived example depending on what you are trying to achieve (and how good your data is) each approach has its upsides and downsides.
TO make it simple, PCA tries to represent the total data in minimum dimension. LDA also tries to do the same but also make sure that the different classes can be differentiated(classification). PCA does not help in classification. It helps only in dimesionality reduction. SO LDA = PCA + classification

Gaussian Weighting for point re-distribution

I am working with some points which are very compact together and therefore forming clusters amongst them is proving very difficult. Since I am new to this concept, I read in a paper about the concept of Gaussian weighting the points randomly or rather resampling using gaussian weight.
My question here is how are gaussian weight applied to the data points? Is it the actual normal distribution where I have to compute the means and the variance and SD and than randomly sample or there is other ways to do it. I am confused on this concept?
Can I get some hints on the concept please
I think you should look at book:
http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738
There is a good chapters on modeling point distributions.

ML / density clustering on house areas. two-component or more mixtures in each dimension

I trying to self-learn ML and came across this problem. Help from more experienced people in the field would be much appreciated!
Suppose i have three vectors with areas for house compartments such as bathroom, living room and kitchen. The data consists of about 70,000 houses. A histogram of each individual vector clearly has evidence for a bimodal distribution, say a two-component gaussian mixture. I now wanted some sort of ML algorithm, preferably unsupervised, that would classify houses according to these attributes. Say: large bathroom, small kitchen, large living-room.
More specifically, i would like an algorithm to choose the best possible separation threshold for each bimodal distribution vector, say large/small kitchen (this can be binary as there we assume evidence for a bimodality), do the same for others and cluster the data. Ideally this would come with some confidence measure so that i could check houses in the intermediate regimes... for instance, a house with clearly a large kitchen, but whose bathroom would fall close to a threshold area/ boundary for large/small bathroom would be put for example on the bottom of a list with "large kitchens and large bathrooms". Because of this reason, first deciding on a threshold (fitting the gausssians with less possible FDR), collapsing the data and then clustering would Not be desirable.
Any advice on how to proceed? I know R and python.
Many thanks!!
What you're looking for is a clustering method: this is basically unsupervised classification. A simple method is k-means, which has many implementations (k-means can be viewed as the limit of a multi-variate Gaussian mixture as the variance tends to zero). This would naturally give you a confidence measure, which would be related to the distance metric (Euclidean distance) between the point in question and the centroids.
One final note: I don't know about clustering each attribute in turn, and then making composites from the independent attributes: why not let the algorithm find the clusters in multi-dimensional space? Depending on the choice of algorithm, this will take into account covariance in the features (big kitchen increases the probability of big bedroom) and produce natural groupings you might not consider in isolation.
Sounds like you want EM clustering with a mixture of Gaussians model.
Should be in the mclust package in R.
In addition to what the others have suggested, it is indeed possible to cluster (maybe even density-based clustering methods such as DBSCAN) on the individual dimensions, forming one-dimensional clusters (intervals) and working from there, possibly combining them into multi-dimensional, rectangular-shaped clusters.
I am doing a project involving exactly this. It turns out there are a few advantages to running density-based methods in one dimension, including the fact that you can do what you are saying about classifying objects on the border of one attribute according to their other attributes.

Resources