Assumptions when using scikit-learn DPGMM - machine-learning

I have been using scikit-learn's Dirichlet Process gaussian mixture model to cluster my dataset. And I have been using this excellent tutorial for this purpose : http://blog.echen.me/2012/03/20/infinite-mixture-models-with-nonparametric-bayes-and-the-dirichlet-process/
In the end, the author uses a dataset that clusters food items using their nutritional values (i.e- total fat, vitamin D, vitamin C etc) as features. Before the implementation of the algorithm, the author normalizes these features. What is the importance of the normalization ? Does every single item in the dataset need to have a feature set that is a gaussian distribution ? Is that an underlying assumption ?
Any help would be appreciated. Thanks!

A Dirichlet process Gaussian mixture model is the infinite limit of a Gaussian mixture model and therefore assumes Gaussian distribution of your data. Recap the generative process of a Gaussian mixture model. The formulation of a Dirichlet process mixture model itself however is independent of the observation distribution.
Normalisation of the data, e.g. z-score, is not necessary if you parametrize the base distribution of the model properly. If this is not possible in the implementation you use than normalisation is required.

Related

Use categorical data as feature/target without encoding it

I am recently found a model to classify the Irish flower based on the size of its leaf. There are 3 types of flowers as a target (dependent variable). As I know, the categorical data should be encoded so that it can be used in machine learning. However, in the model the data is used directly without encoding process.
Can anyone help to explain when to use encoding? Thank you in advance!
Relevant question - encoding of continuous feature variables.
Originally, the Iris data were published by Fisher when he published his linear discriminant classifier.
Generally, a distinction is made between:
Real-value classifiers
Discrete feature classifiers
Linear discriminant analysis and quadratic discriminant analysis are real-value classifiers. Trying to add discrete variables as extra input does not work. Special procedures for working with indicator variables (the name used in statistics) in discriminant analysis have been developed. Also the k-nearest neighbour classifier really only works well with real-valued feature variables.
The naive Bayes classifier is most commonly used for classification problems with discrete features. When you don't want to assume conditional independence between the feature variables, the multinomial classifier can be applied to discrete features. A classifier service that does all this for you in one go, is insight classifiers.
Neural networks and support vector machines combine real-valued and discrete features. My advice is to use one separate input node for each discrete outcome - don't use one single input node provided with values like: (0: small, 1: minor, 2: medium, 3: larger, 4: big). One input-node-per-outcome-encoding will improve your training result and yield better test set performance.
The random forest classifier also combines real-valued and discrete features seamlessly.
Final advice is to train and test-set compare at least 4 different types of classifiers, as there is no such thing as the universal best type of classifier.

Unsupervised Naive Bayes - how does it work?

So as I understand it, to implement an unsupervised Naive Bayes, we assign random probability to each class for each instance, then run it through the normal Naive Bayes algorithm. I understand that, through each iteration, the random estimates get better, but I can't for the life of me figure out exactly how that works.
Anyone care to shed some light on the matter?
The variant of Naive Bayes in unsupervised learning that I've seen is basically application of Gaussian Mixture Model (GMM, also known as Expectation Maximization or EM) to determine the clusters in the data.
In this setting, it is assumed that the data can be classified, but the classes are hidden. The problem is to determine the most probable classes by fitting a Gaussian distribution per class. Naive Bayes assumption defines the particular probabilistic model to use, in which the attributes are conditionally independent given the class.
From "Unsupervised naive Bayes for data clustering with mixtures of
truncated exponentials" paper by Jose A. Gamez:
From the previous setting, probabilistic model-based clustering is
modeled as a mixture of models (see e.g. (Duda et al., 2001)), where
the states of the hidden class variable correspond to the components
of the mixture (the number of clusters), and the multinomial
distribution is used to model discrete variables while the Gaussian
distribution is used to model numeric variables. In this way we move
to a problem of learning from unlabeled data and usually the EM
algorithm (Dempster et al., 1977) is used to carry out the learning
task when the graphical structure is fixed and structural EM
(Friedman, 1998) when the graphical structure also has to be
discovered (Pena et al., 2000). In this paper we focus on the
simplest model with fixed structure, the so-called Naive Bayes
structure (fig. 1) where the class is the only root variable and all
the attributes are conditionally independent given the class.
See also this discussion on CV.SE.

What does it mean by the phrase "a machine learning algorithm learn a probability distribution"? What exactly is happening here

Generative and discriminative models seem to learn conditional P(x|y) and joint P(x,y) probability distributions. But at the fundamental level I fail to convince myself what it means by the probability distribution is learnt.
It means that your model is either functioning as an estimator for the distribution from which your training samples were drawn, or is utilizing that estimator to perform some other prediction.
To give a trivial example, consider a set of observations {x[1], ..., x[N]}. Let's say you want to train a Gaussian estimator on it. From these samples, the maximum-likelihood parameters for this Gaussian estimator would be the mean and variance of the data
Mean = 1/N * (x[1] + ... + x[N])
Variance = 1/(N-1) * ((x[1] - Mean)^2 + ... + (x[N] - Mean)^2)
Now you have a model capable of generating new samples from (an estimate of) the distribution your training sample was drawn from.
Going a little more sophisticated, you could consider something like a Gaussian mixture model. This similarly infers the best-fitting parameters of a model given your data. Except this time, that model is comprised of multiple Gaussians. As a result, if you are given some test data, you may probabilistically assign classes to each of those samples, based on the relative contribution of each Gaussian component to the probability density at the points of observation. This of course makes the fundamental assumption of machine learning: your training and test data are both drawn from the same distribution (something you ought to check).

Suggested unsupervised feature selection / extraction method for 2 class classification?

I've got a set of F features e.g. Lab color space, entropy. By concatenating all features together, I obtain a feature vector of dimension d (between 12 and 50, depending on which features selected.
I usually get between 1000 and 5000 new samples, denoted x. A Gaussian Mixture Model is then trained with the vectors, but I don't know which class the features are from. What I know though, is that there are only 2 classes. Based on the GMM prediction I get a probability of that feature vector belonging to class 1 or 2.
My question now is: How do I obtain the best subset of features, for instance only entropy and normalized rgb, that will give me the best classification accuracy? I guess this is achieved, if the class separability is increased, due to the feature subset selection.
Maybe I can utilize Fisher's linear discriminant analysis? Since I already have the mean and covariance matrices obtained from the GMM. But wouldn't I have to calculate the score for each combination of features then?
Would be nice to get some help if this is a unrewarding approach and I'm on the wrong track and/or any other suggestions?
One way of finding "informative" features is to use the features that will maximise the log likelihood. You could do this with cross validation.
https://www.cs.cmu.edu/~kdeng/thesis/feature.pdf
Another idea might be to use another unsupervised algorithm that automatically selects features such as an clustering forest
http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf
In that case the clustering algorithm will automatically split the data based on information gain.
Fisher LDA will not select features but project your original data into a lower dimensional subspace. If you are looking into the subspace method
another interesting approach might be spectral clustering, which also happens
in a subspace or unsupervised neural networks such as auto encoder.

Hidden Markov Model Tools : Jahmm

I am a novice to machine learning, I have read about the HMM but I still have a few questions:
When applying the HMM for machine learning, how can the initial, emmission and transition probabilities be obtained?
Currently I have a set of values (consisting the angles of a hand which I would like to classify via an HMM), what should my first step be?
I know that there are three problems in a HMM (ForwardBackward, Baum-Welch, and Viterbi), but what should I do with my data?
In the literature that I have read, I never encountered the use of distribution functions within an HMM, yet the constructor that JaHMM uses for an HMM consists of:
number of states
Probability Distribution Function factory
Constructor Description:
Creates a new HMM. Each state has the same pi value and the transition probabilities are all equal.
Parameters:
nbStates The (strictly positive) number of states of the HMM.
opdfFactory A pdf generator that is used to build the pdfs associated to each state.
What is this used for? And how can I use it?
Thank you
You have to somehow model and learn the initial, emmision, and tranisition probabilities such that they represent your data.
In the case of discrete distributions and not to much variables/states you can obtain them form maximum likelihood fitting or you train a discriminative classifier that can give you a probability estimate like Random Forests or Naive Bayes. For continuous distributions have a look at Gaussian Processes or any other regression method like Gaussian Mixture Models or Regression Forests.
Regarding your 2. and 3. question: they are to general and fuzzy to be answered here. You should kindly refer to the following books: "Pattern Recognition and Machine Learning" by Bishop and "Probabilistic Graphical Models" by Koller/Friedman.

Resources