Estimating cluster assignment using Stan - mcmc

I am trying to fit a finite Gaussian mixture model with unknown mean and covariances using Stan. I am aware that as HMC can't be applied to sample from discrete distributions, the marginalization technique is usually used to infer mixture parameters using Stan. However, for my application, I need the data cluster assignments. What is the best way to infer them in Stan? Suggestions will be appreciated.

Chapter 13 of the Stan User Manual discusses this in some detail. In short, you can calculate (in the generated quantities block) the posterior probability that an observation falls in each of a finite number of categories and then use that vector of probabilities to draw a category realization from the categorical distribution.

Related

How to get the final equation that the Random Forest algorithm uses on your independent variables to predict your dependent variable?

I am working on optimizing a manufacturing based dataset which consists of a huge number of controllable parameters. The goal is to attain the best run settings of these parameters.
I familiarized myself with several predictive algorithms while doing my research and if I say, use Random Forest to predict my dependent variable to understand how important each independent variable is, is there a way to extract the final equation/relationship the algorithm uses?
I'm not sure if my question was clear enough, please let me know if there's anything else I can add here.
There is no general way to get an interpretable equation from a random forest, explaining how your covariates affect the dependent variable. For that you can use a different model more suitable, e.g., linear regression (perhaps with kernel functions), or a decision tree. Note that you can use one model for prediction, and one model for descriptive analysis - there's no inherent reason to stick with a single model.
use Random Forest to predict my dependent variable to understand how important each independent variable is
Understanding how important each dependent variable, does not necessarily mean you need the question in the title of your question, namely getting the actual relationship. Most random forest packages have a method quantifying how much each covariate affected the model over the train set.
There is a number of methods to estimate feature importance based on trained model. For Random Forest, most famous methods are MDI (Mean Decrease of Impurity) and MDA (Mean Decrease of Accuracy). Many popular ML libraries support feature importance estimation out of the box for Random Forest.

How to preprocess high cardinality categorical features?

I have a data file which has features of different mobile devices. One column with categorical data type has 1421 distinct types of values. I am trying to train a logistic regression model along with other data that I have.
My question is: Will the high cardinality column described above affect the model I am training? If yes, how do I go about preprocessing this column so that it has lower number of distinct values?
You can calculate weight of evidence(WOE) to transform your numeric or categorical variable. Refer to this link http://www.kdnuggets.com/2016/08/include-high-cardinality-attributes-predictive-model.html for understanding WOE.
The best you could do here is that to group the features using what domain knowledge you have. For example phones by brand. If you do not have that information what you could do is that you could group the features by frequency. For example any feature that is not represented more than 5% of the data, you could group as others. You could use both of these methods together as well. For more information please refer this article.
Since logistic regression is distance based model (mostly least squares method) it suffers from the curse of dimensionality.
Hope this helps though quite late.
thanks
Michael
Typically, dimensionality reduction tasks (such as PCA and FA) are performed in order to decide which features are the most significant.
For example, in the case of PCA which is the most popular and easily employed dimensionality reduction task, significance is defined by largest variation of values.
By performing PCA, you "wash" out variables that are insignificant yet can cause overfitting. I suggestyou familiarize yourself with topics such as PCA, FA and SVD.

Difference between Generative, Discriminating and Parametric, Nonparametric Algorithm/Model

Here in SO I found the following explanation of generative and discriminitive algorithms:
"A generative algorithm models how the data was generated in order to categorize a signal. It asks the question: based on my generation assumptions, which category is most likely to generate this signal?
A discriminative algorithm does not care about how the data was generated, it simply categorizes a given signal."
And here is the definition for parametric and nonparametric algorithms
"Parametric: data are drawn from a probability distribution of specific form up to unknown parameters.
Nonparametric: data are drawn from a certain unspecified probability distribution.
"
So essentially can we say that generative and parametric algorithms assume underlying model whereas discriminitve and nonparametric algorithms dont assume any model?
thanks.
Say you have inputs X (probably a vector) and output Y (probably univariate). Your goal is to predict Y given X.
A generative method uses a model of the joint probability p(X,Y) to determine P(Y|X). It is thus possible given a generative model with known parameters to sample jointly from the distribution p(X,Y) to produce new samples of both input X and output Y (note they are distributed according to the assumed, not true, distribution if you do this). Contrast this to discriminative approaches which only have a model of the form p(Y|X). Thus provided with input X they can sample Y; however, they cannot sample new X.
Both assume a model. However, discriminative approaches assume only a model of how Y depends on X, not on X. Generative approaches model both. Thus given a fixed number of parameters you might argue (and many have) that it's easier to use them to model the thing you care about, p(Y|X), than the distribution of X since you'll always be provided with the X for which you wish to know Y.
Useful references: this (very short) paper by Tom Minka. This seminal paper by Andrew Ng and Michael Jordan.
The distinction between parametric and non-parametric models is probably going to be harder to grasp until you have more stats experience. A parametric model has a fixed and finite number of parameters regardless of how many data points are observed. Most probability distributions are parametric: consider a variable z which is the height of people, assumed to be normally distributed. As you observe more people, your estimate for the parameters \mu and \sigma, the mean and standard deviation of z, become more accurate but you still only have two parameters.
In contrast, the number of parameters in a non-parametric model can grow with the amount of data. Consider an induced distribution over peoples' heights which places a normal distribution over each observed sample, with mean given by the measurement and fixed standard deviation. The marginal distribution over new heights is then a mixture of normal distributions, and the number of mixture components increases with each new data point. This is a non-parametric model of people's height. This specific example is called a kernel density estimator. Popular (but more complicated) non parametric models include Gaussian Processes for regression and Dirichlet Processes.
A pretty good tutorial on non-parametrics can be found here, which constructs the Chinese Restaurant Process as the limit of a finite mixture model.
I don't think you can say it. E.g. linear regression is a discriminative algorithm - you make an assumption about P(Y|X), and then estimate paramenters directly from the data, without making any assumption about P(X) or P(X|Y), as you would do in case of generative models. But at the same time, aby inference based on linear regression, including the properties of the paramenters, is a parametric estimation, as there is an assumption about behaviour of unobserved errors.
Here I'm only talking about parametric/non-parametric. Generative/ discriminative is a separate concept.
Non-parametric model means you don't make any assumptions on the distribution of your data. For example, in the real world, data will not 100% follow theoretical distributions like Gaussian, beta, Poisson, Weibull, etc. Those distributions are developed for our need's to model the data.
On the other hand, parametric models try to completely explain our data using parameters. In practice, this way is preferred because it makes easier to define how the model should behave in different circumstances (for example, we already know the derivative/gradients of the model, what happens when we set the rate too high/too low in Poisson, etc.)

What are the metrics to evaluate a machine learning algorithm

I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?
I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.
Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop
There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.

ML / density clustering on house areas. two-component or more mixtures in each dimension

I trying to self-learn ML and came across this problem. Help from more experienced people in the field would be much appreciated!
Suppose i have three vectors with areas for house compartments such as bathroom, living room and kitchen. The data consists of about 70,000 houses. A histogram of each individual vector clearly has evidence for a bimodal distribution, say a two-component gaussian mixture. I now wanted some sort of ML algorithm, preferably unsupervised, that would classify houses according to these attributes. Say: large bathroom, small kitchen, large living-room.
More specifically, i would like an algorithm to choose the best possible separation threshold for each bimodal distribution vector, say large/small kitchen (this can be binary as there we assume evidence for a bimodality), do the same for others and cluster the data. Ideally this would come with some confidence measure so that i could check houses in the intermediate regimes... for instance, a house with clearly a large kitchen, but whose bathroom would fall close to a threshold area/ boundary for large/small bathroom would be put for example on the bottom of a list with "large kitchens and large bathrooms". Because of this reason, first deciding on a threshold (fitting the gausssians with less possible FDR), collapsing the data and then clustering would Not be desirable.
Any advice on how to proceed? I know R and python.
Many thanks!!
What you're looking for is a clustering method: this is basically unsupervised classification. A simple method is k-means, which has many implementations (k-means can be viewed as the limit of a multi-variate Gaussian mixture as the variance tends to zero). This would naturally give you a confidence measure, which would be related to the distance metric (Euclidean distance) between the point in question and the centroids.
One final note: I don't know about clustering each attribute in turn, and then making composites from the independent attributes: why not let the algorithm find the clusters in multi-dimensional space? Depending on the choice of algorithm, this will take into account covariance in the features (big kitchen increases the probability of big bedroom) and produce natural groupings you might not consider in isolation.
Sounds like you want EM clustering with a mixture of Gaussians model.
Should be in the mclust package in R.
In addition to what the others have suggested, it is indeed possible to cluster (maybe even density-based clustering methods such as DBSCAN) on the individual dimensions, forming one-dimensional clusters (intervals) and working from there, possibly combining them into multi-dimensional, rectangular-shaped clusters.
I am doing a project involving exactly this. It turns out there are a few advantages to running density-based methods in one dimension, including the fact that you can do what you are saying about classifying objects on the border of one attribute according to their other attributes.

Resources