MCMCglmm for continuous data - mcmc

I'm referring to R Package MCMCglmm (Monte Carlo Markov Chain Generalized Linear Mixed-effect Models), see cran.r-project.org/web/packages/MCMCglmm/MCMCglmm.pdf
While MCMCglmm specifies as a generalised mixed effect model (and is therefore unsuitable for analyzing continuous data with a Gaussian distribution), the function specifically offers family="gaussian" as an option.
Hence, my question: Am I allowed to analyze continuous data with MCMCglmm as well? If not, is there an equivalent (i.e. MCMClme) for continous data?
Thanks for your help!

Yes, you can analyse continuous data with MCMCglmm.
It sounds as if you may be misunderstanding what a "generalised" mixed effect model is. General (i.e. not generalised) linear models can analyse only data with a Gaussian error distribution. Generalised models extensions of this that allow you to analyse not only Gaussian data, but also other types of data, such as binomial or poisson.
As you have already noticed, you can specify
family = "gaussian"
This is exactly equivalent to running a "General" model (i.e. not generalised). It is a completely valid approach as long as your data conform to the assumptions of such a model.

Related

Handling geospatial coordinates in machine learning

I'm building a machine learning model where some columns are physical addresses (which I can translate into X / Y coordinates) but I'm a little bit confused on how this will be handled by the ML algorithm.
Is there a particular way to translate a GEO location into columns for use into ML (classification and/or regression) ?
Thanks in advance !
The choice of features would, in general, depend on what kind of relationship you anticipate between the features and the target variable. You are right in saying that post code number itself does not bear any relation to the target. Here the postcode is simply a string, or a category. What kind of model are you planning to use? Linear regression and Decision tree are two examples. These models capture relationships in different ways. As an example for a feature, you could compute the straight line distance between the source and destination, and use that in the model, since intuitively, the farther they are, the higher the transit time is likely to be. What else does the transit time depend on? See if you can relate the factors influencing the travel time to the information that you have, i.e., the postcodes / XY co-ordinates, in some way.
This summarizes the answer we ended up with in the comments of the questions:
This transformation from ZIP codes to geo-coordinates should not be seen as a "split" but only as a way to represent your data in a multidimensional way (in this case the dimension will be 2).
Machine learning algorithms exist for both unidimensional and multidimensional data. The two dimensions can be correlated or uncorrelated, depending on how you define the parameters of the model you choose afterwards.
Moreover, the correlation does not have to be explicitly set in most cases. Only an initial value may be useful, but many algorithm also rely on random initialization or other simple methods that estimate it from a subset of your data. So, for clarity's sake, if you model you data by a Gaussian for example, when estimating the parameters of this Gaussian, the covariance matrix will have non-diagonal term that are non-zeros which will represent the data correlation. You only need not to take an assumption that states that the 2 dimensions are uncorrelated!

Detect hidden unknown patterns when visualization fails

I have a fast set of multi dimensional timebased data which i suspect contain patterns. I simplified the dataset to create a custom visualization.
Humans see patterns in the visualization but the result of the pattern cannot be explained by the visualization. This is because of the simplification step, it hides data which is important.
I cannot put all my data in my visualization cause than humans cannot see the possible patterns anymore because too much data and dimensions are visualized.
Is there a technique that can detect hidden unknown patterns in a data set? (without using visualization, and without me learning the technique patterns) .
One optional extra would be that the technique should somehow be able to "explain the patterns" to me so that i can check if they make sense.
[edit] i can give the technique a collection of small sized datasets (extracted from the big dataset; still very multi dimensional) that i know that contain patterns (by using my visualization). The technique then needs to analyze under what conditions a pattern produces result a or result b.
First of, how did you "simplify" the data? If you did it without any heuristics, you might go ahead and perform PCA. The very idea of PCA is to solve your problem: Not losing "important" data while having a dimensional reduction. You can visualize your principal components so that patterns can be detected by the human eye as well as algorithms.
To your 2nd question: Yes, there are techniques that can detect hidden unknown patterns in data. However, this is a huge field (Machine Learning) and what algorithm you'd use, would depend on your problem structure, so it's impossible to give a specific model name at this point. From what you specified, neural networks in general seem fit to do the job. After you trained a network, you can visualize the activations or weights (Hinton Diagram) to perform an analysis on which input data is treated "similarly".

Machine learning: Which algorithm is used to identify relevant features in a training set?

I've got a problem where I've potentially got a huge number of features. Essentially a mountain of data points (for discussion let's say it's in the millions of features). I don't know what data points are useful and what are irrelevant to a given outcome (I guess 1% are relevant and 99% are irrelevant).
I do have the data points and the final outcome (a binary result). I'm interested in reducing the feature set so that I can identify the most useful set of data points to collect to train future classification algorithms.
My current data set is huge, and I can't generate as many training examples with the mountain of data as I could if I were to identify the relevant features, cut down how many data points I collect, and increase the number of training examples. I expect that I would get better classifiers with more training examples given fewer feature data points (while maintaining the relevant ones).
What machine learning algorithms should I focus on to, first,
identify the features that are relevant to the outcome?
From some reading I've done it seems like SVM provides weighting per feature that I can use to identify the most highly scored features. Can anyone confirm this? Expand on the explanation? Or should I be thinking along another line?
Feature weights in a linear model (logistic regression, naive Bayes, etc) can be thought of as measures of importance, provided your features are all on the same scale.
Your model can be combined with a regularizer for learning that penalises certain kinds of feature vectors (essentially folding feature selection into the classification problem). L1 regularized logistic regression sounds like it would be perfect for what you want.
Maybe you can use PCA or Maximum entropy algorithm in order to reduce the data set...
You can go for Chi-Square tests or Entropy depending on your data type. Supervized discretization highly reduces the size of your data in a smart way (take a look into Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani).
If you work in R, the SIS package has a function that will do this for you.
If you want to do things the hard way, what you want to do is feature screening, a massive preliminary dimension reduction before you do feature selection and model selection from a sane-sized set of features. Figuring out what is the sane-size can be tricky, and I don't have a magic answer for that, but you can prioritize what order you'd want to include the features by
1) for each feature, split the data in two groups by the binary response
2) find the Komogorov-Smirnov statistic comparing the two sets
The features with the highest KS statistic are most useful in modeling.
There's a paper "out there" titled "A selctive overview of feature screening for ultrahigh-dimensional data" by Liu, Zhong, and Li, I'm sure a free copy is floating around the web somewhere.
4 years later I'm now halfway through a PhD in this field and I want to add that the definition of a feature is not always simple. In the case that your features are a single column in your dataset, the answers here apply quite well.
However, take the case of an image being processed by a convolutional neural network, for example, a feature is not one pixel of the input, rather it's much more conceptual than that. Here's a nice discussion for the case of images:
https://medium.com/#ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

Classification Algorithm which can take predefined weights for attributes as input

I have 20 attributes and one target feature. All the attributes are binary(present or not present) and the target feature is multinomial(5 classes).
But for each instance, apart from the presence of some attributes, I also have the information that how much effect(scale 1-5) did each present attribute have on the target feature.
How do I make use of this extra information that I have, and build a classification model that helps in better prediction for the test classes.
Why not just use the weights as the features, instead of binary presence indicator? You can code the lack of presence as a 0 on the continuous scale.
EDIT:
The classifier you choose to use will learn optimal weights on the features in training to separate the classes... thus I don't believe there's any better you can do if you do not have access to test weights. Essentially a linear classifier is learning a rule of the form:
c_i = sgn(w . x_i)
You're saying you have access to weights, but without an example of what the data look like, and an explanation of where the weights come from, I'd have to say I don't see how you'd use them (or even why you'd want to---is standard classification with binary features not working well enough?)
This clearly depends on the actual algorithms that you are using.
For decision trees, the information is useless. They are meant to learn which attributes have how much effect.
Similarly, support vector machines will learn the best linear split, so any kind of weight will disappear since the SVM already learns this automatically.
However, if you are doing NN classification, just scale the attributes as desired, to emphasize differences in the influential attributes.
Sorry, you need to look at other algorithms yourself. There are just too many.
Use the knowledge as prior over the weight of features. You can actually compute the posterior estimation out of the data and then have the final model

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources