Usage of nested cross validation for different regressors - machine-learning

I'm working on an assignment in which I have to compare two regressors (random forest and svr) that I implement with scikit-learn. I want to evaluate both regressors and I was googling around a lot and came across the nested cross validation where you use the inner loop to tune the hyperparameter and the outer loop to validate on the k folds of the training set. I would like to use the inner loop to tune both my regressors and the outer loop to validate both so I will have the same test and train folds for both rergessors.
Is this a proper way to compare two ml algorithms with each other? Are there better ways to compare two algorithms with each other? Especialy regressors?
I found some entries in blogs but I could not find any scientific paper stating this is a good technique to compare two algorithms with each other, which would be important to me. If there are some links to current papers I would be glad if you could post them, too.
Thanks for the help in advance!
EDIT
I have a very low amount of data (apprx. 200 samples) with a high amount of features (apprx. 250, after using feature selection, otherwise about 4500) so I decided to use cross validation.My dependent variable is a continous value from 0 to 1.
The problem is a recommender problem so it makes no sense to test for accuracy in this case. As this is only an assignment I can only measure the ml algorithms with statistical methods rather than asking users for their opinion or measure the purchases done by them.

I think it depends on what you want to compare. If you just want to compare different models with regards to prediction power (classifier and regressor alike), nested cross validation is usually good in order to not report overly optimistic metrics: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html while allowing you to find the best set of hyperparameters.
However, sometimes it seems like is just overkilling: https://arxiv.org/abs/1809.09446
Also, depending on how the ml algorithms behave, what datasets are you talking about, their characteristics, etc etc, maybe your "comparison" might need to take into consideration a lot of other things rather than just prediction power. Maybe if you give some more details we will be able to help more.

Related

Best Learning model for high numerical dimension data? (with Rapidminer)

I have a dataset of approx. 4800 rows with 22 attributes, all numerical, describing mostly the geometry of rock / minerals, and 3 different classes.
I tried out a cross validation with k-nn Model inside it, with k= 7 and Numerical Measure -> Camberra Distance as parameters set..and I got a performance of 82.53% and 0.673 kappa. Is that result representative for the dataset? I mean 82% is quite ok..
Before doing this, I evaluated the best subset of attributes with a decision table, I got out 6 different attributes for that.
the problem is, you still don't learn much from that kind of models, like instance-based k-nn. Can I get any more insight from knn? I don't know how to visualize the clusters in that high dimensional space in Rapidminer, is that somehow possible?
I tried decision tree on the data, but I got too much branches (300 or so) and it looked all too messy, the problem is, all numerical attributes have about the same mean and distribution, therefore its hard to get a distinct subset of meaningful attributes...
ideally, the staff wants to "Learn" something about the data, but my impression is, that you cannot learn much meaningful of that data, all that works best is "Blackbox" Learning models like Neural Nets, SVM, and those other instance-based models...
how should I proceed?
Welcome to the world of machine learning! This sounds like a classic real-world case: we want to make firm conclusions, but the data rows don't cooperate. :-)
Your goal is vague: "learn something"? I'm taking this to mean that you're investigating, hoping to find quantitative discriminations among the three classes.
First of all, I highly recommend Principal Component Analysis (PCA): find out whether you can eliminate some of these attributes by automated matrix operations, rather than a hand-built decision table. I expect that the messy branches are due to unfortunate choice of factors; decision trees work very hard at over-fitting. :-)
How clean are the separations of the data sets? Since you already used Knn, I'm hopeful that you have dense clusters with gaps. If so, perhaps a spectral clustering would help; these methods are good at classifying data based on gaps between the clusters, even if the cluster shapes aren't spherical. Interpretation depends on having someone on staff who can read eigenvectors, to interpret what the values mean.
Try a multi-class SVM. Start with 3 classes, but increase if necessary until your 3 expected classes appear. (Sometimes you get one tiny outlier class, and then two major ones get combined.) The resulting kernel functions and the placement of the gaps can teach you something about your data.
Try the Naive Bayes family, especially if you observe that the features come from a Gaussian or Bernoulli distribution.
As a holistic approach, try a neural net, but use something to visualize the neurons and weights. Letting the human visual cortex play with relationships can help extract subtle relationships.

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Machine Learning Algorithm selection

I am new in machine learning. My problem is to make a machine to select a university for the student according to his location and area of interest. i.e it should select the university in the same city as in the address of the student. I am confused in selection of the algorithm can I use Perceptron algorithm for this task.
There are no hard rules as to which machine learning algorithm is the best for which task. Your best bet is to try several and see which one achieves the best results. You can use the Weka toolkit, which implements a lot of different machine learning algorithms. And yes, you can use the perceptron algorithm for your problem -- but that is not to say that you would achieve good results with it.
From your description it sounds like the problem you're trying to solve doesn't really require machine learning. If all you want to do is match a student with the closest university that offers a course in the student's area of interest, you can do this without any learning.
I second the first remark that you probably don't need machine learning if the student has to live in the same area as the university. If you want to use an ML algorithm, maybe it would best to think about what data you would have to start with. The thing that comes to mind is a vector for a university that has certain subjects/areas for each feature. Then compute a distance from a vector which is like an ideal feature vector for the student. Minimize this distance.
The first and formost thing you need is a labeled dataset.
It sounds like the problem could be decomposed into a ML problem however you first need a set of positive and negative examples to train from.
How big is your dataset? What features do you have available? Once you answer these questions you can select an algorithm that bests fits the features of your data.
I would suggest using decision trees for this problem which resembles a set of if else rules. You can just take the location and area of interest of the student as conditions of if and else if statements and then suggest a university for him. Since its a direct mapping of inputs to outputs, rule based solution would work and there is no learning required here.
Maybe you can use a "recommender system"or a clustering approach , you can investigate more deeply the techniques like "collaborative filtering"(recommender system) or k-means(clustering) but again, as some people said, first you need data to learn from, and maybe your problem can be solved without ML.
Well, there is no straightforward and sure-shot answer to this question. The answer depends on many factors like the problem statement and the kind of output you want, type and size of the data, the available computational time, number of features, and observations in the data, to name a few.
Size of the training data
Accuracy and/or Interpretability of the output
Accuracy of a model means that the function predicts a response value for a given observation, which is close to the true response value for that observation. A highly interpretable algorithm (restrictive models like Linear Regression) means that one can easily understand how any individual predictor is associated with the response while the flexible models give higher accuracy at the cost of low interpretability.
Speed or Training time
Higher accuracy typically means higher training time. Also, algorithms require more time to train on large training data. In real-world applications, the choice of algorithm is driven by these two factors predominantly.
Algorithms like Naïve Bayes and Linear and Logistic regression are easy to implement and quick to run. Algorithms like SVM, which involve tuning of parameters, Neural networks with high convergence time, and random forests, need a lot of time to train the data.
Linearity
Many algorithms work on the assumption that classes can be separated by a straight line (or its higher-dimensional analog). Examples include logistic regression and support vector machines. Linear regression algorithms assume that data trends follow a straight line. If the data is linear, then these algorithms perform quite good.
Number of features
The dataset may have a large number of features that may not all be relevant and significant. For a certain type of data, such as genetics or textual, the number of features can be very large compared to the number of data points.

Text categorization using Naive Bayes

I am doing the text categorization machine learning problem using Naive Bayes. I have each word as a feature. I have been able to implement it and I am getting good accuracy.
Is it possible for me to use tuples of words as features?
For example, if there are two classes, Politics and sports. The word called government might appear in both of them. However, in politics I can have a tuple (government, democracy) whereas in the class sports I can have a tuple (government, sportsman). So, if a new text article comes in which is politics, the probability of the tuple (government, democracy) has more probability than the tuple (government, sportsman).
I am asking this is because by doing this am I violating the independence assumption of the Naive Bayes problem, because I am considering single words as features too.
Also, I am thinking of adding weights to features. For example, a 3-tuple feature will have less weight than a 4-tuple feature.
Theoretically, are these two approaches not changing the independence assumptions on the Naive Bayes classifier? Also, I have not started with the approach I mentioned yet but will this improve the accuracy? I think the accuracy might not improve but the amount of training data required to get the same accuracy would be less.
Even without adding bigrams, real documents already violate the independence assumption. Conditioned on having Obama in a document, President is much more likely to appear. Nonetheless, naive bayes still does a decent job at classification, even if the probability estimates it gives are hopelessly off. So I recommend that you go on and add more complex features to your classifier and see if they improve accuracy.
If you get the same accuracy with less data, that is basically equivalent to getting better accuracy with the same amount of data.
On the other hand, using simpler, more common features works better as you decrease the amount of data. If you try to fit too many parameters to too little data, you tend to overfit badly.
But the bottom line is to try it and see.
No, from a theoretical viewpoint, you are not changing the independence assumption. You are simply creating a modified (or new) sample space. In general, once you start using higher n-grams as events in your sample space, data sparsity becomes a problem. I think using tuples will lead to the same issue. You will probably need more training data, not less. You will probably also have to give a little more thought to the type of smoothing you use. Simple Laplace smoothing may not be ideal.
Most important point, I think, is this: whatever classifier you are using, the features are highly dependent on the domain (and sometimes even the dataset). For example, if you are classifying sentiment of texts based on movie reviews, using only unigrams may seem to be counterintuitive, but they perform better than using only adjectives. On the other hand, for twitter datasets, a combination of unigrams and bigrams were found to be good, but higher n-grams were not useful. Based on such reports (ref. Pang and Lee, Opinion mining and Sentiment Analysis), I think using longer tuples will show similar results, since, after all, tuples of words are simply points in a higher-dimensional space. The basic algorithm behaves the same way.

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources