I am currently using auto gluon to train my dataset. When looking at the feature importance value of WeightedEnsembles_L2, I realise that the importance number is larger than the validation or the test score. Is this normal??
From my understanding, a score of x means that when the feature’s value was randomly shuffled across rows, the model’s performance dropped by x. If x is larger than R squared, does that make sense anymore?
Thanks in advance!
I tried to search up info regarding to this and found no result.
Related
I have a classification problem and my current feature vector does not seem to hold enough information.
My training set has 10k entries and I am using a SVM as classifier (scikit-learn).
What is the maximum reasonable feature vector size (how many dimension)?
(Training and evaluation using Labtop CPU)
100? 1k? 10k? 100k? 1M?
The thing is not how many features should it be for a certain number of cases (i.e. entries) but rather the opposite:
It’s not who has the best algorithm that wins. It’s who has the most data. (Banko and Brill, 2001)
Banko and Brill in 2001 made a comparison among 4 different algorithms, they kept increasing the Training Set Size to millions and came up with the above-quoted conclusion.
Moreover, Prof. Andrew Ng clearly covered this topic, and I’m quoting here:
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much
So as a rule of thumb, your data cases must be greater than the number of features in your dataset taking into account that all features should be informative as much as possible (i.e. the features are not highly collinear (i.e. redundant)).
I read once in more than one place and somewhere in Scikit-Learn Documentation, that the number of inputs (i.e. samples) must be at least the square size of the number of features (i.e. n_samples > n_features ** 2 ).
Nevertheless, for SVM in particular, the number of features n v.s number of entries m is an important factor to specify the type of kernel to use initially, as a second rule of thumb for SVM in particular (also according to Prof. Andrew Ng):
If thr number of features is much greater than number of entries (i.e. n is up to 10K and m is up to 1K) --> use SVM without a kernel (i.e. "linear kernel") or use Logistic Regression.
If the number of features is small and if the number of entries is intermediate (i.e. n is up to 1K and m is up to 10K) --> use SVM with Gaussian kernel.
If the number of feature is small and if the number of entries is much larger (i.e. n is up to 1K and m > 50K) --> Create/add more features, then use SVM without a kernel or use Logistic Regression.
So I have a set of data, 1900 rows and 22 columns. 21 column is just numbers but that one crucial that I want to train the data on has 3 stages: a,b, and c.
I have tried both decision trees/jungles, and neural networks and no matter how I set them up I can't get more than 55% precision.
Usually it's around 50% accuracy and the best I was ever able to get was 55% overall accuracy and around 70% average.
Should I even use NN on a such small dataset? As I said I tried with other ML algorithms but they don't yield anything better.
I think that there is no clear answer to your question. Low accuracy score may come from a few reasons. I will state some of them in the following points :
When you use decision trees / neural networks - low accuracy may be a result of a wrong setup of metaparameters (like maximum height of a tree or number of trees in DT or wrong topology or data preparation in NN case). What I advise you is to use a grid or random search for both NN and DT to look for the best metaparameters for your algorithm (in case of "static" (not sequential data) packages like e.g. h20 in R or Scikit-learn in Python may do a great job) and in neural network case - normalize your data properly (e.g. subtract mean and divide by standard deviation every x column of your data).
Your dataset might be inconsistent. If e.g. your data has not a property that there exists a functional dependency between x and y (what means that y = f(x) for some f) then what is learnt during a training session is a probability that given x - your example belong to some specified class. This inconsistency might seriously harm your accuracy. What I advice you in this case is to try specify if that phenomenon occurs and then e.g. try to segmentate your data to solve the problem.
Your data set might be simply too small. Try to get more data in this case.
I used the caret package to train a random forest, including repeated cross-validation. I’d like to know whether the OOB, as in the original RF by Breiman, is used or whether this is replaced by the cross-validation. If it is replaced, do I have the same advantages as described in Breiman 2001, like increased accuracy by reducing the correlation between input data? As OOB is drawn with replacement and CV is drawn without replacement, are both procedures comparable? What is the OOB estimate of error rate (based on CV)?
How are the trees grown? Is CART used?
As this is my first thread, please let me know if you need more details. Many thanks in advance.
There are a lot of basic questions here and you would be better served by reading a book on machine learning or predictive modeling. Thats probably why you haven't gotten much of a response.
For caret you should also consult the package website where some of these questions are answered.
Here are some notes:
CV and OOB estimation for RF are somewhat different. This post might help explain how. For this application, the OOB rate from random forest is computed while the model is being build whereas CV uses holdout samples that are predicted after the random forest model is computed.
The original random forest model (used here) uses unpruned CART trees. Again, this is in many text books and papers.
Max
I recently got a little confused with this too, but reading chapter 4 in Applied Predictive Modeling by Max Kuhn helped me to understand the difference.
If you use randomForest in R, you grow a number of decision trees by sampling N cases with replacement (N is the number of cases in the training set). You then sample m variables at each node where m is less than the number of predictors. Each tree is then grown fully and terminal nodes are assigned to a class based on the mode of cases in that node. New cases are classified by sending them down all the trees and then taking a vote; the majority vote wins.
The key points to note here are:
how the trees are grown - sampling WITH replacement (a bootstrap). This means that some cases will be represented many times in your bootstrap sample and others may not be represented at all. The bootstrap sample will be the same size as your training dataset.
The cases that are not selected for building trees are referred to as the OOB samples- an OOB error estimate is calculated by classifying the cases that aren't selected when building a tree. About 63% of the data points in the bootstrap sample are represented at least once.
If you use caret in R, you will normally use caret::train(....) and specify the method as "rf" and trControl="repeatedcv". You can change trControl to "oob" if you want out of the bag. The way this works is as follows (I'm going to use a simple example of a 10 fold cv repeated 5 times): the training dataset is split into 10 folds of roughly equal size, a number of trees will be built using only 9 samples - so omitting the 1st fold (which is held out). The held out sample is predicted by running the cases through the trees and used to estimate performance measures. The first subset is returned to the training set and the procedure repeats with the 2nd subset held out, and so on. The process is repeated 10 times. This whole procedure can be repeated multiple times (in my example, I do this 5 times); for each of the 5 runs, the training dataset with be split into 10 slightly different folds. It should be noted that 50 different held out samples are used to calculate model efficacy.
The key points to note are:
this involves sampling WITHOUT replacement - you split the training data and build a model on 9 samples and predict the held out sample (the remaining 1 sample of the 10) and repeat this process as above
the model is built using a dataset that is smaller than the training dataset; this is different to the bootstrap method discussed above
You are using 2 different resampling techniques which will yield different results therefore they are not comparable. The k fold repeated cv tends to have low bias (for k large); where k is 2 or 3, bias is high and comparable to the bootstrap method. K fold cv tends to have high variance though...
Description of classification problem:
Assume a regular dataset X with n samples and d features.
This classification problem is somewhat hard (many features, few samples, low overall AUC ~70%).
It might be useful to mention that feature selection/extraction, dimension reduction, kernels, many classifiers have been applied. So I am not interested in trying these.
I am not looking forward to see an improvement in overall AUC. The goal is to find relevant features in haystack of features.
Description of my approach:
I select all pairwise combination of d features and create many two dimensional sub-datasets x with n samples.
On each sub-dataset x, I perform a 10-fold cross-validation (using all samples of the main dataset X). A very long process, assume weeks of computation.
I select top k pairs (according to highest AUC for example) and label them as +. All other pairs are labeled as -.
For each pair, I can compute several properties (e.g. relations between each pair using Expert's knowledge). These properties can be calculated without using the labels in main dataset X.
Now I have pairs which are labeled as + or -. In addition, each pair has many properties calculated based on Expert's knowledge (i.e. features). Hence, I have a new classification problem. Lets call this newly generated dataset Y.
I train a classifier on Y while following cross-validation rules. Surprisingly, I can predict the + and - labels with 90% AUC.
As far as I can see, it means that I am able to select relevant features. However, seeing a 90% AUC makes me worried about information leakage somewhere in this long process. Specially in step 3.
I was wondering if anyone can see any leakage in this approach.
Information Leakage:
Incorporation of target labels in the actual features. Your classifier will produce good prediction while did not learn anything.
Showing your test set to you classifier during the training phase. Your classifier will "memorize" the test set and its corresponding labels without "learning" anything.
Update 1:
I want to stress that indeed I am using all data points of X in step 1. However, I am not using them ever again (even for testing). The final 90% AUC is obtained from predicting labels of dataset Y.
On the other hand, it would be useful to note that, even if I randomize the values of my main dataset X, the computed features for dataset Y is going to be the same. However, the sample labels in Y would change because the previous + pairs might not be a good one anymore. Therefore they will be labeled as -.
Update 2:
Although I haven't got any opinion, I am going to state what I have got during 4 days of talking with pattern recognition researchers. Briefly I became confident that there is no information leakage (as long as I wont go back to the first dataset X and using its labels). Later on, in case I wanted to check to see if I could have better performance in X (i.e. predicting sample labels), I need to use only a part of dataset X for pairwise comparison (as training set). Then I can use the rest of samples in X as test set while using positively predicted pairs of Y as features.
I will set this as an answer in case no one could reject this method.
If your processes in step 1 uses all data. then the features you are learning have information from the whole data set. Since you selected based on the whole dataset and THEN validation, you are leaking serious information.
You should probably stick with tools that are well known / already done for you before running out and trying weird strategies like this. Try using a model with L1 regularization to do feature selection for your, or start with some of the simpler searches like Sequential Backward Selection.
If you do cross validation correctly in the end, each training will perform its own independent feature selection. If you do one global feature selection and then do CV, you are going to be doing it wrong and probably leaking information.
Is it possible to apply RandomForests to very small datasets?
I have a dataset with many variables but only 25 observation each. Random forests produce reasonable results with low OOB errors (10-25%).
Is there any rule of thumb regarding the minimum number of observations to use?
In fact one of the response variable is unbalanced, and if I'm going to subsample it I will end up with an even smaller number of observations.
Thanks in advance
Absolutely RF can be used on these type of datasets (i.e. p>n). In fact they use RF in fields like genomics where the number of fields >= 20000 and there are only a very small number of rows - say 10-12. The entire problem is figuring out which of the 20k variables would make up a parsimonious marker (i.e. feature selection is the entire problem).
I don't have any ROTs about minimum size other than if your model doesn't work well on a held back sample (or Hold-One-Back cross validation might work well in your case) well then you should try something else.
Hope this helps