I have a multi-class machine learning problem for which I will try different methods on such as logistic regression, decision trees, multilayer perceptron etc.
The observations in the data set have an attribute which is an index from 1-5 which defines how important it is that a certain observation gets correctly classified (index 1 very important, 5 not important at all). My questions are:
Question 1: How should I emphasize to the models that the lower index observations have greater importance? I am thinking of duplicating these observations so the models fit the lower index observations more well, what other approaches are possible?
Question 2: What performance evaluation criterias can I use to find the models that predict these low index observations well? (appart from calculating the distribution of indexes among the correctly predicted instances.)
Regards,
Answer 1: Presenting the important patterns of the training set more often is the standard approach for this. If your training algorithm has something like a leraning rate (for example if you use backpropagation), you could also increase this parameter for the high priority patterns.
Answer 2: I would use a weighted mean square error and give the errors of the high priority patterns a larger weight.
Related
I have a question about variable importance ranking.
I built an MLP and an RF model using the same dataset with 34 variables and achieved the same accuracy on a similar test dataset. As you can see in the picture below the top variables for the SHAP summary plot and the RF VIM are quite different.
Interestingly, I removed the low-ranked variable from the MLP and the accuracy increased. However, the RF result didn’t change.
Does that mean the RF is not a good choice for modeling this dataset?
It’s still strange to me that the rankings are so different:
SHAP summary plot vs. RF VIM, I numbered the top and low-ranked variable
Shouldn't the variables ranking be the same for MLP and RF?
No. There may be tendency for different algos to rank certain features higher, but there is no reason for ranking to be the same.
Different algorithms:
May have different objective functions to achieve intended goal.
May use features differently to achieve min (max) of the objective function.
On top, what you cite as RF "feature importances" (mean decrease in Gini) is only one of the many ways to calculate "feature importance" for RF (including which metric you use, and how you calculate total decrease due to a feature). In contrast, SHAP is model agnostic when it comes to explaining feature contributions to outcome.
In sum:
Different models will have different opinions about what is important and not. What is important for one algo may be not so important for another and vice versa. It doesn't tell anything about applicability of a model to a specific dataset.
Use SHAP values (or any other feature importance metric that you and your clients understand) to explain a model (if necessary).
Choose "best" model based on your goals: performance or explainability.
Based on data that our business department supplied to us, I used the sklearn decision tree algorithm to determine the ROC_AUC for a binary classification problem.
The data consists of 450 rows and there are 30 features in the data.
I used 10 times StratifiedKFold repetition/split of training and test data. As a result, I got the following ROC_AUC values:
0.624
0.594
0.522
0.623
0.585
0.656
0.629
0.719
0.589
0.589
0.592
As I am new in machine learning, I am unsure whether such a variation in the ROC_AUC values can be expected (with minimum values of 0.522 and maximum values of 0.719).
My questions are:
Is such a big variation to be expected?
Could it be reduced with more data (=rows)?
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Well, you do k-fold splits to actually evaluate how well your model generalizes.
Therefore, from your current results I would assume the following:
This is a difficult problem, the AUCs are usually low.
0.71 is an outlier, you were just lucky there (probably).
Important questions that will help us help you:
What is the proportion of the binary classes? Are they balanced?
What are the features? Are they all continuous? If categorical, are they ordinal or nominal?
Why Decision Tree? Have you tried other methods? Logistic Regression for instance is a good start before you move on to more advanced ML methods.
You should run more iterations, instead of k fold use the ShuffleSplit function and run at least 100 iterations, compute the Average AUC with 95% Confidence Intervals. That will give you a better idea of how well the models perform.
Hope this helps!
Is such a big variation to be expected?
This is a textbook case of high variance.
Depending on the difficulty of your problem, 405 training samples may not be enough for it to generalize properly, and the random forest may be too powerful.
Try adding some regularization, by limiting the number of splits that the trees are allowed to make. This should reduce the variance in your model, though you might expect a potentially lower average performance.
Could it be reduced with more data (=rows)?
Yes, adding data is the other popular way of lowering the variance of your model. If you're familiar with deep learning, you'll know that deep models usually need LOTS of samples to learn properly. That's because they are very powerful models with an intrinsically high variance, and therefore a lot of data is needed for them to generalize.
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Variance will decrease with regularization and adding data, it has no relation to the actual performance "number" that you get.
Cheers
I have a minimal example of a neural network with a back-propagation trainer, testing it on the IRIS data set. I started of with 7 hidden nodes and it worked well.
I lowered the number of nodes in the hidden layer to 1 (expecting it to fail), but was surprised to see that the accuracy went up.
I set up the experiment in azure ml, just to validate that it wasn't my code. Same thing there, 98.3333% accuracy with a single hidden node.
Can anyone explain to me what is happening here?
First, it has been well established that a variety of classification models yield incredibly good results on Iris (Iris is very predictable); see here, for example.
Secondly, we can observe that there are relatively few features in the Iris dataset. Moreover, if you look at the dataset description you can see that two of the features are very highly correlated with the class outcomes.
These correlation values are linear, single-feature correlations, which indicates that one can most likely apply a linear model and observe good results. Neural nets are highly nonlinear; they become more and more complex and capture greater and greater nonlinear feature combinations as the number of hidden nodes and hidden layers is increased.
Taking these facts into account, that (a) there are few features to begin with and (b) that there are high linear correlations with class, would all point to a less complex, linear function as being the appropriate predictive model-- by using a single hidden node, you are very nearly using a linear model.
It can also be noted that, in the absence of any hidden layer (i.e., just input and output nodes), and when the logistic transfer function is used, this is equivalent to logistic regression.
Just adding to DMlash's very good answer: The Iris data set can even be predicted with a very high accuracy (96%) by using just three simple rules on only one attribute:
If Petal.Width = (0.0976,0.791] then Species = setosa
If Petal.Width = (0.791,1.63] then Species = versicolor
If Petal.Width = (1.63,2.5] then Species = virginica
In general neural networks are black boxes where you never really know what they are learning but in this case back-engineering should be easy. It is conceivable that it learnt something like the above.
The above rules were found by using the OneR package.
I am interested in any tips on how to train a set with a very limited positive set and a large negative set.
I have about 40 positive examples (quite lengthy articles about a particular topic), and about 19,000 negative samples (most drawn from the sci-kit learn newsgroups dataset). I also have about 1,000,000 tweets that I could work with.. negative about the topic I am trying to train on. Is the size of the negative set versus the positive going to negatively influence training a classifier?
I would like to use cross-validation in sci-kit learn. Do I need to break this into train / test-dev / test sets? Is know there are some pre-built libraries in sci-kit. Any implementation examples that you recommend or have used previously would be helpful.
Thanks!
The answer to your first question is yes, the amount by which it will affect your results depends on the algorithm. My advive would be to keep an eye on the class-based statistics such as recall and precision (found in classification_report).
For RandomForest() you can look at this thread which discusses
the sample weight parameter. In general sample_weight is what
you're looking for in scikit-learn.
For SVM's have a look at either this example or this
example.
For NB classifiers, this should be handled implicitly by Bayes
rule, however in practice you may see some poor performances.
For you second question it's up for discussion, personally I break my data into a training and test split, perform cross validation on the training set for parameter estimation, retrain on all the training data and then test on my test set. However the amount of data you have may influence the way you split your data (more data means more options).
You could probably use Random Forest for your classification problem. There are basically 3 parameters to deal with data imbalance. Class Weight, Samplesize and Cutoff.
Class Weight-The higher the weight a class is given, the more its error rate is decreased.
Samplesize- Oversample the minority class to improve class imbalance while sampling the defects for each tree[not sure if Sci-kit supports this, used to be param in R)
Cutoff- If >x% trees vote for the minority class, classify it as minority class. By default x is 1/2 in Random forest for 2-class problem. You can set it to a lower value for the minority class.
Check out balancing predict error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
For the 2nd question if you are using Random Forest, you do not need to keep separate train/validation/test set. Random Forest does not choose any parameters based on a validation set, so validation set is un-necessary.
Also during the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.
I'm trying to perform leave-one-out cross validation for modelling a particular problem using Back Propagation Neural Network. I have 8 features in my training data and 20 instances. I'm trying to make the NN learn a function in building a prediction model. Now, the problem is that the error rate is quite high in the prediction. My guess is that the number of instances in the training is less when compared to the number of features under consideration. Is this conclusion correct. Is there any optimal feature to instance ratio ?
(This topic is often phrased in the ML literature as acceptable size or shape of the data set, given that a data set is often described as an m x n matrix in which m is the number of rows (data points) and n is the number of columns (features); obvious m >> n is preferred.)
In an event, I am not aware of a general rule for an acceptable range of features-to-observations; there are probably a couple of reasons for this:
such a ratio would depend strongly on the quality of the data
(signal-to-noise ratio); and
the number of features is just one element of model complexity (e.g., interaction among the features); and model complexity is the strongest determinant of the number of data instances (data points).
So there are two sets of approaches to this problem--which, because they are opposing, both can be applied to the same model:
reduce the number of features; or
use a statistical technique to leverage the data that you do have
A couple of suggestions, one for each of the two paths above:
Eliminate "non-important" features--i.e, those features that don't contribute to the variability in your response variable. Principal Component Analysis (PCA) is fast and reliable way to do this, though there are a number of other techniques which are generally subsumed under the rubric "dimension reduction."
Use Bootstrap methods instead of cross-validation. The difference in methodology seems slight but the (often substantial) improvement in reducing prediction error is well documented for multi-layer perceptrons (neural networks) (see e.g., Efron, B. and Tibshirani, R.J., The bootstrap method: Improvements on cross-validation, J. of the American Statistical Association, 92, 548-560., 1997). If you are not familiar with Bootstrap methods for splitting training and testing data, the general technique is similar to cross-validation except that instead of taking subsets of the entire data set you take subsamples. Section 7.11 of Elements is a good introduction to Bootstrap methods.
The best single source on this general topic that i have found is Chapter 7 Model Assessment and Selection from the excellent treatise Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. This book is available free to download from the book's homepage.