I estimated a Random Forest Model to predict cardiovascular disease in a sample with 4234 observations across 23 characteristics. I have unbalanced data (3120 observations without cardiovascular disease and 1114 observations with cardiovascular disease). I separated my data into training and testing (80/20).
I also used the over/under/both and SMOTE strategies to improve my predictions. However, all of these strategies resulted in Very Low Sensitivity (0.06 to 0.09), Very High Sensitivity (0.93-0.97), and low to fair accuracy (65% to 72%).
Could anyone help me with how to adjust the parameters to improve my metrics?
I am using the randomForest() function from Rstudio.
Thanks.
Related
I am using Random Forest for binary classification.
It gives me 85 % accuracy when I trained with all features(10 features).
After training, I visualized the important features. It shows that 2 features are really important.
So I chose anly two important features and trained RF(with same setup) but accuracy is decrease(0.70 %).
Does it happen ? I was expecting higher accuracy.
What can I do get better accuracy in this case?
Thanks
The general rule of thumb when using random forests is to include all observable data. The reason for this is that a priori, we don't know which features might influence the response and the model. Just because you found that there are only a handful of features which are strong influencers does not mean that the remaining features do not play some role in the model.
So, you should stick with just including all features when training your random forest model. If certain features do not improve accuracy, they will be removed/ignored during training. You typically do not need to manually remediate by removing any features when training.
I have a dataset with thousand of sentences belonging to a subject. I would like to know what would be best to create a classifier that will predict a text as "True" or "False" depending on whether they talk about that subject or not.
I've been using solutions with Weka (basic classifiers) and Tensorflow (neural network approaches).
I use string to word vector to preprocess the data.
Since there are no negative samples, I deal with a single class. I've tried one-class classifier (libSVM in Weka) but the number of false positives is so high I cannot use it.
I also tried adding negative samples but when the text to predict does not fall in the negative space, the classifiers I've tried (NB, CNN,...) tend to predict it as a false positive. I guess it's because of the sheer amount of positive samples
I'm open to discard ML as the tool to predict the new incoming data if necessary
Thanks for any help
I have eventually added data for the negative class and build a Multilineal Naive Bayes classifier which is doing the job as expected.
(the size of the data added is around one million samples :) )
My answer is based on the assumption that that adding of at least 100 negative samples for author’s dataset with 1000 positive samples is acceptable for the author of the question, since I have no answer for my question about it to the author yet
Since this case with detecting of specific topic is looks like particular case of topics classification I would recommend using classification approach with the two simple classes 1 class – your topic and another – all other topics for beginning
I succeeded with the same approach for face recognition task – at the beginning I built model with one output neuron with high level of output for face detection and low if no face detected
Nevertheless such approach gave me too low accuracy – less than 80%
But when I tried using 2 output neurons – 1 class for face presence on image and another if no face detected on the image, then it gave me more than 90% accuracy for MLP, even without using of CNN
The key point here is using of SoftMax function for the output layer. It gives significant increase of accuracy. From my experience, it increased accuracy of the MNIST dataset even for MLP from 92% up to 97% for the same model
About dataset. Majority of classification algorithms with a trainer, at least from my experience are more efficient with equal quantity of samples for each class in a training data set. In fact, if I have for 1 class less than 10% of average quantity for other classes it makes model almost useless for the detection of this class. So if you have 1000 samples for your topic, then I suggest creating 1000 samples with as many different topics as possible
Alternatively, if you don’t want to create a such big set of negative samples for your dataset, you can create a smaller set of negative samples for your dataset and use batch training with a size of batch = 2x your negative sample quantity. In order to do so, split your positive samples in n chunks with the size of each chunk ~ negative samples quantity and when train your NN by N batches for each iteration of training process with chunk[i] of positive samples and all your negative samples for each batch. Just be aware, that lower accuracy will be the price for this trade-off
Also, you could consider creation of more generic detector of topics – figure out all possible topics which can present in texts which your model should analyze, for example – 10 topics and create a training dataset with 1000 samples per each topic. It also can give higher accuracy
One more point about the dataset. The best practice is to train your model only with part of a dataset, for example – 80% and use the rest 20% for cross-validation. This cross-validation of unknown previously data for model will give you a good estimation of your model accuracy in real life, not for the training data set and allows to avoid overfitting issues
About building of model. I like doing it by "from simple to complex" approach. So I would suggest starting from simple MLP with SoftMax output and dataset with 1000 positive and 1000 negative samples. After reaching 80%-90% accuracy you can consider using CNN for your model, and also I would suggest increasing training dataset quantity, because deep learning algorithms are more efficient with bigger dataset
For text data you can use Spy EM.
The basic idea is to combine your positive set with a whole bunch of random samples, some of which you hold out. You initially treat all the random documents as the negative class, and train a classifier with your positive samples and these negative samples.
Now some of those random samples will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring held out true positive samples.
Then you iterate this process until it stablizes.
I'm working on a project with colorectal cancer stage multiclass-classification using Gene Expression Data. My dataset contains 11 Biomarkers. The results from the classification are around 40%. I have tried different models for classification with KNN, SVM, neural network..., and also I have tried algorithms from ensemble machine learning. Has anyone has any idea what can I do with the dataset to improve the results?
To decide what to do next, you will need some metrics:
How well can a team of human experts classify the data?
What is the model accuracy on the training dataset?
What is the model accuracy on the testing dataset?
If the training accuracy is much worse than human experts, you should increase the complexity of the model until the training results approach or exceed human experts. You can do this by increasing the number of input features, choosing a different machine learning model, or increasing the number of layers in the NN. If the training accuracy is poor, you need to improve this first before spending time improving the testing accuracy.
If the training accuracy is good but the testing accuracy is much worse than the training accuracy, you are probably overfitting. Get or create more training data, and use regularization.
EDITED:
I have a classification dataset of 350000 rows and 500 features. The features are a Tfidf vector.
While my Y(predictor) has values from 1-16 to classify the sentences into 16 types.
The training and testing are randomly split
When I send my data through a classification algorithm, I'm getting a huge difference between the accuracy :
SVM and Naive Bayes are giving 20%+ (which is too less)
RandomForest gives around 55% accuracy which seems more accurate but is still less
Is there a reason why I'm getting such a huge difference across different algorithms and is there a way to further increase the accuracy?
I'm trying predict a person's personality through his tweets
I have been working on the Sentiment analysis prediction using the Rotten Tomatoes movie reviews dataset.
The dataset has 5 classes {0,1,2,3,4} where 0 being very negative and 4 being very positive
The dataset is highly unbalanced,
total samples = 156061
'0': 7072 (4.5%),
'1': 27273 (17.4%),
'2': 79583 (50.9%),
'3': 32927 (21%),
'4': 9206 (5.8%)
as you can see class 2 has almost 50% samples and 0 and 5 contribute to ~10% of training set
So there is a very strong bias for class 2 thus reducing the accuracy of classification for class 0 and 4.
What can I do to balance the dataset? One solution would be to get equal number of samples by reducing the samples to only 7072 for each class, but it reduces the dataset drastically!
How can I optimize and balance the dataset without affecting the accuracy of overall classification?
You should not balance the dataset, you should train a classifier in a balanced manner. Nearly all existing classifiers can be trained with some cost sensitive objective. For example - SVMs let you "weight" your samples, simply weight samples of the smaller class more. Similarly Naive Bayes has classes priors - change them! Random forest, Neural networks, Logistic regression, they all let you somehow "weight" samples, it is the core technique for getting more balanced results.
For classification problems, you can try class_weight='balanced' option in your estimator, such as Logistic, SVM, etc. For example:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression