I'm using Naive Bayes classifier in Weka on a data set of 7000 instances with 15 attributes. My baseline accuracy is 87.5% using ZeroR. As a part of data preprocessing I normalized the data set with zero mean and unit variance, applied filter to randomize the dataset. I've used training (70%) and testing (30%) sets, as well as 10-fold cross validation on the entire data set, used supervised discretization and attribute selection and the best accuracy of the classifier I got is 93.43%. Is this small improvement in respect to baseline accuracy?
Related
I have trained LSTM AND decision tree on my data set (type of text classification). I have used K-cross fold validation with k=10.
Decision tree accuracy 61%
LSTM accuracy 90%
Now when I predict on totally unseen data then decision tree predicts more well and good as compared to LSTM.
Why it happens? If LSTM accuracy is more then why decision tree performs more well on unseen data as compare to LSTM?
Your LSTM model may have greater accuracy than a decision tree when training, but the fact that it doesn't generalize well to unseen data, indicates that the LSTM is overfitting to the training data. Try adjusting the train-validation split and batch size to see if that improves your models.
The validation loss during training would indicate which model is better. You can also try using random forests (cluster of decision trees) which has been known to give better results than one decision tree alone
I am new to machine learning, I have built a model that predicts if a client will subscribe in the following month or not. I got 73.4 on the training set and 72.8 on the test set. is it okay? or do I have Overfitting?
It's ok.
Overfitting happens when the accuracy in the training set in higher and the accuracy in the test set is lower (with a marginal difference).
This is what overfitting looks like.
Train accuracy: 99.4%
Test accuracy: 71.4%
You can, however, increase the accuracy using different models and feature engineering
We call it as over-fitting,If the accuracy of training data is abnormally higher (greater than 95%) and accuracy of test data is very low (less than 65%).
In your case,both training and testing accuracy are almost similar.So there is no over-fitting.
Try for more test data and check whether the accuracy is decreasing or not.You can also try to improve the model by
Trying different algorithms
Increasing the size of train data
Trying K-fold cross validation
Hyper parameter tuning
Using Regularization methods
Standardizing feature variables
Which metrics is better for multi-label classification in Keras: accuracy or categorical_accuracy? Obviously the last activation function is sigmoid and as loss function is binary_crossentropy in this case.
I would not use Accuracy for classification tasks with unbalanced classes.
Especially for multi-label tasks, you probably have most of your labels to be False. That is, each data point can only have a small set of labels compared to the cardinality of all of the possibile labels.
For that reason accuracy is not a good metric, if your model predict all False (sigmoid activation output < 0.5) then you would measure a very high accuracy.
I would analyze either the AUC or recall/precision at each epoch.
Alternatively a multi-label task can be seen as a ranking task (like Recommender Systems) and you could evaluate precision#k or recall#k where k are the top predicted labels.
If your Keras back-end is TensorFlow, check out the full list of supported metrics here: https://www.tensorflow.org/api_docs/python/tf/keras/metrics.
Actually, there is no metric named accuracy in Keras. When you set metrics=['accuray'] in Keras, the correct accuracy metric will be inferred automatically based on the loss function used. As a result, since you have used binary_crossentropy as the loss function, the binary_accuracy will be chosen as the metric.
Now, you should definitely choose binary_accuracy over categorical_accuracy in a multi-label classification task since classes are independent from each other and the prediction for each class should be considered independently of the predictions for other classes.
I'm working on a project with colorectal cancer stage multiclass-classification using Gene Expression Data. My dataset contains 11 Biomarkers. The results from the classification are around 40%. I have tried different models for classification with KNN, SVM, neural network..., and also I have tried algorithms from ensemble machine learning. Has anyone has any idea what can I do with the dataset to improve the results?
To decide what to do next, you will need some metrics:
How well can a team of human experts classify the data?
What is the model accuracy on the training dataset?
What is the model accuracy on the testing dataset?
If the training accuracy is much worse than human experts, you should increase the complexity of the model until the training results approach or exceed human experts. You can do this by increasing the number of input features, choosing a different machine learning model, or increasing the number of layers in the NN. If the training accuracy is poor, you need to improve this first before spending time improving the testing accuracy.
If the training accuracy is good but the testing accuracy is much worse than the training accuracy, you are probably overfitting. Get or create more training data, and use regularization.
Could somebody give me the example to show how platt scaling is used along with k-fold cross-validation in multiclass SVM classification in libsvm?
I have divided the whole dataset in two parts: Training and testing. For cross-validation i am partitioning the training data such that 1 partition is for testing and rest is for training multiclass SVM classifier.
Platt scaling has nothing to do with your partitioning or multiclass setting. Platt scaling is internal technique of each individual binary SVM, which uses only a training data. This is actually just fitting a logistic reggresion on top of your learned SVM projections.