I am using Weka GUI for a classification. I am new to Weka and getting confused with the options
Use training Set
Supplied test set
Cross validation
to train my classification algorithm (for example J48), I trained with cross validation 10 folds and the accuracy is pretty good (97%). When I test my classification - the accuracy drops to about 72%. I am so confused. Any tips please? This is how I did it:
I train my model on the training data (For example: train.arff)
I right-click in the Results list on the item which model you want to save
select Save model and save it for example as j48tree.model
and then
I load the test data (for example: test.arff via the Supplied test set button
Right-click in the Results list, I selected Load model and choose j48tree.model
I selected Re-evaluate model on current test set
Is the way i do it wrong? Why the accuracy miserably dropping to 72% from 97%? Or is doing only the cross-validation with 10 folds is enough to train and test the classifier?
Note: my training and testing datasets have the same attributes and labels. The only difference is, I have more data on the testing set which I don't think will be a problem.
I don't think there is any issue with how you use WEKA.
You mentioned that you test set is larger than training? What is the split? The usual rule of thumb is that test set should be one 1/4 of the whole dataset, i.e. 3 times smaller than training and definitely not larger. This alone could explain the drop from 97% to 72% which is by the way not so bad for real life case.
Also it will be helpful if you build the learning curve https://weka.wikispaces.com/Learning+curves as it will explain whether you have a bias or variance issue. Judging by your values sounds like you have a high variance (i.e. too many parameters for your dataset), so adding more examples or changing your split between training and test set will likely help.
Update
I ran a quick analysis of the dataset at question by randomforest and my performance was similar to the one posted by author. Details and code are available on gitpage http://omdv.github.io/2016/03/10/WEKA-stackoverflow
Related
I am using Weka software to classify model. I have confusion using training and testing dataset partition. I divide 60% of the whole dataset as training dataset and save it to my hard disk and use 40% of data as test dataset and save this data to another file. The data that I am using is an imbalanced data. So I applied SMOTE in my training dataset. After that, in the classify tab of the Weka I selected Use training set option from Test options and used Random Forest classifier to do the classification on the training dataset. After getting the result I chose Supplied test set option from Test options and load my test dataset from hard disk and again ran the classifier.
I try to find out tutorial on how to load training set and test set in Weka but did not get it. I did the above process depend upon my understanding.
Therefore, I would like to know is that the right way to perform classification on training and test dataset?
Thank you.
There is no need to evaluate your classifier on the training set (this will be overly optimistic, since the classifier has already seen this data). Just use the Supplied test set option, then your classifier will get trained automatically on the currently loaded dataset before being evaluated on the specified test set.
Instead of manually splitting your data, you could also use the Percentage split test option, with 60% to be used for your training data.
When using filters, you should always wrap them (in this case SMOTE) and your classifier (in this case RandomForest) in the FilteredClassifier meta-classifier. That way, you will ensure that the training and test set data will get transformed correctly. This will also avoid the problem of leaking information into the test set when transforming the full dataset with a supervised filter and splitting the dataset into train/test afterwards. Finally, it also documents nicely what preprocessing is being done to your input data, all in a single command-line string.
If you need to apply more than one filter, use the MultiFilter to apply them sequentially.
How noise in the data, target complexity and size of the training set are related to over-fitting?
I am guessing that you are a beginner, suppose you have dataset with lots of features(as in columns). you create a model and test it on your training and test dataset, you notice that it gives you an accuracy of 100 percent on your training set and 60-70 on your test set, this is an example of Overfitting. it is because you have chosen a lot of features which were not related to predicting the outcome.
you can remove it by dropping those irrelevant columns(which are called as noise), apply K-fold cross validation on your data.
this video might help you get a better understanding
https://www.youtube.com/watch?v=Anq4PgdASsc
I have some doubts about the implementation and tuning of parameters and hyperparameters by using the classic train, validation and test set. So it would be of great help if somebody could clarify me these concepts and bring me some hints for its implementation in a language like Python.
For example, if I have a Neural Network, for what I know the parameter tuning (lets consider the number of hidden layers and neurons per layer), could be tuned with the training set. So when it comes to the validation set, which is approximately 20% of the dataset, I can tune my hyperparameters with the following algorithm:
Example: Tuning batch size and learning rate:
hyperListB=[]
hyperListL=[]
//let´s suppose both lists have the same dimensions
for i in range(0,hyperListB):
model=fit(train_set,hyperListB[i],hyperlistL[i]
values[].add(evaluate(model,validation_set) //add scores of each run
end for
for i in range(0,values):
plot_loss_functions(values)
select best set of hyperparameters
model=fit(test_set, selecter_hyperparameters)
evaluate(model)
would this sequence of steps be correct? I have searched thru different pages and did not find something that could help me with this. Please, bear in mind that I do not want to use cross-validation or other library-based techniques such as GridSearchCV.
Thanks
In a Train validation test split, the fit method on the train data.
Validation data is used for hyperparameter tuning. A set of hyperparameters is selected and the model is trained on the train set. Then this model will be evaluated on the validation set. This is repeated until all permutations of the different hyperparameters have been exhausted.
The best set of hyperparameters are the ones that gave the best result on the validation set. This method is called Grid search.
The test set is used to evaluate the model with the best hyperparameters selected. THis gives the final unbiased accuracy and loss.
The fit method will never be called on the validation or test set.
your example will look like:
hyperListB=[]
hyperListL=[]
//let´s suppose both lists have the same dimensions
for hyperB in hyperListB:
for hyperL in hyperListL:
model=fit(train_set,hyperB,hyperL)
values[].add(evaluate(model,validation_set) //add scores of each run
end for
end for
for i in range(0,values):
plot_loss_functions(values)
select best set of hyperparameters
evaluate(model,test_set)
I have split the dataset ( around 28K images) into 75% trainset and 25% testset. Then I have taken randomly 15% of trainset and randomly 15% of testset to create a validation set. The goal is to classify the images into two categories. The exact image sample can't be shared. But its similar to the one attached. I'm using this model: VGG19 with imagenet weights with last two layers trainable and 4 dense layers appended. I am also using ImageDataGenerator to Augment the images. I trained the model for 30 epochs and found that the training accuracy is 95% and Validation accuracy is 96% and when trained on test dataset it fell down enormously to 75% only.
I have tried regularization and dropout to tackle the overfitting if it is suffering. I have also done one more thing to see what happens if I use the testset as Validation set and test the model on the same testset. The results were: Trainset Acc = 96% and Validation ACC = 96.3% and the testAcc = 68%. I don't understand what should I Do ?
image
First off, you need to make sure that when you split in data, the relative size of every class in the new datasets is equal. It can be imbalanced if that is the distribution of your initial data, but it must have the same imbalance in all datasets after the split.
Now, regarding the split. If you need a train, validation and test sets, they must all be independent of each-other (no-shared samples). This is important if you don't want to cheat yourself with the results that you are getting.
In general, in machine-learning we start from a training set and a test set. For choosing the best model architecture/hyper-parameters, we further divide the training set to get the validation set (the test set should not be touched).
After determining the best architecture/hyper-parameters for our model, we combine the training and validation set and train the best-case model from scratch with the combined full training set. Only now we get to test the results on the test set.
I had faced a similar issue in one of my practice projects.
My InceptionV3 model gave a high training accuracy (99%), a high validation accuracy (95%+) but a very low testing accuracy (55%).
The dataset was a subset of the popular Dogs vs. Cats dataset (https://www.kaggle.com/c/dogs-vs-cats/data), made by me, having 15k images split into 3 folders: train, valid, and test in the ratio of 60:20:20 (9000, 3000, 3000 each halved for cats folder and dogs folder).
The error in my case was actually in my code. It had nothing to do with the model or the data. The model had been defined inside a function and that was creating an untrained instance during the evaluation. Hence, an untrained model was being tested upon on the test dataset. After correcting the errors in my notebook I got a 96%+ testing accuracy.
Links:
https://colab.research.google.com/drive/1-PO1KJYvXdNC8LbvrdL70oG6QbHg_N-e?usp=sharing&fbclid=IwAR2k9ZCXvX_y_UNWpl4ljs1y0P3budKmlOggVrw6xI7ht0cgm03_VeoKVTI
https://drive.google.com/drive/u/3/folders/1h6jVHasLpbGLtu6Vsnpe1tyGCtR7bw_G?fbclid=IwAR3Xtsbm_EZA3TOebm5EfSvJjUmndHrWXm4Iet2fT3BjE6pPJmnqIwW8KWY
tyuhm
Other probable causes:
One possibility is that the testing set would have a different
distribution than the validation set (This could be excluded by
joining all the data, randomizing, and splitting again to train,
test, valid).
To swap valid and test with each other and see if it has an
effect (Sometimes if one set has relatively harder examples).
If the training somehow overfitted on the validation set (Is it
possible that during training, at one or more steps, the model giving the best score on the validation set is chosen).
Images overlapping, lack of shuffling.
In the deep learning world, if something seems way too odd to be
true, or even way too good to be even true, a good guess is that its
probably a bug unless proven otherwise!
I am completely new to Machine Learning algorithms and I have a quick question with respect to Classification of a dataset.
Currently there is a training data that consists of two columns Message and Identifier.
Message - Typical message extracted from Log containing timestamp and some text
Identifier - Should classify the category based on the message content.
The training data was prepared by extracting a particular category from the tool and labelling it accordingly.
Now the test data contains just the message and I am trying to obtain the Category accordingly.
Which approach is most helpful in this scenario ? Is it the Supervised or Unsupervised Learning ?
I have a trained dataset and I am trying to predict the Category for the Test Data.
Thanks in advance,
Adam
If your labels are exact then you can classify using ANN, SVM etc. But labels are not exact you have to cluster data with respect to the features you have in data. K-means or nearest neighbour can be the starting point for clustering.
It is supervised learning, and a classification problem.
However, obviously you do not have the label column (the to-be-predicted value) for your testset. Thus, you cannot calculate error measures (such as False Positive Rate, Accuracy etc) for that test set.
You could, however, split the set of labeled training data that you do have into a smaller training set and a validation set. Split it 70%/30%, perhaps. Then build a prediction model from your smaller 70% training dataset. Then tune it on your 30% validation set. When accuracy is good enough, then apply it on your testset to obtain/predict the missing values.
Which techniques / algorithms to use is a different question. You do not give enough information to answer that. And even if you did you still need to tune the model yourself.
You have labels to predict, and training data.
So by definition it is a supervised problem.
Try any classifier for text, such as NB, kNN, SVM, ANN, RF, ...
It's hard to predict which will work best on your data. You willhave to try and evaluate several.