Wrong way to cascade classifiers in Weka - machine-learning

I have a data set with two classes and was trying to get an optimal classifier using Weka. The best classifier I could obtain was about 79% accuracy. Then I tried adding attributes to my data by classifying it and saving the probability distribution generated by this classification in the data itself.
When I reran the training process on the modified data I got over 93% accuracy!! I sure this is wrong but I can't exactly figure out why.
These are the exact steps I went through:
Open the data in Weka.
Click on add Filter and select AddClassification from Supervised->attribute.
Select a classifier. I select J48 with default settings.
Set "Output Classification" to false and set Output Distribution to true.
Run the filter and restore the class to be your original nominal class. Note the additional attributes added to the end of the attribute list. They will have the names: distribution_yourFirstClassName and distribution_yourSecondClassName.
Go the Classify tab and select a classifier: again I selected J48.
Run it. In this step I noticed much more accuracy than before.
Is this a valid way of creating classifiers? Didn't I "cheat" by adding classification information within the original data? If it is valid, how would one proceed to create a classifier that can predict unlabeled data? How can it add the additional attribute (the distribution) ?
I did try reproducing the same effect using a FilteredClassifier but it didn't work.
Thanks.

The process that you appear to have undertaken seems somewhat close to the Stacking ensemble method, where classifier outputs are used to generate an ensemble output (more on that here).
In your case however, the attributes and a previously trained classifier output is being used to predict your class. It is likely that most of the second J48 model's rules will be based on the first (As the class output will correlate more strongly to the J48 than the other attributes), but with some fine-tuning to improve model accuracy. In this case, the concept of 'two heads are better than one' is used to improve the overall performance of the model.
That's not to say that it is all good though. If you needed to use your J48 with unseen data, then you would not be able to use the same J48 that was used for your attributes (unless you saved it previously). Additionally, you are adding more processing work by using more than one classifier as opposed to the single J48. These costs would also need to be considered against the problem that you are tackling.
Hope this helps!

Okay, here is how I did cascaded learning:
I have the dataset D and divided into 10 equal sized stratified folds (D1 to D10) without repetition.
I applied algorithm A1 to train a classifier C1 on D1 to D9 and then just like you, applied C1 on D10 to give me the additional distribution of positive and negative classes. I name this D10 with the additional two (or more, depending on what information from C1 you want to be included in D10) attributes/features as D10_new.
Next, I applied the same algorithm to train a classifier C2 on D1 to D8 and D10 and then just like you, applied C2 on D9 to give me the additional distribution of positive and negative classes. I name this D9 with the additional attributes/features as D9_new.
In this way I create D1_new to D10_new.
Then I applied another classifier (perhaps with algorithm A2) on these D1_new to D10_new to predict the labels (a 10 fold CV is a good choice).
In this setup, you removed the bias of seeing the data prior to testing it. Also, it is advisable that A1 and A2 should be different.

Related

How to build separate classifiers for each label in the dataset?

I have a list of columns and each column is to be labelled by a label from another list of labels.
Eg: Two columns namely, ALT_ID and MTRC_NM are matched with labels Alternate ID and Metric Name respectively.
This fuzzy string matching has been taken care of. Problem is, I want to incorporate a learning model in this.
Essentially, after the matched results are displayed, the user curates the matches as CORRECT or INCORRECT. Based on this feedback and other features of the column (like minimum value, maximum value), I want to train a classifier such that the learning model will eventually stop making the incorrect matches in the future.
Note: In the first run, only the name of the column is used to produce the first set of results. After this, I want to use other features(like minimum value) to train the model.
Problem is, there can be 10,000 terms (or labels), maybe even more and the user just marks these as CORRECT or INCORRECT. For incorrect classifications, the user does not tell us what the correct classification should be.
I believe one solution could be to make separate classifiers for each label and based on the Correct/Incorrect feedback for a particular classification, we can use these feature vectors to train a classifier for this classification. So in the future, if the fuzzy string matching nominates Metric Name as the classification for some column, we can let the "Metric Name" classifier decide if it is correct or incorrect.
I don't know how to make separate classifiers for each label. I also don't know if this approach is feasible. Any other solution to this problem will also help.
You do not want to create separate models for each label as training more than 10 000 models isn't really feasible. Two possible things that come to my mind are:
Create a supervised learning model with one label as input and probability of each of 10 000 labels as output which only uses correct examples for predictions.
Create a reinforcement learning model with the same input but with output which maximises reward function defined as +1 for each positive prediction and -1 for each negative prediction. This model will also try to maximise the number of correct predictions but will be able to learn from incorrect predictions at the same time i.e. predict -1 score for an incorrect pair (x,y).

Why K-fold cross validation will built K+1 models?

I have read the general step for K-fold cross validation under
https://machinelearningmastery.com/k-fold-cross-validation/
It describe the general procedure is as follows:
Shuffle the dataset randomly.
Split the dataset into k groups (folds)
For each unique group:Take the group as a hold out or test data set
Take the remaining groups as a training data set Fit a model on the training set and evaluate it on the test set
Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of model
evaluation scores
So if it is K-fold then K models will be built, right? But why I read from the following link from H2O which is saying it built K+1 models?
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.ipynb
Arguably, "I read somewhere else" is too vague a statement (where?), because context does matter.
Most probably, such statements refer to some libraries which, by default, after finishing the CV proper procedure, go on to build a model on the whole training data using the hyperparameters found by CV to give best performance; see for example the relevant train function of the caret R package, which, apart from performing CV (if requested), returns also the finalModel:
finalModel
A fit object using the best parameters
Similarly, scikit-learn GridSearchCV has also a relevant parameter refit:
refit : boolean, or string, default=True
Refit an estimator using the best found parameters on the whole dataset.
[...]
The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.
But even then, the models fitted are almost never just K+1: when you use CV in practice for hyperparameter tuning (and keep in mind that there there are other uses, too, for CV), you will end up fitting m*K models, where m is the length of your hyperparameters combination set (all K-folds in a single round are run with one single set of hyperparameters).
In other words, if your hypeparameter search grid consists of, say, 3 values for the no. of trees and 2 values for the tree depth, you will fit 2*3*K = 6*K models during the CV procedure, and possibly +1 for fitting your model at the end to the whole data with the best hyperparameters found.
So, to summarize:
By definition, each K-fold CV procedure consists of fitting just K models, one for each fold, with fixed hyperparameters across all folds
In case of CV for hyperparameter search, this procedure will be repeated for each hyperparameter combination of the search grid, leading to m*K fits
Having found the best hyperparameters, you may want to use them for fitting the final model, i.e. 1 more fit
leading to a total of m*K + 1 model fits.
Hope this helps...

Handle mismatch in number of features in Training Data and Prediction Data

I have 6 text features (say f1,f2,..,f6) available for the data on which I have trained a model. But when this model is deployed and a new data point comes, for which I have to make prediction using this model, it has only 2 features (f1, and f2). So, there is the problem of feature mismatch. How can I tackle this problem?
I have a few thoughts, but that are not very efficient.
Use only two features for training (f1 and f2), and discard other features (f3,..,f6). But this leads to a loss of information and my test set accuracy decreases.
Learn some relation between (f3,..,f6) with (f1 and f2). So that even though, (f3,..,f6) is not there in the new data point, the information can be extracted from f1, and f2 only.
The best way is of course train a new model using f1, f2 and any new data you may have.
Don't want to do that? If you don't have f3...f6, you shouldn't magically expect the model works as intended.
Now, think what are those "f3...f6"? Are they related to the new information you have? If they are, you may be able to approximate them. We can't tell you what to do because we don't have any clue what they are. Interpolation? Regression? Rough approximation?
My suggestion: you are missing most of the predictors for your model. Your old model is meaningless. Please just train a new one.
Perhaps you could fill in data for f3 to f6 with noise data that is an average value for all data that includes that feature. That way the data from features f3 through f6 won't stand out too much, and won't lean your classifier one way or the other. The classifier would be more likely to rely on the features provided f1 and f2 to classify.
When calculating this make sure the averages are calculated for each classification first then averaged. That way if your data set has a large amount of one class it won't skew the average.
Of course this might be an over simplification, and would work best with binary classification. It depends on the data set and classification.
Hope this helps :)

Classfication accuracy on Weka

I am using Weka GUI for a classification. I am new to Weka and getting confused with the options
Use training Set
Supplied test set
Cross validation
to train my classification algorithm (for example J48), I trained with cross validation 10 folds and the accuracy is pretty good (97%). When I test my classification - the accuracy drops to about 72%. I am so confused. Any tips please? This is how I did it:
I train my model on the training data (For example: train.arff)
I right-click in the Results list on the item which model you want to save
select Save model and save it for example as j48tree.model
and then
I load the test data (for example: test.arff via the Supplied test set button
Right-click in the Results list, I selected Load model and choose j48tree.model
I selected Re-evaluate model on current test set
Is the way i do it wrong? Why the accuracy miserably dropping to 72% from 97%? Or is doing only the cross-validation with 10 folds is enough to train and test the classifier?
Note: my training and testing datasets have the same attributes and labels. The only difference is, I have more data on the testing set which I don't think will be a problem.
I don't think there is any issue with how you use WEKA.
You mentioned that you test set is larger than training? What is the split? The usual rule of thumb is that test set should be one 1/4 of the whole dataset, i.e. 3 times smaller than training and definitely not larger. This alone could explain the drop from 97% to 72% which is by the way not so bad for real life case.
Also it will be helpful if you build the learning curve https://weka.wikispaces.com/Learning+curves as it will explain whether you have a bias or variance issue. Judging by your values sounds like you have a high variance (i.e. too many parameters for your dataset), so adding more examples or changing your split between training and test set will likely help.
Update
I ran a quick analysis of the dataset at question by randomforest and my performance was similar to the one posted by author. Details and code are available on gitpage http://omdv.github.io/2016/03/10/WEKA-stackoverflow

When classifying, does one need to normalize new incoming features when predicting on real data?

There are two data sets - the training one and a data set of features, labels for which are yet to be predicted (the new one).
I built a Random Forest classifier. Along the way I had to do two things:
Normalize continuous numeric features.
Perform a one-hot-encoding on the categorical ones.
Now I have two questions. When i am predicting labels for the new data:
Do I need to normalize the incoming features? (common sense tells me that yes :) ) If so, should I take the mean, max, min values for a specific feature from the training data set or should I somehow take into account the new values of the features?
How do I hot-one-encode the new values of the features? Do I expand the dictionary of the possible categories for a specific category taking into account the possibly new values of the features?
In my case I possess both data sets, so I could calculate all this stuff in advance, but what if I only had a classifier and a new data set?
I only have a basic knowledge of the type of classifiers and normalization techniques you're using, but the general rule, that I think applies to what you're doing as well, is to do the following.
Your classifier is not a Random Forest Classifier. That is only one step of the pipeline that acts as your actual classifier. This pipeline / actual classifier is what you describe:
Normalize continuous numeric features.
Perform a one-hot-encoding on the categorical ones.
Use a Random Forest Classifier on what you get from the first 2 steps.
This pipeline, that encompasses 3 things, is what you're actually using as your classifier.
Now, how does a classifier work?
You build some state based on the training data.
You use that state to make predictions on the test data.
So:
Do I need to normalize the incoming features? (common sense tells me that yes :) ) If so, should I take the mean, max, min values for a specific feature from the training data set or should I somehow take into account the new values of the features?
Your classifier normalizes the incoming features for the training data, so it will normalize those for unseen instances too. To do this, it must use the state it has built during training.
For example, if you were doing min-max scaling on your features, your state would store a min(f) and max(f) for each feature f. Then, during testing / prediction, you would do min-max scaling for each feature f using the stored min(f) and max(f) values.
I'm not sure what you mean by "normalize continuous numeric features". Do you mean discretization? If you build some state for this discretization during training, then you need to find a way to factor that in.
How do I hot-one-encode the new values of the features? Do I expand the dictionary of the possible categories for a specific category taking into account the possibly new values of the features?
Don't you know how many values each category can have beforehand? Usually you do (since categoricals are things like nationality, continent etc. - things you know in advance). If you can get a value for a categorical feature that you haven't seen during training, it begs the question if you should even care about it. What good is a categorical value you've never trained on?
Maybe add an "unknown" category. I think expanding for a single one should be fine, what good are more going to do if you've never trained on them?
What kind of categoricals do you have?
I could be wrong, but do you really need one-hot encoding? AFAIK, tree-based classifiers don't seem to benefit that much from it.

Resources