Weka: Results of each fold in 10-fold CV - machine-learning

For Weka Explorer (GUI), when we do a 10-fold CV for any given ARFF file, then what Weka Explorer provides (as far as I can see) is the average result for all the 10 folds.
Q. Is there any way to get the results of each fold? For instance, I need the error rates (incorrectly identified instances) for each fold.
Help appreciated.

I think this is possible using Weka's GUI. You need to use the Experimenter though instead of the Explorer. Here are the steps:
Open the Experimenter from the GUI Chooser
Create a new experiment (New button # top-right)
[optional] Enter a filename and location in the Results Destination to save the results to
Set the Number of (cross-validation) folds to your liking (start experimenting with 2 folds for easy results)
Add your dataset (if your dataset needs preprocessing then you should do this in the Explorer first and then save the preprocessed dataset)
Set the Number of repetitions (I recommend 1 to start of with)
Add the algorithm(s) you want to test (again start easy, start with one algorithm)
Go to the Run tab and Start the experiment and wait till it finishes
Go to the Analyse tab and import the experiment results by clicking Experiment (top-right)
For Row select: Fold
For Column select: Percent_incorrect or Number_incorrect (or any other measure you want to see)
You now see the specified results for each fold

Weka Explorer does not have an option to give the results for individual folds when using the crossvalidation option, there are some workarounds. If you explicitly don't want to change any code, you need to do some manual fiddling, but I think this gives more or less what you want
Instead of Cross-validation, select Percentage split and set it to 90%
Start classifier
Click More options... and change the Random seed for XVal / % Split value to something you haven't used before.
Repeat ten times.
This is not exactly equivalent to 10-fold crossvalidation though, since the pseudo-folds you make this way might overlap.
An alternative that is equivalent to crossvalidation, but more cumbersome, would be to make 10 folds manually by using the unsupervised instance filter RemoveFolds or RemoveRange.
Generate and save 10 training sets and 10 test sets. Then for every fold, load the training set, select Supplied test set in the classify tab, and select the appropriate test fold.

Related

Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?

I know the general rule that we should test a trained classifier only on the testing set.
But now comes the question: When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? Or do I have to apply it to a new predicting set that is different from the training+testing set?
And what if I predict a label column of a time series (edited later: I do not mean to create a classical time series analysis here, but just a broad selection of columns from a typical database, weekly, monthly or randomly stored data that I convert into separate feature columns, each for one week / month / year ...), do I have to shift all of the features (not just the past columns of the time series label column, but also all other normal features) of the training+testing set back to a point in time where the data has no "knowledge" interception with the predicting set?
I would then train and test the classifier on features shifted to the past by n months, scoring against a label column that is unshifted and most recent, and then predicting from most recent, unshifted features. Shifted and unshifted features have the same number of columns, I align shifted and unshifted features by assigning the column names of the shifted features to the unshifted features.
p.s.:
p.s.1: The general approach on https://en.wikipedia.org/wiki/Dependent_and_independent_variables
In data mining tools (for multivariate statistics and machine learning), the dependent variable is assigned a role as target variable (or in some tools as label attribute), while an independent variable may be assigned a role as regular variable.[8] Known values for the target variable are provided for the training data set and test data set, but should be predicted for other data.
p.s.2: In this basic tutorial we can see that the predicting set is made different: https://scikit-learn.org/stable/tutorial/basic/tutorial.html
We select the training set with the [:-1] Python syntax, which produces a new array that contains all > but the last item from digits.data: […] Now you can predict new values. In this case, you’ll predict using the last image from digits.data [-1:]. By predicting, you’ll determine the image from the training set that best matches the last image.
I think you are mixing up some concepts, so I will try to give a general explanation for Supervised Learning.
The training set is what your algorithm LEARNS on. You split it in X (features) and Y (target variable).
The test set is a set that you use to SCORE your model, and it must contain data that was not in the training set. This means that a test set also has X and Y (meaning that you know the value of the target). What happens is that you PREDICT f(Y) based on X, and compare it with the Y you have, and see how good your predictions are
A prediction set is simply new data! This means that usually you DO NOT have a target, since the whole point of supervised learning is predicting it. You will only have your X (features) and you will predict f(X) (your estimate of the target Y) and use it for whatever you need.
So, in the end a test set is simply a prediction set for which you have a target to compare your estimation to.
For time series, it is a bit more complicated, because often the features (X) are transformations on past data of the target variable (Y). For example, if you want to predict today's SP500 price, you might want to use the average of the last 30 days as a feature. This means that for every new day, you need to recompute this feature over the past days.
In general though, I would suggest starting with NON time series data if you're new to ML, as Time Series is much harder in terms of feature engineering and data management and it is easy to make mistakes.
The question above When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? has the simple answer: No.
The question above Do I have to shift all of the features has the simple answer: Yes.
In short, if I predict a month's class column: I have to shift all of the non-class columns also back in time in addition to the previous class months I converted to features, all data must have been known before the month in that the class is predicted.
This also means: the predicting set has to be different from the dataset that contains the testing set. If you included the testing set, the training set loses valuable up-to-date data of the latest month(s) available! The term of a final "predicting set" is meant to be the "most current input to be used without a testing set" to get the "most current results" for the prediction.
This is confirmed by the following overview offered by this user who seems to have made the image, using days instead of months here, but the idea is the same:
Source: Answer on "Cross Validated" - Splitting Time Series Data into Train/Test/Validation Sets; the whole Q/A is recommended (!).
See the last line of the image and the valuable comments of that answer on "Cross Validated" to understand this.
230106:
The image shows that the last step is a training on the whole dataset, this is the "predicting set" that is the newest and that does not have a testing set.
On that image, there is one "mistake" which shows that this seemingly easy question of taking former labels as features for upcoming labels seems to be hard to be understood. I myself did not see this and posted the image without this remark: The "T&V" is in the past of the "Test". And that would be a wrong validation for a model that shall predict the future, the V must be in the "future" test block (unless you have a dataset that is not dynamically changing over time, like in physics).
You would have to change it to a "walk-forward" model, with the validation set - if at all - split k-fold from the testing set, not from the training set. That would look like this:
See also:
Can / should I use past (e.g. monthly) label columns from a database as features in an ML prediction (no time-series!)? with the "walk-forward" main image,
Splitting Time Series Data into Train/Test/Validation Sets with more insight into this and the comment that brought up the model name "walk-forward".

In the Orange data mining toolkit, how do I specify groups for cross-validation?

I'm using the Orange GUI, and trying to perform cross-validation. My data has 8 different groups (specified by a variable in the input data), and I'd like each fold to hold out a different group. Is this possible to do using Orange? I can select the number of folds for cross-validation, but I don't see any way of determining which data is in each one.
Cross-validation does random sampling. I don't think what you seek is possible out of the box.
If you really want to have it honor the splits you made beforehand (according to some input variable), and you aren't afraid of some manual labor, you can use Select Rows widget to select the rows of one group (i.e. Matching Data), pass that into Test & Score as Test Data, and have all the rest of data (i.e. Unmatched Data) as training Data. This way, you get the cross-validation for a single fold (group). Repeat, and finally average, to obtain results for all folds.
If you know some Python, there's always Orange scripting layer you can fall back to.

Clustering before classification in Weka

The instances in my dataset have multiple numeric attributes and a binary class. In Weka is there a way to use a clusterer and pass the result to a classifier (say SMO) to improve the results of classification?
One way that you could add cluster information to your data is using the below method (in Weka Explorer):
Load your Favourite Dataset
Choose your Cluster Model (In my case, I used SimpleKMeans)
Modify the Parameters of the Clusterer as Required
Use the Training Set for the Cluster Mode
Start the Clustering Process
Once the Clusters have been generated, Right-Click on the Result List and select 'Visualize Cluster Assignments'
Select Y to be the Cluster, then hit the Save Button as shown below:
Save the Data to a nominated location.
You should then be able to load this file and use the cluster information in your classifier just like any other attribute. Just make sure that the Class is set to the right attribute and you should be right to go.
NOTE: When I ran these tests I used J48 to evaluate the class, and it seemed that J48 used only the values of the clusters to estimate the class. The accuracy of the model was also surprisingly high as well, so either the dataset was either too simple or I may have missed a step somewhere in the clustering process.
Hope this Helps!
In Weka Explorer, after loading your dataset
choose the Preprocess tab,
click "Choose..." Button,
add the unsupervised-attribute-filter "AddCluster".
click next to button, to open the Clusterer Selection field, choose a clusterer,
configure/parameterize the clusterer
close all modal dialog boxes
Click "Apply" button to apply the filter. It will add another attribute called "cluster" as the rightmost one in your attribute list.
Then continue with your classification experiments.

Information leakage in Cross-validation

Description of classification problem:
Assume a regular dataset X with n samples and d features.
This classification problem is somewhat hard (many features, few samples, low overall AUC ~70%).
It might be useful to mention that feature selection/extraction, dimension reduction, kernels, many classifiers have been applied. So I am not interested in trying these.
I am not looking forward to see an improvement in overall AUC. The goal is to find relevant features in haystack of features.
Description of my approach:
I select all pairwise combination of d features and create many two dimensional sub-datasets x with n samples.
On each sub-dataset x, I perform a 10-fold cross-validation (using all samples of the main dataset X). A very long process, assume weeks of computation.
I select top k pairs (according to highest AUC for example) and label them as +. All other pairs are labeled as -.
For each pair, I can compute several properties (e.g. relations between each pair using Expert's knowledge). These properties can be calculated without using the labels in main dataset X.
Now I have pairs which are labeled as + or -. In addition, each pair has many properties calculated based on Expert's knowledge (i.e. features). Hence, I have a new classification problem. Lets call this newly generated dataset Y.
I train a classifier on Y while following cross-validation rules. Surprisingly, I can predict the + and - labels with 90% AUC.
As far as I can see, it means that I am able to select relevant features. However, seeing a 90% AUC makes me worried about information leakage somewhere in this long process. Specially in step 3.
I was wondering if anyone can see any leakage in this approach.
Information Leakage:
Incorporation of target labels in the actual features. Your classifier will produce good prediction while did not learn anything.
Showing your test set to you classifier during the training phase. Your classifier will "memorize" the test set and its corresponding labels without "learning" anything.
Update 1:
I want to stress that indeed I am using all data points of X in step 1. However, I am not using them ever again (even for testing). The final 90% AUC is obtained from predicting labels of dataset Y.
On the other hand, it would be useful to note that, even if I randomize the values of my main dataset X, the computed features for dataset Y is going to be the same. However, the sample labels in Y would change because the previous + pairs might not be a good one anymore. Therefore they will be labeled as -.
Update 2:
Although I haven't got any opinion, I am going to state what I have got during 4 days of talking with pattern recognition researchers. Briefly I became confident that there is no information leakage (as long as I wont go back to the first dataset X and using its labels). Later on, in case I wanted to check to see if I could have better performance in X (i.e. predicting sample labels), I need to use only a part of dataset X for pairwise comparison (as training set). Then I can use the rest of samples in X as test set while using positively predicted pairs of Y as features.
I will set this as an answer in case no one could reject this method.
If your processes in step 1 uses all data. then the features you are learning have information from the whole data set. Since you selected based on the whole dataset and THEN validation, you are leaking serious information.
You should probably stick with tools that are well known / already done for you before running out and trying weird strategies like this. Try using a model with L1 regularization to do feature selection for your, or start with some of the simpler searches like Sequential Backward Selection.
If you do cross validation correctly in the end, each training will perform its own independent feature selection. If you do one global feature selection and then do CV, you are going to be doing it wrong and probably leaking information.

How do I compare two classifiers in Weka using the Paired T-test

In Weka I can go to the experimenter. In the set-up I can load in an .arff file, and get weka to create a classifier (i.e. J48), then I can run it and then finally I can go to the analyze tab. In this tab it gives me an option to 'testing with Paired T-Test' but I cannot figure out how to create a second classifier (i.e. J48 unpruned) and do a T-Test on the two results.
Google does not lead me to any tutorial or answers.
How can I get Weka to do a T-Test on the results of two different classifiers, made from the same data?
Please follow the steps in http://fiji.sc/Advanced_Weka_Segmentation_-_How_to_compare_classifiers.
In this screenshot, the author is setting the significance level to be 0.05. In my understanding, in such a test, you always compare with a baseline classifier (here it is the NaiveBayes), the output uses the annotation v or * to indicate that a specific result is statistically better (v) or worse (*) than the baseline scheme at the significance level specified (currently 0.05). It might not be the one you expected though.

Resources