How to set the test class which is not labeled? - machine-learning

I'm using Weka explorer now and I want to test my classifier with unlabeled data. People on the internet said I can set the class to '?' in order to do that but it doesn't work. Is that only for command lines? How can I do that in the explorer?

There is no point in having a test case without a label. Setting ? is usefull for training cases, as some methods can infer parameters from unlabeled samples. For testing purposes you have to provide label/value/other indicator of correct answer (depending on type of problem).

Related

Should I retrain the model with the whole dataset after using a train-test split to find the best hyper parameters?

I split my dataset into training and testing. At the end after finding the best hyper parameters for the training dataset, should I fit the model again using all the data? The point is to reach the highest possible score for new data.
Yes, that would help to generalize your model, as more data generally means better generalization.
I don't think so. If you do that, you will no longer have a valid test set. What happens when you come back to improve the model later? If you do this, then you will need a new test set each model improvement, which means more labeling. You won't be able to compare experiments across model versions, because the test set won't be identical.
If you consider this model finished forever, then ok.

What do Weka's different test options mean?

So I've recently started using Weka and there are several test options when building a tree with, for example, J48. The following are the options, including my undestanding of them:
Use training set - I know just that it's highly optimistic and not necessarily useful. Even Weka's documentation at 2.1.5 isn't being all too specific.
Supplied test set - Pretty self-explanatory, you supply it a test set.
Cross-Validation - I understood it by reading this short example.
Percentage Split - I assume it means partitioning the data set into two sets of a certain percentage, one set for training and one for testing.
What I want to know is what exactly is the training set (first option) and what it does. Where does it get this training set from and what data does it test on exactly? And also if you could correct my understanding of the rest, if it's wrong.
The first option simply means "use all data loaded to run this algorithm". You choose this
to try things out,
to have a first look at the results-section in the output,
to check the performance/run duration,
to check if Weka's output matches the implementation of the same algorithm of a different software, say R or Matlab.
...
Option one is:
test set = training set
The resulting scores will be prone to overfitting of course, and that is why its "highly optimistic and not necessarily useful".

Altrnative of "Weka: training and test set are not compatible"?

"Weka: training and test set are not compatible" can be solved using batch filtering but at the time of training a model I don't have test.arff. My problem caused in the command "stringToWord vector" (on CLI).
So my question is, can Caret package(R) or Scikit learn (Python) provides any alternative for this one.
Note:
1. Functionality provided by "stringToWord vector" is a must requirement.
2. I don't want to retrain my model while testing because it takes lot of time.
Given the requirements you mentioned, you can use Weka's Filtered Classifier option during training and testing. I am not re-iterating what I have recorded as a video cast here and here.
But the basic idea is not to use the StringToWord vector as a direct filter rather to use it as a filtering option in the FilteredClassifier option. The model you generate will be just once. And then you can apply the model directly on your unlabelled data without retraining them or without applying StringToWord vector again on the unlabelled data. FilteredClassifier will take care of these concerns for you.

Simple statistical yes/no classifier in WEKA

In order for me to compare my results of my research in labeled text classification, I need to have a baseline to compare with. One of my colleagues told me one solution would be to make the most easiest and dumbest classifier possible. The classifier makes a decision based on the frequency of a particular label.
This means that, when in my dataset I have a total of 100 samples and when it knows 80% of these samples have the label A, it will classify a sample as 'A' in 80% of the time. Since my entire research is using the Weka API, I have looked into the documentation but unfortunatly haven't found anything about this.
So my question is, is it possible in Weka to implement such a classifier and yes, could someone point out how this is possible? This question is pure informative since I looked into this thing but did not find anything, here is where I hope to find an answer.
That classifier is already implemented in Weka, it is called ZeroR and simply predicts the most frequent class (in the case of nominal class attributes) or the mean (in the case of numeric class attributes). If you want to know how to implement such a classifier yourself, look at the ZeroR source code.

The Role of the Training & Tests Sets in Building a Decision Tree and Using it to Classify

I've been working weka for couple of months now.
Currently, I'm working on my machine learning course here in Ostfold University College.
I need a better way to construct a decision tree based on separated training and test sets.
Anybody come up with good idea can be of very great relief.
Thanx in advance.
-Neo
You might be asking for something more specific, but in general:
You build the decision tree with the training set, and you evaluate the performance of that tree using the test set. In other words, on the test data, you call a function usually named something like c*lassify*, passing in the newly-built tree and a data point (within your test set) you wish to classify.
This function returns the leaf (terminal) node from your tree to which that data point belongs--and assuming that the contents of that leaf is homogeneous (populated with data from a single class, not a mixture) then you have in essence assigned a class label to that data point. When you compare that class label assigned by the tree to the data point's actual class label, and repeat for all instances in your test set, you have a metric to evaluate the performance of your tree.
A rule of thumb: shuffle your data, then assign 90% to the training set and the other 10% to a test set.
actually i was looking for something like this - http://weka.wikispaces.com/Saving+and+loading+models
to save a model, load it and use it in the training set.
This is exactly what i was searching for. Hope it might be useful for anyone who had similar problem as mine.
cheers
-Neo182

Resources