So I've recently started using Weka and there are several test options when building a tree with, for example, J48. The following are the options, including my undestanding of them:
Use training set - I know just that it's highly optimistic and not necessarily useful. Even Weka's documentation at 2.1.5 isn't being all too specific.
Supplied test set - Pretty self-explanatory, you supply it a test set.
Cross-Validation - I understood it by reading this short example.
Percentage Split - I assume it means partitioning the data set into two sets of a certain percentage, one set for training and one for testing.
What I want to know is what exactly is the training set (first option) and what it does. Where does it get this training set from and what data does it test on exactly? And also if you could correct my understanding of the rest, if it's wrong.
The first option simply means "use all data loaded to run this algorithm". You choose this
to try things out,
to have a first look at the results-section in the output,
to check the performance/run duration,
to check if Weka's output matches the implementation of the same algorithm of a different software, say R or Matlab.
...
Option one is:
test set = training set
The resulting scores will be prone to overfitting of course, and that is why its "highly optimistic and not necessarily useful".
Related
I am setting up a Naive Bayes Classifier to try to determine sameness between two records of five string properties. I am only comparing each pair of properties exactly (i.e., with a java .equals() method). I have some training data, both TRUE and FALSE cases, but let's just focus on the TRUE cases for now.
Let's say there are some TRUE training cases where all five properties are different. That means every comparator fails, but the records are actually determined to be the 'same' after some human assessment.
Should this training case be fed to the Naive Bayes Classifier? On the one hand, considering the fact that NBC treats each variable separately these cases shouldn't totally break it. However, it certainly seems true that feeding in enough of these cases wouldn't be beneficial to the classifier's performance. I understand that seeing a lot of these cases would mean better comparators are required, but I'm wondering what to do in the time being. Another consideration is that the flip-side is impossible; that is, there's no way all five properties could be the same between two records and still have them be 'different' records.
Is this a preferential issue, or is there a definitive accepted practice for handling this?
Usually you will want to have a training data set that is as feasibly representative as possible of the domain from which you hope to classify observations (often difficult though). An unrepresentative set may lead to a poorly functioning classifier, particularly in a production environment where various data are received. That being said, preprocessing may be used to limit the exposure of a classifier trained on a particular subset of data, so it is quite dependent on the purpose of the classifier.
I'm not sure why you wish to exclude some elements though. Parameter estimation/learning should account for the fact that two different inputs may map to the same output --- that is why you would use machine learning instead of simply using a hashmap. Considering that you usually don't have 'all data' to build your model, you have to rely on this type of inference.
Have you had a look at the NLTK; it is in python but it seems that OpenNLP may be a suitable substitute in Java? You can employ better feature extraction techniques that lead to a model that accounts for minor variations in input strings (see here).
Lastly, it seems to me that you want to learn a mapping from input strings to the classes 'same' and 'not same' --- you seem to want to infer a distance measure (just checking). It would make more sense to invest effort in directly finding a better measure (e.g. for character transposition issues you could use edit distances). I'm not sure that NB is well-suited to your problem as it is attempting to determine a class given an observation(s) (or its features). This class will have to be discernible over various different strings (I'm assuming you are going to concatenate string1 & string2, and offer them to the classifier). Will there be enough structure present to derive such a widely applicable property? This classifier is basically going to need to be able to deal with all pair-wise 'comparisons' ,unless you build NBs for each one-vs-many pairing. This does not seem like a simple approach.
I have a 6-dimensional training dataset where there is a perfect numeric attribute which separates all the training examples this way: if TIME<200 then the example belongs to class1, if TIME>=200 then example belongs to class2. J48 creates a tree with only 1 level and this attribute as the only node.
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
Usually great training score and bad testing means overfitting. But this assumes IID of the data, and you are clearly violating this assumption - your training data is completely different from the testing one (there is a clear rule for the training data which has no meaning for testing one). In other words - your train/test split is incorrect, or your whole problem does not follow basic assumptions of where to use statistical ml. Of course we often fit models without valid assumptions about the data, in your case - the most natural approach is to drop a feature which violates the assumption the most - the one used to construct the node. This kind of "expert decisions" should be done prior to building any classifier, you have to think about "what is different in test scenario as compared to training one" and remove things that show this difference - otherwise you have heavy skew in your data collection, thus statistical methods will fail.
Yes, it is an overfit. The first rule in creating a training set is to make it look as much like any other set as possible. Your training set is clearly different than any other. It has the answer embedded within it while your test set doesn't. Any learning algorithm will likely find the correlation to the answer and use it and, just like the J48 algorithm, will regard the other variables as noise. The software equivalent of Clever Hans.
You can overcome this by either removing the variable or by training on a set drawn randomly from the entire available set. However, since you know that there is a subset with an embedded major hint, you should remove the hint.
You're lucky. At times these hints can be quite subtle which you won't discover until you start applying the model to future data.
I'm using Weka explorer now and I want to test my classifier with unlabeled data. People on the internet said I can set the class to '?' in order to do that but it doesn't work. Is that only for command lines? How can I do that in the explorer?
There is no point in having a test case without a label. Setting ? is usefull for training cases, as some methods can infer parameters from unlabeled samples. For testing purposes you have to provide label/value/other indicator of correct answer (depending on type of problem).
"Weka: training and test set are not compatible" can be solved using batch filtering but at the time of training a model I don't have test.arff. My problem caused in the command "stringToWord vector" (on CLI).
So my question is, can Caret package(R) or Scikit learn (Python) provides any alternative for this one.
Note:
1. Functionality provided by "stringToWord vector" is a must requirement.
2. I don't want to retrain my model while testing because it takes lot of time.
Given the requirements you mentioned, you can use Weka's Filtered Classifier option during training and testing. I am not re-iterating what I have recorded as a video cast here and here.
But the basic idea is not to use the StringToWord vector as a direct filter rather to use it as a filtering option in the FilteredClassifier option. The model you generate will be just once. And then you can apply the model directly on your unlabelled data without retraining them or without applying StringToWord vector again on the unlabelled data. FilteredClassifier will take care of these concerns for you.
(noob in ML, be patient)
I want to test the performance of my scikit-learn SVMLinear classifier. My train-set has a different class distribution than the actual population, but my test-set is a representative, and distributes like the actual population.
I noticed that there's a class-weight parameter, and I want to try giving my classifier the actual population distribution, and see if it helps it perform better.
However - as my train-set distribution is different, so will be my validation set, right? So should I expect an improvement on the validation, or must I use my test-set to see the improvement? And if so - isn't it against the rules to calibrate using the test-set which will lead to burning the test-set or overfitting?
I've thought about bootstrap re-sampling of my train-set: making it distribute the same as the general population, and only then training and validating my model. Is this a good solution?
Thanks!
It seems that you have some good ideas which are mostly worth trying. The answers mostly depend on the application and the size of your train/test set.
It is against the rules to calibrate based on test set and again use the whole test set for evaluation. However, if your test set is large enough, you can always divide your test set to two sets: validation set, and actual test set. Then, your final evaluation will be based on a smaller test set, which might be still acceptable depending on the application.
For your training set that you believe it has different class distribution than the actual population, there might be several things worth trying. Usually the most acceptable approach is to use a classifier that can handle these differences (usually with fewer parameters to avoid over-fitting). There is a whole topic of classification and regression on skewed datasets that you can look through. Other than the classifier, provided that you did not derive the actual population from your test set, the methods below might help too:
1- One of them can be (as you said) bootstrap re-sampling in case that your training set is large enough for that.
2- Another approach can be generating more training samples by adding some noise to the current samples of the training set. For example if you are classifying images of birds, you can randomly make images darker or brighter, or randomly move them a few pixels to the sides or up and down (select values randomly in a small enough range). This way, you can add to the training set in a way to get the desired distribution.