Using training data and testing data in a shared task - machine-learning

I am working on this shared task http://alt.qcri.org/semeval2017/task4/index.php?id=data-and-tools
which is just a twitter sentiment analysis. Since i am pretty new to machine learning, I am not quite sure how to use both training data and testing data.
So the shared task provides two same sets of twitter tweets one without the result (train) and one with the result.
I current understandings of using these kinds of data in machine learning are as follows:
training set: we are supposed to split this into training and testing portions (90% training and 10% testing maybe?)
But the existing of a separate test data kind of confuses.
Are we supposed to use the result that we got in the test using the 10% portion of the 'training set' and compare that to the actual result 'testing set' ?
Can someone correct my understanding?

When training a machine learning model, you are feeding your algorithm with the dataset called training set, which in this stage, you are telling the algorithm what is the ground truth of each sample you put into the algorithm, that way, the algorithm learns from each sample you are feeding to it. the training set is usually 80% of the whole dataset, the other 20% of the dataset is the testing set, which in this case, you know what is the ground truth of each sample, but you let your algorithm predict what it think the truth is to each sample you let it predict. All those prediction over the testing set are based on what the algorithm have learned from the training set you fed it before.
After you make all the predictions over your testing set you can then check how accurate your model is based on the ground truth in compare to the prediction the model have made.

Related

How to perform classification on training and test dataset in Weka

I am using Weka software to classify model. I have confusion using training and testing dataset partition. I divide 60% of the whole dataset as training dataset and save it to my hard disk and use 40% of data as test dataset and save this data to another file. The data that I am using is an imbalanced data. So I applied SMOTE in my training dataset. After that, in the classify tab of the Weka I selected Use training set option from Test options and used Random Forest classifier to do the classification on the training dataset. After getting the result I chose Supplied test set option from Test options and load my test dataset from hard disk and again ran the classifier.
I try to find out tutorial on how to load training set and test set in Weka but did not get it. I did the above process depend upon my understanding.
Therefore, I would like to know is that the right way to perform classification on training and test dataset?
Thank you.
There is no need to evaluate your classifier on the training set (this will be overly optimistic, since the classifier has already seen this data). Just use the Supplied test set option, then your classifier will get trained automatically on the currently loaded dataset before being evaluated on the specified test set.
Instead of manually splitting your data, you could also use the Percentage split test option, with 60% to be used for your training data.
When using filters, you should always wrap them (in this case SMOTE) and your classifier (in this case RandomForest) in the FilteredClassifier meta-classifier. That way, you will ensure that the training and test set data will get transformed correctly. This will also avoid the problem of leaking information into the test set when transforming the full dataset with a supervised filter and splitting the dataset into train/test afterwards. Finally, it also documents nicely what preprocessing is being done to your input data, all in a single command-line string.
If you need to apply more than one filter, use the MultiFilter to apply them sequentially.

train/validate/test split for time series anomaly detection

I'm trying to perform a multivariate time series anomaly detection. I have training data that consists of "normal" data. I train on this data and detect anomalies on the test set that contains normal + anomalous data. My understanding is that it would be wrong to tweak the model hyperparameters based on the results from the test set.
What would the train/validate/test set look like to train and evaluate a time-series anomaly detector?
Nothing very specific to anomaly detection here. You neeed to split the testing data into one or more validation and test sets, while making sure they are reasonably independent (no information leakage between them).

Applying PCA before sending data to SVM

Before applying SVM on my data I want to reduce its dimension by PCA. Should I separate the Train data and Test data then apply PCA on each of them separately or apply PCA on both sets combined then separate them?
Actually both provided answers are only partially right. The crucial part here is what is the exact problem you are trying to solve. There are two basic possible settings which can be considered, and both are valid under some assumptions.
Case 1
You have some data (which you splitted to train and test) and in the future you will get more data coming from the same distribution.
If this is the case, you should fit PCA on train data, then SVM on its projection, and for testing you just apply already fitted PCA followed by already fitted SVM, and you do exactly the same for new data that will come. This way your test error (under some "size assumptions" should approximate your expected error).
Case 2
You have some data (which you splitted train and test) and in the future you will obtain a big chunk of unlabeled data and you will be able to fit your model then.
In such a case, you fit PCA on whole data provided, learn SVM on labeled part (train set) and evaluate on test set. This way, once new data arrives you can fit PCA using both your data and new ones, and then - train SVM on your old data (as this is the only one having labels). Under the assumption that again - data comes from the same distributions, everything is correct here. You use more data to fit PCA only to have a better estimator (maybe your data is really high dimensional and PCA fails with small sample?).
You should do them separately. If you run pca on both sets combined then you are going to introduce a bias in your svn. The goal of the test set is to see how your algorithm will perform without prior knowledge of the data.
Learn the Projection Matrix of PCA on the train set and use this to reduce the dimensions of the test data.
One benifit is this way you don't have to rely on collecting sufficient data in the test set if you are applying your classifier for actual run time where test data comes one sample at a time.
Also I think separate train and test PCA will fail.Why?
Think of PCA as giving you features, and then you learn a classifier over these features. If over time your data shifts, then the test features you get using PCA would be different, and you don't have a classifier trained on these features. Even if the set of directions/features of the PCA remain same but their order varies your classifier still fails.

About training HMM by using EM

I am new to EM algorithm, studying Hidden Markov Model.
During training my HMM by EM, I am very confused on the data setting. (text processing)
Please confirm whether my EM usage is okay or not.
At first, I calculated statistics for emission probability matrix with my whole training set. And then, I ran EM with the same set.
-> Emission probability for unseen data converged to zero at the time.
While I read a text, Speech and Language Processing, I found the exercise 8.3 tells two phase training method.
8.3 Extend the HMM tagger you built in Exercise 8.?? by adding the ability to make use of some unlabeled data in addition to your labeled training corpus. First acquire a large unlabeled corpus. Next, implement the forward-backward training algorithm. Now start with the HMM parameters you trained on the training corpus in Exercise 8.??; call this model M0. Run the forward-backward algorithm with these HMM parameters to label the unsupervised corpus. Now you have a new model M1. Test the performance of M1 on some held-out labeled data.
Following this statement, I select some instances from my training set (1/3 of training set) for getting initial statistics.
And then, I run EM procedure with whole training set for optimizing parameters in EM.
Is it ok?
The procedure that the exercise is referring to is a type of unsupervised learning known as self-training. The idea is that you use your entire labeled trainign set to build a model. Then you collect more data that is unlabeled. It is much easier to find new unlabeled data than it is to find new labeled data. After that, you would label the new data using the model you originally trained. Now, using the automatically generated labels, train a new model.

Classfication accuracy on Weka

I am using Weka GUI for a classification. I am new to Weka and getting confused with the options
Use training Set
Supplied test set
Cross validation
to train my classification algorithm (for example J48), I trained with cross validation 10 folds and the accuracy is pretty good (97%). When I test my classification - the accuracy drops to about 72%. I am so confused. Any tips please? This is how I did it:
I train my model on the training data (For example: train.arff)
I right-click in the Results list on the item which model you want to save
select Save model and save it for example as j48tree.model
and then
I load the test data (for example: test.arff via the Supplied test set button
Right-click in the Results list, I selected Load model and choose j48tree.model
I selected Re-evaluate model on current test set
Is the way i do it wrong? Why the accuracy miserably dropping to 72% from 97%? Or is doing only the cross-validation with 10 folds is enough to train and test the classifier?
Note: my training and testing datasets have the same attributes and labels. The only difference is, I have more data on the testing set which I don't think will be a problem.
I don't think there is any issue with how you use WEKA.
You mentioned that you test set is larger than training? What is the split? The usual rule of thumb is that test set should be one 1/4 of the whole dataset, i.e. 3 times smaller than training and definitely not larger. This alone could explain the drop from 97% to 72% which is by the way not so bad for real life case.
Also it will be helpful if you build the learning curve https://weka.wikispaces.com/Learning+curves as it will explain whether you have a bias or variance issue. Judging by your values sounds like you have a high variance (i.e. too many parameters for your dataset), so adding more examples or changing your split between training and test set will likely help.
Update
I ran a quick analysis of the dataset at question by randomforest and my performance was similar to the one posted by author. Details and code are available on gitpage http://omdv.github.io/2016/03/10/WEKA-stackoverflow

Resources