How to output resultant documents from Weka text-classification - machine-learning

So we are running a multinomial naive bayes classification algorithm on a set of 15k tweets. We first break up each tweet into a vector of word features based on Weka's StringToWordVector function. We then save the results to a new arff file to user as our training set. We repeat this process with another set of 5k tweets and re-evaluate the test set using the same model derived from our training set.
What we would like to do is to output each sentence that weka classified in the test set along with its classification... We can see the general information (Precision, recall, f-score) of the performance and accuracy of the algorithm but we cannot see the individual sentences that were classified by weka, based on our classifier... Is there anyway to do this?
Another problem is that ultimately our professor will give us 20k more tweets and expect us to classify this new document. We are not sure how to do this however as:
All of the data we have been working with has been classified manually, both the training and test sets...
however the data we will be getting from the professor will be UNclassified... How can we
reevaluate our model on the unclassified data if Weka requires that the attribute information must
be the same as the set used to form the model and the test set we are evaluating against?
Thanks for any help!

The easiest way to acomplish these tasks is using a FilteredClassifier. This kind of classifier integrates a Filter and a Classifier, so you can connect a StringToWordVector filter with the classifier you prefer (J48, NaiveBayes, whatever), and you will be always keeping the original training set (unprocessed text), and applying the classifier to new tweets (unprocessed) by using the vocabular derived by the StringToWordVector filter.
You can see how to do this in the command line in "Command Line Functions for Text Mining in WEKA" and via a program in "A Simple Text Classifier in Java with WEKA".

Related

How to perform classification on training and test dataset in Weka

I am using Weka software to classify model. I have confusion using training and testing dataset partition. I divide 60% of the whole dataset as training dataset and save it to my hard disk and use 40% of data as test dataset and save this data to another file. The data that I am using is an imbalanced data. So I applied SMOTE in my training dataset. After that, in the classify tab of the Weka I selected Use training set option from Test options and used Random Forest classifier to do the classification on the training dataset. After getting the result I chose Supplied test set option from Test options and load my test dataset from hard disk and again ran the classifier.
I try to find out tutorial on how to load training set and test set in Weka but did not get it. I did the above process depend upon my understanding.
Therefore, I would like to know is that the right way to perform classification on training and test dataset?
Thank you.
There is no need to evaluate your classifier on the training set (this will be overly optimistic, since the classifier has already seen this data). Just use the Supplied test set option, then your classifier will get trained automatically on the currently loaded dataset before being evaluated on the specified test set.
Instead of manually splitting your data, you could also use the Percentage split test option, with 60% to be used for your training data.
When using filters, you should always wrap them (in this case SMOTE) and your classifier (in this case RandomForest) in the FilteredClassifier meta-classifier. That way, you will ensure that the training and test set data will get transformed correctly. This will also avoid the problem of leaking information into the test set when transforming the full dataset with a supervised filter and splitting the dataset into train/test afterwards. Finally, it also documents nicely what preprocessing is being done to your input data, all in a single command-line string.
If you need to apply more than one filter, use the MultiFilter to apply them sequentially.

Do I still need to load word2vec model at model testing?

This may sound like a naive question, but i am quite new on this. Let's say I use the Google pre-trained word2vector model (https://github.com/dav/word2vec) to train a classification model. I save my classification model. Now I load back the classification model into memory for testing new instances. Do I need to load the Google word2vector model again? Or is it only used for training my model?
It depends on how your corpuses and test examples are structured and pre-processed.
You are probably using the pre-trained word-vectors to turn text into numerical features. At first, text examples are vectorized to train the classifier. Later, other (test/production) text examples will be vectorized in the same, and presented to get the classifier to get its judgements.
So you will need to use the same text-to-vectors process for test/production text examples as was used during training. Perhaps you've done that in a separate earlier bulk step, in which case you already have the features in the vector form the classifier uses. But often your classifier pipeline will itself take raw text, and vectorize it – in which case it will need the same pre-trained (word)->(vector) mappings available at test time as were available during training.

Extract WHY label was chosen on classification?

I currently have a system set up where I train from old posts/categories and try to predict what category a new post will be. I am using a pipeline with TfidfVectorizer and LinearSVC to train the dataset and storing that in a pickle, then I process new posts by loading that pickle and using predict from the loaded pickle to classify the new posts. Currently, I am struggling with a few labels and I don't know why.
I am looking to provide some output on what words were triggered in the new post for each classification label so that I can see why a certain label was chosen when classifying new data against a training set, but I cannot find a way to do this.
I know that I can output the top features in my vectorizer when I am training, but how can I output essentially the reason why a certain label was chosen over another one?
During the training phase of the SVM for each word of the corpus vocabulary you learn a weight for each of the classes.
Then, during inference, you calculate the dot product between the class weights and the vector description of the instance to be classified. The algorithm returns the class that yields the highest dot product scores. Hence, you can have an estimate of how things work by examining those weights (coef_ attribute) for your instance.
I agree however that other methods like trees are more interpretable.

Classfication accuracy on Weka

I am using Weka GUI for a classification. I am new to Weka and getting confused with the options
Use training Set
Supplied test set
Cross validation
to train my classification algorithm (for example J48), I trained with cross validation 10 folds and the accuracy is pretty good (97%). When I test my classification - the accuracy drops to about 72%. I am so confused. Any tips please? This is how I did it:
I train my model on the training data (For example: train.arff)
I right-click in the Results list on the item which model you want to save
select Save model and save it for example as j48tree.model
and then
I load the test data (for example: test.arff via the Supplied test set button
Right-click in the Results list, I selected Load model and choose j48tree.model
I selected Re-evaluate model on current test set
Is the way i do it wrong? Why the accuracy miserably dropping to 72% from 97%? Or is doing only the cross-validation with 10 folds is enough to train and test the classifier?
Note: my training and testing datasets have the same attributes and labels. The only difference is, I have more data on the testing set which I don't think will be a problem.
I don't think there is any issue with how you use WEKA.
You mentioned that you test set is larger than training? What is the split? The usual rule of thumb is that test set should be one 1/4 of the whole dataset, i.e. 3 times smaller than training and definitely not larger. This alone could explain the drop from 97% to 72% which is by the way not so bad for real life case.
Also it will be helpful if you build the learning curve https://weka.wikispaces.com/Learning+curves as it will explain whether you have a bias or variance issue. Judging by your values sounds like you have a high variance (i.e. too many parameters for your dataset), so adding more examples or changing your split between training and test set will likely help.
Update
I ran a quick analysis of the dataset at question by randomforest and my performance was similar to the one posted by author. Details and code are available on gitpage http://omdv.github.io/2016/03/10/WEKA-stackoverflow

spam classification - machine learning

I have to do spam detection application using a few classifiers(e.g. Naive Bayes, SVM and another one yet) and compare them efficiency but unfortunately I don't know what should I do exactly.
Is this correct:
Firstly I should have corpus spam such as trec2005, spamassasin or enron-spam.
Then, I do text pre-processing like stemming, stop words removal, tokenize, etc.
After that I can measure weight my features/terms in spam emails using tf-idf .
Next I remove these features with very low and very high frequencies.
And I can classify my emails then. Right?
After that I can measure my correct classifications by true-positive, false-positive, etc..
If 10fold cross validation is required for something ?
How should I use it?
Could you tell me if these steps for email classifications are OK?
If not, please explain what are the correct steps for spam classification.
Here is roughly the steps you need to build a spam classifier:
1- Input: a labeled training set that contains enough samples of spam and legitimate e-mails
2- Feature Extraction: convert your e-mail text into useful features for training e.g. stemming, remove stop words, words frequency. Then evaluate these features (i.e. apply attribute selection method) to select the most significant ones.
3- If you have large enough dataset, split it into training, validation and testing set. If not you can use your entire dataset for training and do cross validation to evaluate the classifier performance
4- Train your classifier and either use the testing dataste to evaluate its performance or do cross validation
5- Use the trained model to classify new e-mails. Done.
The use of cross validation is to evaluate your model performance on new/unseen data. So if you have an independent testing dataset you might not need cross validation at all, because you can evaluate the model performance on the testing dataset. However when your dataset is small you can divide it to subsets (e.g. 10 folds) and then repeat the training 10 times, every time you will use only 90% of your data and test on the remaining 10% and so on.
You will end up with 10 estimates of the classifier error average them to get the mean square or absolute error

Resources