Classifying instances of a set with a Classification Model on WEKA GUI - machine-learning

I am new to data mining and I would like to ask you a classification question.
I have trained a classification algorithm on WEKA (GUI), using a training set in ARFF format. Consequently I saved it in Model format for future use.
Now I want to use this classification Model on WEKA (GUI) to get the predicted class of instances of a set that is also in ARFF format. Could you please give me instructions on how to do this on WEKA? Unlike the Weka Java API, the GUI version has really poor documentation in the Web and I couldn't find anything relevant.
Is it possible to store the classified set back to ARFF format with '?''s replaced with the class label in the class attribute? I need such outputs files for some computations.
Thank you beforehand.

Related

Do I still need to load word2vec model at model testing?

This may sound like a naive question, but i am quite new on this. Let's say I use the Google pre-trained word2vector model (https://github.com/dav/word2vec) to train a classification model. I save my classification model. Now I load back the classification model into memory for testing new instances. Do I need to load the Google word2vector model again? Or is it only used for training my model?
It depends on how your corpuses and test examples are structured and pre-processed.
You are probably using the pre-trained word-vectors to turn text into numerical features. At first, text examples are vectorized to train the classifier. Later, other (test/production) text examples will be vectorized in the same, and presented to get the classifier to get its judgements.
So you will need to use the same text-to-vectors process for test/production text examples as was used during training. Perhaps you've done that in a separate earlier bulk step, in which case you already have the features in the vector form the classifier uses. But often your classifier pipeline will itself take raw text, and vectorize it – in which case it will need the same pre-trained (word)->(vector) mappings available at test time as were available during training.

Extract Classification Function In Supervised Learning

Possibly I am asking a trivial question but the answer is so crucial to me.
I'm really new to machine learning.I have read about Supervised learning and I know basics of these kind of algorithms.The question is when I'm using an algorithm such as j48 on a dataset how can I find the specified Function to use later to classify unlabeled data.
Thank you in advance
"Function" you are refering to is a classifier itself. It is being learned during the training procedure. Consequently, in order to use your model to classify new data you have to dump it to disk/database. How? It depends completely on the language/implementation used. For python you would simply pickle an object. In Java you can serialize your trained object or use Weka to learn j48 decision tree and save it for later use:
https://weka.wikispaces.com/Saving+and+loading+models

How to load & use a WEKA model in Ruby?

I'm writing a rails app that helps the user to predict sentiment analysis in text.
After using the WEKA GUI, I have an output of the model file.
I would like to know how to load the WEKA model file in Ruby, and evaluate prediction on a specific data (string).
Thanks for the help.

How to output resultant documents from Weka text-classification

So we are running a multinomial naive bayes classification algorithm on a set of 15k tweets. We first break up each tweet into a vector of word features based on Weka's StringToWordVector function. We then save the results to a new arff file to user as our training set. We repeat this process with another set of 5k tweets and re-evaluate the test set using the same model derived from our training set.
What we would like to do is to output each sentence that weka classified in the test set along with its classification... We can see the general information (Precision, recall, f-score) of the performance and accuracy of the algorithm but we cannot see the individual sentences that were classified by weka, based on our classifier... Is there anyway to do this?
Another problem is that ultimately our professor will give us 20k more tweets and expect us to classify this new document. We are not sure how to do this however as:
All of the data we have been working with has been classified manually, both the training and test sets...
however the data we will be getting from the professor will be UNclassified... How can we
reevaluate our model on the unclassified data if Weka requires that the attribute information must
be the same as the set used to form the model and the test set we are evaluating against?
Thanks for any help!
The easiest way to acomplish these tasks is using a FilteredClassifier. This kind of classifier integrates a Filter and a Classifier, so you can connect a StringToWordVector filter with the classifier you prefer (J48, NaiveBayes, whatever), and you will be always keeping the original training set (unprocessed text), and applying the classifier to new tweets (unprocessed) by using the vocabular derived by the StringToWordVector filter.
You can see how to do this in the command line in "Command Line Functions for Text Mining in WEKA" and via a program in "A Simple Text Classifier in Java with WEKA".

NLP text tagging

I am a newbie in NLP, just doing it for the first time.
I am trying to solve a problem.
My problem is I have some documents which are manually tagged like:
doc1 - categoryA, categoryB
doc2 - categoryA, categoryC
doc3 - categoryE, categoryF, categoryG
.
.
.
.
docN - categoryX
Here I have a fixed set of categories and any document can have any number of tags associated with it.
I want to train the classifier using this input, so that this tagging process can be automated.
Thanks
What you are trying to do is called multi-way supervised text categorization (or classification). Knowing the right question to ask is half the problem.
As for how this can be done, here are two references:
RCV1 : A New Benchmark Collection for Text Categorization
Research
Improved Nearest Neighbor Methods For Text Classification With
Language Modeling and Harmonic Functions
Most of classifier works on Bag of word model . There are multiple use case to get expected result.
Try out most general Multinomial naive base classifer with changing different input paramters and check out result.
Try variants of ML Naive base (http://scikit-learn.org/0.11/modules/naive_bayes.html)
You can check out sentence classifier along with considering sentence structures. Considering ngram concepts, you can try out with 2,3,4,5 gram models and check how result varies. Count vectorizer allows ngram, check out this link for example - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
Based on dataset features, not a single classifier can be best for you scenario, you have to check out different use case, which fits best for you.
Most initial approach is, you get started with simple classifier using scikit learn.
Put each category as traning class and train the classifier with this classes
For any input docX, classifier with trained model
You will get probability result for each category
Now put some threshold like probability different between three most highest resulting category, if it matches the threshold consider those category as result for that input class.
its not clear what you have tried or what programming language you are using but as most have suggested try text classification with document vectors, bag of words (as long as there are words in the documents that can help with classification)
Here are some simple tools that can help get you started
Weka http://www.cs.waikato.ac.nz/ml/weka/ (GUI & Java)
NLTK http://www.nltk.org (Python)
Mallet http://mallet.cs.umass.edu/ (command line & Java)
NUML http://numl.net/ (C#)

Resources