About training HMM by using EM - machine-learning

I am new to EM algorithm, studying Hidden Markov Model.
During training my HMM by EM, I am very confused on the data setting. (text processing)
Please confirm whether my EM usage is okay or not.
At first, I calculated statistics for emission probability matrix with my whole training set. And then, I ran EM with the same set.
-> Emission probability for unseen data converged to zero at the time.
While I read a text, Speech and Language Processing, I found the exercise 8.3 tells two phase training method.
8.3 Extend the HMM tagger you built in Exercise 8.?? by adding the ability to make use of some unlabeled data in addition to your labeled training corpus. First acquire a large unlabeled corpus. Next, implement the forward-backward training algorithm. Now start with the HMM parameters you trained on the training corpus in Exercise 8.??; call this model M0. Run the forward-backward algorithm with these HMM parameters to label the unsupervised corpus. Now you have a new model M1. Test the performance of M1 on some held-out labeled data.
Following this statement, I select some instances from my training set (1/3 of training set) for getting initial statistics.
And then, I run EM procedure with whole training set for optimizing parameters in EM.
Is it ok?

The procedure that the exercise is referring to is a type of unsupervised learning known as self-training. The idea is that you use your entire labeled trainign set to build a model. Then you collect more data that is unlabeled. It is much easier to find new unlabeled data than it is to find new labeled data. After that, you would label the new data using the model you originally trained. Now, using the automatically generated labels, train a new model.

Related

Feature selection with Decision Tree

I'm supposed to perform feature selection of my dataset (independent variables: some aspects of a patient, target varibale: patient ill or not) using a dcision tree. After that with the features selected I've to implement a different ML model.
My doubt is: when I'm implementing the decison tree is it necessary having a train and a test set or just fit the model on the whole data?
it's necessary to split the dataset into train-test because otherwise you will measure the performance with data used in training and could end up into over-fitting.
Over-fitting is where the training error constantly decrease but the generalization error increase, where by generalization error is intended as the ability of the model to classify correctly new (never seen before) samples.

Using training data and testing data in a shared task

I am working on this shared task http://alt.qcri.org/semeval2017/task4/index.php?id=data-and-tools
which is just a twitter sentiment analysis. Since i am pretty new to machine learning, I am not quite sure how to use both training data and testing data.
So the shared task provides two same sets of twitter tweets one without the result (train) and one with the result.
I current understandings of using these kinds of data in machine learning are as follows:
training set: we are supposed to split this into training and testing portions (90% training and 10% testing maybe?)
But the existing of a separate test data kind of confuses.
Are we supposed to use the result that we got in the test using the 10% portion of the 'training set' and compare that to the actual result 'testing set' ?
Can someone correct my understanding?
When training a machine learning model, you are feeding your algorithm with the dataset called training set, which in this stage, you are telling the algorithm what is the ground truth of each sample you put into the algorithm, that way, the algorithm learns from each sample you are feeding to it. the training set is usually 80% of the whole dataset, the other 20% of the dataset is the testing set, which in this case, you know what is the ground truth of each sample, but you let your algorithm predict what it think the truth is to each sample you let it predict. All those prediction over the testing set are based on what the algorithm have learned from the training set you fed it before.
After you make all the predictions over your testing set you can then check how accurate your model is based on the ground truth in compare to the prediction the model have made.

Text Classification Technique for this scenario

I am completely new to Machine Learning algorithms and I have a quick question with respect to Classification of a dataset.
Currently there is a training data that consists of two columns Message and Identifier.
Message - Typical message extracted from Log containing timestamp and some text
Identifier - Should classify the category based on the message content.
The training data was prepared by extracting a particular category from the tool and labelling it accordingly.
Now the test data contains just the message and I am trying to obtain the Category accordingly.
Which approach is most helpful in this scenario ? Is it the Supervised or Unsupervised Learning ?
I have a trained dataset and I am trying to predict the Category for the Test Data.
Thanks in advance,
Adam
If your labels are exact then you can classify using ANN, SVM etc. But labels are not exact you have to cluster data with respect to the features you have in data. K-means or nearest neighbour can be the starting point for clustering.
It is supervised learning, and a classification problem.
However, obviously you do not have the label column (the to-be-predicted value) for your testset. Thus, you cannot calculate error measures (such as False Positive Rate, Accuracy etc) for that test set.
You could, however, split the set of labeled training data that you do have into a smaller training set and a validation set. Split it 70%/30%, perhaps. Then build a prediction model from your smaller 70% training dataset. Then tune it on your 30% validation set. When accuracy is good enough, then apply it on your testset to obtain/predict the missing values.
Which techniques / algorithms to use is a different question. You do not give enough information to answer that. And even if you did you still need to tune the model yourself.
You have labels to predict, and training data.
So by definition it is a supervised problem.
Try any classifier for text, such as NB, kNN, SVM, ANN, RF, ...
It's hard to predict which will work best on your data. You willhave to try and evaluate several.

labelling of dataset in machine learning

I have a question about some basic concepts of machine learning. The examples, I observed, were giving a brief overview .For training the system, feature vector is given as input. In case of supervised learning, the dataset is labelled. I have confusion about labelling. For example if I have to distinguish between two types of pictures, I will provide a feature vector and on output side for testing, I'll provide 1 for type A and 2 for type B. But if I want to extract a region of interest from a dataset of images. How will I label my data to extract ROI using SVM. I hope I am able to convey my confusion. Thanks in anticipation.
In supervised learning, such as SVMs, the dataset should be composed as follows:
<i-th feature vector><i-th label>
where i goes from 1 to the number of patterns (also examples or observations) in your training set so this represents a single record in your training set which can be used to train the SVM classifier.
So you basically have a set composed by such tuples and if you do have just 2 labels (binary classification problem) you can easily use a SVM. Indeed the SVM model will be trained thanks to the training set and the training labels and once the training phase has finished you can use another set (called Validation Set or Test Set), which is structured in the same way as the training set, to test the accuracy of your SVMs.
In other words the SVM workflow should be structured as follows:
train the SVM using the training set and the training labels
predict the labels for the validation set using the model trained in the previous step
if you know what the actual validation labels are, you can match the predicted labels with the actual labels and check how many labels have been correctly predicted. The ratio between the number of correctly predicted labels and the total number of labels in the validation set returns a scalar between [0;1] and it's called the accuracy of your SVM model.
if you're interested in the ROI, you might want to check the trained SVM parameters (mainly the weights and bias) to reconstruct the separation hyperplane
It is also important to know that the training set records should be correctly, a priori labelled: if the training labels are not correct, the SVM will never be able to correctly predict the output for previously unseen patterns. You do not have to label your data according to the ROI you want to extract, the data must be correctly labelled a priori: the SVM will have the entire set of type A pictures and the set of type B pictures and will learn the decision boundary to separate pictures of type A and pictures of type B. You do not have to trick the labels: if you do, you're not doing classification and/or machine learning and/or pattern recognition. You're basically tricking the results.

How to output resultant documents from Weka text-classification

So we are running a multinomial naive bayes classification algorithm on a set of 15k tweets. We first break up each tweet into a vector of word features based on Weka's StringToWordVector function. We then save the results to a new arff file to user as our training set. We repeat this process with another set of 5k tweets and re-evaluate the test set using the same model derived from our training set.
What we would like to do is to output each sentence that weka classified in the test set along with its classification... We can see the general information (Precision, recall, f-score) of the performance and accuracy of the algorithm but we cannot see the individual sentences that were classified by weka, based on our classifier... Is there anyway to do this?
Another problem is that ultimately our professor will give us 20k more tweets and expect us to classify this new document. We are not sure how to do this however as:
All of the data we have been working with has been classified manually, both the training and test sets...
however the data we will be getting from the professor will be UNclassified... How can we
reevaluate our model on the unclassified data if Weka requires that the attribute information must
be the same as the set used to form the model and the test set we are evaluating against?
Thanks for any help!
The easiest way to acomplish these tasks is using a FilteredClassifier. This kind of classifier integrates a Filter and a Classifier, so you can connect a StringToWordVector filter with the classifier you prefer (J48, NaiveBayes, whatever), and you will be always keeping the original training set (unprocessed text), and applying the classifier to new tweets (unprocessed) by using the vocabular derived by the StringToWordVector filter.
You can see how to do this in the command line in "Command Line Functions for Text Mining in WEKA" and via a program in "A Simple Text Classifier in Java with WEKA".

Resources