How to perform annotation in sentiment analysis? - machine-learning

I collected some reviews on Books, DVD, Mobile, and Camera from www.amazon.com. I converted reviews with 1 star as negative and 5 stars as positive. The ratio of negative and positive reviews are 1:5. The collected reviews are converted into Document-Term-Matrix and few features were selected using chi-square feature subset selection method and some our proposed feature selection methods. We employed some classification algorithms like MLP, SVM, DT, etc to classify the samples. I reported the result under 10-fold cross-validation framework.
In order to compare our results with baseline, the reviewers asked me to perform the human evaluation to compare our results. How to perform annotation here? Whether we should employ annotators on randomly selected samples to compare our results or we should perform annotation for all samples?
My professor is asking to perform annotation of all samples and then divide the dataset into 10 folds, then calculate the average of an accuracy of 10-folds of annotators response to comparing our result?
I found some literature, they are performing annotation on randomly selected samples. Any reference suggested in this regard would be quite helpful to me.
Thanks in advance.

Related

Interpretation of Classifier Result in Weka

I am running classification algorithm in Weka. But I am unsure about some of results that Weka generate for reporting purposes.
In classification problem (either Yes=have disease or No = do not have disease), Weka produce result for each classifier. But also provide weighted result at bottom for both classifier.
Image
My question is, from reporting prospective what score should I be reporting on? (Basically I want to compare my results with other people results)
As per weka result (attached) for F-Measure; will it be 91 percent or 89 percent? Same applies for all other measurements (recall and precision).
Also, I want to know in research papers what score is reported for any given classifier? Weighted or for classifier that we are trying to predict, for example in my case, only report on result for'Yes' score?
Many thanks,
The use case defines what you report. In general, research papers report the entire confusion matrix and statistics tables. This allows readers to extract the data needed for the way they will use the research.
If a patient receives a "disease-free" result from this classifier, there's chance of about 18% that the person actually does have the disease. Is this acceptable? That's not a question SO (Stack Overflow) can answer: that's the use case.
If you insist on describing the test with a single, scalar statistic, you need to clarify the use case, and report that single metric accurately. In general, the summary f-measure (weighted) is what you report.

Sentiment Analysis using classification and clustering algorithms: Which is better?

I am trying to do a Sentiment Analysis on Song Lyrics using Python.
After studying many simple classification problems, with known labels (such as Email classification Spam/Not Spam), I thought that the Lyrics Sentiment Analysis lies on the Classification field.
While actually coding it, i discovered that I had to compute the sentiment for each song's lyrics, and probably adding a column to the original dataset, marking it positive or negative, or using the actual sentiment score.
Couldn't this be done using a clustering approach? Since we don't know each song's class in the first place (positive sentiment / negative sentiment) the algorithm will cluster the data using sentiment analysis.
Clustering usually won't produce sentiments.
It a more likely to produce e.g., a cluster for rap and one for non-rap. Or one for lyrics with an even song length, and one for odd length.
There is more in the data than sentiment. So why would clustering produce sentiment clusters?
If you want particular labels (positive sentiment, negative sentiment) then you need to provide training data and use a supervised approach.
You are thinking of Clustering without supervision i.e, unsupervised clustering which might result in low accuracy results because you actually dont know what is the threshold value of score which seperates the positive and negative classes.So first try to find the threshold which will be your parameter which seperates your classes.Use supervised learning to find the threshold

Machine Learning - Huge Only positive text dataset

I have a dataset with thousand of sentences belonging to a subject. I would like to know what would be best to create a classifier that will predict a text as "True" or "False" depending on whether they talk about that subject or not.
I've been using solutions with Weka (basic classifiers) and Tensorflow (neural network approaches).
I use string to word vector to preprocess the data.
Since there are no negative samples, I deal with a single class. I've tried one-class classifier (libSVM in Weka) but the number of false positives is so high I cannot use it.
I also tried adding negative samples but when the text to predict does not fall in the negative space, the classifiers I've tried (NB, CNN,...) tend to predict it as a false positive. I guess it's because of the sheer amount of positive samples
I'm open to discard ML as the tool to predict the new incoming data if necessary
Thanks for any help
I have eventually added data for the negative class and build a Multilineal Naive Bayes classifier which is doing the job as expected.
(the size of the data added is around one million samples :) )
My answer is based on the assumption that that adding of at least 100 negative samples for author’s dataset with 1000 positive samples is acceptable for the author of the question, since I have no answer for my question about it to the author yet
Since this case with detecting of specific topic is looks like particular case of topics classification I would recommend using classification approach with the two simple classes 1 class – your topic and another – all other topics for beginning
I succeeded with the same approach for face recognition task – at the beginning I built model with one output neuron with high level of output for face detection and low if no face detected
Nevertheless such approach gave me too low accuracy – less than 80%
But when I tried using 2 output neurons – 1 class for face presence on image and another if no face detected on the image, then it gave me more than 90% accuracy for MLP, even without using of CNN
The key point here is using of SoftMax function for the output layer. It gives significant increase of accuracy. From my experience, it increased accuracy of the MNIST dataset even for MLP from 92% up to 97% for the same model
About dataset. Majority of classification algorithms with a trainer, at least from my experience are more efficient with equal quantity of samples for each class in a training data set. In fact, if I have for 1 class less than 10% of average quantity for other classes it makes model almost useless for the detection of this class. So if you have 1000 samples for your topic, then I suggest creating 1000 samples with as many different topics as possible
Alternatively, if you don’t want to create a such big set of negative samples for your dataset, you can create a smaller set of negative samples for your dataset and use batch training with a size of batch = 2x your negative sample quantity. In order to do so, split your positive samples in n chunks with the size of each chunk ~ negative samples quantity and when train your NN by N batches for each iteration of training process with chunk[i] of positive samples and all your negative samples for each batch. Just be aware, that lower accuracy will be the price for this trade-off
Also, you could consider creation of more generic detector of topics – figure out all possible topics which can present in texts which your model should analyze, for example – 10 topics and create a training dataset with 1000 samples per each topic. It also can give higher accuracy
One more point about the dataset. The best practice is to train your model only with part of a dataset, for example – 80% and use the rest 20% for cross-validation. This cross-validation of unknown previously data for model will give you a good estimation of your model accuracy in real life, not for the training data set and allows to avoid overfitting issues
About building of model. I like doing it by "from simple to complex" approach. So I would suggest starting from simple MLP with SoftMax output and dataset with 1000 positive and 1000 negative samples. After reaching 80%-90% accuracy you can consider using CNN for your model, and also I would suggest increasing training dataset quantity, because deep learning algorithms are more efficient with bigger dataset
For text data you can use Spy EM.
The basic idea is to combine your positive set with a whole bunch of random samples, some of which you hold out. You initially treat all the random documents as the negative class, and train a classifier with your positive samples and these negative samples.
Now some of those random samples will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring held out true positive samples.
Then you iterate this process until it stablizes.

What does this learning curve show ? And how to handle non representativity of a sample?

==> to see learning curves
I am trying a random forest regressor for a machine learning problem (price estimation of spatial points). I have a sample of spatial points in a city. The sample is not randomly drawn since there are very few observations downtown. And I want to estimate prices for all addresses in the city.
I have a good cross validation score (absolute mean squared error) an also a good test score after splitting the training set. But predictions are very bad.
What could explain this results ?
I plotted the learning curve (link above) : cross validation score increases with number of instances (that sounds logical), training score remains high (should it decrease ?) ... What do these learning curves show ? And in general how do we "read" learning curves ?
Moreover, I suppose that the sample is not representative. I tried to make the dataset for which I want predictions spatially similar to the training set by drawing whitout replacement according to proportions of observations in each district for the training set. But this didn't change the result. How can I handle this non representativity ?
Thanks in advance for any help
There are a few common cases that pop up when looking at training and cross-validation scores:
Overfitting: When your model has a very high training score but a poor cross-validation score. Generally this occurs when your model is too complex, allowing it to fit the training data exceedingly well but giving it poor generalization to the validation dataset.
Underfitting: When neither the training nor the cross-validation scores are high. This occurs when your model is not complex enough.
Ideal fit: When both the training and cross-validation scores are fairly high. You model not only learns to represent the training data, but it generalizes well to new data.
Here's a nice graphic from this Quora post showing how model complexity and error relate to the type a fit a model exhibits.
In the plot above, the errors for a given complexity are the errors found at equilibrium. In contrast, learning curves show how the score progresses throughout the entire training process. Generally you never want to see the score decreasing during training, as this usually means your model is diverging. But the difference between the training and validation scores as they move forward in time (towards equilibrium) indicates how well your model is fitting.
Notice that even when you have an ideal fit (middle of complexity axis) it is common to see a training score that's higher than the cross-validation score, since the model's parameters are updated using the training data. But since you're getting poor predictions, and since validation score is ~10% lower than training score (assuming the score is out of 1), I would guess that your model is overfitting and could benefit from less complexity.
To answer your second point, models will generalize better if the training data is a better representation of validation data. So when splitting the data into training and validation sets, I recommend finding a way to randomly segregate the data. For example, you could generate a list of all the points in the city, iterate of the list, and for each point draw from a uniform distribution to decide which dataset that point belongs to.

binary classification with sparse binary matrix

My crime classification dataset has indicator features, such as has_rifle.
The job is to train and predict whether data points are criminals or not. The metric is weighted mean absolute error, where if the person is criminal, and the model predicts him/her as not, then the weight is large as 5. If person is not criminal and the model predicts as he/she is, then weight is 1. Otherwise the model predicts correctly, with weight 0.
I've used classif:multinom method in mlr in R, and tuned the threshold to 1/6. The result is not that good. Adaboost is slightly better. Though neither is perfect.
I'm wondering which method is typically used in this kind of binary classification problem with a sparse {0,1} matrix? And how to improve the performance measured by the weighted mean absolute error metric?
Dealing with sparse data is not a trivial task. Lack of information makes difficult to capture features such as variance. I would suggest you to search for subspace clustering methods or to be more specific, soft subspace clustering. The last one usually identifies relevant/irrelevant data dimensions. It is a good approach when you want to improve classification accuracy.

Resources