Identifying positivity and negativity separately from a negative dataset - machine-learning

First of all, i would like you to know that I am new to machine learning (ML). I am working on a project which detects how positive or negative a set of words can be, therefore i have created a database containing possible negative words. So that ML can take place and predict the overall score on how positive or negative the whole set of words.
My questions are is it possible to classify positive words with only negative words in the dataset? Does it affect the accuracy of predicting if it is possible?

No, it's not generally possible. The model will have no way to differentiate among (1) new negative phrases; (2) neutral phrases; (3) positive phrases. In fact, with only negative phrases, the model will have a hard time learning that "bad" and "not bad" are opposites, as it has seen plenty of "not" references in the negative literature, such as "not worth watching, even for free."

Related

Features of vector form of sentences for opinion finding.

I want to find the opinion of a sentence either positive or negative. For example talk about only one sentence.
The play was awesome
If change it to vector form
[0,0,0,0]
After searching through the Bag of words
bad
naughty
awesome
The vector form becomes
[0,0,0,1]
Same for other sentences. Now I want to pass it to the machine learning algorithm for training it. How can I train the network using these multiple vectors? (for finding the opinion of unseen sentences) Obviously not! Because the input is fix in neural network. Is there any way? The above procedure is just my thinking. Kindly correct me if I am wrong. Thanks in advance.
Since your intuitive input format is "Sentence". Which is, indeed, a string of tokens with arbitrary length. Abstracting sentences as token series is not a good choice for many existing algorithms only works on determined format of inputs.
Hence, I suggest try using tokenizer on your entire training set. This will give you vectors of length of the dictionary, which is fixed for given training set.
Because when the length of sentences vary drastically, then size of the dictionary always keeps stable.
Then you can apply Neural Networks(or other algorithms) to the tokenized vectors.
However, vectors generated by tokenizer is extremely sparse because you only work on sentences rather than articles.
You can try LDA (supervised, not PCA), to reduce the dimension as well as amplify the difference.
That will keep the essential information of your training data as well as express your data at fixed size, while this "size" is not too large.
By the way, you may not have to label each word by its attitude since the opinion of a sentence also depends on other kind of words.
Simple arithmetics on number of opinion-expressing words many leave your model highly biased. Better label the sentences and leave the rest job to classifiers.
For the confusions
PCA and LDA are Dimensional Reduction techniques.
difference
Let's assume each tuple of sample is denoted as x (1-by-p vector).
p is too large, we don't like that.
Let's find a matrix A(p-by-k) in which k is pretty small.
So we get reduced_x = x*A, and most importantly, reduced_x must
be able to represent x's characters.
Given labeled data, LDA can provide proper A that can maximize
distance between reduced_x of different classes, and also minimize
the distance within identical classes.
In simple words: compress data, keep information.
When you've got
reduced_x, you can define training data: (reduced_x|y) where y is
0 or 1.

training set with only one label, missing the other

Hi I've been doing a machine learning project about predicting if a given (query, answer) pair is a good match (label the pair with 1 if it is a good match, 0 otherwise). But the problem is, in the training set, all the items are labelled with 1. So I got confused because I don't think the training set has strong discriminative power. To be more specific, now I could extract some features like:
1. textual similarity between query and answer
2. some attributes like the posting date, who created it, which aspect is it about etc.
Maybe I should try semi supervised learning (never studied it so have no idea if it will work)? But with such a training set I even cannot do validation....
Actually, you can train a data set on only positive examples; 1-class SVM does this. However, this presumes that anything "sufficiently outside" the original data set is negative data, with "sufficiently outside" affected mainly by gamma (allowed error rate) and k (degree of the kernel function).
A solution for your problem depends on the data you have. You are quite correct that a model trains better when given representative negative examples. The description you give strongly suggests that you do know there are insufficient matches.
Do you need a strict +/- scoring for the matches? Most applications simply rank them: the match strength is the score. This changes your problem from a classification to a prediction case. If you do need a strict +/- partition (classification), then I suggest that you slightly alter your training set: include only obvious examples: throw out anything scored near your comfort threshold for declaring a match.
With these inputs only, train your model. You'll have a clear "alley" between good and bad matches, and the model will "decide" which way to judge the in-between cases in testing and production.

Classifier or heuristics?

I need to classify questions asking user to specify brand.
I has some set of samples featuring word "brand".
Positives like:
"What is your favorite cosmetic brand?",
"Which fragrance brand (if any) do you think this advert is for?"...
and negatives like:
"Is there any particular reason why you chose this brand?"
Of cause, it's possible to train 2-class classifier based on concrete samples. However precision and recall will be poor. Is there any way to construct something having good precision based on variety of positive samples?
Precision and recall does not have to be poor. You should try and build a binary classifier (I would recommend SVM or decision tree for this purpose). I would recommend extracting features like the number of occurrences of each word in a sample (or tf-idf) or the length of the words and sentences. I guess that the question word in the sentence will have a major impact on the classification.
In addition, please note that a good precision value is very easy to get when you do not care about recall.
Choosing a set of words as features using tf-idf and training a tree algorithm seems the easiest way to go but I would also suggest to also try k-means clustering in the case that noe or more categories of answers considered as "neutral" emerge. This will possible help you decide which of these you consider positive or negative in order to re-factor your feature vector and subsequently your algorithm.
I am also a huge fan of HMM variants (I have used them to perform energy disaggregation) and I suggest you have a look at the following. It might give you some extra ideas:
http://www.merl.com/publications/docs/TR2004-085.pdf

Naive Bays spam filtering

I am trying to implement my first spam filter using a naive bayes classifier. I am using the data provided by UCI’s machine learning data repository (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). The data is a table of features corresponding to a few thousand spam and non-spam(ham) messages. Therefore, my features are limited to those provided by the table.
My goal is to implement a classifier that can calculate P(S∣M), the probability of being spam given a message. So far I have been using the following equation to calculate P(S∣F), the probability of being spam given a feature.
P(S∣F)=P(F∣S)/(P(F∣S)+P(F∣H))
from http://en.wikipedia.org/wiki/Bayesian_spam_filtering
where P(F∣S) is the probability of feature given spam and P(F∣H) is the probability of feature given ham. I am having trouble bridging the gap from knowing a P(S∣F) to P(S∣M) where M is a message and a message is simply a bag of independent features.
At a glance I want to just multiply the features together. But that would make most numbers very small, I am not sure if that is normal.
In short these are the questions I have right now.
1.) How to take a set of P(S∣F) to a P(S∣M).
2.) Once P(S∣M) has been calculated, how do I define a a threshold for my classifier?
3.) Fortunately my feature set was selected for me, how would I go about selecting or finding my own feature set?
I would also appreciate resources that might help me out as well. Thanks for your time.
You want to use Naive Bayes:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
It's probably beyond the scope of this answer to explain it, but essentially you multiply the probability of each feature give spam together, and multiply that by the prior probability of spam. Then repeat for ham (i.e. multiple each feature given ham together, and multiply that by the prior probability of ham). Now you have two numbers which can be normalized to probabilities by dividing each by the total of both. That will give you the probability of S|M and S|H. Again read the article above. If you want to avoid numerical underflow, take the log of each conditional and prior probability (any base) and add, instead of multiplying the original probabilities. Adding logs is equivalent to multiplying the original numbers. This won't give you a probability number at the end, but you can still take the one with the larger value as the predicted class.
You should not need to set a threshold, simply classify each instance by what is more likely, spam or ham (or whichever gives you the greater log likelihood).
There is no simple answer to this. Using a bag of words model is reasonable for this problem. Avoid very infrequent (occurring in < 5 documents) and also very frequent words, such as the, and a. A stop word list is often used to remove these. A feature selection algorithm can also help. Removing features that are highly correlated will help, particularly with Naive Bayes, which is highly sensitive to this.

Good performance only for one class naive bayes

I use Naive Bayes from Weka to do text classification. I have two classes for my sentences, "Positive" and "Negative". I collected about 207 sentences with positive meaning and 189 sentences with negative meaning, in order to create my training set.
When I ran Naive Bayes with a test set that contains sentences with strong negative meaning, such as the one of the word "hate", the accuracy of the results is pretty good, about 88%. But when I use sentences with positive meaning, such as the one of the word "love", as a test set, the accuracy is much worse, about 56%.
I think that this difference probably has something to do with my training set and especially its "Positive" sentences.
Can you think of any reason that could explain this difference? Or maybe a way to help me find out where the problem begins?
Thanks a lot for your time,
Nantia
Instead of creating test sets which contain only positive or negative samples I would just create a test set with mixed samples. You can the view the resulting confusion matrix in Weka which allows you to see how well both the positive and negative samples where classified. Furthermore I would use (10-fold) cross-validation to get a more stable measure of the performance (once you have done this you might want to edit your post with the confusion matrix cross-validation results and we might be able to help out more).
It may be that your negative sentences have words that are more consistently present, whereas your positive sentences have more variations in the words that are present or those words may also often be present in the negative sentences.
It is hard to give specific advice without knowing the size of your dictionary (i.e., number of attributes), size of your test set, etc. Since the Naive Bayes Classifier calculates the product of the probabilities of individual words being present or absent, I would take some of the misclassified positive examples and examine the conditional probabilities for both positive and negative classification to see why the examples are being misclassified.
To better understand how your classifier works, you can inspect the parameters to see which words the classifier thinks are the most predictive of positive/negative of sentence. Can you print out the top predictors for positive and negative cases?
e.g.,
top positive predictors:
p('love'|positive) = 0.05
p('like'|positive) = 0.016
...
top negative predictors:
p('hate'|negative) = 0.25
p('dislike'|negative) = 0.17
...

Resources