Sentiment analysis social media with Naive Bayes - twitter

I am trying to implement sentiment analysis on Google+ posts using a Naive Bayes classification.
I have been looking for a dataset, and the only one I found was made for twitter. A google+ post and tweet have a lot in common but maybe not the length. I'd like to know if that changes anything for a NB classification.
Also Naive Bayes takes smileys into account but suppose we want to give them more weight in text (since they express unambiguously the emotion of a tweet). Is there a way to do that with naive bayes?

Related

How do I find the backend generated code in weka?

I am using weka for classification.
If I use Naive Bayes for classification of datasets, how can I see the backend code of the naive bayes algorithm in weka??
Is there any way??
Weka is open source, so you can see their code at their repository as stated in their website. The naive bayes part is here

Collecting Machine learning training data

I am very new to machine learning, and need a couple of things clarified. I am trying to predict the probability of someone liking an activity based on their Facebook likes. I am using the Naive Bayes classifier, but am unsure on a couple of things. 1. What would my labels/inputs be? 2. What info do I need to collect for training data? My guess is create a survey and have questions on wether the person would enjoy an activity (Scale from 1-10)
In supervised classification, all classifiers need to be trained with known labeled data, this data is known as training data. Your data should have a vector of features followed by a special one called class. In your problem, if the person has enjoyed the activity or not.
Once you train the classifier, you should test it's behavior with another dataset in order not to be biased. This dataset must have the class as the train data. If you train and test with the same datasets your classifiers prediction may be really nice but unfair.
I suggest you to take a look to evaluation techniques like K Fold Cross Validation.
Another thing you should know is that the common Naïve Bayes classifier is used to predict binary data, so your class should be 0 or 1 meaning that the person you make a survey enjoyed or not the activity. Also it's implemented in packages like Weka (Java) or SkLearn (Python).
If you are really interested in Bayesian Classifiers I need to say that in fact, Naïve Bayes for binary classification is not the best one because Minsky in 1961 discovered that the decision boundaries are hyperplanes. Also the Brier Score is really bad and it is say that this classifier is not well calibrated. But, it make good predictions after all.
Hope it helps.
This may be fairly difficult with Naive Bayes. You'll need to collect (or calculate) samples of whether or not a person likes activity X, and also details on their Facebook likes (organized in some consistent way).
Basically, for Naive Bayes, your training data should be the same data type as your testing data.
The survey approach may work, if you have access to each person's Facebook like history.

Random Forest, text classification

How can I use words as feature to classify text using random forest algorithm for sentiment analysis? I'm using words as features, whereas random forest uses numbers, this is where I'm stuck.
I think you can use sckit-learn to facilitate you in solving it. You can look for tutorial at the website of sckit-learn tutorial here. it will be very useful.
When working with text features you can use CountVectorizer or DictVectorizer. Take a look at feature extraction and especially section 4.1.3 here.
To facilitate you to know more, you can find an example here. It will useful for classifying text documents.
you can use countvectorizer or tfidf in the pre processing section of the random forest pipeline. post an excerpt of your data and I will demostrate

Sentiment Analysis on Twitter Data?

I am working on this project where I wish to classify the general mood of a Twitter user from his recent tweets. Since the tweets can belong to a huge variety of domains how should I go about it ?
I could use the Naive Bayes algorithm (like here: http://phpir.com/bayesian-opinion-mining) but since the tweets can belong to a large variety of domains, I am not sure if this will be very accurate.
The other option is using maybe sentiment dictionaries like SentiWordNet or here. Would this be a better approach, I don't know.
Also where can I get data to train my classifier if I plan to use the Naive Bayes or some other algorithm ?
Just to add here, I am primarily coding in PHP.
It appears you could use SentiWordNet as the classifier data if you are focused on a word-by-word approach. It is how simple Bayesian spam filters works; it focuses on each word.
The advantage here is that while many of the words in SentiWordNet have multiple meanings, each with different positive/objective/negative scores, you could experiment with using the scores of the other words in the tweet to narrow in on the most appropriate meaning for each multi-meaning word, which could give you a more accurate score for each word and for the overall tweet.

Weighted Naive Bayes Classifier in Apache Mahout

I am using Naive Bayes classifier for my sentiment analysis on customer support. But unfortunately I don't have huge annotated data sets in the customer support domain. But I have a little amount of annotated data in the same domain(around 100 positive and 100 negative). I have the amazon product review data set as well.
Is there anyway can I implement a weighted naive bayes classifier using mahout, so that I can give more weight to the small set of customer support data and small weight to the amazon product review data. A training on the above weighted data set would drastically improve accuracy I guess. Kindly help me with the same.
One really simple approach is oversampling. Ie just repeat the customer support examples in your training data multiple times.
Though it's not the same problem you might get some further ideas by looking into the approaches used for class imbalance; in particular oversampling (as mentioned) and undersampling.

Resources