I'm trying to predict the sentiment of the next tweet posted by a twitter user. Right now I have the following steps (step 1 and 2 are already implemented in python):
Learn how to classify a tweet as postive (1), neutral (0), or negative (-1). I use a naive bayes classifier for this and it works pretty well.
Classify existing tweets from a user. This results in a series of numbers like this: [0, 1, -1, -1, -1, 0, 1, 1, ..] There's also information about the posting time.
Would it be possible to predict the sentiment (1, 0 or -1) for the next tweet?
What algorithm could I use for this?
I don't know how this one works yet, but are hidden markov models suitable or some kind of regression?
I think one appealing way to think about this is in terms prior and likelihood for sentiment. Naive Bayes is a likelihood model (how probable am I to see this exact tweet, given that it's positive?). You're asking about the prior probability of a the next tweet being positive, given that you've observed a certain sequence of sentiments so far. There are a few ways you could do this:
The most naive way is the fraction of tweets the user has uttered which are positive is the probability that the next one will be positive
However, this ignores recency. You could come up with a transition-based model: from each possible previous state, there's a probability of the next tweet being positive, negative or neutral. Thus you have a 3x3 transition matrix, and the conditional probability of the next tweet being positive given the last one was positive is the transition probability pos->pos. This can be estimated from counts, and is a Markovian process (previous state is all that matters, basically).
You can get more and more complex with these transition models, for example the current 'state' could be the sentiments of the last two, or indeed last-n, tweets, meaning you get more specific predictions at the expense of more and more parameters in the model. You can overcome this with smoothing schemes, parameter tying etc. etc.
As a final point, I think #Anony-Mousse's point about the prior being weak evidence is going to be true: really, whatever your prior tells you, I think this is going to be dominated by the likelihood function (what's actually in the tweet in question). If you get to see the tweet as well, consider a CRF as #Neil McGuigan suggests.
On the machine learning side, you could consider sequential associations:
http://web.mit.edu/rudin/www/RudinEtAlCOLT11.pdf
This site has some java libraries:
http://www.philippe-fournier-viger.com/spmf/
A Hidden Markov Model should also work. An HMM is a special case of a Conditional Random Field, which lets you look at other factors such as say weather or news events.
I wonder if the next tweet of a person is also affected by the current tweets of a) everyone b) or those that they follow
Related
I am using matrix factorization as a recommender system algorithm based on the user click behavior records. I try two matrix factorization method:
The first one is the basic SVD whose prediction is just the product of user factor vector u and item factor i: r = u * i
The second one I used is the SVD with bias component.
r = u * i + b_u + b_i
where b_u and b_i represents the bias of preference of users and items.
One of the models I use has a very low performance, and the other one is reasonable. I really do not understand why the latter one performs worse, and I doubt that it is overfitting.
I googled methods to detect overfitting, and found the learning curve is a good way. However, the x-axis is the size of the training set and y-axis is the accuracy. This make me quite confused. How can I change the size of the training set? Pick out some of the records out of the data set?
Another problem is, I tried to plot the iteration-loss curve (The loss is the ). And it seems the curve is normal:
But I am not sure whether this method is correct because the metrics I use are precision and recall. Shall I plot the iteration-precision curve??? Or this one already tells that my model is correct?
Can anybody please tell me whether I am going in the right direction? Thank you so much. :)
I will answer in reverse:
So you are trying two different models, one that uses straight matrix factorization r = u * i and the other which enters the biases, r = u * i + b_u + b_i.
You mentioned you are doing Matrix Factorization for a recommender system which looks at user's clicks. So my question here is: Is this an Implicit ratings case? or Explicit one? I believe is an Implicit ratings problem if it is about clicks.
This is the first important thing you need to be very aware of, whether your problem is about Explicit or Implicit ratings. Because there are some differences about the way they are used and implemented.
If you check here:
http://yifanhu.net/PUB/cf.pdf
Implicit ratings are treated in a way that the number of times someone clicked or bought a given item for example is used to infer a confidence level. If you check the error function you can see that the confidence levels are used almost as a weight factor. So the whole idea is that in this scenario the biases have no meaning.
In the case of Explicit Ratings, where one has ratings as a score for example from 1-5, one can calculate those biases for users and products (averages of these bounded scores) and introduce them in the ratings formula. They make sense int his scenario.
The whole point is, depending whether you are in one scenario or the other you can use the biases or not.
On the other hand, your question is about over fitting, for that you can plot training errors with test errors, depending on the size of your data you can have a holdout test data, if the errors differ a lot then you are over fitting.
Another thing is that matrix factorization models usually include regularization terms, see the article posted here, to avoid over fitting.
So I think in your case you are having a different problem the one I mentioned before.
I am building a RNN for a time series model, which have a categorical output.
For example, if precious 3 pattern is "A","B","A","B" model predict next is "A".
there's also a numerical level associated with each category.
For example A is 100, B is 50,
so A(100), B(50), A(100), B(50),
I have the model framework to predict next is "A", it would be nice to predict the (100) at the same time.
For real life examples, you have national weather data.
You are predicting the next few days weather type(Sunny, windy, raining ect...) at the same time, it would be nice model will also predict the temperature.
Or for Amazon, analysis customer's trxns pattern.
Customer A shopped category
electronic($100), household($10), ... ...
predict what next trxn category that this customer is likely to shop and predict at the same time what would be the amount of that trxns.
Researched a bit, have not found any relevant research on similar topics.
What is stopping you from adding an extra output to your model? You could have one categorical output and one numerical output next to each other. Every neural network library out there supports multiple outputs.
Your will need to normalise your output data though. Categories should be normalised with one-hot encoding and numerical values should be normalised by dividing by some maximal value.
Researched a bit, have not found any relevant research on similar topics.
Because this is not really a 'topic'. This is something completely normal, and it does not require some special kind of network.
There is a stream of short texts. Each one has the size of a tweet, or let us just assume they are all tweets.
The user can vote on any tweet. So, each tweet has one of the following three states:
relevant (positive vote)
default (neutral i.e. no vote)
irrelevant (negative vote)
Whenever a new set of tweets come, they will be displayed in a specific order. This order is determined by the votes of the user on all previous tweets. The aim is to assign a score to each new tweet. This score is calculated based on the word similarity or match between the text of this tweet and all the previous tweets voted by the user. In other words, the tweet with the highest score is going to be the one which contains the maximum number of words voted previously positive and the minimum of words voted previously as negative. Also, the new tweets having a high score will trigger a notification to the user as they are considered very relevant.
One last thing, a minimum of semantic consideration (natural language processing) would be great.
I have read about Term Frequency–Inverse Document Frequency and come up with this very simple and basic solution:
Reminder: a high weight in tf–idf is reached by a high word frequency and a low total frequency of the word in the whole collection.
If the user votes positive on a Tweet, all the words of this tweet will receive a positive point (same thing for the negative case). This means that we will have a large set of words where each word has the total number of positive points and negative points.
If (Tweet score > 0) then this tweet will trigger a notification.
Tweet score = sum of all individual words’ scores of this tweet
word score = word frequency * inverse total frequency
word frequency in all previous votes = ( total positive votes for this word - total negative votes for this word) / total votes for this word
Inverse total frequency = log ( total votes of all words / total votes for this word)
Is this method enough? I am open to any better methods and any ready API or algorithm.
One possible solution would be to train a classifier such as Naive Bayes on the tweets that a user has voted on. You can take a look at the documentation of scikit-learn, a Python library, which explains how you can easily preprocess your text and train such a classifier.
I would look at Naive Bayes, however I would also look at the K-Nearest Neighbours algorithm when performing a simple classification - this is contained within the Sci-kit Learn library and documented well.
RE: "running SKLearn on GAE is not possible" - you will either need to use the Google Predict API, or, run a VPS which would serve as a worker to process your classification tasks; this would obviously have to live on a different system though.
I would say though, if you are only hoping to perform simple classification on a suitably small dataset, you could actually implement a classifier in JavaScript, like
`http://jsfiddle.net/bkanber/hevFK/light/`
With a JS implementation, the processing time will become unacceptably slow if the dataset is too large, but it's nice to have as an option, even preferable in many cases.
Ultimately, GAE is not the platform I would use when building anything which may require all but the most basic of ML techniques. I would look at Heroku or a VPS in such a place as Digital Ocean, AWS et al.
I am trying to implement my first spam filter using a naive bayes classifier. I am using the data provided by UCI’s machine learning data repository (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). The data is a table of features corresponding to a few thousand spam and non-spam(ham) messages. Therefore, my features are limited to those provided by the table.
My goal is to implement a classifier that can calculate P(S∣M), the probability of being spam given a message. So far I have been using the following equation to calculate P(S∣F), the probability of being spam given a feature.
P(S∣F)=P(F∣S)/(P(F∣S)+P(F∣H))
from http://en.wikipedia.org/wiki/Bayesian_spam_filtering
where P(F∣S) is the probability of feature given spam and P(F∣H) is the probability of feature given ham. I am having trouble bridging the gap from knowing a P(S∣F) to P(S∣M) where M is a message and a message is simply a bag of independent features.
At a glance I want to just multiply the features together. But that would make most numbers very small, I am not sure if that is normal.
In short these are the questions I have right now.
1.) How to take a set of P(S∣F) to a P(S∣M).
2.) Once P(S∣M) has been calculated, how do I define a a threshold for my classifier?
3.) Fortunately my feature set was selected for me, how would I go about selecting or finding my own feature set?
I would also appreciate resources that might help me out as well. Thanks for your time.
You want to use Naive Bayes:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
It's probably beyond the scope of this answer to explain it, but essentially you multiply the probability of each feature give spam together, and multiply that by the prior probability of spam. Then repeat for ham (i.e. multiple each feature given ham together, and multiply that by the prior probability of ham). Now you have two numbers which can be normalized to probabilities by dividing each by the total of both. That will give you the probability of S|M and S|H. Again read the article above. If you want to avoid numerical underflow, take the log of each conditional and prior probability (any base) and add, instead of multiplying the original probabilities. Adding logs is equivalent to multiplying the original numbers. This won't give you a probability number at the end, but you can still take the one with the larger value as the predicted class.
You should not need to set a threshold, simply classify each instance by what is more likely, spam or ham (or whichever gives you the greater log likelihood).
There is no simple answer to this. Using a bag of words model is reasonable for this problem. Avoid very infrequent (occurring in < 5 documents) and also very frequent words, such as the, and a. A stop word list is often used to remove these. A feature selection algorithm can also help. Removing features that are highly correlated will help, particularly with Naive Bayes, which is highly sensitive to this.
I collect a bunch of questions from Twitter's stream by using a regular expression to pick out any tweet that contains a text that starts with a question type: who, what, when, where etc and ends with a question mark.
As such, I end up getting several non-useful questions in my database like: 'who cares?', 'what's this?' etc and some useful ones like: 'How often is there a basketball fight?', 'How much does a polar bear weigh?' etc
However, I am only interested in useful questions.
I have got about 3000 questions, ~2000 of them are not useful, ~1000 of them are useful that I have manually label them. I am attempting to use a naive Bayesian classifier (that comes with NLTK) to try to classify questions automatically so that I don't have to manually pick out the useful questions.
As a start, I tried choosing the first three words of a question as a feature but this doesn't help very much. Out of 100 questions the classifier predicted only around 10%-15% as being correct for useful questions. It also failed to pick out the useful questions from the ones that it predicted not useful.
I have tried other features such as: including all the words, including the length of the questions but the results did not change significantly.
Any suggestions on how I should choose the features or carry on?
Thanks.
Some random suggestions.
Add a pre-processing step and remove stop-words like this, a, of, and, etc.
How often is there a basketball fight
First you remove some stop words, you get
how often basketball fight
Calculate tf-idf score for each word (Treating each tweet as a document, to calculate the score, you need the whole corpus in order to get document frequency.)
For a sentence like above, you calculate tf-idf score for each word:
tf-idf(how)
tf-idf(often)
tf-idf(basketball)
tf-idf(fight)
This might be useful.
Try below additional features for your classifier
average tf-idf score
median tf-idf score
max tf-idf score
Furthermore, try a pos-tagger and generate a categorized sentence for each tweet.
>>> import nltk
>>> text = nltk.word_tokenize(" How often is there a basketball fight")
>>> nltk.pos_tag(text)
[('How', 'WRB'), ('often', 'RB'), ('is', 'VBZ'), ('there', 'EX'), ('a', 'DT'), ('basketball', 'NN'), ('fight', 'NN')]
Then you have possibly additional features to try that related to pos-tags.
Some other features that might be useful, see paper - qtweet (that is a paper for tweet question identification) for details.
whether the tweet contains any url
whether the tweet contains any email or phone number
whether there is any strong feeling such as ! follows the question.
whether unigram words present in the contexts of tweets.
whether the tweet mentions other user's name
whether the tweet is a retweet
whether the tweet contains any hashtag #
FYI, the author of qtweet attempted 4 different classifiers, namely, Random Forest, SVM, J48 and Logistic regression. Random forest performed best among them.
Hope they help.
A most likely very powerful feature you could try and build (Not sure if its possible) is it there is a reply to the tweet in question.