I am trying to find sentiment of tweets using Stanford NLP package. Here is the example tweet
#SouthwestAir Fastest response all day. Hour on the phone: never got off hold. Hour in line: never got to the Flight Booking Problems desk.
Here Stanford NLP labels sentiment based on sentence. So, this tweet has three sentences with full-stops. Hence NLP gives me three different sentiment lables for each sentence of tweet. Now how can I label entire tweet as positive , negative or neutral ?
Neutral #SouthwestAir Fastest response all day.
Negative Hour on the phone: never got off hold.
Negative Hour in line: never got to the Flight Booking Problems desk.
The sentiment software we release produces sentiment judgments per sentence. You will have to come up with a heuristic to turn that into document level sentiment. For each tweet you might look at the set of sentence level judgments {neutral, negative, negative} and just map that to a final judgment. One way might be to just take the judgment with highest count. Tweets are going to be short, so usually you're probably looking at most at 3 sentiments to collapse into one per tweet.
Related
I am new to Data Science. This could be a dumb question, but just want to know opinions and confirm if I could enhance it well.
I have a question getting the most common/frequent 5 sentences from the database. I know I could gather all the data (sentences) into a list and using the Counter library - I could fetch the most occurring 5 sentences, but I am interested to know if any algorithm (ML/DL/NLP) is present for such a requirement. All the sentences are given by the user. I need to know his top 5 (most occurring/frequent) sentences (not phrases please)!!
Examples of sentences -
"Welcome to the world of Geeks"
"This portal has been created to provide well written subject"
"If you like Geeks for Geeks and would like to contribute"
"to contribute at geeksforgeeks org See your article appearing on "
"to contribute at geeksforgeeks org See your article appearing on " (occurring for the second time)
"the Geeks for Geeks main page and help thousands of other Geeks."
Note: All my sentences in my database are distinct (contextual wise and no duplicates too). This is just an example for my requirement.
Thanks in Advance.
I'd suggest you to start with sentence embeddings. Briefly, it returns a vector for a given sentence and it roughly represents the meaning of the sentence.
Let's say you have n sentences in your database and you found the sentence embeddings for each sentence so now you have n vectors.
Once you have the vectors, you can use dimensionality reduction techniques such as t-sne to visualize your sentences in 2 or 3 dimensions. In this visualization, sentences that have similar meanings should ideally be close to each other. That may help you pinpoint the most-frequent sentences that are also close in meaning.
I think one problem is that it's still hard to draw boundaries to the meanings of sentences since meaning is intrinsically subjective. You may have to add some heuristics to the process I described above.
Adding to MGoksu's answer, Once you get sentence embeddings, you can apply LSH(Locality Sensitive Hashing) to group the embeddings into clusters.
Once you get the clusters of embeddings. It would be a trivial to get the clusters with highest number of vectors.
I have got a dataset with users ratings on images. I am normalizing the ratings using mean- standard deviation normalization to remove bias in the dataset due to user specific preferences. Is this a correct way to handle bias or is there any other way to remove bias in users ratings.
This is certainly wrong on a couple of points:
If you 'normalise' input by standard deviation in this way, what you are saying is that "low variability doesn't matter much, only the outliers really count" -- because the outliers will have themselves a deviation larger than the standard one...
You are dealing with 'votes' of user satisfaction, not 'measurements'. Bias, by definition is information about satisfaction -- you are throwing it away. I.e. 150 years ago people used to find the "No dogs, no Irish" thing acceptable, these days not so much. If you want to predict how well a restaurant is likely to be regarded after a visit, you can't discount 0 star votes merely because the people objected to the sign!
When it comes to star ratings as a prediction for how likely something is to be "enjoyed" or "regretted" you might want to read this article: https://www.evanmiller.org/how-not-to-sort-by-average-rating.html
Note that the linked article is primarily interested in modelling "given past ratings, does the current vote indicate: (a) a continuation of past 'satisfaction', (b) a shifting trend towards increasing 'satisfaction', (c) a shifting trend towards decreasing 'satisfaction'" in terms of stars to award.
I want to use the resolution time in minutes and the client description of the tickets on Zendesk to predict the resolution time of next tickets based on their description. I will use only this two values, but the description is a large text. I searched about hashing the feature values instead of hash the feature name on Vowpal Wabbit but with no success. Wich is the better approach to use feature values that is large texts to predict using Vowpal?
Values of features in Vowpal Wabbit can only be real numbers. If you have a categorical feature with n possible values you simply represent it as n binary features (so e.g. color=red is a name of a binary feature and its value is 1 by default).
If you have a text description you can use the individual words of the text as features (i.e. feature names). You only need to escape ":", "|" and whitespace characters in feature names, all other characters are allowed (including "="). So an example can look like
9 |USER avg_time:11 |SUMMARY words:5 sentences:1 |TEXT I have a big problem
So this ticket with text "I have a big problem" took 9 minutes to resolve and previous tickets from the same user took on average 11 minutes to resolve.
If you have enough training examples, I would recommend to add many more features (any details about the user, more summary features about the text etc). Also the time of day (morning, afternoon, evening) and day of week when the ticket was reported may be a good predictor (tickets reported on Friday evening tend to take longer), but maybe you intentionally don't want to model this and focus only on the "difficulty" of the ticket irrelevant of reporting time.
You can also try using word bigrams as features with --ngram T2, which means that 2-grams features will be created for all namespaces beginning with T (only TEXT namespace in my example). Maybe the individual words "big" and "problem" are not strong predictors, but the bigram "big problem" will get a high positive weight (indicating it is a good predictor of long resolution time).
I will use only this two values
You mean resolution time and text of the ticket, am I right? But the resolution time is the (dependent) variable you want to predict, so this does not count as a feature (aka independent variable). Of course, if you know the identity of the user and have enough training examples for each user, you can include the average time of previous tickets (excluding the current one, of course) of the user as a feature as I tried to show in the example.
There is a stream of short texts. Each one has the size of a tweet, or let us just assume they are all tweets.
The user can vote on any tweet. So, each tweet has one of the following three states:
relevant (positive vote)
default (neutral i.e. no vote)
irrelevant (negative vote)
Whenever a new set of tweets come, they will be displayed in a specific order. This order is determined by the votes of the user on all previous tweets. The aim is to assign a score to each new tweet. This score is calculated based on the word similarity or match between the text of this tweet and all the previous tweets voted by the user. In other words, the tweet with the highest score is going to be the one which contains the maximum number of words voted previously positive and the minimum of words voted previously as negative. Also, the new tweets having a high score will trigger a notification to the user as they are considered very relevant.
One last thing, a minimum of semantic consideration (natural language processing) would be great.
I have read about Term Frequency–Inverse Document Frequency and come up with this very simple and basic solution:
Reminder: a high weight in tf–idf is reached by a high word frequency and a low total frequency of the word in the whole collection.
If the user votes positive on a Tweet, all the words of this tweet will receive a positive point (same thing for the negative case). This means that we will have a large set of words where each word has the total number of positive points and negative points.
If (Tweet score > 0) then this tweet will trigger a notification.
Tweet score = sum of all individual words’ scores of this tweet
word score = word frequency * inverse total frequency
word frequency in all previous votes = ( total positive votes for this word - total negative votes for this word) / total votes for this word
Inverse total frequency = log ( total votes of all words / total votes for this word)
Is this method enough? I am open to any better methods and any ready API or algorithm.
One possible solution would be to train a classifier such as Naive Bayes on the tweets that a user has voted on. You can take a look at the documentation of scikit-learn, a Python library, which explains how you can easily preprocess your text and train such a classifier.
I would look at Naive Bayes, however I would also look at the K-Nearest Neighbours algorithm when performing a simple classification - this is contained within the Sci-kit Learn library and documented well.
RE: "running SKLearn on GAE is not possible" - you will either need to use the Google Predict API, or, run a VPS which would serve as a worker to process your classification tasks; this would obviously have to live on a different system though.
I would say though, if you are only hoping to perform simple classification on a suitably small dataset, you could actually implement a classifier in JavaScript, like
`http://jsfiddle.net/bkanber/hevFK/light/`
With a JS implementation, the processing time will become unacceptably slow if the dataset is too large, but it's nice to have as an option, even preferable in many cases.
Ultimately, GAE is not the platform I would use when building anything which may require all but the most basic of ML techniques. I would look at Heroku or a VPS in such a place as Digital Ocean, AWS et al.
I collect a bunch of questions from Twitter's stream by using a regular expression to pick out any tweet that contains a text that starts with a question type: who, what, when, where etc and ends with a question mark.
As such, I end up getting several non-useful questions in my database like: 'who cares?', 'what's this?' etc and some useful ones like: 'How often is there a basketball fight?', 'How much does a polar bear weigh?' etc
However, I am only interested in useful questions.
I have got about 3000 questions, ~2000 of them are not useful, ~1000 of them are useful that I have manually label them. I am attempting to use a naive Bayesian classifier (that comes with NLTK) to try to classify questions automatically so that I don't have to manually pick out the useful questions.
As a start, I tried choosing the first three words of a question as a feature but this doesn't help very much. Out of 100 questions the classifier predicted only around 10%-15% as being correct for useful questions. It also failed to pick out the useful questions from the ones that it predicted not useful.
I have tried other features such as: including all the words, including the length of the questions but the results did not change significantly.
Any suggestions on how I should choose the features or carry on?
Thanks.
Some random suggestions.
Add a pre-processing step and remove stop-words like this, a, of, and, etc.
How often is there a basketball fight
First you remove some stop words, you get
how often basketball fight
Calculate tf-idf score for each word (Treating each tweet as a document, to calculate the score, you need the whole corpus in order to get document frequency.)
For a sentence like above, you calculate tf-idf score for each word:
tf-idf(how)
tf-idf(often)
tf-idf(basketball)
tf-idf(fight)
This might be useful.
Try below additional features for your classifier
average tf-idf score
median tf-idf score
max tf-idf score
Furthermore, try a pos-tagger and generate a categorized sentence for each tweet.
>>> import nltk
>>> text = nltk.word_tokenize(" How often is there a basketball fight")
>>> nltk.pos_tag(text)
[('How', 'WRB'), ('often', 'RB'), ('is', 'VBZ'), ('there', 'EX'), ('a', 'DT'), ('basketball', 'NN'), ('fight', 'NN')]
Then you have possibly additional features to try that related to pos-tags.
Some other features that might be useful, see paper - qtweet (that is a paper for tweet question identification) for details.
whether the tweet contains any url
whether the tweet contains any email or phone number
whether there is any strong feeling such as ! follows the question.
whether unigram words present in the contexts of tweets.
whether the tweet mentions other user's name
whether the tweet is a retweet
whether the tweet contains any hashtag #
FYI, the author of qtweet attempted 4 different classifiers, namely, Random Forest, SVM, J48 and Logistic regression. Random forest performed best among them.
Hope they help.
A most likely very powerful feature you could try and build (Not sure if its possible) is it there is a reply to the tweet in question.