Choosing Features to identify Twitter Questions as "Useful" - machine-learning

I collect a bunch of questions from Twitter's stream by using a regular expression to pick out any tweet that contains a text that starts with a question type: who, what, when, where etc and ends with a question mark.
As such, I end up getting several non-useful questions in my database like: 'who cares?', 'what's this?' etc and some useful ones like: 'How often is there a basketball fight?', 'How much does a polar bear weigh?' etc
However, I am only interested in useful questions.
I have got about 3000 questions, ~2000 of them are not useful, ~1000 of them are useful that I have manually label them. I am attempting to use a naive Bayesian classifier (that comes with NLTK) to try to classify questions automatically so that I don't have to manually pick out the useful questions.
As a start, I tried choosing the first three words of a question as a feature but this doesn't help very much. Out of 100 questions the classifier predicted only around 10%-15% as being correct for useful questions. It also failed to pick out the useful questions from the ones that it predicted not useful.
I have tried other features such as: including all the words, including the length of the questions but the results did not change significantly.
Any suggestions on how I should choose the features or carry on?
Thanks.

Some random suggestions.
Add a pre-processing step and remove stop-words like this, a, of, and, etc.
How often is there a basketball fight
First you remove some stop words, you get
how often basketball fight
Calculate tf-idf score for each word (Treating each tweet as a document, to calculate the score, you need the whole corpus in order to get document frequency.)
For a sentence like above, you calculate tf-idf score for each word:
tf-idf(how)
tf-idf(often)
tf-idf(basketball)
tf-idf(fight)
This might be useful.
Try below additional features for your classifier
average tf-idf score
median tf-idf score
max tf-idf score
Furthermore, try a pos-tagger and generate a categorized sentence for each tweet.
>>> import nltk
>>> text = nltk.word_tokenize(" How often is there a basketball fight")
>>> nltk.pos_tag(text)
[('How', 'WRB'), ('often', 'RB'), ('is', 'VBZ'), ('there', 'EX'), ('a', 'DT'), ('basketball', 'NN'), ('fight', 'NN')]
Then you have possibly additional features to try that related to pos-tags.
Some other features that might be useful, see paper - qtweet (that is a paper for tweet question identification) for details.
whether the tweet contains any url
whether the tweet contains any email or phone number
whether there is any strong feeling such as ! follows the question.
whether unigram words present in the contexts of tweets.
whether the tweet mentions other user's name
whether the tweet is a retweet
whether the tweet contains any hashtag #
FYI, the author of qtweet attempted 4 different classifiers, namely, Random Forest, SVM, J48 and Logistic regression. Random forest performed best among them.
Hope they help.

A most likely very powerful feature you could try and build (Not sure if its possible) is it there is a reply to the tweet in question.

Related

Computing a similarity score for a set of sentences

My team does a lot of chatbot training, and I'm trying to come up with some tools to improve the quality of our work. In chatbot training, it is really important to train intents with diverse utterances that phrase the same intent in very different ways. Ideally, there would be very little similarity in the syntax of the utterances in the set.
Here's an example for an intent inquiring about medical insurance coverage
Bad set of utterances
Is my daughter covered by insurance?
Is my son covered by medical insurance?
Will my son be covered by insurance?
Decent set of utterances
How can I look up whether we have insurance coverage for the whole family?
Seeking details on eligibility for medical coverage
Is there a document that details who is protected under our medical insurance policy?
I want to be able to take all of the utterances associated to an intent and analyze them for similarity. I would expect my set of bad utterances to have a high similarity score and my set of decent utterances to have a low similarity score.
I've tried playing around with a few doc2vec tutorials, but I feel like I'm missing something. I keep seeing stuff like this:
Train a set of data and then measure the similarity of a new sentence to your set of data
Measure the similarity between two sentences
I need to have an array of sentences and understand how similar they are to each other.
Any advice on achieving this?
Answering some questions:
What makes the bad utterances bad?The utterances themselves are not bad, it is the lack of variety between them. If most of the training had been like the “bad” set, then real user utterances of greater variety will not be recognized correctly.
Are you trying to discover new intents? No, this is for prerelease training, trying to improve the effectiveness of it.
Why do bad utterances have high similarity scores and decent utterances have low similarity scores? This is a hypothesis. I know how varied real user utterances are, and I have found my trainers fall into ruts when training, asking things the same way, and not seeing good accuracy results. Improving the variety in the utterances tends to result in better accuracy.
What will I do with this info? I’ll use it to assess the training quality of an intent, to determine if more training is likely necessary. In the future we might build real time tools as utterances are being added to let trainers know if they’re being too repetitive.
Most applications of text vectors benefit from the vectors capturing the "essential meaning" of a text, **without* regard to variances in word choice.
That is, it's considered a feature, not a flaw, if two completely different wordings with similar meaning have nearly the same vector. (Or, if some similarity-measure indicates they are totally similar.)
For example, to contrive an example similar to yours, consider the two phrasings:
"health coverage for brother"
"male sibling medical insurance"
There's no reuse of words, but the likely intended meaning is the same – so a good text-vectorization for typical purposes would create very similar vectors. And a similarity-measure using those vectors, or otherwise using the words/word-vectors as input, would indicate very high similarity.
But from your clarifying answers, it seems you actually want a more superficial "similarity" measure. You'd like a measure that reveals when certain phrasings show variety/contrast in their wording. (And specifically, you already know form other factors, like how they were hand-crafted, that groups of these phrasings are semantically related.)
What you want this similarity measure to show is actually a behavior that many projects using text-vectors would consider a failure of the vectors. So semantic methods like those in Word2Vec, Paragraph Vectors (aka "Doc2Vec"), etc are likely the wrong tool for your goal.
You could probably do well with a simpler measure based just on the words, or perhaps character-n-grams, of the texts.
For example, for two texts A and B, you could just tally the number of shared words (that appear in both A and B), and divide by the total number of unique words in both A and B, to get a 0.0 to 1.0 "word choice similarity" number.
And, when considering a new text against a set of prior texts, if its average similarity to the prior texts is low, it'd be "good" for your purposes.
Rather than just words, you could also use all n-character substrings ("n-grams") of your texts – which might help better highlight differences in word-forms, or common typos, which may also be useful variances for your purposes.
In general, I'd look at the scikit-learn text-vectorization functionality for ideas:
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

Retrieving the top 5 sentences- Algorithm if any present

I am new to Data Science. This could be a dumb question, but just want to know opinions and confirm if I could enhance it well.
I have a question getting the most common/frequent 5 sentences from the database. I know I could gather all the data (sentences) into a list and using the Counter library - I could fetch the most occurring 5 sentences, but I am interested to know if any algorithm (ML/DL/NLP) is present for such a requirement. All the sentences are given by the user. I need to know his top 5 (most occurring/frequent) sentences (not phrases please)!!
Examples of sentences -
"Welcome to the world of Geeks"
"This portal has been created to provide well written subject"
"If you like Geeks for Geeks and would like to contribute"
"to contribute at geeksforgeeks org See your article appearing on "
"to contribute at geeksforgeeks org See your article appearing on " (occurring for the second time)
"the Geeks for Geeks main page and help thousands of other Geeks."
Note: All my sentences in my database are distinct (contextual wise and no duplicates too). This is just an example for my requirement.
Thanks in Advance.
I'd suggest you to start with sentence embeddings. Briefly, it returns a vector for a given sentence and it roughly represents the meaning of the sentence.
Let's say you have n sentences in your database and you found the sentence embeddings for each sentence so now you have n vectors.
Once you have the vectors, you can use dimensionality reduction techniques such as t-sne to visualize your sentences in 2 or 3 dimensions. In this visualization, sentences that have similar meanings should ideally be close to each other. That may help you pinpoint the most-frequent sentences that are also close in meaning.
I think one problem is that it's still hard to draw boundaries to the meanings of sentences since meaning is intrinsically subjective. You may have to add some heuristics to the process I described above.
Adding to MGoksu's answer, Once you get sentence embeddings, you can apply LSH(Locality Sensitive Hashing) to group the embeddings into clusters.
Once you get the clusters of embeddings. It would be a trivial to get the clusters with highest number of vectors.

what methods are there to classify documents?

I am trying to do document classification. But I am really confused between feature selections and tf-idf. Are they the same or two different ways of doing classification?
Hope somebody can tell me? I am not really sure that my question will make sense to you guys.
Yes, you are confusion a lot of things.
Feature selection is the abstract term for choosing features (0 or 1). Stopword removal can be seen as feature selection.
TF is one method of extracting features from text: counting words.
IDF is one method of assigning weights to features.
Neither of them is classification... they are popular for text classification, but they are even more popular for information retrieval, which is not classification...
However, many classifiers work on numeric data, so the common process is to 1. Extract features (e.g.: TF) 2. Select features (e.g. remove stopwords) 3. Weight features (e.g. IDF) 4. Train a classifier on the resulting numerical vectors. 5. Predict the classes of new/unlabeled documents.
Taking a look at this explanation may help a lot when it comes to understanding text classifiers.
TF-IDF is a good way to find a document that answers a given query, but it does not necessarily assigns documents with classes.
Examples that may be helpful:
1) You have a bunch of documents with subjects ranging from politics, economics, computer science and the arts. The documents belonging to each subject are separated into the appropriate directories for each subject (you have a labeled dataset). Now, you received a new document whose subject you do not know. In which directory should it be stored? A classifier can answer this question from the documents that are already labeled.
2) Now, you received a query regarding computer science. For instance, you received the query "Good methods for finding textual similarity". Which document in the directory of computer science can provide the best response to that query? TF-IDF would be a good approach to figure that out.
So, when you are classifying documents, you are trying to make a decision about whether a document is a member of a particular class (like, say, 'about birds' or 'not about birds').
Classifiers predict the value of the class given a set of features. A good set of features will be highly discriminative - they will tell you a lot about whether the document is of one class or another.
Tf-idf (term frequency inverse document frequency) is a particular feature that seems to be discriminative for document classification tasks. There are others, like word counts (tf or term frequency) or whether a regexp matches the text or what have you.
Feature selection is the task of selecting good (discriminative) features. Tfidf is probably a good feature to select.

What is a good way to classify or cluster free form text entries?

I have a set of ratings entered by users for N-items, along with reasons as to why they select that rating for that item. The ratings are in an ordinal scale (-2, -1, 0, +1, +2).
I would like to come up with meaningful groupings of these reasons. For example, say users are rating movies and reasons behind the ratings might fall under 3 broad groups: 1). 'They are huge fan of the actor', 2). 'Amazing Story line', 4). 'Lacks originality'. This is just a dummy example.
More concretely, given a set of free form textual entries, can one come up with such groupings. I know that topic modeling is one way of doing this. I can specify the number of topics K, and then feed data into my topic model (LDA etc.), the model will output K topics where each topic is a list of most probable words in that topic. So with respect to this dummy example, topic 1 may contain words and phrases like - 'fan', 'actor', 'great acting'.
Is there other ways to do this clustering? Do I need to consider the ordinal rating scale while clustering? How can I take that into account?
Word embeddings might be useful. Here is a recent, relevant Stanford project.
It depends on how sophisticated you wish the handling of the text to be. If just matching single words (1-grams) were sufficient then:
remove stop words
possibly do stemming or other text preprocessing
apply a naive bayes classification algorithm Options are here: http://en.wikipedia.org/wiki/Naive_Bayes_classifier
However you may also wish to do a better job with phrases / related words. In that case there is plenty of research - and implementations - to help you. Ngrams is a relatively simple approach, but more advanced methods that understand the semantics of the language have better statistical performance.

Classified dataset for emotion recognition

I work with some research educational task and need dataset with classified facial emotions to train classifier. For example, gender classification is simple: I can create csv file, and mark any file with image as 0 or 1, according to gender. Something like this:
.../../male.jpg:1
.../../female.jpg:0
...
...
So, I need something similar, but for facial emotions classification. I found images dataset with keypoints, so I could cluster them by different emotions, but there'll be more accuracy if It is marked manually before. Maybe somebody has direct sourses, or links with information like this. Thanks.
It's tricky because emotions are not uniquely characterized, even by humans. But there are academics who have gone through the trouble to prepare the supervised data that you want, i.e. you could contact the authors below and ask about their data set:
"We introduce two large databases consisting of 750 000 and 1.2 million thumbnail-sized images, labeled with emotion-related keywords." Solli and Lenz, Linkoping University, Norrkoping, Sweden.
Twitter is often a good place to start with sentiment analysis, because it provides in its advanced search the possilbity to filter for positive and negative tweets.
You can have a look here : https://twitter.com/search-advanced
If you want to go that way, you would need to write some code to use the twitter API as documented here :
https://dev.twitter.com/docs/using-search
You can play with the API if you want here :
https://dev.twitter.com/console
Choose "OAuth 1" for Authentication
Set GET as method
get positive twts at the following URL : https://api.twitter.com/1.1/search/tweets.json?q=%3A)
get negative : https://api.twitter.com/1.1/search/tweets.json?q=%3A(
The results are returned as json.
It is usually enough to get very well started !
You will simply associate each tweet to the corresponding sentiment.
If you want a more "atomic" dataset, you can calculate a score for each word given how often it appears in the positive and negative classes, and normalize with a tf-idf approach.
Please note that if you want to build a more advanced classifier, you will need to handle "neutral" emotions as well, and this is not provided by twitter.

Resources