I work with some research educational task and need dataset with classified facial emotions to train classifier. For example, gender classification is simple: I can create csv file, and mark any file with image as 0 or 1, according to gender. Something like this:
.../../male.jpg:1
.../../female.jpg:0
...
...
So, I need something similar, but for facial emotions classification. I found images dataset with keypoints, so I could cluster them by different emotions, but there'll be more accuracy if It is marked manually before. Maybe somebody has direct sourses, or links with information like this. Thanks.
It's tricky because emotions are not uniquely characterized, even by humans. But there are academics who have gone through the trouble to prepare the supervised data that you want, i.e. you could contact the authors below and ask about their data set:
"We introduce two large databases consisting of 750 000 and 1.2 million thumbnail-sized images, labeled with emotion-related keywords." Solli and Lenz, Linkoping University, Norrkoping, Sweden.
Twitter is often a good place to start with sentiment analysis, because it provides in its advanced search the possilbity to filter for positive and negative tweets.
You can have a look here : https://twitter.com/search-advanced
If you want to go that way, you would need to write some code to use the twitter API as documented here :
https://dev.twitter.com/docs/using-search
You can play with the API if you want here :
https://dev.twitter.com/console
Choose "OAuth 1" for Authentication
Set GET as method
get positive twts at the following URL : https://api.twitter.com/1.1/search/tweets.json?q=%3A)
get negative : https://api.twitter.com/1.1/search/tweets.json?q=%3A(
The results are returned as json.
It is usually enough to get very well started !
You will simply associate each tweet to the corresponding sentiment.
If you want a more "atomic" dataset, you can calculate a score for each word given how often it appears in the positive and negative classes, and normalize with a tf-idf approach.
Please note that if you want to build a more advanced classifier, you will need to handle "neutral" emotions as well, and this is not provided by twitter.
Related
We are currently working on integrating ICD10-CM for our medical company, to be used for patient diagnosis. ICD10-CM is a coding system for diagnoses.
I tried to import ICD10-CM data in description-code pairs but obviously, it didn't work since AutoML needed more text for that code(label). I found a dataset on Kaggle but it only contained hrefs to an ICD10 website. I did find out that the website contains multiple texts and descriptions associated with codes that can be used to train our desired model.
Kaggle Dataset:
https://www.kaggle.com/shamssam/icd10datacom
Sample of a page from ICD10data.com:
https://www.icd10data.com/ICD10CM/Codes/A00-B99/A15-A19/A17-/A17.0
Most notable fields are:
- Approximate Synonyms
- Clinical Information
- Diagnosis Index
If I made a dataset from the sentences found in these pages and assigned them to their code(labels), will it be enough for AutoML dataset training? since each label will have 2 or more texts finally instead of just one, but definitely still a lot less than a 100 for each code unlike those in demos/tutorials.
From what I can see here, the disease code has a tree-like structure where, for instance, all L00-L99 codes refer to "Diseases of the skin and subcutaneous tissue". At the same time L00-L08 codes refer to "Infections of the skin and subcutaneous tissue", and so on.
What I mean is that the problem is not 90000 examples for 90000 different independent labels, but a decision tree (you take several decisions in function of the previous decision: the first step would be choosing which of the about 15 most general categories fits best, then choosing which of the subcategories etc.)
In this sense, probably autoML is not the best product, given that you cannot implement a specially designed decision tree model that takes into account all of this.
Another way of using autoML would be training separately for each of the decisions and then combine the different models. This would easily work for the first layer of decision but would be exponentially time consuming (the number of models to train in order to be able to predict more accurately grows exponentially with the level of accuracy, by accurate I mean afirminng it is L00-L08 instad of L00-L99).
I hope this helps you understand better the problem and the different approaches you can give to it!
I need to perform the text classification on set of emails. But all the words in my text are thinly sparse i.e frequency of each word with respect to all the documents are very less. words are not that much frequently repeating. Since to train the classifiers I think document term matrix with frequency as weightage is not suitable. Can you please suggest me what kind of other methods I need to use .
Thanks
The real problem will be, that if your words are that sparse a learned classifier will not generalise to the real world data. However, there are several solutions to it
1.) Use more data. This is kind-of a no-brainer. However, you can not only add labeled data you can also use unlabelled data in a semi-supervised learning
2.) Use more data (part b). You can look into the transfer learning setting. There you build a classifier on a large data set with similar characteristics. This might be twitter streams and then adapt this classifier to your domain
3.) Get your processing pipeline right. Your problem might origin from a suboptimal processing pipeline. Are you doing stemming? In the email the word steming should be mapped onto stem. This can be pushed even further by using synonym matching with a dictionary.
In normal case I had tried out naive bayes and linear SVM earlier to classify data related to certain specific type of comments related to some page where I had access to training data manually labelled and classified as spam or ham.
Now I am being told to check if there are any ways to classify comments as spam where we don't have a training data. Something like getting two clusters for data which will be marked as spam or ham given any data.
I need to know certain ways to approach this problem and what would be a good way to implement this.
I am still learning and experimenting . Any help will be appreciated
Are the new comments very different from the old comments in terms of vocabulary? Because words is almost everything the classifiers for this task look at.
You always can try using your old training data and apply the classifier to the new domain. You would have to label a few examples from your new domain in order to measure performance (or better, let others do the labeling in order to get more reliable results).
If this doesn't work well, you could try domain adaptation or look for some datasets more similar to your new domain, using Google or looking at this spam/ham corpora.
Finally, there may be some regularity or pattern in your new setting, e.g. downvotes for a comment, which may indicate spam/ham. In such cases, you could compile training data yourself. This would them be called distant supervision (you can search for papers using this keyword).
The best I could get to was this research work which mentions about active learning. So what I came up with is that I first performed Kmeans clustering and got the central clusters (assuming 5 clusters I took 3 clusters descending ordered by length) and took 1000 msgs from each. Then I would assign it to be labelled by the user. The next process would be training using logistic regression on the labelled data and getting the probabilities of unlabelled data and then if I have probability close to 0.5 or in range of 0.4 to 0.6 which means it is uncertain I would assign it to be labelled and then the process would continue.
Suppose that for a given ML problem, we have a feature which car the person possesses. We can encode this information in one of the following ways:
Assign an id to each of the car. Make a column 'CAR_POSSESSED' and put feature id as value.
Make columns for each of the car and put 0 or 1 according to whether that car is possessed by the considered sample or not. Columns will be like "BMW_POSSESSED", "AUDI_POSSESSED".
In my experiments the 2nd way performed much better than 1st one, when tried with SVM.
How does the encoding way affects the model learning, and are there some resources in which affect of encoding has been studied? Or do we need to do hit and trials to check where it performs best?
The problem with the first way is that you use arbitrary numbers to represent the features (e.g. BMW=2, etc.) and SVM take those numbers seriously, as if they have order: e.g. it may try to use cases with CAR_OWNED>3 for the prediction.
So the second way is better.
Chapter 2.1 Categorical Features:
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
You'll find many more if you search for "svm Categorical Features"
I collect a bunch of questions from Twitter's stream by using a regular expression to pick out any tweet that contains a text that starts with a question type: who, what, when, where etc and ends with a question mark.
As such, I end up getting several non-useful questions in my database like: 'who cares?', 'what's this?' etc and some useful ones like: 'How often is there a basketball fight?', 'How much does a polar bear weigh?' etc
However, I am only interested in useful questions.
I have got about 3000 questions, ~2000 of them are not useful, ~1000 of them are useful that I have manually label them. I am attempting to use a naive Bayesian classifier (that comes with NLTK) to try to classify questions automatically so that I don't have to manually pick out the useful questions.
As a start, I tried choosing the first three words of a question as a feature but this doesn't help very much. Out of 100 questions the classifier predicted only around 10%-15% as being correct for useful questions. It also failed to pick out the useful questions from the ones that it predicted not useful.
I have tried other features such as: including all the words, including the length of the questions but the results did not change significantly.
Any suggestions on how I should choose the features or carry on?
Thanks.
Some random suggestions.
Add a pre-processing step and remove stop-words like this, a, of, and, etc.
How often is there a basketball fight
First you remove some stop words, you get
how often basketball fight
Calculate tf-idf score for each word (Treating each tweet as a document, to calculate the score, you need the whole corpus in order to get document frequency.)
For a sentence like above, you calculate tf-idf score for each word:
tf-idf(how)
tf-idf(often)
tf-idf(basketball)
tf-idf(fight)
This might be useful.
Try below additional features for your classifier
average tf-idf score
median tf-idf score
max tf-idf score
Furthermore, try a pos-tagger and generate a categorized sentence for each tweet.
>>> import nltk
>>> text = nltk.word_tokenize(" How often is there a basketball fight")
>>> nltk.pos_tag(text)
[('How', 'WRB'), ('often', 'RB'), ('is', 'VBZ'), ('there', 'EX'), ('a', 'DT'), ('basketball', 'NN'), ('fight', 'NN')]
Then you have possibly additional features to try that related to pos-tags.
Some other features that might be useful, see paper - qtweet (that is a paper for tweet question identification) for details.
whether the tweet contains any url
whether the tweet contains any email or phone number
whether there is any strong feeling such as ! follows the question.
whether unigram words present in the contexts of tweets.
whether the tweet mentions other user's name
whether the tweet is a retweet
whether the tweet contains any hashtag #
FYI, the author of qtweet attempted 4 different classifiers, namely, Random Forest, SVM, J48 and Logistic regression. Random forest performed best among them.
Hope they help.
A most likely very powerful feature you could try and build (Not sure if its possible) is it there is a reply to the tweet in question.