I am trying to solve a problem where I'm identifying entities in articles (ex: names of cars), and trying to predict sentiment about each car within the article. For that, I need to extract the text relevant to each entity from within the article.
Currently, the approach I am using is as follows:
If a sentence contains only 1 entity, tag the sentence as text for that entity
If sentence has more than 1 entity, ignore it
If sentence contains no entity, tag as a sentence for previously identified entity
However, this approach is not yielding accurate results, even if we assume that our sentiment classification is working.
Is there any method that the community may have come across that can solve this problem?
The approach fails for many cases and gives wrong results. For example if I am saying - 'Lets talk about the Honda Civic. The car was great, but failed in comparison to the Ford focus. The car also has good economy.'
Here, the program would pick up Ford Focus as the entity in last 2 sentences and tag those sentences for it.
I am using nltk for descriptive words tagging, and scikit-learn for classification (linear svm model).
If anyone could point me in the right direction, it would be greatly appreciated. Is there some classifier I could build with custom features that can detect this type of text if I were to manually tag say - 50 articles and the text in them?
Thanks in advance!
Related
I am trying to extract previous Job titles from a CV using spacy and named entity recognition.
I would like to train spacy to detect a custom named entity type : 'JOB'. For that I have around 800 job title names from https://www.careerbuilder.com/browse/titles/ that I can use as training data.
In my training data for spacy, do I need to integrate these job titles in sentences added to provide context or not?
In general in the CV the job title kinda stands on it's own and is not really part of a full sentence.
Also, if I need to provide coherent context for each of the 800 titles, it will be too time-consuming for what I'm trying to do, so maybe there are other solutions than NER?
Generally, Named Entity Recognition relies on the context of words, otherwise the model would not be able to detect entities in previously unseen words. Consequently, the list of titles would not help you to train any model. You could rather run string matching to find any of those 800 titles in CV documents and you will even be guaranteed to find all of them - no unknown titles, though.
I you could find 800 (or less) real CVs and replace the Job names by those in your list (or others!), then you are all set to train a model capable of NER. This would be the way to go, I suppose. Just download as many freely available CVs from the web and see where this gets you. If it is not enough data, you can augment it, for example by exchanging the job titles in the data by some of the titles in your list.
I am currently playing with SpaCy NER and wondering if SpaCy NER can do these 2 things:
Case 1
Let's say we have 2 sentences that we want to do NER with:
Sugar level in his body is increasing.
His overall health quality is increasing.
Can we tag "increasing" in the first sentence as "symptoms" entity, and tag "increasing" in the second one as "good outcome" entity? Will NER see the difference in those 2 "increasing" words?
Case 2
We also have 2 different sentences:
My salary is USD 8000 per month
My spending is USD 5000 per month
Can NER see the number in the first sentence as "income" entity and the number in the second sentence as "spending"?
Thank you
These tasks go beyond what you would expect an NER model to be able to do in a number of ways. Spacy's NER algorithm could be used to find types of entities like MONEY (which is an entity type in its English models) or maybe something like SYMPTOM, but it doesn't look at a very large context to detect/classify entities, so it's not going to be able to differentiate these cases where the relevant context is fairly far away.
You probably want to combine NER (or another type of relevant span detection, which could also be rule-based) with another type of analysis that focuses more on the context. This could be some kind of text classification, you could examine the dependency parse, etc.
Here is a simple example from the spacy docs about extracting entity relations using NER (to find MONEY) followed by examining the dependency parse to try to figure out what the money element could be referring to:
https://spacy.io/usage/examples#entity-relations
I am working on an NLP project, wherein I have a list of emails all related to appreciation. I am trying to determine from the email content, who is being appreciated. This in turn will help the organization in our performance evaluation program.
Apart from identifying who is being appreciated, I am also trying to identify the type of work a person has done and score it. I am using open NLP (max entropy/logistic regression) for classification of the email and use some form of heuristics to identify the person being appreciated.
The approach for person identification is as follows:
Determine if an email is related to appreciation
Get the list of people in the "To:" list
Check if that person is being referred to in the email
Tag that person as the receiver of appreciation
However, this approach is very simple and does not work for complex emails we generally see. An email can consist of many email ids or people being referred to and they are not the receivers of the appreciation. The context of the person is not available and hence the accuracy is not very good.
I am thinking of using HMM and word2vec to solve the person issue. I would appreciate if anyone has come across this problem or has any suggestion.
Use tm package for R. And use tf-idf (term frequency - inverse document frequency) to determine whos been appreciate.
I'm suggesting this because , for what I can read , this is an unsupervised learning (you dont knot prior whos been appreciate). So you have to describe the documents (emails) content , and that formula (tf-idf) will help know what words are been use most in a particular document that are rarely uso on all anothers.
One way to solve this problem is through the use of Named Entity Recognition. You can possibly run something like Stanford NER over the text which will help you recognize all people names mentioned in the email and then use a rules based chunker such as Stanford TokensRegex to extract sentences where names of people and appreciation words are mentioned.
The best way to solve this will be by treating this as a supervised learning problem. You will then need to annotate a bunch of training data with entities and expression phrases and the relations between them. Then you can use Stanford Relation Extractor to extract appropriate relations.
I am having a hard time understanding the process of building a bag-of-words. This will be a multiclass classfication supervised machine learning problem wherein a webpage or a piece of text is assigned to one category from multiple pre-defined categories. Now the method that I am familiar with when building a bag of words for a specific category (for example, 'Math') is to collect a lot of webpages that are related to Math. From there, I would perform some data processing (such as remove stop words and performing TF-IDF) to obtain the bag-of-words for the category 'Math'.
Question: Another method that I am thinking of is to instead search in google for something like 'List of terms related to Math' to build my bag-of-words. I would like to ask if this is method is okay?
Another question: In the context of this question, does bag-of-words and corpus mean the same thing?
Thank you in advance!
This is not what bag of words is. Bag of words is the term to describe a specific way of representing a given document. Namely, a document (paragraph, sentence, webpage) is represented as a mapping of form
word: how many times this word is present in a document
for example "John likes cats and likes dogs" would be represented as: {john: 1, likes: 2, cats: 1, and: 1, dogs: 1}. This kind of representation can be easily fed into typical ML methods (especially if one assumes that total vocabulary is finite so we end up with numeric vectors).
Note, that this is not about "creating a bag of words for a category". Category, in typical supervised learning would consist of multiple documents, and each of them independently is represented as a bag of words.
In particular this invalidates your final proposal of asking google for words that are related to category - this is not how typical ML methods work. You get a lot of documents, represent them as bag of words (or something else) and then perform statistical analysis (build a model) to figure out the best set of rules to discriminate between categories. These rules usually will not be simply "if the word X is present, this is related to Y".
I have a data set(Example) of the following type:
Food Type: Chinese, Indian, Thai, Mexican
Ingredient 1: Salt, Chinese Salt
Ingredient 2: Chilli, Red Chilli, Thai Chilli, Green Chilli
Ingredient 3: Turmeric, Cardamom,
Ingredient 4: Chicken, Beef, Fish, Tofu
I have some combinations of data made by hand and I classified them in different food types based on ingredients and recipes. I need to generate more data based on the most probable combinations. One way I have done it so far is to generate all combinations of all ingredients and then classify them into food types based on previous learning. But this approach will not be practical as the data is large. Each category of ingredient can have more than 30-40 values. Moreover, Ingredients are not just 4, they are much more in real data set. I am looking for better ways to generate and classify the data than my already proposed approach. I have applied NB classifier to classify the data. Your help is much appreciated
Since I did not get any replies for over 4 four months, I thought of posting my solution which might help someone else.
The technique I used was to get top five most important features from each of the attribute types(food types in my example). Then I made combinations of all those features. For the rest of the features, I chose a value randomly. This generated new data which was manageable in size.
If you need any clarifications, please feel free to ask.