How to train OpenNLP model to extract multi set words - machine-learning

I am newbie to Open NLP - Entity extraction with NER, I had train and evaluated models for Entity extraction in Open NLP NER, which works fine when I give input text with an entity of one word Eg: "I want to buy Cadbury"
But It does not works works for the Multi-word scenarios Eg: "I want to but an Apple MacBook"
How train the models to pick the multi word
PS: I have understood that I need to do something related with BiGrams provided in NLP, but how do i do it with OpenNLP?

You need to provide training data which covers multi-word spans. Example from the OpenNLP documentation:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 . Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .
Besides the above format, IO/BIO/etc tags are also common.
In your example, Apple MacBook could be one entity of type Product Name, but could also be two, with Apple as Company Name and MacBook as Product Name. How that works depends completely on your training data.
You can create data like this by hand or visually using brat.

Related

SpaCy NER differentiating numbers or entities

I am currently playing with SpaCy NER and wondering if SpaCy NER can do these 2 things:
Case 1
Let's say we have 2 sentences that we want to do NER with:
Sugar level in his body is increasing.
His overall health quality is increasing.
Can we tag "increasing" in the first sentence as "symptoms" entity, and tag "increasing" in the second one as "good outcome" entity? Will NER see the difference in those 2 "increasing" words?
Case 2
We also have 2 different sentences:
My salary is USD 8000 per month
My spending is USD 5000 per month
Can NER see the number in the first sentence as "income" entity and the number in the second sentence as "spending"?
Thank you
These tasks go beyond what you would expect an NER model to be able to do in a number of ways. Spacy's NER algorithm could be used to find types of entities like MONEY (which is an entity type in its English models) or maybe something like SYMPTOM, but it doesn't look at a very large context to detect/classify entities, so it's not going to be able to differentiate these cases where the relevant context is fairly far away.
You probably want to combine NER (or another type of relevant span detection, which could also be rule-based) with another type of analysis that focuses more on the context. This could be some kind of text classification, you could examine the dependency parse, etc.
Here is a simple example from the spacy docs about extracting entity relations using NER (to find MONEY) followed by examining the dependency parse to try to figure out what the money element could be referring to:
https://spacy.io/usage/examples#entity-relations

Detecting text relevant to an entity in nlp

I am trying to solve a problem where I'm identifying entities in articles (ex: names of cars), and trying to predict sentiment about each car within the article. For that, I need to extract the text relevant to each entity from within the article.
Currently, the approach I am using is as follows:
If a sentence contains only 1 entity, tag the sentence as text for that entity
If sentence has more than 1 entity, ignore it
If sentence contains no entity, tag as a sentence for previously identified entity
However, this approach is not yielding accurate results, even if we assume that our sentiment classification is working.
Is there any method that the community may have come across that can solve this problem?
The approach fails for many cases and gives wrong results. For example if I am saying - 'Lets talk about the Honda Civic. The car was great, but failed in comparison to the Ford focus. The car also has good economy.'
Here, the program would pick up Ford Focus as the entity in last 2 sentences and tag those sentences for it.
I am using nltk for descriptive words tagging, and scikit-learn for classification (linear svm model).
If anyone could point me in the right direction, it would be greatly appreciated. Is there some classifier I could build with custom features that can detect this type of text if I were to manually tag say - 50 articles and the text in them?
Thanks in advance!

Named Entity Recognition upper case issue

I recently switched the model I use for NER in spacy from en_core_web_md to xx_ent_wiki_sm.
I noticed that the new model always recognises full upper case words such as NEW JERSEY or NEW YORK as organisations. I would be able to provide training data to retrain the model, although it would be very time consuming. However I am uncertain if the model would loose the assumption that upper case words are organisations or if it would instead keep the assumption and create some exceptions for it. Does it maybe even learn that every all upper case with word with less than 5 letter is likely to be an organisation and everything with more letters not? I just dont know how exactly the training will affect the model
en_core_web_md seems to deal fine with acronyms, while ignoring words like NEW JERSEY. However the overall performance of xx_ent_wiki_sm is better for my use case
I ask because the assumption as such is still pretty useful, as it allows us to identify acronyms such as IBM as an organisation.
The xx_ent_wiki_sm model was trained on Wikipedia, so it's very biased towards what Wikipedia considers and entity, and what's common in the data. (It also tends to frequently recognise "I" as an entity, since sentences in the first person are so rare on Wikipedia.) So post-training with more examples is definitely a good strategy, and what you're trying to do sounds feasible.
The best way to prevent the model from "forgetting" about the uppercase entities is to always include examples of entities that the model previously recognised correctly in the training data (see: the "catastrophic forgetting problem"). The nice thing is that you can create those programmatically by running spaCy over a bunch of text and extracting uppercase entities:
uppercase_ents = [ent for ent in doc.ents if all(t.is_upper for t in ent)]
See this section for more examples of how to create training data using spaCy. You can also use spaCy to generate the lowercase and titlecase variations of the selected entities to bootstrap your training data, which should hopefully save you a lot of time and work.

Searching for list of terms using Google in order to build a bag-of-words for a particular category

I am having a hard time understanding the process of building a bag-of-words. This will be a multiclass classfication supervised machine learning problem wherein a webpage or a piece of text is assigned to one category from multiple pre-defined categories. Now the method that I am familiar with when building a bag of words for a specific category (for example, 'Math') is to collect a lot of webpages that are related to Math. From there, I would perform some data processing (such as remove stop words and performing TF-IDF) to obtain the bag-of-words for the category 'Math'.
Question: Another method that I am thinking of is to instead search in google for something like 'List of terms related to Math' to build my bag-of-words. I would like to ask if this is method is okay?
Another question: In the context of this question, does bag-of-words and corpus mean the same thing?
Thank you in advance!
This is not what bag of words is. Bag of words is the term to describe a specific way of representing a given document. Namely, a document (paragraph, sentence, webpage) is represented as a mapping of form
word: how many times this word is present in a document
for example "John likes cats and likes dogs" would be represented as: {john: 1, likes: 2, cats: 1, and: 1, dogs: 1}. This kind of representation can be easily fed into typical ML methods (especially if one assumes that total vocabulary is finite so we end up with numeric vectors).
Note, that this is not about "creating a bag of words for a category". Category, in typical supervised learning would consist of multiple documents, and each of them independently is represented as a bag of words.
In particular this invalidates your final proposal of asking google for words that are related to category - this is not how typical ML methods work. You get a lot of documents, represent them as bag of words (or something else) and then perform statistical analysis (build a model) to figure out the best set of rules to discriminate between categories. These rules usually will not be simply "if the word X is present, this is related to Y".

Unsupervised Feature extraction of dishes by building tree structure of ingerdients with Natural Language Processing

I am building a recommendation system for dishes. Consider a user eating french fries and rates it a 5. Then I want to give a good rating to all the ingredients that the dish is made of. In the case of french fires the linked words should be "fried" "potato" "junk food" "salty" and so on. From the word Tsatsiki I want to extract "Cucumbers", "Yoghurt" "Garlic". From Yoghurt I want to extract Milk product, From Cucumbers vegetable and so on.
What is this problem called in Natural Language Processing and is there a way to address it?
I have no data at all, and I am thinking of building web crawler that analyzes the web for the dish. I would like it to be as little Ad-Hoc as possible and not necessarily in English. Is there a way, maybe in within deep learning to do the thing? I would not only a dish to be linked to the ingredients but also a category: junk food, vegetarian, Italian food and so on.
This type of problem is called ontology engineering or ontology building. For an example of a large ontology and how it's structured, you might check out something like YAGO. It seems like you are going to be building a boutique ontology for food and then overlaying a rating's system. I don't know of any ontologies out there of the form you're looking for, but there are relevant things out there you should take a look at, for example, this OWL-based food ontology and this recipe ontology.
Do you have a recipe like that:
Ingredients:
*Cucumbers
*Garlic
*Yoghurt
or like that:
Grate a cucumber or chop it. Add garlic and yoghurt.
If the former, your features have already been extracted. The next step would be to convert to a vector recommend other recipes. The simplest way would be to do (unsupervised) clustering of recipes.
If the latter, I suspect you can get away with a simple rule of thumb. Firstly, use a part-of-speech tagger to extract all the nouns in the recipe. This would extract all the ingredients and a bit more (e.g. kitchen appliances, cutlery, etc). Look up the nouns in a database of food ingredients database such as this one.

Resources