Trying to find the name of a specific location from tweets - machine-learning

I am trying to find the name of a specific location from tweets and performing sentiment analysis on the hits I get from the search. The problem I am facing is that, I am looking for a location whose name is suppose "Sammy's Tap and Grill", searching which I get no hits. I need to search something like "Sammys" or "Sammy's" to get some hits. Alternatively, when I search for "Empire State Building", I cannot search for "Empire" alone, it gives weird tweets including Mayan and Chola empires. So here I have to search with "Empire State Building" or "Empire State". So is there an NLP trick where I can do something and search for the best possible term from the full name of the location that gets most relevent hits? I was just able to make a solution where I was checking if the hits I get were nouns, because some places have names like "Excellent" and "Fantastic" and I didnt want adjectives to pop up. So is there some NLP way to solve my problem about searching a locationname from a tweet?

your problem is very similar to named entity recognition problem. You can try using standart named entity exctractors or train your own NER model.
There different libraries for NER, like
Stanford NER,
SpaCy NER Tool
NLTK NER module
In case if you want to train your own Named Entity Recognition model check this links:
CRF git repository
Named Entity Recognition with Tensorflow
Good luck)

Related

How much data / context needed to train custom NER Spacy model?

I am trying to extract previous Job titles from a CV using spacy and named entity recognition.
I would like to train spacy to detect a custom named entity type : 'JOB'. For that I have around 800 job title names from https://www.careerbuilder.com/browse/titles/ that I can use as training data.
In my training data for spacy, do I need to integrate these job titles in sentences added to provide context or not?
In general in the CV the job title kinda stands on it's own and is not really part of a full sentence.
Also, if I need to provide coherent context for each of the 800 titles, it will be too time-consuming for what I'm trying to do, so maybe there are other solutions than NER?
Generally, Named Entity Recognition relies on the context of words, otherwise the model would not be able to detect entities in previously unseen words. Consequently, the list of titles would not help you to train any model. You could rather run string matching to find any of those 800 titles in CV documents and you will even be guaranteed to find all of them - no unknown titles, though.
I you could find 800 (or less) real CVs and replace the Job names by those in your list (or others!), then you are all set to train a model capable of NER. This would be the way to go, I suppose. Just download as many freely available CVs from the web and see where this gets you. If it is not enough data, you can augment it, for example by exchanging the job titles in the data by some of the titles in your list.

Bi-gram model to predict text

I am planning to implement bi-gram model to predict a search text. If a user has frequently searched "Test search word" and then if user types "Test" I am looking to automatically suggest "Test search word"
I have the list of data of searched text. I am trying with bi-gram as even if user types "Tast" it should still provide "Test search word". I am implementing it in Java. I am looking for a library to supply the data that I have and when I pass the user keyed in text, it should provide the prediction.
After research I found below links
https://www.javatips.net/api/Solbase-Lucene-master/contrib/analyzers/common/src/java/org/apache/lucene/analysis/shingle/ShingleFilter.java
https://opennlp.apache.org/docs/1.8.1/apidocs/opennlp-tools/opennlp/tools/ngram/NGramUtils.html
but they are not helping in my case. Are there any Java libraries that suits my purpose?
I'm thinking of two solutions:
First
Index each of your user string queries in a MARISA (Matching Algorithm with Recursively Implemented StorAge) TRIE data structure (data structure optimised for keywords search and autocomplete).
Prepare a Levenshtein distance measurement method to tolerate typos.
Now for each new user query q, get all strings indexed in MARISA TRIE that has your query q as prefix (after typo tolerance).
Second
Use a elasticsearch suggester
Documentation https://www.elastic.co/guide/en/elasticsearch/reference/7.5/search-suggesters.html#completion-suggester
Please notice that parts of the suggest feature are still under development.

Detecting text relevant to an entity in nlp

I am trying to solve a problem where I'm identifying entities in articles (ex: names of cars), and trying to predict sentiment about each car within the article. For that, I need to extract the text relevant to each entity from within the article.
Currently, the approach I am using is as follows:
If a sentence contains only 1 entity, tag the sentence as text for that entity
If sentence has more than 1 entity, ignore it
If sentence contains no entity, tag as a sentence for previously identified entity
However, this approach is not yielding accurate results, even if we assume that our sentiment classification is working.
Is there any method that the community may have come across that can solve this problem?
The approach fails for many cases and gives wrong results. For example if I am saying - 'Lets talk about the Honda Civic. The car was great, but failed in comparison to the Ford focus. The car also has good economy.'
Here, the program would pick up Ford Focus as the entity in last 2 sentences and tag those sentences for it.
I am using nltk for descriptive words tagging, and scikit-learn for classification (linear svm model).
If anyone could point me in the right direction, it would be greatly appreciated. Is there some classifier I could build with custom features that can detect this type of text if I were to manually tag say - 50 articles and the text in them?
Thanks in advance!

Named Entity Recognition upper case issue

I recently switched the model I use for NER in spacy from en_core_web_md to xx_ent_wiki_sm.
I noticed that the new model always recognises full upper case words such as NEW JERSEY or NEW YORK as organisations. I would be able to provide training data to retrain the model, although it would be very time consuming. However I am uncertain if the model would loose the assumption that upper case words are organisations or if it would instead keep the assumption and create some exceptions for it. Does it maybe even learn that every all upper case with word with less than 5 letter is likely to be an organisation and everything with more letters not? I just dont know how exactly the training will affect the model
en_core_web_md seems to deal fine with acronyms, while ignoring words like NEW JERSEY. However the overall performance of xx_ent_wiki_sm is better for my use case
I ask because the assumption as such is still pretty useful, as it allows us to identify acronyms such as IBM as an organisation.
The xx_ent_wiki_sm model was trained on Wikipedia, so it's very biased towards what Wikipedia considers and entity, and what's common in the data. (It also tends to frequently recognise "I" as an entity, since sentences in the first person are so rare on Wikipedia.) So post-training with more examples is definitely a good strategy, and what you're trying to do sounds feasible.
The best way to prevent the model from "forgetting" about the uppercase entities is to always include examples of entities that the model previously recognised correctly in the training data (see: the "catastrophic forgetting problem"). The nice thing is that you can create those programmatically by running spaCy over a bunch of text and extracting uppercase entities:
uppercase_ents = [ent for ent in doc.ents if all(t.is_upper for t in ent)]
See this section for more examples of how to create training data using spaCy. You can also use spaCy to generate the lowercase and titlecase variations of the selected entities to bootstrap your training data, which should hopefully save you a lot of time and work.

Identify the person referred to in an email using ML/NLP

I am working on an NLP project, wherein I have a list of emails all related to appreciation. I am trying to determine from the email content, who is being appreciated. This in turn will help the organization in our performance evaluation program.
Apart from identifying who is being appreciated, I am also trying to identify the type of work a person has done and score it. I am using open NLP (max entropy/logistic regression) for classification of the email and use some form of heuristics to identify the person being appreciated.
The approach for person identification is as follows:
Determine if an email is related to appreciation
Get the list of people in the "To:" list
Check if that person is being referred to in the email
Tag that person as the receiver of appreciation
However, this approach is very simple and does not work for complex emails we generally see. An email can consist of many email ids or people being referred to and they are not the receivers of the appreciation. The context of the person is not available and hence the accuracy is not very good.
I am thinking of using HMM and word2vec to solve the person issue. I would appreciate if anyone has come across this problem or has any suggestion.
Use tm package for R. And use tf-idf (term frequency - inverse document frequency) to determine whos been appreciate.
I'm suggesting this because , for what I can read , this is an unsupervised learning (you dont knot prior whos been appreciate). So you have to describe the documents (emails) content , and that formula (tf-idf) will help know what words are been use most in a particular document that are rarely uso on all anothers.
One way to solve this problem is through the use of Named Entity Recognition. You can possibly run something like Stanford NER over the text which will help you recognize all people names mentioned in the email and then use a rules based chunker such as Stanford TokensRegex to extract sentences where names of people and appreciation words are mentioned.
The best way to solve this will be by treating this as a supervised learning problem. You will then need to annotate a bunch of training data with entities and expression phrases and the relations between them. Then you can use Stanford Relation Extractor to extract appropriate relations.

Resources