I'm trying to detect contact information in a huge list of offers I get. The offers contain text without any given structure, some examples could be the following ones:
if you're interested, send an email to test#test.com
want to know more? call 000 000 000
come to the public viewing on 25nd of january
public viewing is on the coming wednesday
is this what you're searching for? We're looking forward to hearing from you
As you can see, there are multiple possibilities:
there is no date for a viewing, but there's a phone number
there is no date for a viewing, but there's an email
there is a date for a viewing
there is no detailed information
The tricky point is, there can also be e.g. other dates in text, therefore I can't just parse out dates.
What is the best way to solve something like that? I've already tried it with regex. I think I could get it work but there is an enormous amount of cases which makes it very hard.
I've also looked into things like NLP with libraries like https://spacy.io/ or prodi.gy, but I feel like I'm not on the right track.
The original texts are written in German.
In 2020, how do I go after this?
You can use a NLP powered Rule-based matcher. With spacy, You explored the right tool, just didn't go deep with it. And it's available in german.
Here are some examples:
Some patterns:
#call number
call_pattern = [{'LOWER':'call'},{"ORTH": "(", 'OP':"?"}, {"SHAPE": "ddd"}, {"ORTH": ")", 'OP':"?"}, {"SHAPE": "ddd"},
{"ORTH": "-", "OP": "?"}, {"SHAPE": "ddd"}]
#e-mail pattern
email_pattern = [{'LIKE_EMAIL': True}]
#pattern for public viewing
public_viewing_pattern = [{'LOWER': 'public'},
{'LOWER': 'viewing'},
{'POS': 'AUX', 'OP': '?'},
{'POS': 'ADP', 'OP': '?'},
{'label': 'DATE', 'OP':'+'}]
Then, you iterate over your patterns and apply them:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
#or:
#import de_core_news_sm
#nlp = de_core_news_sm.load()
matcher = Matcher(nlp.vocab)
matcher.add("call_pattern", None, call_pattern)
matcher.add("email_pattern", None, email_pattern)
matcher.add("public_viewing_pattern", None, public_viewing_pattern)
found = {'numbers':[], 'emails':[], 'public_viewings':[]}
for sent in sentences:
doc = nlp(sent)
matches = matcher(doc)
for match_id, start, end in matches:
if doc.vocab.strings[match_id] == 'call_pattern':
found['numbers'].append(doc[start:end])
if doc.vocab.strings[match_id] == 'email_pattern':
found['emails'].append(doc[start:end])
if doc.vocab.strings[match_id] == 'public_viewing_pattern':
found['public_viewings'].append(doc[start:end])
print(found)
result:
{'numbers': [call 000 000 000], 'emails': [test#test.com], 'public_viewings': [public viewing on, public viewing on 25nd, public viewing on 25nd of, public viewing on 25nd of january, public viewing is, public viewing is on, public viewing is on the, public viewing is on the coming, public viewing is on the coming wednesday]}
Ps.: This repeating is caused by a bug in spacy versions prior to 2.1. Just add some manual validation for repeating matches (get the one with most lenght) and you'll be good.
The hard part will be to generalize enough and correctly get your patterns, but they are very powerful and you can do all sort of tweaks to them. Check spacy online demo for testing. Also, refer to the manual for more complex stuff.
Related
So i have task in front of me to make a custom ner model for the pharmaceutical industry where in i have a finite list of drugs and over 4000 text files from where NER is supposed to be done. I have also tried entity matching using spacy but it is showing some error. So now i plan on using SKlearn crfsuite but in order to do that my data needs to be in conll format and should be annotated.Would really appreciate if someone could guide me in annotating my text files! is there any way i can initiate automatic annotation on the text files using the drug list i have ? as it is a humongous effort for an individual to achieve the same manually.I also had a look at the question asked in the link mentioned below.
NER model to recognize Indian names
But no one has actually addressed my question.Would really appreciate if someone could help me out
Spacy code:-
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
class EntityMatcher(object):
name = 'entity_matcher'
def __init__(self, nlp, terms, label):
patterns = [nlp(term) for term in terms]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add(label, None, *patterns)
def __call__(self, doc):
matches = self.matcher(doc)
spans = []
for label, start, end in matches:
span = Span(doc, start, end, label=label)
spans.append(span)
doc.ents = spans
return doc
data=pd.read_excel(r'C:\Users\xyz\pname.xlsx')
ld=list(set(data['Product']))
nlp = spacy.load('en')
entity_matcher = EntityMatcher(nlp, ld, 'DRUG')
nlp.add_pipe(entity_matcher)
print(nlp.pipe_names)
doc=nlp('Hi bnbbn, ope all is well. In preparation for the bcbcb is there anything that BGTD requires specifically? We had sent you the US centric Briefing Package to align with our previous discussion on having bkjnsd included in the Wave 1 IMOVAX POLIO submission plan. If you would like, we can set-up a BGTD specific meeting after the June 20th meeting to discuss any jk specific product questions you may have as the product mix is a bit different between countries.')
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
when i run my script, this is the error i get :-
[T002] Pattern length (11) >= phrase_matcher.max_length (10). Length can be set on initialization, up to 10.
Background
I'm writing a Swift application that requires the classification of user events by categories. These categories can be things like:
Athletics
Cinema
Food
Work
However, I have a set list of these categories, and do not wish to make any more than the minimal amount I believe is needed to be able to classify any type of event.
Question
Is there a machine learning (nlp) procedure that does the following?
Takes a block of text (in my case, a description of an event).
Creates a "percentage match" to each possible classification.
For instance, suppose the description of an event is as follows:
Fun, energetic bike ride for people of all ages.
The algorithm in which this description would be passed in would return an object that looks something like this:
{
athletics: 0.8,
cinema: 0.1,
food: 0.06,
work: 0.04
}
where the values of each key in the object is a confidence.
If anyone can guide me in the right direction (or even send some general resources or solutions specific to iOS dev), I'd be super appreciative!
You are talking about typical classification model. I believe iOS offers you APIs to do this inside your app. Here Look for natural language processing bit - NLP
Also you are probably being downvoted because this forum typically looks to solve specific programming queries and not generic ones (this is an assumption and there could be another reason for downvotes.)
{
"blogid": 11,
"blog_authorid": 2,
"blog_content": "(this is blog complete content: html encoded on base64 such as) PHNlY3Rpb24+PGRpdiBjbGFzcz0icm93Ij4KICAgICAgICA8ZGl2IGNsYXNzPSJjb2wtc20tMTIiIGRhdGEtdHlwZT0iY29udGFpbmVyLWNvbnRlbn",
"blog_timestamp": "2018-03-17 00:00:00",
"blog_title": "Amazon India Fashion Week: Autumn-",
"blog_subtitle": "",
"blog_featured_img_link": "link to image",
"blog_intropara": "Introductory para to article",
"blog_status": 1,
"blog_lastupdated": "\"Mar 19, 2018 7:42:23 AM\"",
"blog_type": "Blog",
"blog_tags": "1,4,6",
"blog_uri": "Amazon-India-Fashion-Week-Autumn",
"blog_categories": "1",
"blog_readtime": "5",
"ViewsCount": 0
}
Above is one sample blog as per my API. I have a JsonArray of such blogs.
I am trying to predict 3 similar blogs based on a blog's props(eg: tags,categories,author,keywords in title/subtitle) and contents. I have no user data i.e, there is no logged in user data(such as rating or review). I know that without user's data it will not be accurate but I'm just getting started with data science or ML. Any suggestion/link is appreciated. I prefer using java but python,php or any other lang also works for me. I need an easy to implement model as I am a beginner. Thanks in advance.
My intuition is that this question might not be at the right address.
BUT
I would do the following:
Create a dataset of sites that would be an inventory from which to predict. For each site you will need to list one or more features: Amount of tags, amount of posts, average time between posts in days, etc.
Sounds like this is for training and you are not worried about accuracy
too much, numeric features should suffice.
Work back from a k-NN algorithm. Don't worry about the classifiers. Instead of classifying a blog, you list the 3 closest neighbors (k = 3). A good implementation of the algorithm is here. Have fun simplifying it for your purposes.
Your algorithm should be a step or two shorter than k-NN which is considered to be among simpler ML, a good place to start.
Good luck.
EDIT:
You want to build a recommender engine using text, tags, numeric and maybe time series data. This is a broad request. Just like you, when faced with this request, I’d need to dive in the data and research best approach. Some approaches require different sets of data. E.g. Collaborative vs Content-based filtering.
Few things may’ve been missed on the user side that can be used like a sort of rating: You do not need a login feature get information: Cookie ID or IP based DMA, GEO and viewing duration should be available to the Web Server.
On the Blog side: you need to process the texts to identify related terms. Other blog features I gave examples above.
I am aware that this is a lot of hand-waving, but there’s no actual code question here. To reiterate my intuition is that this question might not be at the right address.
I really want to help but this is the best I can do.
EDIT 2:
If I understand your new comments correctly, each blog has the following for each other blog:
A Jaccard similarity coefficient.
A set of TF-IDF generated words with
scores.
A Euclidean distance based on numeric data.
I would create a heuristic from these and allow the process to adjust the importance of each statistic.
The challenge would be to quantify the words-scores TF-IDF output. You can treat those (over a certain score) as tags and run another similarity analysis, or count overlap.
You already started on this path, and this answer assumes you are to continue. IMO best path is to see which dedicated recommender engines can help you without constructing statistics piecemeal (numeric w/ Euclidean, tags w/ Jaccard, Text w/ TF-IDF).
I have some user chat data and categorised in various categories, the problem is there are a lot of algorithm generated categories, please see example below:
Message | Category
I want to play cricket | Play cricket
I wish to watch cricket | Watch cricket
I want to play cricket outside | Play cricket outside
As you can see Categories (essentially phrases) are extracted from the text itself,
based on my data there are 10,000 messages with approx 4,500 unique catgories.
Is there any suitable algorithm which can give me good prediction accuracy in such cases.
Well, I habitually use OpenNLP's DocumentCategorizer for tasks like this, but StanfordNLP core I think does some similar stuff. OpenNLP uses Maximum Entropy for this, but there are many ways to do it.
First some thoughts on the amount of unique labels. Basically you only have a few samples per class, and that is generally a bad thing: your classifier is going to give sucky results no matter what it is if you try to do it the way you are implying because of overlap and / or underfitting. So here's what i've done before in a similar situation: separate concepts into different thematic classifiers, then assemble the best scores for each. For example, based on what you wrote above, you may be able to detect OUTSIDE or INSIDE with one classification model, and then WATCHING CRICKET vs PLAYING CRICKET in another. Then at runtime, you would pass the text into both classifiers, and take the best hit for each to assemble a single category. Pseudo code:
DoccatModel outOrIn = new DoccatModel(modelThatDetectsOutsideOrInside);
DoccatModel cricketMode = new DoccatModel(modelThatDetectsPlayingOrWatchingCricket)
String stringToDetectClassOf = "Some dude is playing cricket outside, he sucks";
String outOrInCat = outOrIn.classify(stringToDetectClassOf);
String cricketModeCat = cricketMode .classify(stringToDetectClassOf);
String best = outOrInCat + " " + cricketModeCat ;
you get the point I think.
Also some other random thoughts:
- Use a text index to explore the amount of data you get back to figure out how to break up the categories.
- You want a few hundred examples for each model
let me know if you want me to give you some code examples from OpenNLP if you are doing this in Java
We want to identify the address fields from a document. For Identifying the address fields we converted the document to OCR files using Tesseract. From the tesseract output we want to check a string contains the address field or not . Which is the right strategy to resolve this problem ?
Its not possible to solve this problem using the regex because address fields are different for various documents and countries
Tried NLTK for classifying the words but not works perfectly for address field.
Required output
I am staying at 234 23 Philadelphia - Contains address files <234 23 Philadelphia>
I am looking for a place to stay - Not contains address
Provide your suggestions to solve this problem .
As in many ML problems, there are mutiple posible solutions, and the important part(and the one commonly has greater impact) is not which algorithm or model you use, but feature engineering ,data preprocessing and standarization ,and things like that. The first solution comes to my mind(and its just an idea, i would test it and see how it performs) its:
Get your training set examples and list the "N" most commonly used words in all examples(thats your vocabulary), this list will contain every one of the "N" most used words , every word would be represented by a number(the list index)
Transform your training examples: read every training example and change its representation replacing every word by the number of the word in the vocabolary.
Finally, for every training example create a feature vector of the same size as the vocabulary, and for every word in the vocabulary your feature vector will be 0(the corresponding word doesnt exists in your example) or 1(it exists) , or the count of how many times the word appears(again ,this is feature engineering)
Train multiple classifiers ,varing algorithms,parameters, training set sizes, etc, and do cross validation to choose your best model.
And from there keep the standard ML workflow...
If you are interested in just checking YES or NO and not extraction of complete address, One simple solution can be NER.
You can try to check if Text contains Location or not.
For Example :
import nltk
def check_location(text):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text))):
if hasattr(chunk, "label"):
if chunk.label() == "GPE" or chunk.label() == "GSP":
return "True"
return "False"
text="I am staying at 234 23 Philadelphia."
print(text+" - "+check_location(text))
text="I am looking for a place to stay."
print(text+" - "+check_location(text))
Output:
# I am staying at 234 23 Philadelphia. - True
# I am looking for a place to stay. - False
If you want to extract complete address as well, you will need to train your own model.
You can check: NER with NLTK , CRF++.
You're right. Using regex to find an address in a string is messy.
There are APIs that will attempt to extract addresses for you. These APIs are not always guaranteed to extract addresses from strings, but they will do their best. One example of an street address extract API is from SmartyStreets. Documentation here and demo here.
Something to consider is that even your example (I am staying at 234 23 Philadelphia) doesn't contain a full address. It's missing a state or ZIP code field. This makes is very difficult to programmatically determine if there is an address. Once there is a state or ZIP code added to that sample string (I am staying at 234 23 Philadelphia PA) it becomes much easier to programmatically determine if there is an address contained in the string.
Disclaimer: I work for SmartyStreets
A better method to do this task could be as followed below:
Train your own custom NER model (extending pre-trained SpaCy's model or building your own CRF++ / CRF-biLSTM model, if you have annotated data) or using a pre-trained models like SpaCy's large model or geopandas, etc.
Define a weighted score mechanism based on your problem statement.
For example - Let's assume every address have 3 important components - an address, a telephone number and an email id.
Text that would have all three of them would get a score of 33.33% + 33.33% + 33.33% = 100 %
For identifying if it's an address field or not you may take into account - the per% of SpaCy's location tags (GPE, FAC, LOC, etc) out of total tokens in text which gives a good estimate of how many location tags are present in text. Then run a regex for postal codes, and match the found city names with the 3-4 words just before the found postal code, if there's an overlap, you have correctly identified a postal code and hence an address field - (got your 33.33% score!).
For telephone numbers - certain checks and regex could do it but an important criteria would be that it performs these phone checks only if an address field is located in above text.
For emails/web address again you could perform nomial regex checks and finally add all these 3 scores to a cumulative value.
An ideal address would get 100 score while missing fields wile yield 66% etc. The rest of the text would get a score of 0.
Hope it helped! :)
Why do you say regular expressions won't work?
Basically, define all the different forms of address you might encounter in the form of regular expressions. Then, just match the expressions.