So i have task in front of me to make a custom ner model for the pharmaceutical industry where in i have a finite list of drugs and over 4000 text files from where NER is supposed to be done. I have also tried entity matching using spacy but it is showing some error. So now i plan on using SKlearn crfsuite but in order to do that my data needs to be in conll format and should be annotated.Would really appreciate if someone could guide me in annotating my text files! is there any way i can initiate automatic annotation on the text files using the drug list i have ? as it is a humongous effort for an individual to achieve the same manually.I also had a look at the question asked in the link mentioned below.
NER model to recognize Indian names
But no one has actually addressed my question.Would really appreciate if someone could help me out
Spacy code:-
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
class EntityMatcher(object):
name = 'entity_matcher'
def __init__(self, nlp, terms, label):
patterns = [nlp(term) for term in terms]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add(label, None, *patterns)
def __call__(self, doc):
matches = self.matcher(doc)
spans = []
for label, start, end in matches:
span = Span(doc, start, end, label=label)
spans.append(span)
doc.ents = spans
return doc
data=pd.read_excel(r'C:\Users\xyz\pname.xlsx')
ld=list(set(data['Product']))
nlp = spacy.load('en')
entity_matcher = EntityMatcher(nlp, ld, 'DRUG')
nlp.add_pipe(entity_matcher)
print(nlp.pipe_names)
doc=nlp('Hi bnbbn, ope all is well. In preparation for the bcbcb is there anything that BGTD requires specifically? We had sent you the US centric Briefing Package to align with our previous discussion on having bkjnsd included in the Wave 1 IMOVAX POLIO submission plan. If you would like, we can set-up a BGTD specific meeting after the June 20th meeting to discuss any jk specific product questions you may have as the product mix is a bit different between countries.')
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
when i run my script, this is the error i get :-
[T002] Pattern length (11) >= phrase_matcher.max_length (10). Length can be set on initialization, up to 10.
Related
When I use SpaCy NER, SpaCy will recognize 'TodoA' as PERSON. This is obviously unreasonable. Is there any way to verify whether the entity extracted by SpaCy is reasonable? Thanks!
Most of these unreasonable entities are extracted by spacy beam search. The beam search code is:
import spacy
import sys
from collections import defaultdict
nlp = spacy.load('en')
text = u'Will Japan join the European Union? If yes, we should \
move to United States. Fasten your belts, America we are coming'
with nlp.disable_pipes('ner'):
doc = nlp(text)
threshold = 0.2
(beams, somethingelse) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)
entity_scores = defaultdict(float)
for beam in beams:
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for start, end, label in ents:
entity_scores[(start, end, label)] += score
print ('Entities and scores (detected with beam search)')
for key in entity_scores:
start, end, label = key
score = entity_scores[key]
if ( score > threshold):
print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))
The "unreasonable" annotation you are seeing is directly linked with the nature of the model that is used to perform the annotation and the process of obtaining it.
In short, the model is an approximation of a very complex function (in mathematical terms) from some characteristics of sequences of words (e.g. presence of particular letters, upper-casing, usage of particular terms, etc.) to a closed set of tags (e.g. PERSON). It is an approximation that is close to best across a large body of text (e.g. a few GBs of ASCII text) but certainly it is not a mapping of particular phrases to tags. Therefore, even though the data which is used for training is accurate, the result of applying the model might be not ideal in some circumstances.
In your case it is likely that the model is clinging on upper-casing of a word, and maybe there was a large number of words used in training that share the prefix that were marked with tag PERSON) - e.g. Toddy, toddler, etc. and a very small number of words with such a prefix that were not PERSONs.
This phenomenon that we observe was not chosen explicitly by person preparing the model, it is only a by-product of the combination of the process of preparing it (training), and the data used.
I am doing a college project where I need to compare a string with list of other strings. I want to know if we have any kind of library which can do this or not.
Suppose I have a table called : DOCTORS_DETAILS
Other Table names are : HOSPITAL_DEPARTMENTS , DOCTOR_APPOINTMENTS, PATIENT_DETAILS,PAYMENTS etc.
Now I want to calculate which one among those are more relevant to DOCTOR_DETAILS ?
Expected output can be,
DOCTOR_APPOINTMENTS - More relevant because of the term doctor matches in both string
PATIENT_DETAILS - The term DETAILS present in both string
HOSPITAL_DEPARTMENTS - least relevant
PAYMENTS - least relevant
Therefore I want to find RELEVENCE based on number of similar terms present on both the strings in question.
Ex : DOCTOR_DETAILS -> DOCTOR_APPOITMENT(1/2) > DOCTOR_ADDRESS_INFORMATION(1/3) > DOCTOR_SPECILIZATION_DEGREE_INFORMATION (1/4) > PATIENT_INFO (0/2)
Semantic similarity is a common NLP problem. There are multiple approaches to look into, but at their core they all are going to boil down to:
Turn each piece of text into a vector
Measure distance between vectors, and call closer vectors more similar
Three possible ways to do step 1 are:
tf-idf
fasttext
bert-as-service
To do step 2, you almost certainly want to use cosine distance. It is pretty straightforward with Python, here is a implementation from a blog post:
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
For your particular use case, my instincts say to use fasttext. So, the official site shows how to download some pretrained word vectors, but you will want to download a pretrained model (see this GH issue, use https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip),
Then you'd then want to do something like:
import fasttext
model = fasttext.load_model("model_filename.bin")
def order_tables_by_name_similarity(main_table, candidate_tables):
'''Note: we use a fasttext model, not just pretrained vectors, so we get subword information
you can modify this to also output the distances if you need them
'''
main_v = model[main_table]
similarity_to_main = lambda w: cos_sim(main_v, model[w])
return sorted(candidate_tables, key=similarity_to_main, reverse=True)
order_tables_by_name_similarity("DOCTORS_DETAILS", ["HOSPITAL_DEPARTMENTS", "DOCTOR_APPOINTMENTS", "PATIENT_DETAILS", "PAYMENTS"])
# outputs: ['PATIENT_DETAILS', 'DOCTOR_APPOINTMENTS', 'HOSPITAL_DEPARTMENTS', 'PAYMENTS']
If you need to put this in production, the giant model size (6.7GB) might be an issue. At that point, you'd want to build your own model, and constrain the model size. You can probably get roughly the same accuracy out of a 6MB model!
I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX
I'm trying to detect contact information in a huge list of offers I get. The offers contain text without any given structure, some examples could be the following ones:
if you're interested, send an email to test#test.com
want to know more? call 000 000 000
come to the public viewing on 25nd of january
public viewing is on the coming wednesday
is this what you're searching for? We're looking forward to hearing from you
As you can see, there are multiple possibilities:
there is no date for a viewing, but there's a phone number
there is no date for a viewing, but there's an email
there is a date for a viewing
there is no detailed information
The tricky point is, there can also be e.g. other dates in text, therefore I can't just parse out dates.
What is the best way to solve something like that? I've already tried it with regex. I think I could get it work but there is an enormous amount of cases which makes it very hard.
I've also looked into things like NLP with libraries like https://spacy.io/ or prodi.gy, but I feel like I'm not on the right track.
The original texts are written in German.
In 2020, how do I go after this?
You can use a NLP powered Rule-based matcher. With spacy, You explored the right tool, just didn't go deep with it. And it's available in german.
Here are some examples:
Some patterns:
#call number
call_pattern = [{'LOWER':'call'},{"ORTH": "(", 'OP':"?"}, {"SHAPE": "ddd"}, {"ORTH": ")", 'OP':"?"}, {"SHAPE": "ddd"},
{"ORTH": "-", "OP": "?"}, {"SHAPE": "ddd"}]
#e-mail pattern
email_pattern = [{'LIKE_EMAIL': True}]
#pattern for public viewing
public_viewing_pattern = [{'LOWER': 'public'},
{'LOWER': 'viewing'},
{'POS': 'AUX', 'OP': '?'},
{'POS': 'ADP', 'OP': '?'},
{'label': 'DATE', 'OP':'+'}]
Then, you iterate over your patterns and apply them:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
#or:
#import de_core_news_sm
#nlp = de_core_news_sm.load()
matcher = Matcher(nlp.vocab)
matcher.add("call_pattern", None, call_pattern)
matcher.add("email_pattern", None, email_pattern)
matcher.add("public_viewing_pattern", None, public_viewing_pattern)
found = {'numbers':[], 'emails':[], 'public_viewings':[]}
for sent in sentences:
doc = nlp(sent)
matches = matcher(doc)
for match_id, start, end in matches:
if doc.vocab.strings[match_id] == 'call_pattern':
found['numbers'].append(doc[start:end])
if doc.vocab.strings[match_id] == 'email_pattern':
found['emails'].append(doc[start:end])
if doc.vocab.strings[match_id] == 'public_viewing_pattern':
found['public_viewings'].append(doc[start:end])
print(found)
result:
{'numbers': [call 000 000 000], 'emails': [test#test.com], 'public_viewings': [public viewing on, public viewing on 25nd, public viewing on 25nd of, public viewing on 25nd of january, public viewing is, public viewing is on, public viewing is on the, public viewing is on the coming, public viewing is on the coming wednesday]}
Ps.: This repeating is caused by a bug in spacy versions prior to 2.1. Just add some manual validation for repeating matches (get the one with most lenght) and you'll be good.
The hard part will be to generalize enough and correctly get your patterns, but they are very powerful and you can do all sort of tweaks to them. Check spacy online demo for testing. Also, refer to the manual for more complex stuff.
I want to tag text based on the category it belongs to ...
For example ...
"Clutch and gear is monitored using microchip " -> clutch /mechanical , gear/mechanical , microchip / electronic
"software used here to monitor hydrogen levels" -> software/computer , hydrogen / chemistry ..
How to do this using openNLP or other NLP engines.
MY WORKS
I tried NER model , but It needs large number of training corpus which I don't have ?
My Need
Do any ready made training corpus available for NER or classification (it must contains scientific and engineering words).. ?
If you want to create a set of class labels for an entire sentence, then you will want to use the Doccat lib. With Doccat you would get a prob distribution for each chunk of text.
with doccat your sample would produce something like this:
"Clutch and gear is monitored using microchip " -> mechanical 0.85847568, electronic 0.374658
with doocat you will lose the keyword->classlabel mapping, so if you really need it doccat might not cut it.
as for NER, OpenNLP has an addon called Modelbuilder-addon that may help you. It is designed to expedite the creation of NER model building. You can create a file/list of as many of the terms for each category as you can think of, then create a file of a bunch of sentences, then use the addon to create an NER model using the seed terms and the file of sentences. see this post where I described it before with code example. You will have to pull down the addon from SVN.
OpenNLP: foreign names does not get recognized