Using a model to compare name and surname - machine-learning

I have names of employees saved in a text file. I processed the file and compared a name that already exist.
When I checked using most_similar method, I found that it returns totally unrelated name even if the exact same name exist in the corpus.
import gensim
training_file='todel.txt'
mylist=list()
with open(training_file, encoding="iso-8859-1") as f:
for i, line in enumerate(f):
mylist.append(gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i]))
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=55)
model.build_vocab(mylist)
inferred_vector=model.infer_vector(['aakash', 'prakash', 'patel'])
sims = model.docvecs.most_similar([inferred_vector])
' '.join(mylist[sims[0][0]].words)
How do I correctly train the data to return (closely) matching names?

You define similarity in terms of edit distance, i.e. how similar two strings are.
x2vec models define similarity in terms of semantic closeness, i.e. how similar two meanings are, computed through machine learning and co-occurrence statistics.
In other words, you're using a sledgehammer to kill a fly. Look into tools for computing string distance instead:
from Levenshtein import distance
string1 = 'aakash'
string2 = 'akash'
string3 = 'konstantinos'
print(distance(string1, string2))
1
print(distance(string1, string3))
11

Related

Categorical features encoding in H2O

I train GBM models with H2O and want to use them in my backend (not Java). To do so, I download the MOJOs, convert it to ONNX and run it in my apps.
In order to make inference, I need to know how categorical columns transformed to their one-hot encoded versions. I was able to find it in the POJO:
static final void fill(String[] sa) {
sa[0] = "Age";
sa[1] = "Fare";
sa[2] = "Pclass.1";
sa[3] = "Pclass.2";
sa[4] = "Pclass.3";
sa[5] = "Pclass.missing(NA)";
sa[6] = "Sex.female";
sa[7] = "Sex.male";
sa[8] = "Sex.missing(NA)";
}
So, here is the workflow for non-Java backend as I see it:
Encode categorical features with OneHotExplicit.
Train GBM model.
Download MOJO and convert to ONNX.
Download POJO and find feature alignment in the source code.
Implement the inference in your backend.
Is it the most straightforward and correct way?
Thank you for your question.
Can you access the stored categorical values here?
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/tree/SharedTreeMojoModel.java#L72
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/tree/SharedTreeMojoReader.java#L34
https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/tree/SharedTreeMojoWriter.java#L61
The index in the array means the translated categorical value.
The EasyPredictModelWrapper did it this way:
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/easy/RowToRawDataConverter.java#L44
Can you access the model.ini inside of the zip? There is [domains] tag and under the tag is a list of files in domains/ directory which correspond the categorical encoding for each feature.
e.g:
[columns]
AGE
RACE
DPROS
DCAPS
PSA
VOL
GLEASON
CAPSULE
[domains]
7: 2 d000.txt
means 7th column (CAPSULE) has 2 categorical variables in d000.txt
or there is a experimental/modelDetails.json file that has categorical values under output.domains. The index in the list correspond to the feature in the output.names list.
e.g output.domains[7] are domains for output.names[7] feature.

How Spacy NER verifies the rationality of entities?

When I use SpaCy NER, SpaCy will recognize 'TodoA' as PERSON. This is obviously unreasonable. Is there any way to verify whether the entity extracted by SpaCy is reasonable? Thanks!
Most of these unreasonable entities are extracted by spacy beam search. The beam search code is:
import spacy
import sys
from collections import defaultdict
nlp = spacy.load('en')
text = u'Will Japan join the European Union? If yes, we should \
move to United States. Fasten your belts, America we are coming'
with nlp.disable_pipes('ner'):
doc = nlp(text)
threshold = 0.2
(beams, somethingelse) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)
entity_scores = defaultdict(float)
for beam in beams:
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for start, end, label in ents:
entity_scores[(start, end, label)] += score
print ('Entities and scores (detected with beam search)')
for key in entity_scores:
start, end, label = key
score = entity_scores[key]
if ( score > threshold):
print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))
The "unreasonable" annotation you are seeing is directly linked with the nature of the model that is used to perform the annotation and the process of obtaining it.
In short, the model is an approximation of a very complex function (in mathematical terms) from some characteristics of sequences of words (e.g. presence of particular letters, upper-casing, usage of particular terms, etc.) to a closed set of tags (e.g. PERSON). It is an approximation that is close to best across a large body of text (e.g. a few GBs of ASCII text) but certainly it is not a mapping of particular phrases to tags. Therefore, even though the data which is used for training is accurate, the result of applying the model might be not ideal in some circumstances.
In your case it is likely that the model is clinging on upper-casing of a word, and maybe there was a large number of words used in training that share the prefix that were marked with tag PERSON) - e.g. Toddy, toddler, etc. and a very small number of words with such a prefix that were not PERSONs.
This phenomenon that we observe was not chosen explicitly by person preparing the model, it is only a by-product of the combination of the process of preparing it (training), and the data used.

Can you search for related database tables/fields using text similarity?

I am doing a college project where I need to compare a string with list of other strings. I want to know if we have any kind of library which can do this or not.
Suppose I have a table called : DOCTORS_DETAILS
Other Table names are : HOSPITAL_DEPARTMENTS , DOCTOR_APPOINTMENTS, PATIENT_DETAILS,PAYMENTS etc.
Now I want to calculate which one among those are more relevant to DOCTOR_DETAILS ?
Expected output can be,
DOCTOR_APPOINTMENTS - More relevant because of the term doctor matches in both string
PATIENT_DETAILS - The term DETAILS present in both string
HOSPITAL_DEPARTMENTS - least relevant
PAYMENTS - least relevant
Therefore I want to find RELEVENCE based on number of similar terms present on both the strings in question.
Ex : DOCTOR_DETAILS -> DOCTOR_APPOITMENT(1/2) > DOCTOR_ADDRESS_INFORMATION(1/3) > DOCTOR_SPECILIZATION_DEGREE_INFORMATION (1/4) > PATIENT_INFO (0/2)
Semantic similarity is a common NLP problem. There are multiple approaches to look into, but at their core they all are going to boil down to:
Turn each piece of text into a vector
Measure distance between vectors, and call closer vectors more similar
Three possible ways to do step 1 are:
tf-idf
fasttext
bert-as-service
To do step 2, you almost certainly want to use cosine distance. It is pretty straightforward with Python, here is a implementation from a blog post:
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
For your particular use case, my instincts say to use fasttext. So, the official site shows how to download some pretrained word vectors, but you will want to download a pretrained model (see this GH issue, use https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip),
Then you'd then want to do something like:
import fasttext
model = fasttext.load_model("model_filename.bin")
def order_tables_by_name_similarity(main_table, candidate_tables):
'''Note: we use a fasttext model, not just pretrained vectors, so we get subword information
you can modify this to also output the distances if you need them
'''
main_v = model[main_table]
similarity_to_main = lambda w: cos_sim(main_v, model[w])
return sorted(candidate_tables, key=similarity_to_main, reverse=True)
order_tables_by_name_similarity("DOCTORS_DETAILS", ["HOSPITAL_DEPARTMENTS", "DOCTOR_APPOINTMENTS", "PATIENT_DETAILS", "PAYMENTS"])
# outputs: ['PATIENT_DETAILS', 'DOCTOR_APPOINTMENTS', 'HOSPITAL_DEPARTMENTS', 'PAYMENTS']
If you need to put this in production, the giant model size (6.7GB) might be an issue. At that point, you'd want to build your own model, and constrain the model size. You can probably get roughly the same accuracy out of a 6MB model!

How to train one model for several devices

I have some tabular device data comprising a
time column, some tabular features, target classes
There are around 500 rows (not same) in all devices data and target classes are same.
I have same data for around 1000 devices,
I want to train a general model for all the devices for detecting the class.
Can someone help me with the approach to train for the target variable. What kind of models work in this condition
If your device type is part of the data, you can train a decision tree. If the device type feature is important for classification sake, it will be added to the tree. First, create the device type features yourself - a binary column for each device type, like done in one-hot encoding. There will be a binary column per device type - is_device_samsung, is_device_lg, is_device_iphone and so forth. The number of columns created is equal to the number of device types. All but one of these columns will be 0, and the one indicating the current type will be 1. This will not guarantee the device type will be a part of the model - but let the AI decide this for you.
BTW - don't use get_dummies unless you know how to reuse it exactly as needed in the test data.
Another option is to use the python-weka wrapper, which accepts nominal attributes:
Example:
import weka.core.jvm as jvm
from weka.core.converters import Loader
from weka.classifiers import Classifier
def get_weka_prob(inst):
dist = c.distribution_for_instance(inst)
p = dist[next((i for i, x in enumerate(inst.class_attribute.values) if x == 'DONE'), -1)]
return p
jvm.start()
loader = Loader(classname="weka.core.converters.CSVLoader")
data = loader.load_file(r'.\recs_csv\df.csv')
data.class_is_last()
datatst = loader.load_file(r'.\recs_csv\dftst.csv')
datatst.class_is_last()
c = Classifier("weka.classifiers.trees.J48", options=["-C", "0.1"])
c.build_classifier(data)
print(c)
probstst = [get_weka_prob(inst) for inst in datatst]
jvm.stop()
Weka models are different models that use a java bridge to python - the methods are java methods that can be called using this bridge. To use the dataframe in sklearn - you would have to manipulate it with one-hot encoding. Note that the nominal attributes in weka cannot have any special character in them. so use
df = df.replace([',', '"', "'", "%", ";"], '', regex=True)
for any nominal attribute before saving it to csv.
If you want to ensure that the model_type feature will be included in your model, you can trick it and add a dummy model type - and ensure that the class column for this dummy model is always "1" or "True" - depending on your class variable. If you have enough rows with this dummy model - j48 will open it as the first branch. Once the attribute is selected by j48 - it will be branched for all of the model types, not just the dummy one.

How do I speedup adding two big vectors of tuples?

Recently, I am implementing an algorithm from a paper that I will be using in my master's work, but I've come across some problems regarding the time it is taking to perform some operations.
Before I get into details, I just want to add that my data set comprehends roughly 4kk entries of data points.
I have two lists of tuples that I've get from a framework (annoy) that calculates cosine similarity between a vector and every other vector in the dataset. The final format is like this:
[(name1, cosine), (name2, cosine), ...]
Because of the algorithm, I have two of that lists with the same names (first value of the tuple) in it, but two different cosine similarities. What I have to do is to sum the cosines from both lists, and then order the array and get the top-N highest cosine values.
My issue is: is taking too long. My actual code for this implementation is as following:
def topN(self, user, session):
upref = self.m2vTN.get_user_preference(user)
spref = self.sm2vTN.get_user_preference(session)
# list of tuples 1
most_su = self.indexer.most_similar(upref, len(self.m2v.wv.vocab))
# list of tuples 2
most_ss = self.indexer.most_similar(spref, len(self.m2v.wv.vocab))
# concat both lists and add into a dict
d = defaultdict(int)
for l, v in (most_ss + most_su):
d[l] += v
# convert the dict into a list, and then sort it
_list = list(d.items())
_list.sort(key=lambda x: x[1], reverse=True)
return [x[0] for x in _list[:self.N]]
How do I make this code faster? I've tried using threads but I'm not sure if it will make it faster. Getting the lists is not the problem here, but the concatenation and sorting is.
Thanks! English is not my native language, so sorry for any misspelling.
What do you mean by "too long"? How large are the two lists? Is there a chance your model, and interim results, are larger than RAM and thus forcing virtual-memory paging (which would create frustrating slowness)?
If you are in fact getting the cosine-similarity with all vectors in the model, the annoy-indexer isn't helping any. (Its purpose is to get a small subset of nearest-neighbors much faster, at the expense of perfect accuracy. But if you're calculating the similarity to every candidate, there's no speedup or advantage to using ANNOY.
Further, if you're going to combine all of the distances from two such calculation, there's no need for the sorting that most_similar() usually does - it just makes combining the values more complex later. For the gensim vector-models, you can supply a False-ish topn value to just get the unsorted distances to all model vectors, in order. Then you'd have two large arrays of the distances, in the model's same native order, which are easy to add together elementwise. For example:
udists = self.m2v.most_similar(positive=[upref], topn=False)
sdists = self.m2v.most_similar(positive=[spref], topn=False)
combined_dists = udists + sdists
The combined_dists aren't labeled, but will be in the same order as self.m2v.index2entity. You could then sort them, in a manner similar to what the most_similar() method itself does, to find the ranked closest. See for example the gensim source code for that part of most_similar():
https://github.com/RaRe-Technologies/gensim/blob/9819ce828b9ed7952f5d96cbb12fd06bbf5de3a3/gensim/models/keyedvectors.py#L557
Finally, you might not need to be doing this calculation yourself at all. You can provide more-than-one vector to most_similar() as the positive target, and then it will return the vectors closest to the average of both vectors. For example:
sims = self.m2v.most_similar(positive=[upref, spref], topn=len(self.m2v))
This won't be the same value/ranking as your other sum, but may behave very similarly. (If you wanted less-than-all of the similarities, then it might make sense to use the ANNOY indexer this way, as well.)

Resources