Normalize company names using a long set of rules - parsing

We have a large table (>30M rows) containing company names and other characteristics.
Data:
Company_id Type Name Adress (more...)
497651684 8 Big mall Toys'rUs BigMall adress
468468486 1 McDonnnals WhateverStreet
161684314 8 Toys R Us Another street
546846846 1 BgKing BigMall2 adress
484984988 5 IKEA store103 Other Mall
489616848 5 Mss Duty Addrs
484984984 5 Pull&Bear Adrss
468484867 5 Zara store Adress2
(...)
From that table, we have identified about ~300 company groups whose name could be normalized easily with something on lines of:
if type is (8,10,85,2)
and
(
contains name ("toys","us")
or
stringDistance name("toys R us") < (X)
)
new name is "Toys 'R us"
if type is (1,90,7)
and
(contains name("donalds")
or
stringDistance name("mcdonalds") < (X)
)
new name is "Mc donalds"
(...)
I'm not sure what would be the best approach for this honestly. We previously did an ad-hoc approach for a way smaller set with a simpler logic for a fast solution. But this time I would love to know what would be the ideal way.

While String edit distance e.g. stringDistance name("toys R us") < (X) is a good approach, I will also recommend trying to use clustering especially hierarchical clustering here.
All the names falling into the same cluster should have the same standard company name. For the above to work you will have to cut the dendogram (http://en.wikipedia.org/wiki/Dendrogram) of the hierarchy pretty close to the leaves. You will have to try different features (the ones used in calculating the distance or similarity) of your clustering to arrive at a suitable set. Examples can be representing each company name as a vector of characters and then using cosine similarity to measure distances. Btw, cosine similarity works great for texts.

Related

Can you search for related database tables/fields using text similarity?

I am doing a college project where I need to compare a string with list of other strings. I want to know if we have any kind of library which can do this or not.
Suppose I have a table called : DOCTORS_DETAILS
Other Table names are : HOSPITAL_DEPARTMENTS , DOCTOR_APPOINTMENTS, PATIENT_DETAILS,PAYMENTS etc.
Now I want to calculate which one among those are more relevant to DOCTOR_DETAILS ?
Expected output can be,
DOCTOR_APPOINTMENTS - More relevant because of the term doctor matches in both string
PATIENT_DETAILS - The term DETAILS present in both string
HOSPITAL_DEPARTMENTS - least relevant
PAYMENTS - least relevant
Therefore I want to find RELEVENCE based on number of similar terms present on both the strings in question.
Ex : DOCTOR_DETAILS -> DOCTOR_APPOITMENT(1/2) > DOCTOR_ADDRESS_INFORMATION(1/3) > DOCTOR_SPECILIZATION_DEGREE_INFORMATION (1/4) > PATIENT_INFO (0/2)
Semantic similarity is a common NLP problem. There are multiple approaches to look into, but at their core they all are going to boil down to:
Turn each piece of text into a vector
Measure distance between vectors, and call closer vectors more similar
Three possible ways to do step 1 are:
tf-idf
fasttext
bert-as-service
To do step 2, you almost certainly want to use cosine distance. It is pretty straightforward with Python, here is a implementation from a blog post:
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
For your particular use case, my instincts say to use fasttext. So, the official site shows how to download some pretrained word vectors, but you will want to download a pretrained model (see this GH issue, use https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip),
Then you'd then want to do something like:
import fasttext
model = fasttext.load_model("model_filename.bin")
def order_tables_by_name_similarity(main_table, candidate_tables):
'''Note: we use a fasttext model, not just pretrained vectors, so we get subword information
you can modify this to also output the distances if you need them
'''
main_v = model[main_table]
similarity_to_main = lambda w: cos_sim(main_v, model[w])
return sorted(candidate_tables, key=similarity_to_main, reverse=True)
order_tables_by_name_similarity("DOCTORS_DETAILS", ["HOSPITAL_DEPARTMENTS", "DOCTOR_APPOINTMENTS", "PATIENT_DETAILS", "PAYMENTS"])
# outputs: ['PATIENT_DETAILS', 'DOCTOR_APPOINTMENTS', 'HOSPITAL_DEPARTMENTS', 'PAYMENTS']
If you need to put this in production, the giant model size (6.7GB) might be an issue. At that point, you'd want to build your own model, and constrain the model size. You can probably get roughly the same accuracy out of a 6MB model!

How to check an input string contains street address or not?

We want to identify the address fields from a document. For Identifying the address fields we converted the document to OCR files using Tesseract. From the tesseract output we want to check a string contains the address field or not . Which is the right strategy to resolve this problem ?
Its not possible to solve this problem using the regex because address fields are different for various documents and countries
Tried NLTK for classifying the words but not works perfectly for address field.
Required output
I am staying at 234 23 Philadelphia - Contains address files <234 23 Philadelphia>
I am looking for a place to stay - Not contains address
Provide your suggestions to solve this problem .
As in many ML problems, there are mutiple posible solutions, and the important part(and the one commonly has greater impact) is not which algorithm or model you use, but feature engineering ,data preprocessing and standarization ,and things like that. The first solution comes to my mind(and its just an idea, i would test it and see how it performs) its:
Get your training set examples and list the "N" most commonly used words in all examples(thats your vocabulary), this list will contain every one of the "N" most used words , every word would be represented by a number(the list index)
Transform your training examples: read every training example and change its representation replacing every word by the number of the word in the vocabolary.
Finally, for every training example create a feature vector of the same size as the vocabulary, and for every word in the vocabulary your feature vector will be 0(the corresponding word doesnt exists in your example) or 1(it exists) , or the count of how many times the word appears(again ,this is feature engineering)
Train multiple classifiers ,varing algorithms,parameters, training set sizes, etc, and do cross validation to choose your best model.
And from there keep the standard ML workflow...
If you are interested in just checking YES or NO and not extraction of complete address, One simple solution can be NER.
You can try to check if Text contains Location or not.
For Example :
import nltk
def check_location(text):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text))):
if hasattr(chunk, "label"):
if chunk.label() == "GPE" or chunk.label() == "GSP":
return "True"
return "False"
text="I am staying at 234 23 Philadelphia."
print(text+" - "+check_location(text))
text="I am looking for a place to stay."
print(text+" - "+check_location(text))
Output:
# I am staying at 234 23 Philadelphia. - True
# I am looking for a place to stay. - False
If you want to extract complete address as well, you will need to train your own model.
You can check: NER with NLTK , CRF++.
You're right. Using regex to find an address in a string is messy.
There are APIs that will attempt to extract addresses for you. These APIs are not always guaranteed to extract addresses from strings, but they will do their best. One example of an street address extract API is from SmartyStreets. Documentation here and demo here.
Something to consider is that even your example (I am staying at 234 23 Philadelphia) doesn't contain a full address. It's missing a state or ZIP code field. This makes is very difficult to programmatically determine if there is an address. Once there is a state or ZIP code added to that sample string (I am staying at 234 23 Philadelphia PA) it becomes much easier to programmatically determine if there is an address contained in the string.
Disclaimer: I work for SmartyStreets
A better method to do this task could be as followed below:
Train your own custom NER model (extending pre-trained SpaCy's model or building your own CRF++ / CRF-biLSTM model, if you have annotated data) or using a pre-trained models like SpaCy's large model or geopandas, etc.
Define a weighted score mechanism based on your problem statement.
For example - Let's assume every address have 3 important components - an address, a telephone number and an email id.
Text that would have all three of them would get a score of 33.33% + 33.33% + 33.33% = 100 %
For identifying if it's an address field or not you may take into account - the per% of SpaCy's location tags (GPE, FAC, LOC, etc) out of total tokens in text which gives a good estimate of how many location tags are present in text. Then run a regex for postal codes, and match the found city names with the 3-4 words just before the found postal code, if there's an overlap, you have correctly identified a postal code and hence an address field - (got your 33.33% score!).
For telephone numbers - certain checks and regex could do it but an important criteria would be that it performs these phone checks only if an address field is located in above text.
For emails/web address again you could perform nomial regex checks and finally add all these 3 scores to a cumulative value.
An ideal address would get 100 score while missing fields wile yield 66% etc. The rest of the text would get a score of 0.
Hope it helped! :)
Why do you say regular expressions won't work?
Basically, define all the different forms of address you might encounter in the form of regular expressions. Then, just match the expressions.

How to get the data to Feature Space Y from Input Space I

I am trying to implement a Support Vector Machine to understand in and out of it but I am stuck on how to implement it.
Everywhere it is explained how to get a hyper-plane such that we are able to separate different classes. My question is how to get the data to Feature Space Y from Input Space I.
Like for example consider below data:
date userId pc activity
01/04/2010 07:12:31 RES0962 PC-3736 Connect
01/04/2010 07:35:40 RES0962 PC-2588 Disconnect
01/04/2010 08:02:14 ZKH0388 PC-1021 Connect
01/04/2010 08:20:17 ZKH0388 PC-3736 Disconnect
Q) Assuming we are trying to build a User behavior model. We can extract features of each user and use it to train but in terms of code how its working? I have no idea about that. If someone could explain that it would be of great help.
Mapping to feature space requires you to have a weight for each of the distinct feature that determine the classes of your input. Getting the weight is a function of clearly understood the theoretical basis of your project e.g Your financial worth is determined by Money in bank and Investment. The weight of money in bank might be 2; while for investment mightt be 5. therefore, somebody with more investment and less money will likely be with more networths.
Now, the two features money in bank and investment will now be treated as a cordinate x and y respectively as you wished for each inputed data(of course with two features). Imagine you plot the graph after knowing each data (x, y) cordinate based on your weight. Then, getting the hyperplane will be the next challenge. I hope this help. Good luck

How to quantify these features so they can be analysed upon using Logistic Regression?

I have a very small question which has been baffling me for a while. I have a dataset with interesting features, but some of them are dimensionless quantities (I've tried using z-scores) on them but they've made things worse. These are:
Timestamps (Like YYYYMMDDHHMMSSMis) I am getting the last 9 chars from this.
User IDs (Like in a Hash form) How do I extract meaning from them?
IP Addresses (You know what those are). I only extract the first 3 chars.
City (Has an ID like 1,15,72) How do I extract meaning from this?
Region (Same as city) Should I extract meaning from this or just leave it?
The rest of the things are prices, widths and heights which understand. Any help or insight would be much appreciated. Thank you.
Timestamps can be transformed into Unix Timestamps, which are reasonable natural numbers
User IF/Cities/Regions are nominal values, which has to be encoded somehow. The most common approach is to create as much "dummy" dimensions as the number of possible values. So if you have 100 ciries, than you create 100 dimensions and give "1" only on the one representing a particular city (and 0 on the others)
IPs should rather be removed, or transformed into some small group of them (based on the DNS-network identification and nominal to dummy transformation as above)

Naive Bayes training set optimization

I am working on a naive bayes classifier that takes a bunch of user profile data such as:
Name
City
State
School
Email Address
URLS { ... }
The last bit is a bunch of urls that are search results for the user gathered by a google search for the user by name. The objective is to decide if the search result is accurate(ie. it is about the person) or inaccurate. In order to do this, each piece of the profile data is searched within each link in the url array and a binary value is assigned per attribute if that profile data (ex. City) is matched on a page. The results are then represented as a vector of binaries (ie. 1 0 0 0 1 means Name and Email address was matched on the url).
My questions revolves around creating the optimal training set. If a person's profile has incomplete information (such as missing email adddress), should that be a good profile to use in my training set? Should I be only training on profiles with full training information? Would it make sense to make different training sets (one for each combination of complete profile attributes) and then when i am given a user's url to test with, i determine which training set to use based on how much user profile is on record for the test person? How can i go about this?
In general, there is no "should". Whichever way you create a model, the only thing which matters is its performance, no matter how you created it.
However, it is highly unlikely you'd be able to create a proper model with hand-picked training set. The simple idea is that you should train your model on data which looks exactly like live data. Will live data have missing values, incomplete profiles etc? So, you need your model to know what to do in such situations and, therefore, such profiles should be in the training set.
Yes, certainly, you can make a model composed of several sub-models, however you might run into problems with having not enough training data and overfitting. You'll have to create multiple good models for it to work, which is harder. I suppose it would be better to leave this reasoning to the model itself rather than trying to hand-hold it into the right direction, this is what machine learning is for - save you the trouble... But there is really no way to say before trying it on your data set. Again, whatever works in your particular case is right.
Because you're using Naive Bayes as your model (and only because of that) you can benefit from the independence assumption to use every piece of data you have available and only consider those present in the new sample.
You have features f1...fn, some of which may or may not be present in any given entry. The posterior probability p( relevant | f_1 ... f_n ) decomposes as:
p( relevant | f_1 ... f_n ) \propto p( relevant ) * p( f_1 | relevant ) * p( f_2 | relevant ) ... p(f_n | relevant )
p( irrelevant | f_1...f_n ) is similar. If some particular f_i isn't present, just drop the terms from the two posteriors---given that they're defined over the same feature space probabilities are comparable, and can be normalised in the standard way. All you then need is to estimate the terms p( f_i | relevant ): this is simply the fraction of the relevant links where the i_th feature is 1 (possibly smoothed). To estimate this parameter simply use the set of relevant links where the i-th feature is defined.
This is only going to work if you implement yourself, as I don't think you can do this with a standard package, but given how easy it is to implement I wouldn't be concerned.
Edit: an example
Imagine you have the following features and data (they're binary, since you say that's what you have, but the extension to categorical or continuous is not difficult, I hope):
D = [ {email: 1, city: 1, name: 1, RELEVANT: 1},
{city: 1, name: 1, RELEVANT: 0},
{city: 0, email: 0, RELEVANT: 0}
{name: 1, city: 0, email: 1, RELEVANT: 1} ]
where each element of the list is an instance, and the target variable for classification is the special RELEVANT field (note that some of these instances have some variables missing).
You then want to classify the following instance, missing the RELEVANT field since that's what you're hoping to predict:
t = {email: 0, name: 1}
The posterior probability
p(RELEVANT=1 | t) = [p(RELEVANT=1) * p(email=0|RELEVANT=1) * p(name=1|RELEVANT)] / evidence(t)
while
p(RELEVANT=0 | t) = [p(RELEVANT=0) * p(email=0|RELEVANT=0) * p(name=1|RELEVANT=0)] / evidence(t)
where evidence(t) is just a normaliser obtained by summing the two numerators above.
To get each of the parameters of the form p(email=0|RELEVANT=1), look at the fraction of training instances where RELEVANT=1 which have email=0:
p(email=0|RELEVANT=1) = count(email=0,RELEVANT=1) / [count(email=0,RELEVANT=1) + count(email=1,RELEVANT=1)].
Notice that this term simply ignores instances for which email is not defined.
In this instance, the posterior probability of relevance goes to zero because the count(email=0,RELEVANT=1) is zero. So I would suggest using a smoothed estimator where you add one to every count, so that:
p(email=0|RELEVANT=1) = [count(email=0,RELEVANT=1)+1] / [count(email=0,RELEVANT=1) + count(email=1,RELEVANT=1) + 2].

Resources