Cluster URLs based on their pattern using Python - machine-learning

I am new to clustering techniques and I highly value any input you can provide for my problem bellow.
Basically, I want to cluster URLs based on their structural patterns.
for example
cluster1 - simple URLs https://domain/path/file
cluster2 - shortened URLs
cluster3 - redirect URLs
....
cluster k - new URL pattern
Given a URL dataset, I want to understand how many different URL pattern clusters exists and then visually see the difference.
What I see in the existing methods are clustering domain wise (cluster URLs of the same website together). And this is not what I am expecting. When I try the nlp based (word based) similarity clustering this is happening as the URLs of the same website tend to have same words with little differences.
Instead, I want to focus on the URL structure and identify URL patterns. Removing all the special characters and just creating a bag of words for each URL nullify the URL structure. Can anyone help me to identify a suitable clustering technique as well as a vectorizing technique to identify different URL pattern clusters.
Thanks in advance
Matheesha

Here is an example of how to cluster text.
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = "kitten belly squooshy merley best eating google feedback face extension impressed map feedback google eating face extension climbing key".split(" ") #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Result:
- *eating:* climbing, eating
- *google:* google, squooshy
- *feedback:* feedback
- *face:* face, map
- *impressed:* impressed
- *extension:* extension
- *key:* belly, best, key, kitten, merley

Related

How Spacy NER verifies the rationality of entities?

When I use SpaCy NER, SpaCy will recognize 'TodoA' as PERSON. This is obviously unreasonable. Is there any way to verify whether the entity extracted by SpaCy is reasonable? Thanks!
Most of these unreasonable entities are extracted by spacy beam search. The beam search code is:
import spacy
import sys
from collections import defaultdict
nlp = spacy.load('en')
text = u'Will Japan join the European Union? If yes, we should \
move to United States. Fasten your belts, America we are coming'
with nlp.disable_pipes('ner'):
doc = nlp(text)
threshold = 0.2
(beams, somethingelse) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)
entity_scores = defaultdict(float)
for beam in beams:
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for start, end, label in ents:
entity_scores[(start, end, label)] += score
print ('Entities and scores (detected with beam search)')
for key in entity_scores:
start, end, label = key
score = entity_scores[key]
if ( score > threshold):
print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))
The "unreasonable" annotation you are seeing is directly linked with the nature of the model that is used to perform the annotation and the process of obtaining it.
In short, the model is an approximation of a very complex function (in mathematical terms) from some characteristics of sequences of words (e.g. presence of particular letters, upper-casing, usage of particular terms, etc.) to a closed set of tags (e.g. PERSON). It is an approximation that is close to best across a large body of text (e.g. a few GBs of ASCII text) but certainly it is not a mapping of particular phrases to tags. Therefore, even though the data which is used for training is accurate, the result of applying the model might be not ideal in some circumstances.
In your case it is likely that the model is clinging on upper-casing of a word, and maybe there was a large number of words used in training that share the prefix that were marked with tag PERSON) - e.g. Toddy, toddler, etc. and a very small number of words with such a prefix that were not PERSONs.
This phenomenon that we observe was not chosen explicitly by person preparing the model, it is only a by-product of the combination of the process of preparing it (training), and the data used.

Can you search for related database tables/fields using text similarity?

I am doing a college project where I need to compare a string with list of other strings. I want to know if we have any kind of library which can do this or not.
Suppose I have a table called : DOCTORS_DETAILS
Other Table names are : HOSPITAL_DEPARTMENTS , DOCTOR_APPOINTMENTS, PATIENT_DETAILS,PAYMENTS etc.
Now I want to calculate which one among those are more relevant to DOCTOR_DETAILS ?
Expected output can be,
DOCTOR_APPOINTMENTS - More relevant because of the term doctor matches in both string
PATIENT_DETAILS - The term DETAILS present in both string
HOSPITAL_DEPARTMENTS - least relevant
PAYMENTS - least relevant
Therefore I want to find RELEVENCE based on number of similar terms present on both the strings in question.
Ex : DOCTOR_DETAILS -> DOCTOR_APPOITMENT(1/2) > DOCTOR_ADDRESS_INFORMATION(1/3) > DOCTOR_SPECILIZATION_DEGREE_INFORMATION (1/4) > PATIENT_INFO (0/2)
Semantic similarity is a common NLP problem. There are multiple approaches to look into, but at their core they all are going to boil down to:
Turn each piece of text into a vector
Measure distance between vectors, and call closer vectors more similar
Three possible ways to do step 1 are:
tf-idf
fasttext
bert-as-service
To do step 2, you almost certainly want to use cosine distance. It is pretty straightforward with Python, here is a implementation from a blog post:
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
For your particular use case, my instincts say to use fasttext. So, the official site shows how to download some pretrained word vectors, but you will want to download a pretrained model (see this GH issue, use https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip),
Then you'd then want to do something like:
import fasttext
model = fasttext.load_model("model_filename.bin")
def order_tables_by_name_similarity(main_table, candidate_tables):
'''Note: we use a fasttext model, not just pretrained vectors, so we get subword information
you can modify this to also output the distances if you need them
'''
main_v = model[main_table]
similarity_to_main = lambda w: cos_sim(main_v, model[w])
return sorted(candidate_tables, key=similarity_to_main, reverse=True)
order_tables_by_name_similarity("DOCTORS_DETAILS", ["HOSPITAL_DEPARTMENTS", "DOCTOR_APPOINTMENTS", "PATIENT_DETAILS", "PAYMENTS"])
# outputs: ['PATIENT_DETAILS', 'DOCTOR_APPOINTMENTS', 'HOSPITAL_DEPARTMENTS', 'PAYMENTS']
If you need to put this in production, the giant model size (6.7GB) might be an issue. At that point, you'd want to build your own model, and constrain the model size. You can probably get roughly the same accuracy out of a 6MB model!

How to create a recommender based on tags?

I am working on an e-learning platform using PHP. It recommends videos, if you fail a specific question. How do i go about creating the recommender system that takes in Tags and recommends relevant Videos?
import pandas as pd
videos = pd.read_csv("/file_path/vid_com_dup.csv",
sep = ',', names =
['vid_id','ques_id','vid_name','vid_tags'])
videos.head()
The csv file includes the following columns:
vid_id - primary key and id for videos.
ques_id - foreign key.
vid_name - the name of the video.
vid_tags - some tags in form of (1+1,single digit, addition, grade 1).
the tags above are also in question table which are similar.
if question has tags (1+1,single digit, addition, grade 1), I want to make
recommender that takes in above tags compares with different videos that have similar tags and gives recommendations.
I finally got Around it, hope it will help someone else.
Load Data set: Image of Sample Dataset:
Split the tags: image of split tags:
Basically what the above picture depicts is that if the tag is present is 1 else 0.
Scale and Transform the Features matrix above:
Apply Scikit learn unsupervised nearest neighbors. You should get indices and distance matrix. What is unsupervised nearest neighbors? For this problem we are only interested in getting nearest neighbours based on distances and recommending and not for classifying. Image of indices and distances below:
Your all done. All is needed now is a function for you to get nearest videos. This is depicted in the image below that has code and result.

Custom Named entity recognition

So i have task in front of me to make a custom ner model for the pharmaceutical industry where in i have a finite list of drugs and over 4000 text files from where NER is supposed to be done. I have also tried entity matching using spacy but it is showing some error. So now i plan on using SKlearn crfsuite but in order to do that my data needs to be in conll format and should be annotated.Would really appreciate if someone could guide me in annotating my text files! is there any way i can initiate automatic annotation on the text files using the drug list i have ? as it is a humongous effort for an individual to achieve the same manually.I also had a look at the question asked in the link mentioned below.
NER model to recognize Indian names
But no one has actually addressed my question.Would really appreciate if someone could help me out
Spacy code:-
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
class EntityMatcher(object):
name = 'entity_matcher'
def __init__(self, nlp, terms, label):
patterns = [nlp(term) for term in terms]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add(label, None, *patterns)
def __call__(self, doc):
matches = self.matcher(doc)
spans = []
for label, start, end in matches:
span = Span(doc, start, end, label=label)
spans.append(span)
doc.ents = spans
return doc
data=pd.read_excel(r'C:\Users\xyz\pname.xlsx')
ld=list(set(data['Product']))
nlp = spacy.load('en')
entity_matcher = EntityMatcher(nlp, ld, 'DRUG')
nlp.add_pipe(entity_matcher)
print(nlp.pipe_names)
doc=nlp('Hi bnbbn, ope all is well. In preparation for the bcbcb is there anything that BGTD requires specifically? We had sent you the US centric Briefing Package to align with our previous discussion on having bkjnsd included in the Wave 1 IMOVAX POLIO submission plan. If you would like, we can set-up a BGTD specific meeting after the June 20th meeting to discuss any jk specific product questions you may have as the product mix is a bit different between countries.')
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
when i run my script, this is the error i get :-
[T002] Pattern length (11) >= phrase_matcher.max_length (10). Length can be set on initialization, up to 10.

best algorithm to predict 3 similar blogs based on a blog props and contents only

{
"blogid": 11,
"blog_authorid": 2,
"blog_content": "(this is blog complete content: html encoded on base64 such as) PHNlY3Rpb24+PGRpdiBjbGFzcz0icm93Ij4KICAgICAgICA8ZGl2IGNsYXNzPSJjb2wtc20tMTIiIGRhdGEtdHlwZT0iY29udGFpbmVyLWNvbnRlbn",
"blog_timestamp": "2018-03-17 00:00:00",
"blog_title": "Amazon India Fashion Week: Autumn-",
"blog_subtitle": "",
"blog_featured_img_link": "link to image",
"blog_intropara": "Introductory para to article",
"blog_status": 1,
"blog_lastupdated": "\"Mar 19, 2018 7:42:23 AM\"",
"blog_type": "Blog",
"blog_tags": "1,4,6",
"blog_uri": "Amazon-India-Fashion-Week-Autumn",
"blog_categories": "1",
"blog_readtime": "5",
"ViewsCount": 0
}
Above is one sample blog as per my API. I have a JsonArray of such blogs.
I am trying to predict 3 similar blogs based on a blog's props(eg: tags,categories,author,keywords in title/subtitle) and contents. I have no user data i.e, there is no logged in user data(such as rating or review). I know that without user's data it will not be accurate but I'm just getting started with data science or ML. Any suggestion/link is appreciated. I prefer using java but python,php or any other lang also works for me. I need an easy to implement model as I am a beginner. Thanks in advance.
My intuition is that this question might not be at the right address.
BUT
I would do the following:
Create a dataset of sites that would be an inventory from which to predict. For each site you will need to list one or more features: Amount of tags, amount of posts, average time between posts in days, etc.
Sounds like this is for training and you are not worried about accuracy
too much, numeric features should suffice.
Work back from a k-NN algorithm. Don't worry about the classifiers. Instead of classifying a blog, you list the 3 closest neighbors (k = 3). A good implementation of the algorithm is here. Have fun simplifying it for your purposes.
Your algorithm should be a step or two shorter than k-NN which is considered to be among simpler ML, a good place to start.
Good luck.
EDIT:
You want to build a recommender engine using text, tags, numeric and maybe time series data. This is a broad request. Just like you, when faced with this request, I’d need to dive in the data and research best approach. Some approaches require different sets of data. E.g. Collaborative vs Content-based filtering.
Few things may’ve been missed on the user side that can be used like a sort of rating: You do not need a login feature get information: Cookie ID or IP based DMA, GEO and viewing duration should be available to the Web Server.
On the Blog side: you need to process the texts to identify related terms. Other blog features I gave examples above.
I am aware that this is a lot of hand-waving, but there’s no actual code question here. To reiterate my intuition is that this question might not be at the right address.
I really want to help but this is the best I can do.
EDIT 2:
If I understand your new comments correctly, each blog has the following for each other blog:
A Jaccard similarity coefficient.
A set of TF-IDF generated words with
scores.
A Euclidean distance based on numeric data.
I would create a heuristic from these and allow the process to adjust the importance of each statistic.
The challenge would be to quantify the words-scores TF-IDF output. You can treat those (over a certain score) as tags and run another similarity analysis, or count overlap.
You already started on this path, and this answer assumes you are to continue. IMO best path is to see which dedicated recommender engines can help you without constructing statistics piecemeal (numeric w/ Euclidean, tags w/ Jaccard, Text w/ TF-IDF).

Resources