Why is "machine_learning" lemmatized both as "machine_learning" and "machine_learne"? - machine-learning

I am running LDA on a number of texts. When I generated some visualizations of the produced topics, I found that the bigram "machine_learning" had been lemmatized both as "machine_learning" and "machine_learne". Here is as minimal a reproducible example as I can provide:
import en_core_web_sm
tokenized = [
[
'artificially_intelligent', 'funds', 'generating', 'excess', 'returns',
'artificial_intelligence', 'deep_learning', 'compelling', 'reasons',
'join_us', 'artificially_intelligent', 'fund', 'develop', 'ai',
'machine_learning', 'capabilities', 'real', 'cases', 'big', 'players',
'industry', 'discover', 'emerging', 'trends', 'latest_developments',
'ai', 'machine_learning', 'industry', 'players', 'trading',
'investing', 'live', 'investment', 'models', 'learn', 'develop',
'compelling', 'business', 'case', 'clients', 'ceos', 'adopt', 'ai',
'machine_learning', 'investment', 'approaches', 'rare', 'gathering',
'talents', 'including', 'quants', 'data_scientists', 'researchers',
'ai', 'machine_learning', 'experts', 'investment_officers', 'explore',
'solutions', 'challenges', 'potential', 'risks', 'pitfalls',
'adopting', 'ai', 'machine_learning'
],
[
'recent_years', 'topics', 'data_science', 'artificial_intelligence',
'machine_learning', 'big_data', 'become_increasingly', 'popular',
'growth', 'fueled', 'collection', 'availability', 'data',
'continually', 'increasing', 'processing', 'power', 'storage', 'open',
'source', 'movement', 'making', 'tools', 'widely', 'available',
'result', 'already', 'witnessed', 'profound', 'changes', 'work',
'rest', 'play', 'trend', 'increase', 'world', 'finance', 'impacted',
'investment', 'managers', 'particular', 'join_us', 'explore',
'data_science', 'means', 'finance_professionals'
]
]
nlp = en_core_web_sm.load(disable=['parser', 'ner'])
def lemmatization(descrips, allowed_postags=None):
if allowed_postags is None:
allowed_postags = ['NOUN', 'ADJ', 'VERB',
'ADV']
lemmatized_descrips = []
for descrip in descrips:
doc = nlp(" ".join(descrip))
lemmatized_descrips.append([
token.lemma_ for token in doc if token.pos_ in allowed_postags
])
return lemmatized_descrips
lemmatized = lemmatization(tokenized)
print(lemmatized)
As you will notice, "machine_learne" is found nowhere in the input tokenized, but both "machine_learning" and "machine_learne" are found in the output lemmatized.
What is the cause of this and can I expect it to cause issues with other bigrams/trigrams?

I think you misunderstood the process of POS Tagging and Lemmatization.
POS Tagging is based on several other informations than the word alone (I don't know which is your mother language, but that is common to many languages), but also on the surrounding words (for example, one common learned rule is that in many statements verb is usually preceded by a noun, which represents the verb's agent).
When you pass all these 'tokens' to your lemmatizer, spacy's lemmatizer will try to "guess" which is the Part of Speech of your solitary word.
In many cases it'll go for a default noun and, if it is not in a lookup table for common and irregular nouns, it'll attempt to use generic rules (such as stripping plural 's').
In other cases it'll go for a default verb based on some patterns (the "-ing" in the end), which is probably your case. Since no verb "machine_learning" exists in any dictionary (there's no instance in its model), it'll go for a "else" route and apply generic rules.
Therefore, machine_learning is probably being lemmatized by a generic '"ing" to "e"' rule (such as in the case of making -> make, baking -> bake), common to many regular verbs.
Look at this test example:
for descrip in tokenized:
doc = nlp(" ".join(descrip))
print([
(token.pos_, token.text) for token in doc
])
Output:
[('NOUN', 'artificially_intelligent'), ('NOUN', 'funds'), ('VERB',
'generating'), ('ADJ', 'excess'), ('NOUN', 'returns'), ('NOUN',
'artificial_intelligence'), ('NOUN', 'deep_learning'), ('ADJ',
'compelling'), ('NOUN', 'reasons'), ('PROPN', 'join_us'), ('NOUN',
'artificially_intelligent'), ('NOUN', 'fund'), ('NOUN', 'develop'),
('VERB', 'ai'), ('VERB', 'machine_learning'), ('NOUN',
'capabilities'), ('ADJ', 'real'), ('NOUN', 'cases'), ('ADJ', 'big'),
('NOUN', 'players'), ('NOUN', 'industry'), ('VERB', 'discover'),
('VERB', 'emerging'), ('NOUN', 'trends'), ('NOUN',
'latest_developments'), ('VERB', 'ai'), ('VERB', 'machine_learning'),
('NOUN', 'industry'), ('NOUN', 'players'), ('NOUN', 'trading'),
('VERB', 'investing'), ('ADJ', 'live'), ('NOUN', 'investment'),
('NOUN', 'models'), ('VERB', 'learn'), ('VERB', 'develop'), ('ADJ',
'compelling'), ('NOUN', 'business'), ('NOUN', 'case'), ('NOUN',
'clients'), ('NOUN', 'ceos'), ('VERB', 'adopt'), ('VERB', 'ai'),
('ADJ', 'machine_learning'), ('NOUN', 'investment'), ('NOUN',
'approaches'), ('ADJ', 'rare'), ('VERB', 'gathering'), ('NOUN',
'talents'), ('VERB', 'including'), ('NOUN', 'quants'), ('NOUN',
'data_scientists'), ('NOUN', 'researchers'), ('VERB', 'ai'), ('ADJ',
'machine_learning'), ('NOUN', 'experts'), ('NOUN',
'investment_officers'), ('VERB', 'explore'), ('NOUN', 'solutions'),
('VERB', 'challenges'), ('ADJ', 'potential'), ('NOUN', 'risks'),
('NOUN', 'pitfalls'), ('VERB', 'adopting'), ('VERB', 'ai'), ('NOUN',
'machine_learning')]
You are getting both machine_learning as verb and noun based on context. But see that just concatenating the words gets you messy because they are not ordered in Natural language as expected.
Not even a human can understand and correctly POS Tag this text:
artificially_intelligent funds generating excess returns
artificial_intelligence deep_learning compelling reasons join_us
artificially_intelligent fund develop ai machine_learning capabilities
real cases big players industry discover emerging trends
latest_developments ai machine_learning industry players trading
investing live investment models learn develop compelling business
case clients ceos adopt ai machine_learning investment approaches rare
gathering talents including quants data_scientists researchers ai
machine_learning experts investment_officers explore solutions
challenges potential risks pitfalls adopting ai machine_learning

Related

the stacking model, I want to see the recall and precision results

In the stacking model, I want to see the recall and accuracy results, I have tried many methods and I have not found results. I have found recall and precision in another model but I stuck with the stacking model., little help would go a long way.
estimator = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('dec_tree', dec_tree),
('knn', knn),
('xgb' , xgb),
('ext' , ext),
('grad' , grad),
('hist', hist)]
#bulid stack model
stack_model= StackingClassifier(
estimators=estimator, final_estimator=LogisticRegression())
#train stack model
stack_model.fit(x_train, y_train)
#make preduction
y_train_pred = stack_model.predict(x_train)
y_test_pred = stack_model.predict(x_test)
#traning set performance
stack_model_train_accuracy = accuracy_score(y_train,y_train_pred)
stack_model_train_f1 = f1_score(y_train,y_train_pred, average ='weighted')
#Testing set performance
stack_model_test_accuracy = accuracy_score(y_test,y_test_pred)
stack_model_test_f1= f1_score(y_test,y_test_pred, average ='weighted')
#print
print ('Model Performance For Traning Set')
print ('- Accuracy: %s' % stack_model_train_accuracy)
print ('- f1: %s' % stack_model_train_f1)
print ('______________________________________')
print ('Model Performance For Testing Set')
print ('- Accuracy: %s' % stack_model_test_accuracy)
print ('- f1: %s' % stack_model_test_f1)
until here it is working >> but I need the recall and precision. if I check them out, in the same way, I checked the accuracy and f_score > > it will be wrong! and if I used classification_report it will be an error too.

How calculate clusters coherence/quality?

I did embeddings with fasttext and I have clusters thanks to KMeans.
I would like to calculate similarities inside each cluster to check if the sentences inside are well clustered. I want to keep sentences with good similarities in each clusters. If the similarity is not good, I want to exit sentence that not belong to a cluster, and next group similar sentences not belonging to clusters.
How can I do it in a good manner ? I thought using cosine similarity but don't know how to compare all sentences inside a cluster
Maybe something like this...
# clustering words into similar groups:
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Result:
- *LDPELDKSL:* LDPELDKSL
- *DFKLKSLFD:* DFKLKSLFD
- *XYZ:* ABC, XYZ
- *DLFKFKDLD:* DLFKFKDLD
See these links for additional guidance on how to cluster text.
https://towardsdatascience.com/applying-machine-learning-to-classify-an-unsupervised-text-document-e7bb6265f52
https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html
https://pythonprogramminglanguage.com/kmeans-text-clustering/
http://brandonrose.org/clustering
Here are a couple examples using Cosine Similarity.
d1 = "plot: two teen couples go to a church party, drink and then drive."
d2 = "films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . "
d3 = "every now and then a movie comes along from a suspect studio , with every indication that it will be a stinker , and to everybody's surprise ( perhaps even the studio ) the film becomes a critical darling . "
d4 = "damn that y2k bug . "
documents = [d1, d2, d3, d4]
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
print(LemVectorizer.vocabulary_)
tf_matrix = LemVectorizer.transform(documents).toarray()
print(tf_matrix)
tf_matrix.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
print(tfidfTran.idf_)
import math
def idf(n,df):
result = math.log((n+1.0)/(df+1.0)) + 1
return result
print("The idf for terms that appear in one document: " + str(idf(4,1)))
print("The idf for terms that appear in two documents: " + str(idf(4,2)))
tfidf_matrix = tfidfTran.transform(tf_matrix)
print(tfidf_matrix.toarray())
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
print(cos_similarity_matrix)
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
https://sites.temple.edu/tudsc/2017/03/30/measuring-similarity-between-texts-in-python/
# Define the documents
doc_trump = "Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin"
doc_election = "President Trump says Putin had no political interference is the election outcome. He says it was a witchhunt by political parties. He claimed President Putin is a friend who had nothing to do with the election"
doc_putin = "Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime Minister earlier in his political career"
documents = [doc_trump, doc_election, doc_putin]
# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Create the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(documents)
# OPTIONAL: Convert Sparse Matrix to Pandas Dataframe if you want to see the word frequencies.
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix,
columns=count_vectorizer.get_feature_names(),
index=['doc_trump', 'doc_election', 'doc_putin'])
df
# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(df, df))
https://www.machinelearningplus.com/nlp/cosine-similarity/

BERT problem with context/semantic search in italian language

I am using BERT model for context search in Italian language but it does not understand the contextual meaning of the sentence and returns wrong result.
in below example code when I compare "milk with chocolate flavour" with two other type of milk and one chocolate so it returns high similarity with chocolate. it should return high similarity with other milks.
can anyone suggest me any improvement on the below code so that it can return semantic results?
Code :
!python -m spacy download it_core_news_lg
!pip install sentence-transformers
import scipy
import numpy as np
from sentence_transformers import models, SentenceTransformer
model = SentenceTransformer('distiluse-base-multilingual-cased') # workes with Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish
corpus = [
"Alpro, Cioccolato bevanda a base di soia 1 ltr", #Alpro, Chocolate soy drink 1 ltr(soya milk)
"Milka cioccolato al latte 100 g", #Milka milk chocolate 100 g
"Danone, HiPRO 25g Proteine gusto cioccolato 330 ml", #Danone, HiPRO 25g Protein chocolate flavor 330 ml(milk with chocolate flabor)
]
corpus_embeddings = model.encode(corpus)
queries = [
'latte al cioccolato', #milk with chocolate flavor,
]
query_embeddings = model.encode(queries)
# Calculate Cosine similarity of query against each sentence i
closest_n = 10
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
print("\n======================\n")
print("Query:", query)
print("\nTop 10 most similar sentences in corpus:")
for idx, distance in results[0:closest_n]:
print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
Output :
======================
Query: latte al cioccolato
Top 10 most similar sentences in corpus:
Milka cioccolato al latte 100 g (Score: 0.7714)
Alpro, Cioccolato bevanda a base di soia 1 ltr (Score: 0.5586)
Danone, HiPRO 25g Proteine gusto cioccolato 330 ml (Score: 0.4569)
The problem is not with your code, it is just the insufficient model performance.
There are a few things you can do. First, you can try Universal Sentence Encoder (USE). From my experience their embeddings are a little bit better, at least in English.
Second, you can try a different model, for example sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1. It is based on ROBERTa and might give a better performance.
Now you can combine together embeddings from several models (just by concatenating the representations). In some cases it helps, on expense of much heavier compute.
And finally you can create your own model. It is well known that single language models perform significantly better than multilingual ones. You can follow the guide and train your own Italian model.

Arff file format weka

I am doing my machine learning homework and I am using Weka which I am very new to. I am trying to use M5P but the classifier is grayed out. I understand that means that the file im using is incorrect whether it be format or parameters. Can someone help me fix my arff file? I'm pretty sure the problem is within the attribute section.
Here it is.
#relation world_happiness
#attribute M5P
#attribute continent {Americas, Africa, Asia, Europe, Australia, Antarctica}
#attribute country string
#attribute SWL-ranking numeric
#attribute SWL-index numeric
#attribute life-expectancy numeric
#attribute GDP-per-capita numeric
#attribute access-to-education-score numeric
#data
Europe,'Albania',157,153.33,73.8,4.9,75.8
Africa,'Algeria',134,173.33,71.1,7.2,66.9
Africa,'Angola',149,160,40.8,3.2,?
Americas, 'Antigua And Barbuda',16,246.67,73.9,11,?
Americas,'Argentina',56,226.67,74.5,13.1,93.7
Europe,'Armenia',172,123.33,71.5,4.5,?
Australia,'Australia',26,243.33,80.3,31.9,?
Europe,'Austria',3,260,79,32.7,99.1
Asia,'Azerbaijan',144,163.33,66.9,4.8,80.2
Americas,'Bahamas',5,256.67,69.7,20.2,?
Asia,'Bahrain',33,240,74.3,23,102
Asia,'Bangladesh',104,190,62.8,2.1,53.7
Americas,'Barbados',27,243.33,75,17,101.1
Europe,'Belarus',170,133.33,68.1,6.9,94.2
Europe,'Belgium',28,243.33,78.9,31.4,145.4
Americas,'Belize',48,230,71.9,6.8,71.6
Africa,'Benin',122,180,54,1.1,21.8
Asia,'Bhutan',8,253.33,62.9,1.4,?
Americas,'Bolivia',117,183.33,64.1,2.9,?
Europe, 'Bosnia & Herzegovina',137,170,74.2,6.8,?
Africa,'Botswana',123,180,36.3,10.5,81.8
Americas,'Brazil',81,210,70.5,8.4,103.2
Asia, 'Brunei Darussalam',9,253.33,76.4,23.6,?
Europe,'Bulgaria',164,143.33,72.2,9.6,92
Africa, 'Burkina Faso',152,156.67,47.5,1.3,10
Asia,'Burma',130,176.67,60.2,1.7,?
Africa,'Burundi',178,100,43.6,0.7,?
Asia,'Cambodia',110,186.67,56.2,2.2,17.3
Africa,'Cameroon',138,170,45.8,2.4,?
Americas,'Canada',10,253.33,80,34,102.6
Africa, 'Cape Verdi',100,193.33,70.4,6.2,?
Africa, 'Central African Republic',145,163.33,39.3,1.1,?
Africa,'Chad',159,150,43.6,1.5,11.5
Americas,'Chile',71,216.67,77.9,11.3,87.5
Asia,'China',82,210,71.6,6.8,62.8
Americas,'Colombia',34,240,72.4,7.9,70.9
Africa,'Comoros',97,196.67,63.2,0.6,?
Africa, 'Congo Democratic Republic',176,110,43.1,0.7,18.4
Africa, 'Congo Republic',105,190,52,1.3,?
Americas, 'Costa Rica',13,250,78.2,11.1,50.9
Europe,'Croatia',98,196.67,75,11.6,?
Americas,'Cuba',83,210,77.3,3.5,?
Europe,'Cyprus',49,230,78.6,7.14,?
Europe, 'Czech Republic',77,213.33,75.6,19.5,87.9
Europe,'Denmark',1,273.33,77.2,34.6,?
Africa,'Dijbouti',150,160,52.8,1.3,14.7
Americas,'Dominica',29,243.33,75.6,5.5,?
Americas, 'Dominican Republic',42,233.33,67.2,7,?
Americas,'Ecuador',111,186.67,74.3,4.3,56.7
Africa,'Egypt',151,160,69.8,3.9,?
Americas, 'El Salvador',61,220,70.9,4.7,49.8
Africa, 'Equatorial Guinea',135,173.33,43.3,50.2,?
Africa,'Eritrea',162,146.67,53.8,1,28.2
Europe,'Estonia',139,170,71.3,16.7,107
Africa,'Ethiopia',153,156.67,47.6,0.9,5.2
Australia, 'Fiji',57,223.33,67.8,6,?
Europe,'Finland',6,256.67,78.5,30.9,124.5
Europe,'France',62,220,79.5,29.9,108.7
Africa,'Gabon',88,206.67,54.5,6.8,54.4
Africa,'Gambia',106,190,55.7,1.9,27
Europe,'Georgia',169,136.67,70.5,3.3,77.7
Europe,'Germany',35,240,78.7,30.4,99
Africa,'Ghana',89,206.67,56.8,2.5,37.3
Europe,'Greece',84,210,78.3,22.2,94.6
Americas,'Grenada',72,216.67,65.3,5,?
Americas,'Guatemala',43,233.33,67.3,4.7,32.7
Africa,'Guinea',140,170,53.7,2,?
Africa,'Guinea-Bissau',124,180,44.7,0.8,20.4
Americas,'Guyana',36,240,63.1,4.6,81
Americas,'Haiti',118,183.33,51.6,1.7,?
Americas,'Honduras',37,240,67.8,2.9,?
Asia, 'Hong Kong',63,220,81.6,32.9,?
Europe,'Hungary',107,190,72.7,16.3,98.6
Europe,'Iceland',4,260,80.7,35.6,108.8
Asia,'India',125,180,63.3,3.3,49.9
Asia,'Indonesia',64,220,66.8,3.6,?
Asia,'Iran',96,200,70.4,8.3,80
Europe,'Ireland',11,253.33,77.7,41,123.1
Asia,'Israel',58,223.33,79.7,24.6,93
Europe,'Italy',50,230,80.1,29.2,92.8
Africa, 'Ivory Coast',160,150,45.9,1.6,21.7
Americas,'Jamaica',44,233.33,70.8,4.4,83.6
Asia,'Japan',90,206.67,82,31.5,102.1
Asia,'Jordan',141,170,71.3,4.7,87.7
Asia,'Kazakhstan',101,193.33,63.2,8.2,87
Africa,'Kenya',112,186.67,47.2,1.1,?
Asia,'Kuwait',38,240,76.9,19.2,55.6
Asia,'Kyrgyzstan',65,220,66.8,2.1,83
Asia,'Laos',126,180,54.7,1.9,35.6
Europe,'Latvia',154,156.67,71.6,13.2,88.9
Asia,'Lebanon',113,186.67,72,6.2,78.2
Africa,'Lesotho',165,143.33,36.3,2.5,28
Africa,'Libya',108,190,73.6,11.4,?
Europe,'Lithuania',155,156.67,72.3,13.7,93.4
Europe,'Luxembourg',12,253.33,78.5,55.6,95.3
Europe,'Macedonia',146,163.33,73.8,7.8,?
Africa,'Madagascar',103,193.33,55.4,0.9,?
Africa,'Malawi',158,153.33,39.7,0.6,?
Asia,'Malaysia',17,246.67,73.2,12.1,98.8
Asia,'Maldives',66,220,66.6,3.9,42.7
Africa,'Mali',131,176.67,47.9,1.2,15
Europe,'Malta',14,250,78.4,19.9,90.4
Africa,'Mauritania',132,176.67,52.7,2.2,?
Africa,'Mauritius',73,216.67,72.2,13.1,107.3
Americas,'Mexico',51,230,75.1,10,73.4
Europe,'Moldova',175,116.67,67.7,1.8,?
Asia,'Mongolia',59,223.33,64,1.9,64.4
Africa,'Morocco',114,186.67,69.7,4.2,39.3
Africa,'Mozambique',127,180,41.9,1.3,13.9
Africa,'Namibia',74,216.67,48.3,7,59.8
Asia,'Nepal',119,183.33,61.6,1.4,53.9
Europe,'Netherlands',15,250,78.4,30.5,124.1
Australia,' New Zealand',18,246.67,79.1,25.2,112.9
Americas,'Nicaragua',85,210,69.7,2.9,?
Africa,'Niger',161,150,44.4,0.9,?
Africa,'Nigeria',120,183.33,43.4,1.4,?
Europe,'Norway',19,246.67,79.4,42.3,117
Asia,'Oman',30,243.33,74.1,13.2,67.8
Asia,'Pakistan',166,143.33,63,2.4,39
Asia,'Palestine',128,180,72.5,5.8,80.7
Americas,'Panama',39,240,74.8,7.2,68.7
Australia, 'Papua New Guinea',86,210,55.3,2.6,21.2
Americas,'Paraguay',75,216.67,71,4.9,56.9
Americas,'Peru',115,186.67,70,5.9,80.8
Asia,'Philippines',78,213.33,70.4,5.1,75.9
Europe,'Poland',99,196.67,74.3,13.3,?
Europe,'Portugal',92,203.33,77.2,19.3,112
Asia,'Qatar',45,233.33,72.8,27.4,92.4
Europe,'Romania',136,173.33,71.3,8.2,80.2
Europe,'Russia',167,143.33,65.3,11.1,81.9
Africa,'Rwanda',163,146.67,43.9,1.5,12.1
Australia, 'Samoa Western',52,230,70.2,5.8,76
Africa, 'Sao Tome And Principe',60,223.33,63,1.2,?
Asia, 'Saudi Arabia',31,243.33,71.8,12.8,68.5
Africa,'Senegal',116,186.67,55.7,1.8,19.5
Africa,'Seychelles',20,246.67,72.7,7.8,?
Africa, 'Sierra Leone',143,166.67,40.8,0.8,23.9
Asia,'Singapore',53,230,78.7,28.1,?
Europe,'Slovakia',129,180,74,16.1,86.6
Europe,'Slovenia',67,220,76.4,21.6,98.8
Australia, 'Solomon Islands',54,230,62.3,1.7,?
Africa, 'South Africa',109,190,48.4,12,90.2
Asia, 'South Korea',102,193.33,77,20.4,97.4
Europe,'Spain',46,233.33,79.5,25.5,112.8
Asia, 'Sri Lanka',93,203.33,74,4.3,?
Americas, 'St Kitts And Nevis',21,246.67,70,8.8,?
Americas, 'St Lucia',47,233.33,72.4,5.4,94.3
Americas, 'St Vincent And The Grenadines',40,240,71.1,2.9,?
Africa,'Sudan',173,120,56.4,2.1,28.8
Americas,'Suriname',32,243.33,69.1,4.1,50.7
Africa,'Swaziland',168,140,32.5,5,?
Europe,'Sweden',7,256.67,80.2,29.8,152.8
Europe,'Switzerland',2,273.33,80.5,32.3,99.9
Asia,'Syria',142,170,73.3,3.9,42
Asia,'Taiwan',68,220,76.1,27.6,?
Asia,'Tajikistan',94,203.33,63.6,1.2,76
Africa,'Tanzania',121,183.33,46,0.7,5.31
Asia,'Thailand',76,216.67,70,8.3,79
Asia,'Timor-Leste',69,220,65.5,0.4,?
Africa,'Togo',147,163.33,54.3,1.7,?
Australia,' Tonga',70,220,72.2,2.3,?
Americas, 'Trinidad And Tobago',55,230,69.9,16.7,78.4
Africa,'Tunisia',79,213.33,73.3,8.3,74.6
Europe,'Turkey',133,176.67,68.7,8.2,?
Asia,'Turkmenistan',171,133.33,62.4,8,?
Asia,'Uae',22,246.67,78,43.4,74.4
Africa,'Uganda',156,156.67,47.3,1.8,?
Europe,'Ukraine',174,120,66.1,7.2,92.8
Europe, 'United Kingdom',41,236.67,78.4,30.3,157.2
Americas,'Uruguay',87,210,75.4,9.6,91.6
Americas,'Usa',23,246.67,77.4,41.8,94.6
Asia,'Uzbekistan',80,213.33,66.5,1.8,?
Australia,' Vanuatu',24,246.67,68.6,2.9,28.5
Americas,'Venezuela',25,246.67,72.9,6.1,?
Asia,'Vietnam',95,203.33,70.5,2.8,64.6
Asia,'Yemen',91,206.67,60.6,0.9,?
Africa,'Zambia',148,163.33,37.5,0.9,25.5
Africa,'Zimbabwe',177,110,36.9,2.3,45.3
You don't need the M5P line. That's not an attribute. Just omit line 2.
Country has some problem: I get the message "Attribute is neither numeric or nominal". (I see you have it as string, so that's right). But when I remove the country attribute, then I can run M5P. (3 rules, correlation = .85).
Now, you may be thinking "but I want to keep track of what country my predictions are for". Here's how to do that:
First, set up the filtered classifier to remove attribute 2 (country) and run M5P:
Second, under more options, choose to Output predictions, choosing a format. Here I chose CSV (comma separated values), and then right clicked to select all attributes (first-last) to output.
Now Start the model. This gives you actual, predicted, and all the data, including the country name:

What should be given as an input to linkage function - tfidf matrix or similarity between different elements of tfidf matrixes?

I have the following python notebook which aims to cluster different groups of abstracts based on the similarity between their text.
I have two approaches here: one to use tfidf numpy array of documents as it is in the linkage function and second is to find the similarity between the tfidf array of different documents and then to use that similarity matrix for clustering. I am unable to understand which one is correct.
Approach 1:
I used cosine_similarity to find out the similarity matrix (cosine) of tfidf matrix. I first converted the redundant matrix (cosine) into the condensed distance matrix (distance_matrix) using squareform function. Then distance_matrix is fed into linkage function and using Dendograms I have plotted the graph.
Approach 2:
I used the condensed form of tfidf numpy array into the linkage function and plotted the dendograms.
My question is what is correct? According to the data as far as i can understand, the approach 2 seems to be correct, but to me approach 1 makes sense. It would be great if someone can explain me what is right here in this scenario. Thanks in advance.
Let me know if anything remains unclear in the question.
import pandas, numpy
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
###Data Cleaning
stop_words = stopwords.words('english')
tokenizer = RegexpTokenizer(r'\w+')
df=pandas.read_csv('WIPO_CSV.csv')
import sys
reload(sys)
sys.setdefaultencoding('utf8')
documents_no_stopwords=[]
def preprocessing(word):
tokens = tokenizer.tokenize(word)
processed_words = []
for w in tokens:
if w in stop_words:
continue
else:
processed_words.append(w)
***This step creates a list of text documents with only the nouns in them***
documents_no_stopwords.append(' '.join(processed_words))
for text in df['TEXT'].tolist():
preprocessing(text)
***Converting into tfidf form***
*Latin is used as utf8 decoder was facing some trouble with the text.*
vectoriser = TfidfVectorizer(encoding='latin1')
***we have numpy here which is in normalised form***
tfidf_documents = vectoriser.fit_transform(documents_no_stopwords)
##Cosine Similarity as the input to linkage should be a distance vector
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import squareform
cosine = cosine_similarity(tfidf_documents)
distance_matrix = squareform(cosine,force='tovector',checks=False)
from scipy.cluster.hierarchy import dendrogram, linkage
##Linkage based on tfidf of each document
z_num=linkage(tfidf_documents.todense(),'ward')
z_num #tfidf
array([[11. , 12. , 0. , 2. ],
[18. , 19. , 0. , 2. ],
[20. , 31. , 0. , 3. ],
[21. , 32. , 0. , 4. ],
[22. , 33. , 0. , 5. ],
[17. , 34. , 0.38208619, 6. ],
[15. , 28. , 1.19375843, 2. ],
[ 6. , 9. , 1.24241258, 2. ],
[ 7. , 8. , 1.27069483, 2. ],
[13. , 37. , 1.28868301, 3. ],
[ 4. , 24. , 1.30850122, 2. ],
[36. , 39. , 1.32090275, 5. ],
[10. , 16. , 1.32602346, 2. ],
[27. , 38. , 1.32934025, 3. ],
[23. , 25. , 1.32987072, 2. ],
[ 3. , 29. , 1.35143582, 2. ],
[ 5. , 14. , 1.35401753, 2. ],
[26. , 42. , 1.35994878, 3. ],
[ 2. , 45. , 1.40055438, 3. ],
[ 0. , 40. , 1.40811825, 3. ],
[ 1. , 46. , 1.41383622, 3. ],
[44. , 50. , 1.4379821 , 5. ],
[41. , 43. , 1.44575227, 8. ],
[48. , 51. , 1.45876241, 8. ],
[49. , 53. , 1.47130328, 11. ],
[47. , 52. , 1.49944936, 11. ],
[54. , 55. , 1.69814818, 22. ],
[30. , 56. , 1.91299937, 24. ],
[35. , 57. , 3.1967033 , 30. ]])
from matplotlib import pyplot as plt
plt.figure(figsize=(25, 10))
dn = dendrogram(z_num)
plt.show()
Linkage based on similarity
z_sim=linkage(distance_matrix,'ward')
z_sim *Cosine Similarity*
array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 2.00000000e+00],
[2.00000000e+00, 3.00000000e+01, 0.00000000e+00, 3.00000000e+00],
[1.70000000e+01, 3.10000000e+01, 0.00000000e+00, 4.00000000e+00],
[3.00000000e+00, 4.00000000e+00, 0.00000000e+00, 2.00000000e+00],
[1.00000000e+01, 3.30000000e+01, 0.00000000e+00, 3.00000000e+00],
[5.00000000e+00, 7.00000000e+00, 0.00000000e+00, 2.00000000e+00],
[6.00000000e+00, 1.80000000e+01, 0.00000000e+00, 2.00000000e+00],
[1.10000000e+01, 1.90000000e+01, 0.00000000e+00, 2.00000000e+00],
[1.20000000e+01, 2.00000000e+01, 0.00000000e+00, 2.00000000e+00],
[8.00000000e+00, 2.40000000e+01, 0.00000000e+00, 2.00000000e+00],
[1.60000000e+01, 2.10000000e+01, 0.00000000e+00, 2.00000000e+00],
[2.20000000e+01, 2.70000000e+01, 0.00000000e+00, 2.00000000e+00],
[9.00000000e+00, 2.90000000e+01, 0.00000000e+00, 2.00000000e+00],
[2.60000000e+01, 4.20000000e+01, 0.00000000e+00, 3.00000000e+00],
[1.40000000e+01, 3.40000000e+01, 3.97089886e-03, 4.00000000e+00],
[2.30000000e+01, 4.40000000e+01, 1.81733052e-02, 5.00000000e+00],
[3.20000000e+01, 3.50000000e+01, 2.14592323e-02, 6.00000000e+00],
[2.50000000e+01, 4.00000000e+01, 2.84944415e-02, 3.00000000e+00],
[1.30000000e+01, 4.70000000e+01, 5.02045376e-02, 4.00000000e+00],
[4.10000000e+01, 4.30000000e+01, 5.10902795e-02, 5.00000000e+00],
[3.70000000e+01, 4.50000000e+01, 5.40176402e-02, 7.00000000e+00],
[3.80000000e+01, 3.90000000e+01, 6.15118462e-02, 4.00000000e+00],
[1.50000000e+01, 4.60000000e+01, 7.54874869e-02, 7.00000000e+00],
[2.80000000e+01, 5.00000000e+01, 9.55487454e-02, 8.00000000e+00],
[5.20000000e+01, 5.30000000e+01, 3.86911095e-01, 1.50000000e+01],
[4.90000000e+01, 5.40000000e+01, 4.16693529e-01, 2.00000000e+01],
[4.80000000e+01, 5.50000000e+01, 4.58764920e-01, 2.40000000e+01],
[3.60000000e+01, 5.60000000e+01, 5.23422380e-01, 2.60000000e+01],
[5.10000000e+01, 5.70000000e+01, 5.49419077e-01, 3.00000000e+01]])
from matplotlib import pyplot as plt
plt.figure(figsize=(25, 10))
dn = dendrogram(z_sim)
plt.show()
accuracy for clustering of data is compared with this photo: https://drive.google.com/file/d/1EgkPqwh7AKhGqOe1zf9KNjSMxPQ9Xfd9/view?usp=sharing
The dendogram that I got are available in the following notebook link: https://drive.google.com/file/d/1TB7aFK4lPDo43GY74FPOqVOx1AxWV-A_/view?usp=sharing
open this html using internet browser.
Scipy only supports distances for HAC, not similarities.
Then the results should be the same. So there is no "right" or "wrong".
At some point you need the distance matrix in linearized form. It is probably most efficient to use a) a method that can process sparse data (avoiding any todense call), and b) directly produces the linearize form, rather than generating the entire matrix and then dropping half of it.

Resources