BERT problem with context/semantic search in italian language - machine-learning

I am using BERT model for context search in Italian language but it does not understand the contextual meaning of the sentence and returns wrong result.
in below example code when I compare "milk with chocolate flavour" with two other type of milk and one chocolate so it returns high similarity with chocolate. it should return high similarity with other milks.
can anyone suggest me any improvement on the below code so that it can return semantic results?
Code :
!python -m spacy download it_core_news_lg
!pip install sentence-transformers
import scipy
import numpy as np
from sentence_transformers import models, SentenceTransformer
model = SentenceTransformer('distiluse-base-multilingual-cased') # workes with Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish
corpus = [
"Alpro, Cioccolato bevanda a base di soia 1 ltr", #Alpro, Chocolate soy drink 1 ltr(soya milk)
"Milka cioccolato al latte 100 g", #Milka milk chocolate 100 g
"Danone, HiPRO 25g Proteine gusto cioccolato 330 ml", #Danone, HiPRO 25g Protein chocolate flavor 330 ml(milk with chocolate flabor)
]
corpus_embeddings = model.encode(corpus)
queries = [
'latte al cioccolato', #milk with chocolate flavor,
]
query_embeddings = model.encode(queries)
# Calculate Cosine similarity of query against each sentence i
closest_n = 10
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
print("\n======================\n")
print("Query:", query)
print("\nTop 10 most similar sentences in corpus:")
for idx, distance in results[0:closest_n]:
print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
Output :
======================
Query: latte al cioccolato
Top 10 most similar sentences in corpus:
Milka cioccolato al latte 100 g (Score: 0.7714)
Alpro, Cioccolato bevanda a base di soia 1 ltr (Score: 0.5586)
Danone, HiPRO 25g Proteine gusto cioccolato 330 ml (Score: 0.4569)

The problem is not with your code, it is just the insufficient model performance.
There are a few things you can do. First, you can try Universal Sentence Encoder (USE). From my experience their embeddings are a little bit better, at least in English.
Second, you can try a different model, for example sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1. It is based on ROBERTa and might give a better performance.
Now you can combine together embeddings from several models (just by concatenating the representations). In some cases it helps, on expense of much heavier compute.
And finally you can create your own model. It is well known that single language models perform significantly better than multilingual ones. You can follow the guide and train your own Italian model.

Related

How to get vocabulary size of word2vec?

I have a pretrained word2vec model in pyspark and I would like to know how big is its vocabulary (and perhaps get a list of words in the vocabulary).
Is this possible? I would guess it has to be stored somewhere since it can predict for new data, but I couldn't find a clear answer in the documentation.
I tried w2v_model.getVectors().count() but the result (970) seem too small for my use case. In case it may be relevant, I'm using short-text data and my dataset has tens of millions of messages each having from 10 to 30/40 words. I am using min_count=50.
Not quite sure why you doubt the result of .getVectors().count(), which gives the desired result indeed, as shown in the documentation link you have provided yourself.
Here is the example posted there, with a vocabulary of just three (3) tokens - a, b, and c:
from pyspark.ml.feature import Word2Vec
sent = ("a b " * 100 + "a c " * 10).split(" ") # 3-token vocabulary
doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
model = word2Vec.fit(doc)
So, unsurprisingly, it is
model.getVectors().count()
# 3
and asking for the vectors themselves
model.getVectors().show()
gives
+----+--------------------+
|word| vector|
+----+--------------------+
| a|[0.09511678665876...|
| b|[-1.2028766870498...|
| c|[0.30153277516365...|
+----+--------------------+
In your case, with min_count=50, every word that appears less than 50 times in your corpus will not be represented; reducing this number will result in more vectors.

Why am I getting almost same top 10 features using Multinomial Naive Bayes classifier for positive and negative class?

After running MultinomialNB multiple times I'm getting same features for +ve and -ve class BoW, TfIdf.
I even tried it on bi-grams, tri-grams still the same features for both classes.
best_alpha = 6
clf = MultinomialNB( alpha=best_alpha )
clf.fit(X_tr, y_train)
y_train_pred = batch_predict(clf, X_tr)
y_test_pred = batch_predict(clf, X_te)
train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)
This is the code for getting top 10 features for positive and negative classes of text data Tf-Idf.
feats_tfidf contains the features of categorical, numerical and text data.
For Positive class
sorted_idx = np.argsort( clf.feature_log_prob_[1] )[-10:]
for p,q in zip(feats_tfidf[ sorted_idx ], clf.feature_log_prob_[1][ sorted_idx ]):
print('{:45}:{}'.format(p,q))
Output:
Mathematics :-7.134937347073638
Literacy :-6.910334729871051
Grades_3_5 :-6.832969821702653
Ms :-6.791634814736902
Math_Science :-6.748584860699069
Grades_PreK_2 :-6.664767807632341
Literacy_Language :-6.4833650280402875
Mrs :-6.404885953106168
Teacher number of previously posted projects :-3.285663623429455
price :-0.09775430166978438
For negative class
sorted_idx = np.argsort( clf.feature_log_prob_[0] )[-10:]
for p,q in zip(feats_tfidf[ sorted_idx ], clf.feature_log_prob_[0][ sorted_idx ]):
print('{:45}:{}'.format(p,q))
Output:
Literacy :-7.31906682336635
Mathematics :-7.318545582802034
Grades_3_5 :-7.088236519755028
Ms :-6.970453484098645
Math_Science :-6.887189615718408
Grades_PreK_2 :-6.85882128589294
Literacy_Language :-6.8194613665941155
Mrs :-6.648860662073821
Teacher number of previously posted projects :-4.008908256269724
price :-0.08131982830664697
Please help me someone is it correct way of doing.
It should be like this
sorted_idx = np.argsort(-1 * clf_bow.feature_log_prob_[0] )[0:11]
for i in sorted_idx:
print(count_vect.get_feature_names()[i])
When you say [-10:] you would be printing elements in position (n-10), (n-9)....n
but we would want elements to be printed are n, n-1, n-2,... n-10
I'm working on the same problem, and yes I too got many top features that are common in both the classes, though it's not exactly in same order as yours.
Here's how I did it -
I first chained all the features and the probability values(exponential of log-probability) together and then sorted in descending order.
top 10 Positive class features
top 10 Negative class features
So yes, I think what you're getting is correct.

nltk naivebayes classifier for text classification

In following code, I know that my naivebayes classifier is working correctly because it is working correctly on trainset1 but why is it not working on trainset2? I even tried it on two classifiers, one from TextBlob and other directly from nltk.
from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob
from nltk.tokenize import word_tokenize
import nltk
trainset1 = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
trainset2 = [('hide all brazil and everything plan limps to anniversary inflation plan initiallyis limping its first anniversary amid soaring prices', 'class1'),
('hello i was there and no one came', 'class2'),
('all negative terms like sad angry etc', 'class2')]
def nltk_naivebayes(trainset, test_sentence):
all_words = set(word.lower() for passage in trainset for word in word_tokenize(passage[0]))
t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in trainset]
classifier = nltk.NaiveBayesClassifier.train(t)
test_sent_features = {word.lower(): (word in word_tokenize(test_sentence.lower())) for word in all_words}
return classifier.classify(test_sent_features)
def textblob_naivebayes(trainset, test_sentence):
cl = NaiveBayesClassifier(trainset)
blob = TextBlob(test_sentence,classifier=cl)
return blob.classify()
test_sentence1 = "he is my horrible enemy"
test_sentence2 = "inflation soaring limps to anniversary"
print nltk_naivebayes(trainset1, test_sentence1)
print nltk_naivebayes(trainset2, test_sentence2)
print textblob_naivebayes(trainset1, test_sentence1)
print textblob_naivebayes(trainset2, test_sentence2)
Output:
neg
class2
neg
class2
Although test_sentence2 clearly belongs to class1.
I will assume your understand that you cannot expect a classifier to learn a good model with only 3 examples, and that your question is more to understand why it does that in this specific example.
The likely reason it does that is that naive bayes classifier uses a prior class probability. That is, the probability of neg vs pos, regardless of the text. In your case, 2/3 of the examples are negative, thus the prior is 66% for neg and 33% for pos. The positive words in your single positive instance are 'anniversary' and 'soaring', which are unlikely to be enough to compensate this prior class probability.
In particular, be aware that the calculation of word probabilities involve various 'smoothing' functions (for instance, it will be log10(Term Frequency + 1) in each class, not log10(Term Frequency) to prevent low frequency words to impact too much the classification results, divisions by zero, etc. Thus the probabilities for "anniversary" and "soaring" are not 0.0 for neg and 1.0 for pos, unlike what you may have expected.

Best similaity (dissimilarity) measure among multidimensions categorical vectors

I would like to find similarity ( dissimilarity ) among the following data points :
my categorical data set as follow: { Art , Science , Math.s , medical , physics , chemistry , engineering ..etc } for example 15 or 20 category .
so I would like to find Sim(Dis) among these libraries which each library row ( data points ) represent the rows vectors ,
Books attributes
libraries total-books Art science Math. chemistry
lib1 1000 50 200 0 3
lib2 500 12 0 0 44
lib3 etc..
table here represent the number of books found in each library , when we found its frequency percentage to total books found then re-arrangement the representation of categories for each library based on frequency percentage for example
I'm not consider the zero category in the following vectors ,
library 1 = { science ,Art , chemistry , ... }
library 2 = { Chemistry , Art , .... }
etc...
How to find similarity / dissimilarity between lib1 and lib2 and etc...
any suggestion please .
If you normalize by the total number of books. you can treat the remaining columns as a histogram.
Then you could try any of the distribution-based distances:
histogram intersection distance
kullback-leibler-divergence
$\chi^2$ distance
Jensen-Shannon divergence

Supervised Latent Dirichlet Allocation for Document Classification?

I have a bunch of already human-classified documents in some groups.
Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?
For what it's worth, LDA as a classifier is going to be fairly weak because it's a generative model, and classification is a discriminative problem. There is a variant of LDA called supervised LDA which uses a more discriminative criterion to form the topics (you can get source for this in various places), and there's also a paper with a max margin formulation that I don't know the status of source-code-wise. I would avoid the Labelled LDA formulation unless you're sure that's what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem.
However, it's worth pointing out that none of these methods use the topic model directly to do the classification. Instead, they take documents, and instead of using word-based features use the posterior over the topics (the vector that results from inference for the document) as its feature representation before feeding it to a classifier, usually a Linear SVM. This gets you a topic model based dimensionality reduction, followed by a strong discriminative classifier, which is probably what you're after. This pipeline is available
in most languages using popular toolkits.
You can implement supervised LDA with PyMC that uses Metropolis sampler to learn the latent variables in the following graphical model:
The training corpus consists of 10 movie reviews (5 positive and 5 negative) along with the associated star rating for each document. The star rating is known as a response variable which is a quantity of interest associated with each document. The documents and response variables are modeled jointly in order to find latent topics that will best predict the response variables for future unlabeled documents. For more information, check out the original paper.
Consider the following code:
import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
"simplistic silly and tedious",
"it's so laddish and juvenile only teenage boys could possibly find it funny",
"it shows that some studios firmly believe that people have lost the ability to think",
"our culture is headed down the toilet with the ferocity of a frozen burrito",
"offers that rare combination of entertainment and education",
"the film provides some great insight",
"this is a film well worth seeing",
"a masterpiece four years in the making",
"offers a breath of the fresh air of true sophistication"]
test_corpus = ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3
#LDA parameters
num_features = 1000 #vocabulary size
num_topics = 4 #fixed for LDA
tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')
#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus) #size D x V
print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]
#tf-idf dictionary
tfidf_dict = tfidf.get_feature_names()
K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents
data = A_tfidf_sp.toarray()
#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]
alpha = np.ones(K)
beta = np.ones(V)
theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])
#pm.deterministic
def zbar(z=z):
zbar_list = []
for i in range(len(z)):
hist, bin_edges = np.histogram(z[i], bins=K)
zbar_list.append(hist / float(np.sum(hist)))
return pm.Container(zbar_list)
eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)
#pm.deterministic
def y_mu(eta=eta, zbar=zbar):
y_mu_list = []
for i in range(len(zbar)):
y_mu_list.append(np.dot(eta, zbar[i]))
return pm.Container(y_mu_list)
#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])
# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])
model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)
#visualize topics
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()
Given the training data (observed words and response variables), we can learn the global topics (beta) and regression coefficients (eta) for predicting the response variable (Y) in addition to topic proportions for each document (theta).
In order to make predictions of Y given the learned beta and eta, we can define a new model where we do not observe Y and use the previously learned beta and eta to obtain the following result:
Here we predicted a positive review (approx 2 given review rating range of -2 to 2) for the test corpus consisting of one sentence: "this is a really positive review, great film" as shown by the mode of the posterior histogram on the right.
See ipython notebook for a complete implementation.
Yes you can try the Labelled LDA in the stanford parser at
http://nlp.stanford.edu/software/tmt/tmt-0.4/

Resources