How to retrieve array in Word2Vec - machine-learning

I am trying to retrieve the array/vector of a word in my trained word2vec model. In SpaCy this is possible with model.vocab.get_vector("word"), but I can't find a way to do it in word2Vec

From gensim documentation:
Initialize a model with e.g.:
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
path = get_tmpfile("word2vec.model")
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
Now, you can get word vector of for example word by:
model.wv['word'] # numpy vector of a word (OR: model.word_vec("word"))

Related

I created a TF-IDF code to analyze an annual report, I want to know the importance of specific keywords

import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
import path
import re
with open(r'C:\Users\maxim\PycharmProjects\THESIS\data\santander2020_1.txt', 'r') as file:
data = file.read()
dataset = [data]
tfIdfVectorizer=TfidfVectorizer(use_idf=True, stop_words="english"
, lowercase=True,max_features=100,ngram_range=(1,3))
tfIdf = tfIdfVectorizer.fit_transform(dataset)
df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
print (df.head(25))
The above code is what ive created to do a TF-IDF analysis on an annual report, however currently it is giving me the values of the most important words within the report. However, I only need the TFIDF values for the keywords
["digital","hardware","innovation","software","analytics","data","digitalisation","technology"], is there a way I can specify to only look for the tfidf values of these terms?
I'm very new to programming with little experience, I'm doing this for my thesis.
Any help is greatly appreciated.
You have defined tfIdf as tfIdf = tfIdfVectorizer.fit_transform(dataset).
So tfIdf.toarray() would be a 2-D array, where each row refers to a document and each element in the row refers to the TF-IDF score of the corresponding word. To know what word each element is representing, you could use the .get_feature_names() function which would print a list of words. Then you can use this information to create a mapping (dict) from words to scores, like this:
wordScores = dict(zip(tfIdfVectorizer.get_feature_names(), tfIdf.toarray()[0]))
Now suppose your document contains the word "digital" and you want to know its TF-IDF score, you could simply print the value of wordScores["digital"].

sklearn how to use saved model to predict new data

I use sklearn trained a SVM text classifier, used tf-idf(TfidfVectorizer) to extract the feature.
now I need to save the model and load it to predict the text unseen. I will load the model in another file, what confuses me is how to extract the new text tf-idf feature
You need to save the model AND the tfidf transformer. You can either save them separately, or create a pipeline of the two and save the pipeline (this is the preferred option).
Example:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import pickle
Tfidf = TfidfVectorizer()
LR = LogisticRegression()
pipe = Pipeline([("Tfidf", Tfidf), ("LR", LR)])
pipe.fit(X, y)
with open('pipe.pickle', 'wb') as picklefile:
pickle.dump(pipe, picklefile)
You can then load the whole pipeline which upon predict will first apply the vectorizer and then pass it to the model:
with open('pipe.pickle', 'rb') as picklefile:
saved_pipe = pickle.load(picklefile)
saved_pipe.predict(X_test)

Doc2Vec with Keras

According to Micholov paper I want to compute Doc2Vec using Keras. I'm new on Keras so I need your help.
There is a corpus of documents with an Id and I want to get two embeddings matrices : one for words and one for paragraphs, isn't it ?
Is it possible to adapt my Word2Vec code to get these embeddings ?
This an extract of my W2V code :
from keras.models import Sequential
cbow = Sequential()
cbow.add(Embedding(input_dim=V, output_dim=dim,input_length=window_size*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(dim,)))
cbow.add(Dense(V, activation='softmax'))
Should I add another embedding layer to take into account the paragraph id ?

Text Classification with scikit-learn: how to get a new document's representation from a pickle model

I have a document binomial classifier that uses a tf-idf representation of a training set of documents and applies Logistic Regression to it:
lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])
lr_tfidf.fit(X_train, y_train)
I save the model in pickle and used it to classify new documents:
text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)
How can I get the representation (features + frequencies) used by the model for this new document without explicitly computing it?
EDIT: I am trying to explain better what I want to get.
Wen I use predict_proba, I guess that the new document is represented as a vector of term frequencies (according to the rules used in the model stored) and those frequencies are multiplied by the coefficients learnt by the logistic regression model to predict the class. Am I right? If yes, how can I get the terms and term frequencies of this new document, as used by predict_proba?
I am using sklearn v 0.19
As I understand from the comments, you need to access the tfidfVectorizer from inside the pipeline. This can be done easily by:
tfidfVect = text_model.named_steps['vect']
Now you can use the transform() method of the vectorizer to get the tfidf values.
tfidf_vals = tfidfVect.transform(new_document)
The tfidf_vals will be a sparse matrix of single row containing the tfidf of terms found in the new_document. To check what terms are present in this matrix, you need to use tfidfVect.get_feature_names().

can I use PCA for dimensionality reduction and then use its o/p for one class SVM classifier in python

I want to use PCA for dimensionality reduction and then use its o/p for one class SVM classifier in python. My training data set is of the order 16000x60. Also how to map principal component to original column to use it in SVM or can I use principal component directly?
It is unclear what the problem is and what did you try already. Of course you can. You can either add PCA output to your original set or just use the output as a single feature. I encourage you to use sklearn pipelines.
Simple example:
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn import svm
svc = svm.SVC()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('svc', svc)])
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
pipe.fit(X_digits, y_digits)
print(pipe.score(X_digits,y_digits))

Resources