How to fetch vectors for a word list with Word2Vec? - machine-learning

I want to create a text file that is essentially a dictionary, with each word being paired with its vector representation through word2vec. I'm assuming the process would be to first train word2vec and then look-up each word from my list and find its representation (and then save it in a new text file)?
I'm new to word2vec and I don't know how to go about doing this. I've read from several of the main sites, and several of the questions on Stack, and haven't found a good tutorial yet.

The direct access model[word] is deprecated and will be removed in Gensim 4.0.0 in order to separate the training and the embedding. The command should be replaced with, simply, model.wv[word].
Using Gensim in Python, after vocabs are built and the model trained, you can find the word count and sampling information already mapped in model.wv.vocab, where model is the variable name of your Word2Vec object.
Thus, to create a dictionary object, you may:
my_dict = dict({})
for idx, key in enumerate(model.wv.vocab):
my_dict[key] = model.wv[key]
# Or my_dict[key] = model.wv.get_vector(key)
# Or my_dict[key] = model.wv.word_vec(key, use_norm=False)
Now that you have your dictionary, you can write it to a file with whatever means you like. For example, you can use the pickle library. Alternatively, if you are using Jupyter Notebook, they have a convenient 'magic command' %store my_dict > filename.txt. Your filename.txt will look like:
{'one': array([-0.06590105, 0.01573388, 0.00682817, 0.53970253, -0.20303348,
-0.24792041, 0.08682659, -0.45504045, 0.89248925, 0.0655603 ,
......
-0.8175681 , 0.27659689, 0.22305458, 0.39095637, 0.43375066,
0.36215973, 0.4040089 , -0.72396156, 0.3385369 , -0.600869 ],
dtype=float32),
'two': array([ 0.04694849, 0.13303463, -0.12208422, 0.02010536, 0.05969441,
-0.04734801, -0.08465996, 0.10344813, 0.03990637, 0.07126121,
......
0.31673026, 0.22282903, -0.18084198, -0.07555179, 0.22873943,
-0.72985399, -0.05103955, -0.10911274, -0.27275378, 0.01439812],
dtype=float32),
'three': array([-0.21048863, 0.4945509 , -0.15050395, -0.29089224, -0.29454648,
0.3420335 , -0.3419629 , 0.87303966, 0.21656844, -0.07530259,
......
-0.80034876, 0.02006451, 0.5299498 , -0.6286509 , -0.6182588 ,
-1.0569025 , 0.4557548 , 0.4697938 , 0.8928275 , -0.7877308 ],
dtype=float32),
'four': ......
}
You may also wish to look into the native save / load methods of Gensim's word2vec.

Gensim tutorial explains it very clearly.
First, you should create word2vec model - either by training it on text, e.g.
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
or by loading pre-trained model (you can find them here, for example).
Then iterate over all your words and check for their vectors in the model:
for word in words:
vector = model[word]
Having that, just write word and vector formatted as you want.

You can Directly get the vectors through
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.wv.vectors
and words through
model.wv.vocab.keys()
Hope it helps !

If you are willing to use python with gensim package, then building upon this answer and Gensim Word2Vec Documentation you could do something like this
from gensim.models import Word2Vec
# Take some sample sentences
tokenized_sentences = [["here","is","one"],["and","here","is","another"]]
# Initialise model, for more information, please check the Gensim Word2vec documentation
model = Word2Vec(tokenized_sentences, size=100, window=2, min_count=0)
# Get the ordered list of words in the vocabulary
words = model.wv.vocab.keys()
# Make a dictionary
we_dict = {word:model.wv[word] for word in words}

Gensim 4.0 updates: vocab method is depreciated and change in how to parse a word's vector
Get the ordered list of words in the vocabulary
words = list(w for w in model.wv.index_to_key)
Get the vector for 'also'
print(model.wv['also'])

Using basic python:
all_vectors = []
for index, vector in enumerate(model.wv.vectors):
vector_object = {}
vector_object[list(model.wv.vocab.keys())[index]] = vector
all_vectors.append(vector_object)

For gensim 4.0:
my_dict = dict({})
for word in word_list:
my_dict[word] = model.wv.get_vector('0', norm = True)

I would suggest this, you may find anything you need including Word2Vec, FastText, Doc2Vec, KeyedVectors and so on...

Related

How to get vocabulary size of word2vec?

I have a pretrained word2vec model in pyspark and I would like to know how big is its vocabulary (and perhaps get a list of words in the vocabulary).
Is this possible? I would guess it has to be stored somewhere since it can predict for new data, but I couldn't find a clear answer in the documentation.
I tried w2v_model.getVectors().count() but the result (970) seem too small for my use case. In case it may be relevant, I'm using short-text data and my dataset has tens of millions of messages each having from 10 to 30/40 words. I am using min_count=50.
Not quite sure why you doubt the result of .getVectors().count(), which gives the desired result indeed, as shown in the documentation link you have provided yourself.
Here is the example posted there, with a vocabulary of just three (3) tokens - a, b, and c:
from pyspark.ml.feature import Word2Vec
sent = ("a b " * 100 + "a c " * 10).split(" ") # 3-token vocabulary
doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
model = word2Vec.fit(doc)
So, unsurprisingly, it is
model.getVectors().count()
# 3
and asking for the vectors themselves
model.getVectors().show()
gives
+----+--------------------+
|word| vector|
+----+--------------------+
| a|[0.09511678665876...|
| b|[-1.2028766870498...|
| c|[0.30153277516365...|
+----+--------------------+
In your case, with min_count=50, every word that appears less than 50 times in your corpus will not be represented; reducing this number will result in more vectors.

How to save sentence-Bert output vectors to a file?

I am using Bert to get similarity between multi term words.here is my code that I used for embedding :
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-large-uncased-whole-word-masking')
words = [
"Artificial intelligence",
"Data mining",
"Political history",
"Literature book"]
I also have a dataset which contains 540000 other words.
Vocabs = [
"Winter flooding",
"Cholesterol diet", ....]
the problem is when I want to embed Vocabs to vectors it takes time forever.
words_embeddings = model.encode(words)
Vocabs_embeddings = model.encode(Vocabs)
is there any way to make it faster? or I want to embed Vocabs in for loops and save the output vectors in a file so I don't have to embed 540000 vocabs every time I need it. is there a way to save embeddings to a file and use it again?
I will really appreciate you for your time trying help me.
You can pickle your corpus and embeddings like this, you can also pickle a dictionary instead, or write them to file in any other format you prefer.
import pickle
with open("my-embeddings.pkl", "wb") as fOut:
pickle.dump({'sentences': words, 'embeddings': word_embeddings},fOut)
Or more generally like below, so you encode when the embeddings don't exist but after that any time you need them you load from file, instead of re-encoding your corpus:
if not os.path.exists(embedding_cache_path):
# read your corpus etc
corpus_sentences = ...
print("Encoding the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_numpy=True)
corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
print("Storing file on disc")
with open(embedding_cache_path, "wb") as fOut:
pickle.dump({'sentences': corpus_sentences, 'embeddings': corpus_embeddings}, fOut)
else:
print("Loading pre-computed embeddings from disc")
with open(embedding_cache_path, "rb") as fIn:
cache_data = pickle.load(fIn)
corpus_sentences = cache_data['sentences']
corpus_embeddings = cache_data['embeddings']

How to "remember" categorical encodings for actual predictions after training?

Suppose wanted to train a machine learning algorithm on some dataset including some categorical parameters. (New to machine learning, but my thinking is...) Even if converted all the categorical data to 1-hot-encoded vectors, how will this encoding map be "remembered" after training?
Eg. converting the initial dataset to use 1-hot encoding before training, say
universe of categories for some column c is {"good","bad","ok"}, so convert rows to
[1, 2, "good"] ---> [1, 2, [1, 0, 0]],
[3, 4, "bad"] ---> [3, 4, [0, 1, 0]],
...
, after training the model, all future prediction inputs would need to use the same encoding scheme for column c.
How then during future predictions will data inputs remember that mapping (where "good" maps to index 0, etc.) (Specifically, when planning on using a keras RNN or LSTM model)? Do I need to save it somewhere (eg. python pickle)(if so, how do I get the explicit mapping)? Or is there a way to have the model automatically handle categorical inputs internally so can just input the original label data during training and future use?
If anything in this question shows any serious confusion on my part about something, please let me know (again, very new to ML).
** Wasn't sure if this belongs in https://stats.stackexchange.com/, but posted here since specifically wanted to know how to deal with the actual code implementation of this problem.
What I've been doing is the following:
After you use StringIndexer.fit(), you can save its metadata (includes the actual encoder mapping, like "good" being the first column)
This is the following code I use (using java, but can be adjusted to python):
StringIndexerModel sim = new StringIndexer()
.setInputCol(field)
.setOutputCol(field + "_INDEX")
.setHandleInvalid("skip")
.fit(dataset);
sim.write().overwrite().save("IndexMappingModels/" + field + "_INDEX");
and later, when trying to make predictions on a new dataset, you can load the stored metadata:
StringIndexerModel sim = StringIndexerModel.load("IndexMappingModels/" + field + "_INDEX");
dataset = sim.transform(dataset);
I imagine you have already solved this issue, since it was posted in 2018, but I've not found this solution anywhere else, so I believe its worth sharing.
My thought would be to do something like this on the training/testing dataset D (using a mix of python and plain psudo-code):
Do something like
# Before: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ...}
# assign unique index for each distinct label for categorical column annd store in a new column
# http://spark.apache.org/docs/latest/ml-features.html#stringindexer
label_indexer = StringIndexer(inputCol="cat_col_i", outputCol="cat_col_i_index").fit(D)
D = label_indexer.transform(D)
# After: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ..., cat_col_1_index: int, cat_col_2_index: int, ...}
for all the categorical columns
Then for all of these categorical name and index columns in D, make a map of form
map = {}
for all categorical column names colname in D:
map[colname] = []
# create mapping dict for all categorical values for all
# see https://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations
for all rows r in D.select(colname, '%s_index' % colname).drop_duplicates():
enc_from = r['%s' % colname]
enc_to = r['%s_index' % colname]
map[colname].append((enc_from, enc_to))
# for cats that may appear later that have yet to be seen
# (IDK if this is best practice, may be another way, see https://medium.com/#vaibhavshukla182/how-to-solve-mismatch-in-train-and-test-set-after-categorical-encoding-8320ed03552f)
map[colname].append(('NOVEL_CAT', map[colname].len))
# sort by index encoding
map[colname].sort(key = lamdba pair: pair[1])
to end up with something like
{
'cat_col_1': [('orig_label_11', 0), ('orig_label_12', 1), ...],
'cat_col_2': [(), (), ...],
...
'cat_col_n': [(orig_label_n1, 0), ...]
}
which can then be used to generate 1-hot-encoded vectors for each categorical column in any later data sample row ds. Eg.
for all categorical column names colname in ds:
enc_from = ds[colname]
# make zero vector for 1-hot for category
col_onehot = zeros.(size = map[colname].len)
for label, index in map[colname]:
if (label == enc_from):
col_onehot[index] = 1
# make new column in sample for 1-hot vector
ds['%s_onehot' % colname] = col_onehot
break
Can then save this structure as pickle pickle.dump( map, open( "cats_map.pkl", "wb" ) ) to use to compare against categorical column values when making actual predictions later.
** There may be a better way, but I think would need to better understand this article (https://medium.com/#satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9). Will update answer if anything.

Doc2vec: Only 10 docvecs in gensim doc2vec model?

I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs.
The example of training data (length>10)
docs = ['This is a sentence', 'This is another sentence', ....]
with some pre-treatment
doc_=[d.strip().split(" ") for d in doc]
doc_tagged = []
for i in range(len(doc_)):
tagd = TaggedDocument(doc_[i],str(i))
doc_tagged.append(tagd)
tagged docs
TaggedDocument(words=array(['a', 'b', 'c', ..., ],
dtype='<U32'), tags='117')
fit a doc2vec model
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(doc_tagged)
model.train(doc_tagged, total_examples= model.corpus_count, epochs= model.iter)
then i get the final model
len(model.docvecs)
the result is 10...
I tried other datasets (length>100, 1000) and got same result of len(model.docvecs).
So, my question is:
How to use model.docvecs to get full vectors? (without using model.infer_vector)
Is model.docvecs designed to provide all training docvecs?
The bug is in this line:
tagd = TaggedDocument(doc[i],str(i))
Gensim's TaggedDocument accepts a sequence of tags as a second argument. When you pass a string '123', it's turned into ['1', '2', '3'], because it's treated as a sequence. As a result, all of the documents are tagged with just 10 tags ['0', ..., '9'], in various combinations.
Another issue: you're defining doc_ and never actually using it, so your documents will be split incorrectly as well.
Here's the proper solution:
docs = [doc.strip().split(' ') for doc in docs]
tagged_docs = [doc2vec.TaggedDocument(doc, [str(i)]) for i, doc in enumerate(docs)]

How do I use Conll 2003 corpus in python crfsuite

I have downloaded Conll 2003 corpus ("eng.train"). I want to use it to extract entity using python crfsuite training. But I don't know how to load this file for training.
I found this example, but it is not for English.
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
Also in future I would like to train new entities other than POS or location. How can I add those.
Also please suggest how to handle multiple words.
You can use ConllCorpusReader.
Here a general impelemantation:
ConllCorpusReader('file path', 'file name', columntypes=['','',''])
Here a list of column types which you can use: 'WORDS', 'POS', 'TREE', 'CHUNK', 'NE', 'SRL', 'IGNORE'
Example:
from nltk.corpus.reader import ConllCorpusReader
train = ConllCorpusReader('CoNLL-2003', 'eng.train', ['words', 'pos', 'ignore', 'chunk'])
test = ConllCorpusReader('CoNLL-2003', 'eng.testa', ['words', 'pos', 'ignore', 'chunk'])

Resources