Dynamic name formatting with epoch number for .tsv file creation - machine-learning

I 've made this custom callback in Keras for obtaining the embedding vectors at the end of each epoch. It should save the vectors on a .tsv which it does but since the name does not update on each epoch, it results in rewriting the first file. So by the end of training I only get the vectors of the last epoch.
I need a way to add the epoch number on the .tsv file name but I have no idea how to do it.
The code:
import io
encoder = info.features['text'].encoder
class CustomCallback(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
vec = model.layers[0].get_weights()[0]
out_v.write('\t'.join([str(x) for x in vec]) + "\n")
out_v.close()
Any suggestions would be much appreciated. Thanks a lot!

You can do that by chaging following:
out_v = io.open('vecs_{}.tsv'.format(epoch), 'w', encoding='utf-8')
The filename will be vec_1.tsv, vec_2.tsv and so on.

Related

How to save sentence-Bert output vectors to a file?

I am using Bert to get similarity between multi term words.here is my code that I used for embedding :
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-large-uncased-whole-word-masking')
words = [
"Artificial intelligence",
"Data mining",
"Political history",
"Literature book"]
I also have a dataset which contains 540000 other words.
Vocabs = [
"Winter flooding",
"Cholesterol diet", ....]
the problem is when I want to embed Vocabs to vectors it takes time forever.
words_embeddings = model.encode(words)
Vocabs_embeddings = model.encode(Vocabs)
is there any way to make it faster? or I want to embed Vocabs in for loops and save the output vectors in a file so I don't have to embed 540000 vocabs every time I need it. is there a way to save embeddings to a file and use it again?
I will really appreciate you for your time trying help me.
You can pickle your corpus and embeddings like this, you can also pickle a dictionary instead, or write them to file in any other format you prefer.
import pickle
with open("my-embeddings.pkl", "wb") as fOut:
pickle.dump({'sentences': words, 'embeddings': word_embeddings},fOut)
Or more generally like below, so you encode when the embeddings don't exist but after that any time you need them you load from file, instead of re-encoding your corpus:
if not os.path.exists(embedding_cache_path):
# read your corpus etc
corpus_sentences = ...
print("Encoding the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_numpy=True)
corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
print("Storing file on disc")
with open(embedding_cache_path, "wb") as fOut:
pickle.dump({'sentences': corpus_sentences, 'embeddings': corpus_embeddings}, fOut)
else:
print("Loading pre-computed embeddings from disc")
with open(embedding_cache_path, "rb") as fIn:
cache_data = pickle.load(fIn)
corpus_sentences = cache_data['sentences']
corpus_embeddings = cache_data['embeddings']

Using torchtext for inference

I wonder what is the right way to use torchtext for inference.
Let's assume I've trained the model and dump all Fields with built vocabularies. It seems the next step is to use torchtext.data.Example to load one single example. Somehow I should numeralize it by using loaded Fields and create an Iterator.
I would appreciate any simple examples of using torchtext for inference.
For a trained model and vocabulary (which is part of the text field , you don't have to save the whole class) :
def read_vocab(path):
#read vocabulary pkl
import pickle
pkl_file = open(path, 'rb')
vocab = pickle.load(pkl_file)
pkl_file.close()
return vocab
def load_model_and_vocab():
import torch
import os.path
my_path = os.path.abspath(os.path.dirname(__file__))
vocab_path = os.path.join(my_path, vocab_file)
weights_path = os.path.join(my_path, WEIGHTS)
vocab = read_vocab(vocab_path)
model = classifier(vocab_size=len(vocab))
model.load_state_dict(torch.load(weights_path))
model.eval()
return model, vocab
def predict(model, vocab, sentence):
tokenized = [w.text.lower() for w in nlp(sentence)] # tokenize the sentence
indexed = [vocab.stoi[t] for t in tokenized] # convert to integer sequence
length = [len(indexed)] # compute no. of words
tensor = torch.LongTensor(indexed).to('cpu') # convert to tensor
tensor = tensor.unsqueeze(1).T # reshape in form of batch,no. of words
length_tensor = torch.LongTensor(length) # convert to tensor
prediction = model(tensor, length_tensor) # prediction
return round(1-prediction.item())
"classifier" is the class I defined for my model.
For saving the vocabulary pkl :
def save_vocab(vocab):
import pickle
output = open('vocab.pkl', 'wb')
pickle.dump(vocab, output)
output.close()
And for saving the model after training you can use :
torch.save(model.state_dict(), 'saved_weights.pt')
Tell me if it worked for you!

Keras Text Preprocessing - Saving Tokenizer object to file for scoring

I've trained a sentiment classifier model using Keras library by following the below steps(broadly).
Convert Text corpus into sequences using Tokenizer object/class
Build a model using the model.fit() method
Evaluate this model
Now for scoring using this model, I was able to save the model to a file and load from a file. However I've not found a way to save the Tokenizer object to file. Without this I'll have to process the corpus every time I need to score even a single sentence. Is there a way around this?
The most common way is to use either pickle or joblib. Here you have an example on how to use pickle in order to save Tokenizer:
import pickle
# saving
with open('tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
# loading
with open('tokenizer.pickle', 'rb') as handle:
tokenizer = pickle.load(handle)
Tokenizer class has a function to save date into JSON format:
tokenizer_json = tokenizer.to_json()
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
f.write(json.dumps(tokenizer_json, ensure_ascii=False))
The data can be loaded using tokenizer_from_json function from keras_preprocessing.text:
with open('tokenizer.json') as f:
data = json.load(f)
tokenizer = tokenizer_from_json(data)
The accepted answer clearly demonstrates how to save the tokenizer. The following is a comment on the problem of (generally) scoring after fitting or saving. Suppose that a list texts is comprised of two lists Train_text and Test_text, where the set of tokens in Test_text is a subset of the set of tokens in Train_text (an optimistic assumption). Then fit_on_texts(Train_text) gives different results for texts_to_sequences(Test_text) as compared with first calling fit_on_texts(texts) and then text_to_sequences(Test_text).
Concrete Example:
from keras.preprocessing.text import Tokenizer
docs = ["A heart that",
"full up like",
"a landfill",
"no surprises",
"and no alarms"
"a job that slowly"
"Bruises that",
"You look so",
"tired happy",
"no alarms",
"and no surprises"]
docs_train = docs[:7]
docs_test = docs[7:]
# EXPERIMENT 1: FIT TOKENIZER ONLY ON TRAIN
T_1 = Tokenizer()
T_1.fit_on_texts(docs_train) # only train set
encoded_train_1 = T_1.texts_to_sequences(docs_train)
encoded_test_1 = T_1.texts_to_sequences(docs_test)
print("result for test 1:\n%s" %(encoded_test_1,))
# EXPERIMENT 2: FIT TOKENIZER ON BOTH TRAIN + TEST
T_2 = Tokenizer()
T_2.fit_on_texts(docs) # both train and test set
encoded_train_2 = T_2.texts_to_sequences(docs_train)
encoded_test_2 = T_2.texts_to_sequences(docs_test)
print("result for test 2:\n%s" %(encoded_test_2,))
Results:
result for test 1:
[[3], [10, 3, 9]]
result for test 2:
[[1, 19], [5, 1, 4]]
Of course, if the above optimistic assumption is not satisfied and the set of tokens in Test_text is disjoint from that of Train_test, then test 1 results in a list of empty brackets [].
I've created the issue https://github.com/keras-team/keras/issues/9289 in the keras Repo. Until the API is changed, the issue has a link to a gist that has code to demonstrate how to save and restore a tokenizer without having the original documents the tokenizer was fit on. I prefer to store all my model information in a JSON file (because reasons, but mainly mixed JS/Python environment), and this will allow for that, even with sort_keys=True
I found the following snippet provided at following link by #thusv89.
Save objects:
import pickle
with open('data_objects.pickle', 'wb') as handle:
pickle.dump(
{'input_tensor': input_tensor,
'target_tensor': target_tensor,
'inp_lang': inp_lang,
'targ_lang': targ_lang,
}, handle, protocol=pickle.HIGHEST_PROTOCOL)
Load objects:
with open("dataset_fr_en.pickle", 'rb') as f:
data = pickle.load(f)
input_tensor = data['input_tensor']
target_tensor = data['target_tensor']
inp_lang = data['inp_lang']
targ_lang = data['targ_lang']
Quite easy, because Tokenizer class has provided two funtions for save and load:
save —— Tokenizer.to_json()
load —— keras.preprocessing.text.tokenizer_from_json
In to_json() method,it call "get_config" method which handle this:
json_word_counts = json.dumps(self.word_counts)
json_word_docs = json.dumps(self.word_docs)
json_index_docs = json.dumps(self.index_docs)
json_word_index = json.dumps(self.word_index)
json_index_word = json.dumps(self.index_word)
return {
'num_words': self.num_words,
'filters': self.filters,
'lower': self.lower,
'split': self.split,
'char_level': self.char_level,
'oov_token': self.oov_token,
'document_count': self.document_count,
'word_counts': json_word_counts,
'word_docs': json_word_docs,
'index_docs': json_index_docs,
'index_word': json_index_word,
'word_index': json_word_index
}

Why the following partial fit is not working property?

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
Hello I have the following list of comments:
comments = ['I am very agry','this is not interesting','I am very happy']
These are the corresponding labels:
sents = ['angry','indiferent','happy']
I am using tfidf to vectorize these comments as follows:
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
from sklearn import preprocessing
I am using label encoder to vectorize the labels:
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
print(labels.shape)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)
Here I am using passive aggressive to fit the model:
clf2 = PassiveAggressiveClassifier()
with open('passive.pickle','wb') as idxf:
pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)
with open('passive.pickle', 'rb') as infile:
clf2 = pickle.load(infile)
with open('tfidf_vectorizer.pickle', 'rb') as infile:
tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
tfidf = pickle.load(infile)
Here I am trying to test the usage of partial fit as follows with three new comments and their corresponding labels:
new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)
print(clf2.predict(vec_new_comments))
clf2.partial_fit(vec_new_comments, new_labels)
The problem is that I am not getting the right results after the partial fit as follows:
print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))
however I am getting this output:
[2 2 2]
So I really appreciate support to find, why the model is not being updated if I am testing it with the same examples that it has used to be trained the desired output should be:
[1,0,2]
I would like to appreciate support to ajust maybe the hyperparameters to see the desired output.
this is the complete code, to show the partial fit:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sys
from sklearn.metrics.pairwise import cosine_similarity
import random
comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
#print(tfidf.shape)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)
clf2 = PassiveAggressiveClassifier()
clf2.fit(tfidf, labels)
with open('passive.pickle','wb') as idxf:
pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)
with open('passive.pickle', 'rb') as infile:
clf2 = pickle.load(infile)
with open('tfidf_vectorizer.pickle', 'rb') as infile:
tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
tfidf = pickle.load(infile)
new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)
clf2.partial_fit(vec_new_comments, new_labels)
print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))
However I got:
AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??
[2 2 2]
Well there are multiple problems with your code. I will start by stating the obvious ones to more complex ones:
You are pickling the clf2 before it has learnt anything. (ie. you pickle it as soon as it is defined, it doesnt serve any purpose). If you are only testing, then fine. Otherwise they should be pickled after the fit() or equivalent calls.
You are calling clf2.fit() before the clf2.partial_fit(). This defeats the whole purpose of partial_fit(). When you call fit(), you essentially fix the classes (labels) that the model will learn about. In your case it is acceptable, because on your subsequent call to partial_fit() you are giving the same labels. But still it is not a good practice.
See this for more details
In a partial_fit() scenario, dont call the fit() ever. Always call the partial_fit() with your starting data and new coming data. But make sure that you supply all the labels you want the model to learn in the first call to parital_fit() in a parameter classes.
Now the last part, about your tfidf_vectorizer. You call fit_transform()(which is essentially fit() and then transformed() combined) on tfidf_vectorizer with comments array. That means that it on subsequent calls to transform() (as you did in transform(new_comments)), it will not learn new words from new_comments, but only use the words which it saw during the call to fit()(words present in comments).
Same goes for LabelEncoder and sents.
This again is not prefereble in a online learning scenario. You should fit all the available data at once. But since you are trying to use the partial_fit(), we assume that you have very large dataset which may not fit in memory at once. So you would want to apply some sort of partial_fit to TfidfVectorizer as well. But TfidfVectorizer doesnt support partial_fit(). In fact its not made for large data. So you need to change your approach. See the following questions for more details:-
Updating the feature names into scikit TFIdfVectorizer
How can i reduce memory usage of Scikit-Learn Vectorizers?
All things aside, if you change just the tfidf part of fitting the whole data (comments and new_comments at once), you will get your desired results.
See the below code changes (I may have organized it a bit and renamed vec_new_comments to new_tfidf, please go through it with attention):
comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
new_comments = ['I love the life','I hate you','this is not important']
new_sents = ['happy','angry','indiferent']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
le = preprocessing.LabelEncoder()
# The below lines are important
# I have given the whole data to fit in tfidf_vectorizer
tfidf_vectorizer.fit(comments + new_comments)
# same for `sents`, but since the labels dont change, it doesnt matter which you use, because it will be same
# le.fit(sents)
le.fit(sents + new_sents)
Below is the Not so preferred code (which you are using, and about which I talked in point 2), but results are good as long as you make the above changes.
tfidf = tfidf_vectorizer.transform(comments)
labels = le.transform(sents)
clf2.fit(tfidf, labels)
print(clf2.predict(tfidf))
# [0 2 1]
new_tfidf = tfidf_vectorizer.transform(new_comments)
new_labels = le.transform(new_sents)
clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2] As you wanted
Correct approach, or the way partial_fit() is intended to be used:
# Declare all labels that you want the model to learn
# Using classes learnt by labelEncoder for this
# In any calls to `partial_fit()`, all labels should be from this array only
all_classes = le.transform(le.classes_)
# Notice the parameter classes here
# It needs to present first time
clf2.partial_fit(tfidf, labels, classes=all_classes)
print(clf2.predict(tfidf))
# [0 2 1]
# classes is not present here
clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]

How to create feature columns for TensorFlow classifier

I have a very simple dataset for binary classification in csv file which looks like this:
"feature1","feature2","label"
1,0,1
0,1,0
...
where the "label" column indicates class (1 is positive, 0 is negative). The number of features is actually pretty big but it doesn't matter for that question.
Here is how I read the data:
train = pandas.read_csv(TRAINING_FILE)
y_train, X_train = train['label'], train[['feature1', 'feature2']].fillna(0)
test = pandas.read_csv(TEST_FILE)
y_test, X_test = test['label'], test[['feature1', 'feature2']].fillna(0)
I want to run tensorflow.contrib.learn.LinearClassifier and tensorflow.contrib.learn.DNNClassifier on that data. For instance, I initialize DNN like this:
classifier = DNNClassifier(hidden_units=[3, 5, 3],
n_classes=2,
feature_columns=feature_columns, # ???
activation_fn=nn.relu,
enable_centered_bias=False,
model_dir=MODEL_DIR_DNN)
So how exactly should I create the feature_columns when all the features are also binary (0 or 1 are the only possible values)?
Here is the model training:
classifier.fit(X_train.values,
y_train.values,
batch_size=dnn_batch_size,
steps=dnn_steps)
The solution with replacing fit() parameters with the input function would also be great.
Thanks!
P.S. I'm using TensorFlow version 1.0.1
You can directly use tf.feature_column.numeric_column :
feature_columns = [tf.feature_column.numeric_column(key = key) for key in X_train.columns]
I've just found the solution and it's pretty simple:
feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
Apparently infer_real_valued_columns_from_input() works well with categorical variables.

Resources