How to load the saved tokenizer from pretrained model - machine-learning

I fine-tuned a pretrained BERT model in Pytorch using huggingface transformer. All the training/validation is done on a GPU in cloud.
At the end of the training, I save the model and tokenizer like below:
best_model.save_pretrained('./saved_model/')
tokenizer.save_pretrained('./saved_model/')
This creates below files in the saved_model directory:
config.json
added_token.json
special_tokens_map.json
tokenizer_config.json
vocab.txt
pytorch_model.bin
Now, I download the saved_model directory in my computer and want to load the model and tokenizer. I can load the model like below
model = torch.load('./saved_model/pytorch_model.bin',map_location=torch.device('cpu'))
But how do I load the tokenizer? I am new to pytorch and not sure because there are multiple files. Probably I am not saving the model in the right way?

If you look at the syntax, it is the directory of the pre-trained model that you are supposed to pass. Hence, the correct way to load tokenizer must be:
tokenizer = BertTokenizer.from_pretrained(<Path to the directory containing pretrained model/tokenizer>)
In your case:
tokenizer = BertTokenizer.from_pretrained('./saved_model/')
./saved_model here is the directory where you'll be saving your pretrained model and tokenizer.

Related

How to fine tune a model from hugging face?

I want to download a pretrained a model and fine tune the model with my own data. I have downloaded a bert-large-NER model artifacts from hugging face,I have listed the contents below . being new to this, I want to know what files or artifacts do i need and from the looks of it the pytorch_model.bin is the trained model, but what are these others file and their purpose like tokenizer files and vocab.txt ....
config.json
pytorch_model.bin
special_tokens_map.json
tokenizer_config.json
vocab.txt
These different files are the metadata of your model and the tokenizer that you are using (when you serialize your model this is the output). To fine tune a pre-trained model from the HF Hub you can either use PyTorch or TF or also the Trainer class where you don't have to write your own custom training code. Ex:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
Reference the official docs here as well for understanding how to tune a pre-trained model end to end: https://huggingface.co/docs/transformers/training.

How to load a trained BlazingText model

I have trained a text classification model using blazingText on AWS sagemaker, I can load the trained model and deploy an inference endpoint
model = bt_model.deploy(initial_instance_count=1, endpoint_name=endpoint_name, instance_type='ml.m5.xlarge', serializer=JSONSerializer())
payload = {"instances": terms}
response = model.predict(payload)
predictions = json.loads(response)
and it's working fine, now I need to load the model's bin file using an entry_point in order to do some logic before and after predictions in the input_fn and output_fn.
I extracted the bin file from the model.tar.gz and I can load it, but I get Segmentation Fault when I try to run a prediction
from gensim.models import FastText
from gensim.models.fasttext import load_facebook_model, load_facebook_vectors
model=FastText.load('model.bin')
model.predict('hello world')
As per blazingText documentations
For both supervised (text classification) and unsupervised (Word2Vec)
modes, the binaries (*.bin) produced by BlazingText can be
cross-consumed by fastText and vice versa. You can use binaries
produced by BlazingText by fastText. Likewise, you can host the model
binaries created with fastText using BlazingText.
Here is an example of how to use a model generated with BlazingText
with fastText:
#Download the model artifact from S3 aws s3 cp s3://<YOUR_S3_BUCKET>//model.tar.gz model.tar.gz
#Unzip the model archive tar -xzf model.tar.gz
#Use the model archive with fastText fasttext predict ./model.bin test.txt
but for some reason it's not working as expected

How to extract features from a pytorch pretrained fine-tuned model

I need to extract features from a pretrained (fine-tuned) BERT model.
I fine-tuned a pretrained BERT model in Pytorch using huggingface transformer. All the training/validation is done on a GPU in cloud.
At the end of the training, I save the model and tokenizer like below:
best_model.save_pretrained('./saved_model/')
tokenizer.save_pretrained('./saved_model/')
This creates below files in the saved_model directory:
config.json
added_token.json
special_tokens_map.json
tokenizer_config.json
vocab.txt
pytorch_model.bin
I save the saved_model directory in my computer and load the model and tokenizer like below
model = torch.load('./saved_model/pytorch_model.bin',map_location=torch.device('cpu'))
tokenizer = BertTokenizer.from_pretrained('./saved_model/')
Now to extract features, I do below
input_ids = torch.tensor([tokenizer.encode("Here is some text to encode", add_special_tokens=True)])
last_hidden_states = model(input_ids)[0][0]
But for the last line, it throws me error TypeError: 'collections.OrderedDict' object is not callable
It seems like I am not loading the model properly. Instead of loading the entire model in itself, I think my model=torch.load(....) line is loading a ordered dictionary.
What am I missing here? Am I even saving the model in the right way? Please suggest.
torch.load() returns a collections.OrderedDict object. Checkout the recommended way of saving and loading a model's state dict.
Save:
torch.save(model.state_dict(), PATH)
Load:
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.eval()
So, in your case, it should be:
model = BertModel(config)
model.load_state_dict('./saved_model/pytorch_model.bin',
map_location=torch.device('cpu'))
model.eval() # to disable dropouts

Where should I pass pre trained word embedding in a encoder-decoder architecture?

I have pre-trained word embeddings from two different languages using MUSE. Now suppose I have a encoder-decoder architecture. And I created a embedding layer from one of this embedding. But where do I pass it in the model?
The model is trying to translate from one language to another. I have created a embedding_layer. Where do I pass it in in the below code?
"""
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
"""
Look at the docs of keras: https://keras.io/getting-started/faq/
If you have the whole model saved, you can load the model using the command
keras.models.load_model(filepath)
This is the code example from the Kera docs:
from keras.models import load_model
model.save('my_model.h5') # creates a HDF5 file 'my_model.h5'
del model # deletes the existing model
# returns a compiled model
# identical to the previous one
model = load_model('my_model.h5')
If you have just only the weights, You can use this command:
model.load_weights('my_model_weights.h5')

I applied an inception model and my model has been savde but how do I avoid training the dataset again and agian?

I have same my inception model in Pycharm using TensorFlow library. Every time I run the project, it starts training the Data set. I want to skip the training every time I run model because once the model has been save ,there is no need to train the data again and again. How I get to know my model has been save successfully? How can I apply the save model in same file?
You can save/restore/load your model using TensorFlow:
Save:
builder = tf.saved_model.builder.SavedModelBuilder(export_dir) with tf.Session(graph=tf.Graph()) as sess: ... builder.add_meta_graph_and_variables(sess,
[tag_constants.TRAINING],
signature_def_map=foo_signatures,
assets_collection=foo_assets,
strip_default_attrs=True)
...
builder.save()
Load:
with tf.Session(graph=tf.Graph()) as sess:
tf.saved_model.loader.load(sess, [tag_constants.TRAINING], export_dir)
...
For further reference: TensorFlow Guide on Saving a Model
Actually, once you have saved your model, some files will be saved to your directory with the extension .YAML, .h5 or .meta(for graph), you can check the accuracy of model by restoring from saved file, just for sanity check.
There is nice tutorial on this:
https://www.tensorflow.org/guide/saved_model
http://cv-tricks.com/tensorflow-tutorial/save-restore-tensorflow-models-quick-complete-tutorial/
If you are use keras-api to build model, then this link will be useful for saving and restoring https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model

Resources