How to predownload a transformers model

How to predownload a transformers model - machine-learning

I want to perform a text generation task in a flask app and host it on a web server however when downloading the GPT models the elastic beanstalk managed EC2 instance crashes because the download takes too much time and memory
from transformers.tokenization_openai import OpenAIGPTTokenizer
from transformers.modeling_tf_openai import TFOpenAIGPTLMHeadModel
model = TFOpenAIGPTLMHeadModel.from_pretrained("openai-gpt")
tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
These are the lines in question causing the issue. GPT is approx 445 MB. I am using the transformers library. Instead of downloading the model at this line I was wondering if I could pickle the model and then bundle it as part of the repository. Is that possible with this library? Otherwise how can I preload this model to avoid the issues I am having?

Approach 1:
Search for the model here: https://huggingface.co/models
Download the model from this link:
pytorch-model: https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-pytorch_model.bin
tensorflow-model: https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-tf_model.h5
The config file: https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-config.json
Source: https://huggingface.co/transformers/_modules/transformers/configuration_openai.html#OpenAIGPTConfig
You can manually download the model (in your case TensorFlow model .h5 and the config.json file), put it in a folder (let's say model) in the repository. (you can try compressing the model, and then decompressing once it's in the ec2 instance if needed)
Then, you can directly load the model in your web server from the path instead of downloading (model folder which contains the .h5 and config.json):
model = TFOpenAIGPTLMHeadModel.from_pretrained("model")
# model folder contains .h5 and config.json
tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
# this is a light download
Approach 2:
Instead of using links to download, you can download the model in your local machine using the conventional method.
from transformers.tokenization_openai import OpenAIGPTTokenizer
from transformers.modeling_tf_openai import TFOpenAIGPTLMHeadModel
model = TFOpenAIGPTLMHeadModel.from_pretrained("openai-gpt")
tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
This downloads the model. Now you can save the weights in a folder using save_pretrained function.
model.save_pretrained('/content/') # saving inside content folder
Now, the content folder should contain a .h5 file and a config.json.
Just upload them to the repository and load from that.

Open https://huggingface.co/models and search the model you want. Click on the model name and finnaly click on "List all files in model". You will get a list of the files you can download.

Related

How can I start using MLFlow on databricks with an existing trained model?

I have an existing model that was trained on Azure. I want to fully integrate and start using the model on Databricks. Whats the best way to do this? How can I successfully load the model into databricks model workflow? I have the model in a pickle file
I have read almost all the documentation on databricks, but 99% of it is regarding new models trained on databricks and never about importing existing models.

Since MLFlow has a standardized model storage format, you just need to bring over the model files and start using them with the MLFlow package. In addition, you can register the model to the workspace's model registry using mlflow.register_model() and then use it from there. These would be the steps:
On the AzureML side, I assume that you have an MLFlow model saved to disk (using mlflow.sklearn.save_model() or mlflow.sklearn.autolog -- or some other mlflow.<flavor>). That should give you a folder that contains an MLModel file, and, depending on the flavor of the model a few more files -- like the below:
mlflow-model
├── MLmodel
├── conda.yaml
├── model.pkl
└── requirements.txt
Note: You can download the model from the AzureML Workspace using the
v2 CLI like so: az ml model download --name <model_name> --version <model_version>
Open a Databricks Notebook and make sure it has mlflow installed
%pip install mlflow
Upload the MLFlow model files to the dbfs connected to the cluster
In the Notebook, register the model using MLFlow (adjust the dbfs: path to the location where the model was uploaded to).
import mlflow
model_version = mlflow.register_model("dbfs:/FileStore/shared_uploads/mlflow-model/", "AzureMLModel")
Now your model is registered in the Workspace's model registry like any model that was created from a Databricks session. So, you can access it from the registry like so:
model = mlflow.pyfunc.load_model(f"models:/AzureMLModel/{model_version.version}")
input_example = {
"sepal_length": [5.1,4.8],
"sepal_width": [3.5,4.4],
"petal_length": [1.4,2.0],
"petal_width": [0.2,0.1]
}
model.predict(input_example)
Or use the model as a spark_udf:
import pandas as pd
model_udf = mlflow.pyfunc.spark_udf(spark=spark, model_uri=f"models:/AzureMLModel/{model_version.version}", result_type='string' )
spark_df = spark.createDataFrame(pd.DataFrame(input_example))
spark_df = spark_df.withColumn('foo', model_udf())
display(spark_df)
Note that I am using mlflow.pyfunc to load the model since every
MLFlow model needs to support the pyfunc flavor. That way, you don't
need to worry about the native flavor of the model.

If your source model is already in a MLflow tracking server.
https://github.com/mlflow/mlflow-export-import
If your source model was not trained in MLflow.
How do I create an MLflow run from a model I have trained elsewhere?
https://github.com/amesar/mlflow-resources/blob/master/MLflow_FAQ.md#how-do-i-create-an-mlflow-run-from-a-model-i-have-trained-elsewhere

GeoDjango Admin how to populate a GeometryField from a file upload?

I have a project that uses GeoDjango to store GPS routes. The geometry is stored in a GeometryField. This works great when data is imported with geospatial information, but it is frustrating when I have a model which needs user-supplied data. I would like to have a widget in the Admin that will let me upload a file, and then use that file to essentially import the geospatial information.
The FileField field doesn't seem appropriate, since I don't want the file stored on the file system. I want it processed and stored in the geospatial DB field so I can run geospatial functions on the data.
Ideally the admin interface would contain a file upload widget and the geospatial field, shown with the typical map.

There are a couple of options for importing geo data files into DB.
If you want to use a zipped shapefile, Geodjango comes with a nice solution, LayerMapping.
Before importing the file, you should implement the workflow for uploading zip file with a form, checking the required extensions ([".shp", ".shx", ".dbf", ".prj"]) and saving the files for reading.
Then you have to define a mapping to match field names across the file and Django model.
After you completed these steps, you can save the geometries into the DB with:
from django.contrib.gis.utils.layermapping import LayerMapping
layer = uploaded_and_extracted_file
mapping = {"id": "district", "name": "dis_name", "area": "shape_area", "geom": "MULTIPOLYGON"}
lm = LayerMapping(ModelName, layer, mapping, transform=True, encoding="utf-8")
lm.save(verbose=True, strict=True, silent=True)

How to load a downloaded h5 model in Tensorflow

I am receiving the following error when trying to load a model that was downloaded from github and am getting the following error
SavedModel file does not exist at: modelname/{saved_model.pbtxt|saved_model.pb}.
I used tf.keras.models.load_model but I believe that is used if the model was already saved. Are there methods to load in an external model that was not previously saved?

As the error says it is not able to find the model to load it, you can try changing the model name if you are saving it again, If you are using make sure you follow the proper folder structure and methods as mentioned in this document.

You need to provide a complete path(including file name) of your model to the function.
my_model = keras.models.load_model("path/to/my_h5_model.h5")
this fuction supports both string and python Path object.

libgit2sharp Diff of Tree with a not saved file

I would like to extend GitDiffMargin addin so that when a user is modifying a file in Visual Studio she can see the update diff in the margin even without saving the file.
Is it possible with libgit2sharp to do such a diff from a Tree and another Tree which I would have to build myself?

As far as I understand it, this question can be splitted into 3 sub questions:
How to diff two Trees
How to build a new Tree by modifying an existing one file (Blob) from it
How to create a Blob from the content of a file that hasn't been previously saved to disk.
How to diff two Trees:
API: repo.Diff.Compare<T>(Tree, Tree)
Tests: DiffTreeToTreeFixture.cs
How to build a new Tree by modifying an existing one file (Blob) from it:
API: TreeDefinition.From(Tree), TreeDefinition.Add(string, Blob, Mode) and repo.ObjectDatabase.CreateTree(TreeDefinition)
Tests: TreeDefinitionFixture.cs and ObjectDatabaseFixture.cs
How to create a Blob from the content of a file that hasn't been previously saved to disk:
API: CreateBlob(Stream, string)
Tests: ObjectDatabaseFixture.cs

Custom configuration file in MVC4

I'm building an ASP.Net MVC4 application and the customer wants to be able to supply an XML configuration file, to configure a vendor list in the application, something like this:
<Vendor>
<Vendor name="ABC Computers" deliveryDays="10"/>
<Vendor name="XYZ Computers" deliveryDays="15"/>
</Vendors>
The file needs to be dropped onto a network location (i.e. not on the web server) and I don't have a database to import and store the data.
The customer also wants the ability to update it daily. So I'm thinking I'll have to do some kind of import (and validate the file) when the application starts up.
Any good ideas on the best way to accomplish this?
- The data needs to be quickly accessible
- Ideally I just want to import/store it once, or be able to access it quickly
- I need to be able to validate the file, so it might be prudent to be able to be able to switch to a backup
One thought was to use something like Entity Framework and simply read the file whenever I needed it, but if possible I'd hold it in memory in the application if possible.
Cheers
Vincent

No need to import it into a database or use Entity Framework. You can simply use .NET Xml Serialization to accomplish this.
The command line tool xsd.exe will generate c# classes from your Xml file. From the command line:
xsd.exe myfile.xml
xsd.exe /c myfile.xsd
The first command will infer and create an xml schema file (myfile.xsd) from your xml. The second command will convert the schema file to c# classes.
Then use the XmlSerializer class to deserialize your xml file into objects (assuming multiple objects in one file):
MyCollection myObjects= null;
string path = "mydata.xml";
XmlSerializer serializer = new XmlSerializer(typeof(MyCollection));
StreamReader reader = new StreamReader(path);
myObjects = (MyCollection)serializer.Deserialize(reader);
reader.Close();
You can use the .xsd file generated above to validate your xml files. Here's a link showing how: http://msdn.microsoft.com/en-us/library/ms162371.aspx.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to predownload a transformers model - machine-learning

Open https://huggingface.co/models and search the model you want. Click on the model name and finnaly click on "List all files in model". You will get a list of the files you can download.

Related

How can I start using MLFlow on databricks with an existing trained model?

GeoDjango Admin how to populate a GeometryField from a file upload?

How to load a downloaded h5 model in Tensorflow

libgit2sharp Diff of Tree with a not saved file

Custom configuration file in MVC4

Categories

Resources