Does Vespa support binary classifier model serving at feed time? - machine-learning

We want to classify documents as they are fed to Vespa using the API, and write those classification scores to document fields.
I'm not sure if it's possible to simply add the ONNX model into Vespa's application package directory, have that model classify any text fed to Vespa, and then write those classifications as document fields.
Does Vespa support model serving at feed time in this way?

Yes, you can do that with Vespa.
A custom document processor that reads the input, invokes the model, and stores the model's output in a new field. Stateless model evaluation
This example is a good starting point DimensionReductionDocProc, a document processor that uses the stateless model evaluation support to perform dimensionality reduction of a vector.
Then you need to export your classifier model to onnx format, and put it in the models folder in the application folder. If you wrap the inference in a component that can be shared between search and docproc and call it Classifier the services.xml of the container cluster looks something like this, plus the ClassifyDocProc.
<container id='default' version='1.0'>
<nodes count='1'/>
<component id='ai.vespa.examples.Classifier'/>
<model-evaluation>
<onnx>
<models>
<model name="classifier">
<intraop-threads>1</intraop-threads>
</model>
</models>
</onnx>
</model-evaluation>
<search/>
<document-api/>
<document-processing>
<chain id='classifier' inherits='indexing'>
<documentprocessor id='ai.vespa.examples.docproc.ClassifyDocProc'/>
<documentprocessor
</chain>
</document-processing>
</container>

Related

How to fine tune a model from hugging face?

I want to download a pretrained a model and fine tune the model with my own data. I have downloaded a bert-large-NER model artifacts from hugging face,I have listed the contents below . being new to this, I want to know what files or artifacts do i need and from the looks of it the pytorch_model.bin is the trained model, but what are these others file and their purpose like tokenizer files and vocab.txt ....
config.json
pytorch_model.bin
special_tokens_map.json
tokenizer_config.json
vocab.txt
These different files are the metadata of your model and the tokenizer that you are using (when you serialize your model this is the output). To fine tune a pre-trained model from the HF Hub you can either use PyTorch or TF or also the Trainer class where you don't have to write your own custom training code. Ex:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
Reference the official docs here as well for understanding how to tune a pre-trained model end to end: https://huggingface.co/docs/transformers/training.

How to export/save/load the actual AutoKeras "super" model, not the underlying tensorflow model

Is there a way to export/save/load a previously trained autokeras model? I understand I can use the following code to save/load the underlying tensorflow best model:
model = reg.export_model()
model.save(MODEL_FILEPATH, save_format="tf")
best_model = load_model(MODEL_FILEPATH, custom_objects=ak.CUSTOM_OBJECTS)
However, in practice that wouldn't work, since my data has been fitted by autokeras, which takes care of data preparation and scaling. I don't think I have access to what autokeras is doing to the input data (X) before actually fitting, so I can't actually use the exported tensorflow best model to predict labels for new samples with un-prepared and unscaled features.
Am I missing something major here?
Also I noticed that there are some binaries in the autokeras temporary dir. That dir seems to be generated automatically. Is there a way to use that dir to load the previously-fit autokeras "super" model?
Just using import pickle will do the job - https://github.com/keras-team/autokeras/issues/1081#issuecomment-645508111 :

Yolo training yolo with own dataset

I want to build a database with Yolo and this is my first time working with deep learning
how can I build a database for Yolo and train it?
How do I get the weights of the classifications?
Is it too difficult for someone new to Deep Learning?
Yes you can do it with ease!! and welcome to the Deep learning Community. You are welcome.
First download the darknet folder from Link
Go inside the folder and type make in command prompt
git clone https://github.com/pjreddie/darknet
cd darknet
make
Define these files -
data/custom.names
data/images
data/train.txt
data/test.txt
Now its time to label the images using LabelImg and save it in YOLO format which will generate corresponding label .txt files for the images dataset.
Labels of our objects should be saved in data/custom.names.
Using the script you can split the dataset into train and test-
import glob, os
dataset_path = '/media/subham/Data1/deep_learning/usecase/yolov3/images'
# Percentage of images to be used for the test set
percentage_test = 20
# Create and/or truncate train.txt and test.txt
file_train = open('train.txt', 'w')
file_test = open('test.txt', 'w')
# Populate train.txt and test.txt
counter = 1
index_test = round(100 / percentage_test)
for pathAndFilename in glob.iglob(os.path.join(dataset_path, "*.jpg")):
title, ext = os.path.splitext(os.path.basename(pathAndFilename))
if counter == index_test+1:
counter = 1
file_test.write(dataset_path + "/" + title + '.jpg' + "\n")
else:
file_train.write(dataset_path + "/" + title + '.jpg' + "\n")
counter = counter + 1
For train our object detector we can use the existing pre trained weights that are already trained on huge data sets. From here we can download the pre trained weights to the root directory.
Create a yolo-custom.data file in the custom_data directory which should contain information regarding the train and test data sets
classes=2
train=custom_data/train.txt
valid=custom_data/test.txt
names=custom_data/custom.names
backup=backup/
Now we have to make changes in our yolov3.cfg for training our model. For two classes. Based on the required performance we can select the YOLOv3 configuration file. For this example we will be using yolov3.cfg. We can duplicate the file from cfg/yolov3.cfg to custom_data/cfg/yolov3-custom.cfg
The maximum number of iterations for which our network should be trained is set with the param max_batches=4000. Also update steps=3200,3600 which is 80%, 90% of max_batches.
We will need to update the classes and filters params of [yolo] and [convolutional] layers that are just before the [yolo] layers.
In this example since we have a single class (tesla) we will update the classes param in the [yolo] layers to 1 at line numbers: 610, 696, 783
Similarly we will need to update the filters param based on the classes count filters=(classes + 5) * 3. For two classes we should set filters=21 at line numbers: 603, 689, 776
All the configuration changes are made to custom_data/cfg/yolov3-custom.cfg
Now, we have defined all the necessary items for training the YOLOv3 model. To train-
./darknet detector train custom_data/detector.data custom_data/cfg/yolov3-custom.cfg darknet53.conv.74
Also you can mark bounded boxes of objects in images for training Yolo right in your web browser, just open url. This tool is deployed to GitHub Pages.
Use this popular forked darknet repository https://github.com/AlexeyAB/darknet. The author describes many steps that will help you to build and use your own Yolo detector model.
How to build your own custom dataset and train it? Follow this step . He suggests to use Yolo Mark labeling tool to build your dataset, but you can also try another tool as described in here and here.
How to get the weights? The weights will be stored in darknet/backup/ directory after every 1000 iterations (you can adjust this value later). The link above explains everything about how to make and use the weights file.
I don't think it will be so difficult if you already know math, statistic and programming. Learning the basic neural network like perceptron, MLP then move to modern Machine Learning is a good start. Then you might want to expand your knowledge to Computer Vision related or NLP related area
Depending on what kind of OS you have. You can either hit up https://github.com/AlexeyAB/darknet [especially for Windows] or stick to https://github.com/pjreddie/darknet.
Steps to do so:
1) Setup darknet as detailed in the posts.
2) I used LabelIMG to label my images. make sure that the format
you save the images is in YOLO. If you save using the PascalVOC format or others you can write scripts to change it to the format that darknet expects.[YOLO]. Also, make sure that you do not change your labels file. If you want to add new labels, at it at the end of the file, not in between. YOLO format is quite different, so your previously labelled images may get messed up if you make changes in between the classes.
3)The weights will be generated as you train your model in a specific folder in darknet.[If you need more details I am happy to help answer that]. You can download the .74 file in YOLO and start training. The input to train needs a built darknet.exe a cfg file a .74 file and your training data location/access.
The setup is draconian, the process itself is not.
To build your own dataset, you should use LabelImg. It's a free and very easy software which will produce for you all the files you need to build a dataset. In fact, because you are working with yolo, you need a txt file for each of your image which will contain important information like bbox coordinates, label name. All these txt files are automatically produced by LabelImg so all you have to do is open the directory which contains all your images with LabelImg, and start the labelisation. Then, well you will have all your txt files, you will also need to create some other files in order to start training (see https://blog.francium.tech/custom-object-training-and-detection-with-yolov3-darknet-and-opencv-41542f2ff44e).

How to load the saved tokenizer from pretrained model

I fine-tuned a pretrained BERT model in Pytorch using huggingface transformer. All the training/validation is done on a GPU in cloud.
At the end of the training, I save the model and tokenizer like below:
best_model.save_pretrained('./saved_model/')
tokenizer.save_pretrained('./saved_model/')
This creates below files in the saved_model directory:
config.json
added_token.json
special_tokens_map.json
tokenizer_config.json
vocab.txt
pytorch_model.bin
Now, I download the saved_model directory in my computer and want to load the model and tokenizer. I can load the model like below
model = torch.load('./saved_model/pytorch_model.bin',map_location=torch.device('cpu'))
But how do I load the tokenizer? I am new to pytorch and not sure because there are multiple files. Probably I am not saving the model in the right way?
If you look at the syntax, it is the directory of the pre-trained model that you are supposed to pass. Hence, the correct way to load tokenizer must be:
tokenizer = BertTokenizer.from_pretrained(<Path to the directory containing pretrained model/tokenizer>)
In your case:
tokenizer = BertTokenizer.from_pretrained('./saved_model/')
./saved_model here is the directory where you'll be saving your pretrained model and tokenizer.

Where can I find the label map between trained model like googleNet's output to there real class label?

everyone, I am new to caffe. Currently, I try to use the trained GoogleNet which was downloaded from model zoo to classify some images. However, the network's output seem to be a vector rather than real label(like dog, cat).
Where can I find the label-map between trained model like googleNet's output to their real class label?
Thanks.
If you got caffe from git you should find in data/ilsvrc12 folder a shell script get_ilsvrc_aux.sh.
This script should download several files used for ilsvrc (sub set of imagenet used for the large scale image recognition challenge) training.
The most interesting file (for you) that will be downloaded is synset_words.txt, this file has 1000 lines, one line per class identified by the net.
The format of the line is
nXXXXXXXX description of class

Resources