Using a model created from python in ML.NET - machine-learning

We have a scenario where we need to use Machine learning algorithm to predict a value. We want to do it in ML.NET because of some issues.
We tried AutoML in a project and trained it with almost 80k records of data. We trained the data for more than 30 min. The csv file was 22MB. Data looks like below.
------------------------------------
Col1 col2
------------------------------------
Some text 21
Some other text 2
We have some historic data of above kind. We need to predict col2 from the col1 text.
It is predicting the result in decimals even though the column is whole number.
There is a model created by someone in python which is working as expected for now. We want to use it in ML.NET.
Is there any possibility we can use a model created from python, in ML.NET?

There are a couple of ways depending if it's a Tensorflow model or if it's from another framework.
If it's a Tensorflow model, it can be loaded directly using the mlContext.Model.LoadTensorFlowModel method from the Microsoft.ML.TensorFlow package.
var tensorFlowModel = context.Model.LoadTensorFlowModel(_modelPath);
If it's any other model, like a Keras or PyTorch model, then it can be converted to the ONNX format using one of the ONNX converter packages. Here's a list of how to do this with several different formats.
Once you have an ONNX model, you can use the Microsoft.ML.OnnxTransformer package which will give you an ApplyOnnxModel method.
mlContext.Transforms.ApplyOnnxModel(modelFile: "Model File", outputColumnName: "Output column name", inputColumnName: "Input column name")

Related

How LFW dataset used for evaluating facenet model

I am building a face recognition model using facenet. I could in most of the papers, LFW is used for validation. Trying to understand how LFW is used for validation as it has only 1600 classes with more than 2 images out of 5400 classes. Trying to find answers for the following questions
1) For validation, do we need to use only the classes with more than 1 image and neglect the remaining class ?
2) In the below link there are files under the name 'pairs.txt' and 'people.txt'. How is it exactly used ?
http://vis-www.cs.umass.edu/lfw/
To prepare a flipped dataset as a query dataset
You can use original lfw as a reference dataset, and flip it as a query dataset.
check this repo for detail https://github.com/ZhaoJ9014/face.evoLVe.PyTorch/blob/master/util/extract_feature_v1.py.
the author also gave extract_feature_v2.py which adding centre crop before flip.

Classification using H2O.ai H2O-3 Automl Algorithm on AWS SageMaker: Categorical Columns

I'm trying to train a model using H2O.ai's H2O-3 Automl Algorithm on AWS SageMaker using the console.
My model's goal is to predict if an arrest will be made based upon the year, type of crime, and location.
My data has 8 columns:
primary_type: enum
description: enum
location_description: enum
arrest: enum (true/false), this is the target column
domestic: enum (true/false)
year: number
latitude: number
longitude: number
When I use the SageMaker console on AWS and create a new training job using the H2O-3 Automl Algorithm, I specify the primary_type, description, location_description, and domestic columns as categorical.
However in the logs of the training job I always see the following two lines:
Converting specified columns to categorical values:
[]
This leads me to believe the categorical_columns attribute in the training hyperparameter is not being taken into account.
I have tried the following hyperparameters with the same output in the logs each time:
{'classification': 'true', 'categorical_columns':'primary_type,description,location_description,domestic', 'target': 'arrest'}
{'classification': 'true', 'categorical_columns':['primary_type','description','location_description','domestic'], 'target': 'arrest'}
I thought the list of categorical columns was supposed to be delimited by comma, which would then be split into a list.
I expected the list of categorical column names to be output in the logs instead of an empty list, like so:
Converting specified columns to categorical values:
['primary_type','description','location_description','domestic']
Can anyone help me figure out how to get these categorical columns to apply to the training of my model?
Also-
I think this is the code that's running when I train my model but I have yet to confirm that: https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L93-L151
This seems to be a bug by h2o package. The code in https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L106 shows that it's reading categorical_columns directly from the hyperparameters, not nested under the training field. However when move up the categorical_columns field a level, the algorithm doesn't recognize it. So no solution for this.
It seems based on the code here: https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L106
that the parameter is looking for a comma separated string. E.g. "cat,dog,bird"
I would try: "primary_type,description,location_description,domestic"as the input parameter, rather than ['primary_type', 'description'... etc]

ARIMA model producing a straight line prediction

I did some experiments with the ARIMA model on 2 datasets
Airline passengers data
USD vs Indian rupee data
I am getting a normal zig-zag prediction on Airline passengers data
ARIMA order=(2,1,2)
Model Results
But on USD vs Indian rupee data, I am getting prediction as a straight line
ARIMA order=(2,1,2)
Model Results
SARIMAX order=(2,1,2), seasonal_order=(0,0,1,30)
Model Results
I tried different parameters but for USD vs Indian rupee data I am always getting a straight line prediction.
One more doubt, I have read that the ARIMA model does not support time series with a seasonal component (for that we have SARIMA). Then why for Airline passengers data ARIMA model is producing predictions with cycle?
Having gone through similar issue recently, I would recommend the following:
Visualize seasonal decomposition of the data to make sure that the seasonality exists in your data. Please make sure that the dataframe has frequency component in it. You can enforce frequency in pandas dataframe with the following :
dh = df.asfreq('W') #for weekly resampled data and fillnas with appropriate method
Here is a sample code to do seasonal decomposition:
import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(dh['value'], model='additive',
extrapolate_trend='freq') #additive or multiplicative is data specific
fig = decomposition.plot()
plt.show()
The plot will show whether seasonality exists in your data. Please feel free to go through this amazing document regarding seasonal decomposition. Decomposition
If you're sure that the seasonal component of the model is 30, then you should be able to get a good result with pmdarima package. The package is extremely effective in finding optimal pdq values for your model. Here is the link to it: pmdarima
example code pmdarima
If you're unsure about seasonality, please consult with a domain expert about the seasonal effects of your data or try experimenting with different seasonal components in your model and estimate the error.
Please make sure that the stationarity of data is checked by Dickey-Fuller test before training the model. pmdarima supports finding d component with the following:
from pmdarima.arima import ndiffs
kpss_diff = ndiffs(dh['value'].values, alpha=0.05, test='kpss', max_d=12)
adf_diff = ndiffs(dh['value'].values, alpha=0.05, test='adf', max_d=12)
n_diffs = max(adf_diff , kpss_diff )
You may also find d with the help of the document I provided here. If the answer isn't helpful, please provide the data source for exchange rate. I will try to explain the process flow with a sample code.

Gensim word embedding training with initial values

I have a dataset with documents separated into different years, and my objective is to train an embedding model for each year's data, while at the same time, the same word appearing in different years will have similar vector representations. Like this: for word 'compute', its vector in year 1 is
[0.22, 0.33, 0.20]
and in year 2 it's something around:
[0.20, 0.35, 0.18]
Is there a way to accomplish this? For example, train the model of year 2 with both initial values (if the word is trained already in year 1, modify its vector) and randomness (if this is a new word for the corpus).
I think the easiest solution is to save the embeddings after training on the first data set, then load the trained model and continue training for the second data set. This way you shouldn't expect the embeddings to drift away from the saved state much (unless your data sets are very different).
It would also make sense to create a single vocabulary from all documents: vocabulary words that aren't present in a particular document will get some random representation, but still it will be a working word2vec model.
Example from the documentation:
>>> model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
>>> model.save(fname)
>>> model = Word2Vec.load(fname) # continue training with the loaded model

Where can I find the label map between trained model like googleNet's output to there real class label?

everyone, I am new to caffe. Currently, I try to use the trained GoogleNet which was downloaded from model zoo to classify some images. However, the network's output seem to be a vector rather than real label(like dog, cat).
Where can I find the label-map between trained model like googleNet's output to their real class label?
Thanks.
If you got caffe from git you should find in data/ilsvrc12 folder a shell script get_ilsvrc_aux.sh.
This script should download several files used for ilsvrc (sub set of imagenet used for the large scale image recognition challenge) training.
The most interesting file (for you) that will be downloaded is synset_words.txt, this file has 1000 lines, one line per class identified by the net.
The format of the line is
nXXXXXXXX description of class

Resources