Training and Prediction in Spark Streaming Machine Learning Model - machine-learning

I am having a hard time understanding how we can both update the machine learning model and use it to make predictions in one spark streaming job.
This code is from Spark StreamingLinearRegressionExample class
val trainingData = ssc.textFileStream(args(0)).map(LabeledPoint.parse).cache()
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)
val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
The variable model is being updated in one stream and used to predict in another stream. I do not understand how the changes were done to the model in the trainOn method are reflected in predictOnValues.

Related

Converting a pytorch model to nn.Module for exporting to onnx for lens studio

I am trying to convert pix2pix to a pb or onnx that can run in Lens Studio. Lens studio has strict requirements for the models. I am trying to export this pytorch model to onnx using this guide provided by lens studio. The issue is the pytorch model found here uses its own base class, when in the example it uses Module.nn, and therefore doesnt have methods/variables that the torch.onnx.export function needs to run. So far Ive run into its missing a variable called training and a method called train
Would it be worth it to try to modify the base model, or should I try to build it from scratch using nn.Module? Is there a way to make the pix2pix model inherit from both the abstract base class and nn.module? Am I not understanding the situation? The reason why I want to do it using the lens studio tutorial is because I have gotten it to export onnx different ways but Lens Studio wont accept those for various reasons.
Also this is my first time asking a SO question (after 6 years of coding), let me know if I make any mistakes and I can correct them. Thank you.
This is the important code from the tutorial creating a pytorch model for Lens Studio:
import torch
import torch.nn as nn
class Model(nn.Module):
def __init__(self):
super().__init__()
self.layer = nn.Conv2d(in_channels=3, out_channels=1,
kernel_size=3, stride=2, padding=1)
def forward(self, x):
out = self.layer(x)
out = nn.functional.interpolate(out, scale_factor=2,
mode='bilinear', align_corners=True)
out = torch.nn.functional.softmax(out, dim=1)
return out
I'm not going to include all the code from the pytorch model bc its large, but the beginning of the baseModel.py is
import os
import torch
from collections import OrderedDict
from abc import ABC, abstractmethod
from . import networks
class BaseModel(ABC):
"""This class is an abstract base class (ABC) for models.
To create a subclass, you need to implement the following five functions:
-- <__init__>: initialize the class; first call BaseModel.__init__(self, opt).
-- <set_input>: unpack data from dataset and apply preprocessing.
-- <forward>: produce intermediate results.
-- <optimize_parameters>: calculate losses, gradients, and update network weights.
-- <modify_commandline_options>: (optionally) add model-specific options and set default options.
"""
def __init__(self, opt):
"""Initialize the BaseModel class.
Parameters:
opt (Option class)-- stores all the experiment flags; needs to be a subclass of BaseOptions
When creating your custom class, you need to implement your own initialization.
In this function, you should first call <BaseModel.__init__(self, opt)>
Then, you need to define four lists:
-- self.loss_names (str list): specify the training losses that you want to plot and save.
-- self.model_names (str list): define networks used in our training.
-- self.visual_names (str list): specify the images that you want to display and save.
-- self.optimizers (optimizer list): define and initialize optimizers. You can define one optimizer for each network. If two networks are updated at the same time, you can use itertools.chain to group them. See cycle_gan_model.py for an example.
"""
self.opt = opt
self.gpu_ids = opt.gpu_ids
self.isTrain = opt.isTrain
self.device = torch.device('cuda:{}'.format(self.gpu_ids[0])) if self.gpu_ids else torch.device('cpu') # get device name: CPU or GPU
self.save_dir = os.path.join(opt.checkpoints_dir, opt.name) # save all the checkpoints to save_dir
if opt.preprocess != 'scale_width': # with [scale_width], input images might have different sizes, which hurts the performance of cudnn.benchmark.
torch.backends.cudnn.benchmark = True
self.loss_names = []
self.model_names = []
self.visual_names = []
self.optimizers = []
self.image_paths = []
self.metric = 0 # used for learning rate policy 'plateau'
and for pix2pix_model.py
import torch
from .base_model import BaseModel
from . import networks
class Pix2PixModel(BaseModel):
""" This class implements the pix2pix model, for learning a mapping from input images to output images given paired data.
The model training requires '--dataset_mode aligned' dataset.
By default, it uses a '--netG unet256' U-Net generator,
a '--netD basic' discriminator (PatchGAN),
and a '--gan_mode' vanilla GAN loss (the cross-entropy objective used in the orignal GAN paper).
pix2pix paper: https://arxiv.org/pdf/1611.07004.pdf
"""
#staticmethod
def modify_commandline_options(parser, is_train=True):
"""Add new dataset-specific options, and rewrite default values for existing options.
Parameters:
parser -- original option parser
is_train (bool) -- whether training phase or test phase. You can use this flag to add training-specific or test-specific options.
Returns:
the modified parser.
For pix2pix, we do not use image buffer
The training objective is: GAN Loss + lambda_L1 * ||G(A)-B||_1
By default, we use vanilla GAN loss, UNet with batchnorm, and aligned datasets.
"""
# changing the default values to match the pix2pix paper (https://phillipi.github.io/pix2pix/)
parser.set_defaults(norm='batch', netG='unet_256', dataset_mode='aligned')
if is_train:
parser.set_defaults(pool_size=0, gan_mode='vanilla')
parser.add_argument('--lambda_L1', type=float, default=100.0, help='weight for L1 loss')
return parser
def __init__(self, opt):
"""Initialize the pix2pix class.
Parameters:
opt (Option class)-- stores all the experiment flags; needs to be a subclass of BaseOptions
"""
(Also sidenote if you see this and it looks like no easy way out let me know, I know what its like seeing someone getting started in something who goes in to deep too early on)
You can definitely have your model inherit from both the base class and torch.nn.Module (python allows multiple inheritance). However you should take care about the conflicts if both inherited class have functions with identical names (I can see at least one : their base provide the eval function and so to nn.module).
However since you do not need the CycleGan, and a lot of the code is compatibility with their training environment, you'd probably better just re-implement the pix2pix. Just steal the code, have it inherit from nn.Module, copy-paste useful/mandatory functions from the base class, and have everything translated into clean pytorch code. You already have the forward function (which is the only requirement for a pytorch module).
All the subnetworks they use (like the resnet blocks) seem to inherit from nn.Module already so there is nothing to change here (double check that though)

How to use pretrained weights of a model for initializing the weights in next iteration?

I have a model architecture. I have saved the entire model using torch.save() for some n number of iterations. I want to run another iteration of my code by using the pre-trained weights of the model I saved previously.
Edit: I want the weight initialization for the new iteration be done from the weights of the pretrained model
Edit 2: Just to add, I don't plan to resume training. I intend to save the model and use it for a separate training with same parameters. Think of it like using a saved model with weights etc. for a larger run and more samples (i.e. a complete new training job)
Right now, I do something like:
# default_lr = 5
# default_weight_decay = 0.001
# model_io = the pretrained model
model = torch.load(model_io)
optim = torch.optim.Adam(model.parameters(),lr=default_lr, weight_decay=default_weight_decay)
loss_new = BCELoss()
epochs = default_epoch
.
.
training_loop():
....
outputs = model(input)
....
.
#similarly for test loop
Am I missing something? I have to run for a very long epoch for a huge number of sample so can not afford to wait to see the results then figure out things.
Thank you!
From the code that you have posted, I see that you are only loading the previous model parameters in order to restart your training from where you left it off. This is not sufficient to restart your training correctly. Along with your model parameters (weights), you also need to save and load your optimizer state, especially when your choice of optimizer is Adam which has velocity parameters for all your weights that help in decaying the learning rate.
In order to smoothly restart training, I would do the following:
# For saving your model
state = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict()
}
model_save_path = "Enter/your/model/path/here/model_name.pth"
torch.save(state, model_save_path)
# ------------------------------------------
# For loading your model
state = torch.load(model_save_path)
model = MyNetwork()
model.load_state_dict(state['model'])
optim = torch.optim.Adam(model.parameters(),lr=default_lr, weight_decay=default_weight_decay)
optim.load_state_dict(state['optimizer'])
Besides these, you may also want to save your learning rate if you are using a learning rate decay strategy, your best validation accuracy so far which you may want for checkpointing purposes, and any other changeable parameter which might affect your training. But in most of the cases, saving and loading just the model weights and optimizer state should be sufficient.
EDIT: You may also want to look at this following answer which explains in detail how you should save your model in different scenarios.

Use tested machine learning model on new unlabeled single observation or dataset?

How can I use a trained and tested algorithm (eg. machine learning classifier) after being saved, on a new observation/dataset, whose I do not know the class (eg. ill vs healthy) based on predictors used for model training?
I use caret but can't find any lines of code for this.
many thanks
After training and testing any machine learning model you can save the model as .rds file and call it as
#Save the fitted model as .rds file
saveRDS(model_fit, "model.rds")
my_model <- readRDS("model.rds")
Creating a new observation from the same dataset or you can use a new dataset also
new_obs <- iris[100,] #I am using default iris dataset, 100 no sample
Prediction on the new observation
predicted_new <- predict(my_model, new_obs)
confusionMatrix(reference = new_obs$Species, data = predicted_new)
table(new_obs$Species, predicted_new)

How to reset specific layer weights for transfer learning?

I am looking for a way to re initialize layer's weights in an existing keras pre trained model.
I am using python with keras and need to use transfer learning,
I use the following code to load the pre trained keras models
from keras.applications import vgg16, inception_v3, resnet50, mobilenet
vgg_model = vgg16.VGG16(weights='imagenet')
I read that when using a dataset that is very different than the original dataset it might be beneficial to create new layers over the lower level features that we have in the trained net.
I found how to allow fine tuning of parameters and now I am looking for a way to reset a selected layer for it to re train. I know I can create a new model and use layer n-1 as input and add layer n to it, but I am looking for a way to reset the parameters in an existing layer in an existing model.
For whatever reason you may want to re-initalize the weights of a single layer k, here is a general way to do it:
from keras.applications import vgg16
from keras import backend as K
vgg_model = vgg16.VGG16(weights='imagenet')
sess = K.get_session()
initial_weights = vgg_model.get_weights()
from keras.initializers import glorot_uniform # Or your initializer of choice
k = 30 # say for layer 30
new_weights = [glorot_uniform()(initial_weights[i].shape).eval(session=sess) if i==k else initial_weights[i] for i in range(len(initial_weights))]
vgg_model.set_weights(new_weights)
You can easily verify that initial_weights[k]==new_weights[k] returns an array of False, while initial_weights[i]==new_weights[i] for any other i returns an array of True.

Difference between train(), run() and fit() functions in Spark

There are several options of using logistic regression with Apache Spark (Version 1.5.2) in Java:
spark.ml:
1) LogisticRegression lr = new LogisticRegression();
a) lr.train(dataFrame);
b) lr.fit(dataFrame);
spark.mllib:
2) LogisticRegressionWithSGD lr = new LogisticRegressionWithSGD();
a) lr.train(rdd);
b) lr.run(rdd);
3) LogisticRegressionWithLBFGS lr = new LogisticRegressionWithLBFGS();
a) lr.train(rdd);
b) lr.run(rdd);
I was wondering what is the difference between a) and b), except the GeneralizedLinearAlgorithm output from run() function instead of LogisticRegressionModel from the other? I couldn't find any hint in the Java or Scala documentation. Thanks in advance for the help.
Spark does contain two libraries that can be used for machine learning: ML and MLLib. Could you specify which version of Spark you're using please?
MLLib. It was the first machine learning library of Spark. It actually has a very shallow structure and uses RDD to be run on. This is kind of anarchical in MLLib and so you have to look at the code to know which one to use. I'm not sure which language or version you're using, but for Spark 1.6.0 on scala, there's a singleton:
object LogisticRegressionWithSGD {
def train(input: RDD[LabeledPoint], ...) = new LogisticRegressionWithSGD(...).run(input,...)
}
which means that train is to be called as a static method on the object LogisticRegressionWithSGD, but if you have an instance of LogisticRegressionWithSGD there's only a run method:
LogisticRegressionWithSGD.train(rdd, parameters)
// OR
val lr = new LogisticRegressionWithSGD()
lr.run(rdd)
anyway, if you have another version, you sould definitly perfer the super version, i.e. run.
ML. It's the newest library that's based on the use of DataFrame, which is basically an RDD[Row] (a Row is just a sequence of untyped objects) with a schema (ie an object that contains information about the colum names, types, metadata...). I definitly advise you to use this as it enables optimizations! In this case, you should use the fit method which is the method that all the estimators need to implement.
Explanation: The ML library uses the notion of Pipeline (kind of the same as in sci-kit learn). A pipeline instance is basically an array of stages (of type PipelineStage), each one of them being either an Estimator or a Transformer (there are some other types, e.g. Evaluator but I won't get into them here as they are being rare). A Transformer is simply an algorithm that transforms your data, so its main method is transform(DataFrame) and it outputs another DataFrame. An Estimator is a an algorithm that produces a Model (a subtype of Transformer). It's basically any block that needs to fit on data, so it has a function fit(DataFrame) that outputs a Transformer. For instance if you want to multiply all your data by $2$, you only need a transformer that implements a transform method that takes your input and multiply it by $2$. If you need to compute the mean and substract it, you need an estimator that fits on the data to compute the mean and outputs a transformer the substracts the mean learned. So any time you use ML, use the fit and transform methods. It allows you to do something like:
val trainingSet = // training DataFrame
val testSet = // test DataFrame
val lr = new LogisticRegession().setInputCol(...).setOutputCol(...) // + setParams()
val stage = // another stage, i.e. something that implements PipelineStage
val stages = Array(lr, stage)
val pipeline: Pipeline = new Pipeline().setStages(stages)
val model: PipelineModel = pipeline.fit(trainingSet)
val result: DataFrame = model.transform(testSet)
Now if you really want to know why train exists, it's a function inherited by Predictor which itself extends Estimator. Indeed there are a tones of possible Estimators - you could compute the mean, IDF,... When you implement a predictor such as the logistic regression you have an abstract class Predictor that extends Estimator and allows you some shortcuts (e.g. it has a label column, feature column and prediction column). In particular the piece of code already overrides fit to change the schema of your dataframe accordingly to those label/features/prediction and you simply need to implement your own train:
override def fit(dataset: DataFrame): M = {
// This handles a few items such as schema validation.
// Developers only need to implement train().
transformSchema(dataset.schema, logging = true)
copyValues(train(dataset).setParent(this))
}
protected def train(dataset: DataFrame): M
as you see the train method should be protected/private so not used by an external user.

Resources