There are several options of using logistic regression with Apache Spark (Version 1.5.2) in Java:
spark.ml:
1) LogisticRegression lr = new LogisticRegression();
a) lr.train(dataFrame);
b) lr.fit(dataFrame);
spark.mllib:
2) LogisticRegressionWithSGD lr = new LogisticRegressionWithSGD();
a) lr.train(rdd);
b) lr.run(rdd);
3) LogisticRegressionWithLBFGS lr = new LogisticRegressionWithLBFGS();
a) lr.train(rdd);
b) lr.run(rdd);
I was wondering what is the difference between a) and b), except the GeneralizedLinearAlgorithm output from run() function instead of LogisticRegressionModel from the other? I couldn't find any hint in the Java or Scala documentation. Thanks in advance for the help.
Spark does contain two libraries that can be used for machine learning: ML and MLLib. Could you specify which version of Spark you're using please?
MLLib. It was the first machine learning library of Spark. It actually has a very shallow structure and uses RDD to be run on. This is kind of anarchical in MLLib and so you have to look at the code to know which one to use. I'm not sure which language or version you're using, but for Spark 1.6.0 on scala, there's a singleton:
object LogisticRegressionWithSGD {
def train(input: RDD[LabeledPoint], ...) = new LogisticRegressionWithSGD(...).run(input,...)
}
which means that train is to be called as a static method on the object LogisticRegressionWithSGD, but if you have an instance of LogisticRegressionWithSGD there's only a run method:
LogisticRegressionWithSGD.train(rdd, parameters)
// OR
val lr = new LogisticRegressionWithSGD()
lr.run(rdd)
anyway, if you have another version, you sould definitly perfer the super version, i.e. run.
ML. It's the newest library that's based on the use of DataFrame, which is basically an RDD[Row] (a Row is just a sequence of untyped objects) with a schema (ie an object that contains information about the colum names, types, metadata...). I definitly advise you to use this as it enables optimizations! In this case, you should use the fit method which is the method that all the estimators need to implement.
Explanation: The ML library uses the notion of Pipeline (kind of the same as in sci-kit learn). A pipeline instance is basically an array of stages (of type PipelineStage), each one of them being either an Estimator or a Transformer (there are some other types, e.g. Evaluator but I won't get into them here as they are being rare). A Transformer is simply an algorithm that transforms your data, so its main method is transform(DataFrame) and it outputs another DataFrame. An Estimator is a an algorithm that produces a Model (a subtype of Transformer). It's basically any block that needs to fit on data, so it has a function fit(DataFrame) that outputs a Transformer. For instance if you want to multiply all your data by $2$, you only need a transformer that implements a transform method that takes your input and multiply it by $2$. If you need to compute the mean and substract it, you need an estimator that fits on the data to compute the mean and outputs a transformer the substracts the mean learned. So any time you use ML, use the fit and transform methods. It allows you to do something like:
val trainingSet = // training DataFrame
val testSet = // test DataFrame
val lr = new LogisticRegession().setInputCol(...).setOutputCol(...) // + setParams()
val stage = // another stage, i.e. something that implements PipelineStage
val stages = Array(lr, stage)
val pipeline: Pipeline = new Pipeline().setStages(stages)
val model: PipelineModel = pipeline.fit(trainingSet)
val result: DataFrame = model.transform(testSet)
Now if you really want to know why train exists, it's a function inherited by Predictor which itself extends Estimator. Indeed there are a tones of possible Estimators - you could compute the mean, IDF,... When you implement a predictor such as the logistic regression you have an abstract class Predictor that extends Estimator and allows you some shortcuts (e.g. it has a label column, feature column and prediction column). In particular the piece of code already overrides fit to change the schema of your dataframe accordingly to those label/features/prediction and you simply need to implement your own train:
override def fit(dataset: DataFrame): M = {
// This handles a few items such as schema validation.
// Developers only need to implement train().
transformSchema(dataset.schema, logging = true)
copyValues(train(dataset).setParent(this))
}
protected def train(dataset: DataFrame): M
as you see the train method should be protected/private so not used by an external user.
Related
I am currently working on a research project where I apply multiple pretrained models (for feature extraction), and train a new model, on audio for a classification task. I am using python, pytorch and the dataset/dataloader classes. My typical project structure and software design has become unruly because of the complexity. It is a research project and doesn't need to run "in production", so I would like a minimal solution that abides by the most common patterns for this type of project.
Below are some examples I have tried. Don't pay attention to every variable names as I've changed some to make the code generic.
Implementation 1: a Pipeline class with a run() method that accepts parameters.
I like that there is only 1 Pipeline instance, but the for-loops on the call and inside the actual run() method are still nasty.
train_dataset = AudioDataset(train_data_path)
val_data_path = AudioDataset(val_data_path)
test_dataset = AudioDataset(test_data_path)
train_loader = DataLoader(train_dataset, batch_size=1, shuffle=False)
val_loader = DataLoader(val_data_path, batch_size=1, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)
pipeline = Pipeline(
config.model_path,
config.audio_output_path,
config.x_output_path,
config.y_output_path,
config.z_output_path,
)
for subset, loader in zip(
["train", "val", "test"], [train_loader, val_loader, test_loader]
):
for i, (data_name, audio, _) in enumerate(
tqdm(loader, desc=f"Processing {subset} data:", position=0, leave=True)
):
data_name = data_name[0]
pipeline.run(audio, data_name, subset)
Implementation 2: multiple instantiations of a Pipeline class with a run() method that does not accept parameters.
The call of the run() method is a bit cleaner, but I don't know if I should instantiate the pipeline instance multiple times.
train_pipeline = Pipeline(
config.x_path,
os.path.join(config.audio_output_path, "train"),
train_loader,
config.fs,
)
train_pipeline.run()
Implementation 3: a composable Pipeline like scikit-learn.
How to have splits and joins in the pipeline?
How to implement without scikit-learn?
for subset in train, val, test:
for position in positions:
for channel in channels:
pipeline_steps = [ step1(subset), step2(position), step3(channel)]
pipeline = Pipeline(pipeline_steps)
pipeline.run()
Do you have any recommended design patterns for my use case? Are there are github repos that you recommend I look at for this? I think it is a rather simple problem in the grand scheme of things, but there doesn't seem to be a consensus on good implementations that are in between beginner ML project and production-level pipelines.
I am trying to convert pix2pix to a pb or onnx that can run in Lens Studio. Lens studio has strict requirements for the models. I am trying to export this pytorch model to onnx using this guide provided by lens studio. The issue is the pytorch model found here uses its own base class, when in the example it uses Module.nn, and therefore doesnt have methods/variables that the torch.onnx.export function needs to run. So far Ive run into its missing a variable called training and a method called train
Would it be worth it to try to modify the base model, or should I try to build it from scratch using nn.Module? Is there a way to make the pix2pix model inherit from both the abstract base class and nn.module? Am I not understanding the situation? The reason why I want to do it using the lens studio tutorial is because I have gotten it to export onnx different ways but Lens Studio wont accept those for various reasons.
Also this is my first time asking a SO question (after 6 years of coding), let me know if I make any mistakes and I can correct them. Thank you.
This is the important code from the tutorial creating a pytorch model for Lens Studio:
import torch
import torch.nn as nn
class Model(nn.Module):
def __init__(self):
super().__init__()
self.layer = nn.Conv2d(in_channels=3, out_channels=1,
kernel_size=3, stride=2, padding=1)
def forward(self, x):
out = self.layer(x)
out = nn.functional.interpolate(out, scale_factor=2,
mode='bilinear', align_corners=True)
out = torch.nn.functional.softmax(out, dim=1)
return out
I'm not going to include all the code from the pytorch model bc its large, but the beginning of the baseModel.py is
import os
import torch
from collections import OrderedDict
from abc import ABC, abstractmethod
from . import networks
class BaseModel(ABC):
"""This class is an abstract base class (ABC) for models.
To create a subclass, you need to implement the following five functions:
-- <__init__>: initialize the class; first call BaseModel.__init__(self, opt).
-- <set_input>: unpack data from dataset and apply preprocessing.
-- <forward>: produce intermediate results.
-- <optimize_parameters>: calculate losses, gradients, and update network weights.
-- <modify_commandline_options>: (optionally) add model-specific options and set default options.
"""
def __init__(self, opt):
"""Initialize the BaseModel class.
Parameters:
opt (Option class)-- stores all the experiment flags; needs to be a subclass of BaseOptions
When creating your custom class, you need to implement your own initialization.
In this function, you should first call <BaseModel.__init__(self, opt)>
Then, you need to define four lists:
-- self.loss_names (str list): specify the training losses that you want to plot and save.
-- self.model_names (str list): define networks used in our training.
-- self.visual_names (str list): specify the images that you want to display and save.
-- self.optimizers (optimizer list): define and initialize optimizers. You can define one optimizer for each network. If two networks are updated at the same time, you can use itertools.chain to group them. See cycle_gan_model.py for an example.
"""
self.opt = opt
self.gpu_ids = opt.gpu_ids
self.isTrain = opt.isTrain
self.device = torch.device('cuda:{}'.format(self.gpu_ids[0])) if self.gpu_ids else torch.device('cpu') # get device name: CPU or GPU
self.save_dir = os.path.join(opt.checkpoints_dir, opt.name) # save all the checkpoints to save_dir
if opt.preprocess != 'scale_width': # with [scale_width], input images might have different sizes, which hurts the performance of cudnn.benchmark.
torch.backends.cudnn.benchmark = True
self.loss_names = []
self.model_names = []
self.visual_names = []
self.optimizers = []
self.image_paths = []
self.metric = 0 # used for learning rate policy 'plateau'
and for pix2pix_model.py
import torch
from .base_model import BaseModel
from . import networks
class Pix2PixModel(BaseModel):
""" This class implements the pix2pix model, for learning a mapping from input images to output images given paired data.
The model training requires '--dataset_mode aligned' dataset.
By default, it uses a '--netG unet256' U-Net generator,
a '--netD basic' discriminator (PatchGAN),
and a '--gan_mode' vanilla GAN loss (the cross-entropy objective used in the orignal GAN paper).
pix2pix paper: https://arxiv.org/pdf/1611.07004.pdf
"""
#staticmethod
def modify_commandline_options(parser, is_train=True):
"""Add new dataset-specific options, and rewrite default values for existing options.
Parameters:
parser -- original option parser
is_train (bool) -- whether training phase or test phase. You can use this flag to add training-specific or test-specific options.
Returns:
the modified parser.
For pix2pix, we do not use image buffer
The training objective is: GAN Loss + lambda_L1 * ||G(A)-B||_1
By default, we use vanilla GAN loss, UNet with batchnorm, and aligned datasets.
"""
# changing the default values to match the pix2pix paper (https://phillipi.github.io/pix2pix/)
parser.set_defaults(norm='batch', netG='unet_256', dataset_mode='aligned')
if is_train:
parser.set_defaults(pool_size=0, gan_mode='vanilla')
parser.add_argument('--lambda_L1', type=float, default=100.0, help='weight for L1 loss')
return parser
def __init__(self, opt):
"""Initialize the pix2pix class.
Parameters:
opt (Option class)-- stores all the experiment flags; needs to be a subclass of BaseOptions
"""
(Also sidenote if you see this and it looks like no easy way out let me know, I know what its like seeing someone getting started in something who goes in to deep too early on)
You can definitely have your model inherit from both the base class and torch.nn.Module (python allows multiple inheritance). However you should take care about the conflicts if both inherited class have functions with identical names (I can see at least one : their base provide the eval function and so to nn.module).
However since you do not need the CycleGan, and a lot of the code is compatibility with their training environment, you'd probably better just re-implement the pix2pix. Just steal the code, have it inherit from nn.Module, copy-paste useful/mandatory functions from the base class, and have everything translated into clean pytorch code. You already have the forward function (which is the only requirement for a pytorch module).
All the subnetworks they use (like the resnet blocks) seem to inherit from nn.Module already so there is nothing to change here (double check that though)
I would like to use the DeepQLearning.jl package from https://github.com/JuliaPOMDP/DeepQLearning.jl. In order to do so, we have to do something similar to
using DeepQLearning
using POMDPs
using Flux
using POMDPModels
using POMDPSimulators
using POMDPPolicies
# load MDP model from POMDPModels or define your own!
mdp = SimpleGridWorld();
# Define the Q network (see Flux.jl documentation)
# the gridworld state is represented by a 2 dimensional vector.
model = Chain(Dense(2, 32), Dense(32, length(actions(mdp))))
exploration = EpsGreedyPolicy(mdp, LinearDecaySchedule(start=1.0, stop=0.01, steps=10000/2))
solver = DeepQLearningSolver(qnetwork = model, max_steps=10000,
exploration_policy = exploration,
learning_rate=0.005,log_freq=500,
recurrence=false,double_q=true, dueling=true, prioritized_replay=true)
policy = solve(solver, mdp)
sim = RolloutSimulator(max_steps=30)
r_tot = simulate(sim, mdp, policy)
println("Total discounted reward for 1 simulation: $r_tot")
In the line mdp = SimpleGridWorld(), we create the MDP. When I was trying to create the MDP, I had the problem of very large state space. A state in my MDP is a vector in {1,2,...,m}^n for some m and n. So, when defining the function POMDPs.states(mdp::myMDP), I realized that I must iterate over all the states which are very large, i.e., m^n.
Am I using the package in the wrong way? Or we must iterate the states even if there are exponentially many? If the latter, then what is the point of using Deep Q Learning? I thought, Deep Q Learning can help when the action and state spaces are very large.
DeepQLearning does not require to enumerate the state space and can handle continuous space problems.
DeepQLearning.jl only uses the generative interface of POMDPs.jl. As such, you do not need to implement the states function but just gen and initialstate (see the link on how to implement the generative interface).
However, due to the discrete action nature of DQN you also need POMDPs.actions(mdp::YourMDP) which should return an iterator over the action space.
By making those modifications to your implementation you should be able to use the solver.
The neural network in DQN takes as input a vector representation of the state. If your state is a m dimensional vector, the neural network input will be of size m. The output size of the network will be equal to the number of actions in your model.
In the case of the grid world example, the input size of the Flux model is 2 (x, y positions) and the output size is length(actions(mdp))=4.
sklearn contains Implementation of different feature selection methods (filter/wrapper/embedded).
All those methods designed for static systems.
Does sklearn supports feature selection on dynamic data ? (Data which vary with time)
In dynamic data, we need to improve the efficiency of feature selection, in order to be more effective.
I found some methods on IEEE (Incremental approaches for feature selection),
So is there any implementation at sklearn or other open-source library ?
Couldn't you just re-run your process on a scheduled basis and load your data dynamically? I wouldn't expect the dependent variable to change at all, but I suppose the independent variables could change somewhat.
#1) load your dataframe
#2) copy your target variable into a new dataframe
y = df[['SeriousDlqin2yrs']]
#3) drop your target variable
x = df[df.columns[df.columns!='SeriousDlqin2yrs']]
Finally, run this.
from sklearn.ensemble import RandomForestClassifier
features = np.array(x)
clf = RandomForestClassifier()
clf.fit(x, y)
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
I just tried that, and got this result.
If you expect to get non-numeric features, you will need to use one hot encoding to handle these.
import pandas as pd
pd.get_dummies(df)
http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example
I am trying to cluster sentence embeddings based on Glove model from text2vec. I generated the embeddings using the glove model like so (I create the iterator, vocab etc in the standard way).
# create document term matrix
dtm = create_dtm(it, vectorizer)
# assign the word embeddings
common_terms = intersect(colnames(dtm), rownames(word_vectors) )
# normalise
dtm_averaged <- text2vec::normalize(dtm[, common_terms], "l1")
# compute average sentence embeddings
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]
The resulting object is of dgeMatrix class, which is equivalent to matrix class as I understand. dgeMatrix class isn't used for many downstream tasks so I would like to convert the matrix. The object, however, is 6GB large, and I have problems converting the matrix to a data frame or even text file for further processing.
Ideally , I'd use this matrix in Spark for further analysis such as k-means clustering. My question what would be the best strategy to use the matrix for downstream tasks.
a) Convert to matrix class or data frame
b) write the matrix to file?
c) something completely different
I run the models on Google Cloud and have a machine with 32gb ram and 28 cpu.
Thanks for your help.