How to implement a efficient structure like GRU in pytorch? - machine-learning

I have a model A which is pretrained. This model A will take x_{t-1} and p_{t} to predict x_{t}. Since it is actually a PyTorch neural network it is differentiable.
What I want to do is I want this model to move along P ={p_i} (a sequence predicted from last layer) and predict X = {x_i}.
However, to predict x_{t} this model needs to take the output of the last frame x_{t-1}. Thus, I implemented this structure using a for loop which makes the process quite slow…
Is there a way to accelerate the process?
This structure is quite similar to GRU or LSTM, but GRU is much faster, how can I improve it?

Related

Can logistic and lineair regression produce a prediction on a scale?

I currently have a dataset of drawings, each drawing being represented by some features. Each feature (independent variable) is a continuous number. None of the drawings have a label as of yet, which is why I am planning to start a sort of questionaire with people. However, before I can correctly setup such questionaire, I should have an idea of what kind of labels I should use for my training data.
At first thought, I was thinking about letting people rate the drawings on a scale, for example from 1 to 5 with 1 being bad, 3 being average and 5 being good. Alternatively, I could also reduce the question to a simple good or bad question. The latter would mean I lose some valuable information, but the dependent variable could then be considered 'binary'.
Using the training data I then composed, I would need to have a machine learning algorithm (model) which given a drawing, predicts if the drawing is good or not. Ideally, I would have some way of tuning the strictness in this prediction. For example, the model could instead of simply predicting 'good' or 'bad', predict the likelyhood of a painting being good on a scale of 0 to 1. I could then say "Well, let's say all paintings which are 70% likely to be good, are considered as good". Another example would be that the model predicts the goodness using the same categorical values the people used to rate the drawing initially. So it would either predict the drawing being a 1, 2, 3, 4 or 5. Similar to my first example, I could then say "Well, all paintings which are rated at least a 4, are considered good paintings" and tune this threshhold to my liking.
After doing some research, I came up with logistic and linear regression being good candidates. However, if which of the two would be the best for my scenario? Equally important, how would I need to format my labels? Just simple 0's and 1's or a scale?
You could use a 1 vs all representation if you wanted to use a multi-class categorical classification:
Essentially, you train 1 classifier for every category you have (you have 10 categories, so you have 10 classifiers) and then each classifier is just trained to predict whether or not the category belongs to each specific class.
There are alternative ways to make multi-class logistic regression work that only require training a single model, such as by using categorical cross entropy, but given that you'd like to use ordinal data, a linear regression used as a regression model is likely more ideal. You'd predict a value between 1 and 10 and then just round to the nearest integer. This way you aren't penalizing close guesses as much as far away guesses.
what keeps you from using a logistic regression model. Due to a lack of better dataset I used the standard diabetes data. The target variable is an integer between 50 and 200. I normalised the data between [-1,1] such that I can use sigmoid as activation function. For the loss I decided to use
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import MaxPooling2D, Input, Convolution2D
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
x_train=diabetes.data
y_train=2*(diabetes.target-min(diabetes.target))/(max(diabetes.target)-min(diabetes.target))-1
inputs = tf.keras.Input(shape=(x_train.shape[1],))
outputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(), # Optimizer
loss=tf.keras.losses.MSE,
metrics=['sparse_categorical_accuracy'])
history = model.fit(x_train, y_train,
batch_size=64,
epochs=300,
validation_data=(x_train, y_train))
You could also use a linear regression model. There you only need to replace the activation function by linear. However I think the squashing character, besides ensuring hat there is no rating larger 1 or smaller -1.
A last alternative would be to train pair-wise preference. The idea is to show the human two drawings and ask which one he likes more. Then build a binary model, e.g., logistic regression. This approach appears preferable to me as it is easier to answer for the human

Classification with Keras, unbalanced classes

I have a binary classification problem I'm trying to tackle in Keras. To start, I was following the usual MNIST example, using softmax as the activation function in my output layer.
However, in my problem, the 2 classes are highly unbalanced (1 appears ~10 times more often than the other). And what's even more critical, they are non-symmetrical in the way they may be mistaken.
Mistaking an A for a B is way less severe than mistaking a B for an A. Just like a caveman trying to classify animals into pets and predators: mistaking a pet for a predator is no big deal, but the other way round will be lethal.
So my question is: how would I model something like this with Keras?
thanks a lot
A non-exhaustive list of things you could do:
Generate a balanced data set using data augmentations. If the data are images, you can add image augmentations in a custom data generator that will output balanced amounts of data from each class per batch and save the results to a new data set. If the data are tabular, you can use a library like imbalanced-learn to perform over/under sampling.
As #Daniel said you can use class_weights during training (in the fit method) in a way that mistakes on important class are penalized more. See this tutorial: Classification on imbalanced data. The same idea can be implemented with a custom loss function with/without class_weights during training.

Function of batch in TensorFlow?

I am new to TensorFlow and Machine Learning and found the concept of batch.
What is the purpose of splitting the DataSets into batches and how does the TensorFlow perform an optimization task on variables, using different sub-sets?
You are confusing a few things, as far as I understand.
First, you need to split the dataset into two (or more) distinct sets. The one is a set that you train your system on, and the second one is used to test your model.
This are basics of ML and you should easily find more in the internet. Look for "cross-validation" or "train, validation, test sets".
Batch is something that is usually important in Neural Networks (NN). You are not using one example at each training step (then the algorithm would be called Stochastic Gradient Descent), nor every example at each training step (this would be Batch Gradient Descent). Usually, it is the best to train NN using mini-batches (Mini-batch Gradient Descent).
It is a trade-off in optimization between accuracy and training speed.
Tensorflow is just a library for NN. You can easily find how sets and batches are split in many tutorials. Remember to learn the basic concepts first, for example in this great class:
https://www.coursera.org/specializations/deep-learning
The purpose of splitting the DataSet into batches is typically to speed up the learning. Instead of processing the entire set of training examples, it processes only a batch, which is just a small subset of the entire training set, at a time. There are various techniques about how to formulate such batches and process them to get the final trained model.

Calling "fit" multiple times in Keras

I've working on a CNN over several hundred GBs of images. I've created a training function that bites off 4Gb chunks of these images and calls fit over each of these pieces. I'm worried that I'm only training on the last piece on not the entire dataset.
Effectively, my pseudo-code looks like this:
DS = lazy_load_400GB_Dataset()
for section in DS:
X_train = section.images
Y_train = section.classes
model.fit(X_train, Y_train, batch_size=16, nb_epoch=30)
I know that the API and the Keras forums say that this will train over the entire dataset, but I can't intuitively understand why the network wouldn't relearn over just the last training chunk.
Some help understanding this would be much appreciated.
Best,
Joe
This question was raised at the Keras github repository in Issue #4446: Quick Question: can a model be fit for multiple times? It was closed by François Chollet with the following statement:
Yes, successive calls to fit will incrementally train the model.
So, yes, you can call fit multiple times.
For datasets that do not fit into memory, there is an answer in the Keras Documentation FAQ section
You can do batch training using model.train_on_batch(X, y) and
model.test_on_batch(X, y). See the models documentation.
Alternatively, you can write a generator that yields batches of
training data and use the method model.fit_generator(data_generator, samples_per_epoch, nb_epoch).
You can see batch training in action in our CIFAR10 example.
So if you want to iterate your dataset the way you are doing, you should probably use model.train_on_batch and take care of the batch sizes and iteration yourself.
One more thing to note is that you should make sure the order in which the samples you train your model with is shuffled after each epoch. The way you have written the example code seems to not shuffle the dataset. You can read a bit more about shuffling here and here

How to use stacked autoencoders for pretraining

Let's say I wish to used stacked autoencoders as a pretraining step.
Let's say my full autoencoder is 40-30-10-30-40.
My steps are:
Train a 40-30-40 using the original 40 features data set in both input and output layers.
Using the trained encoder part only of the above i.e. 40-30 encoder, derive a new 30 feature representation of the original 40 features.
Train a 30-10-30 using the new 30 features data set (derived in step 2) in both input and output layers.
Take the trained encoder from step 1 ,40-30, and feed it into the encoder from step 3,30-10, giving a 40-30-10 encoder.
Take the 40-30-10 encoder from step 4 and use it as the input the NN.
a) Is that correct?
b) Do I freeze the weights in the 40-30-10 encoder when training the NN which would be the same as pregenerating the 10 feature representation from the original 40 feature data set and training on the new 10 feature representation data set.
PS. I already have a question out asking about whether I need to tie the weights of the encoder and decoder
a) Is that correct?
This is one of the typical approaches. You could also try to fit the autoencoder directly, as "raw" autoencoder with that many layers should be possible to fit right away, As an alternative you might consider fitting stacked denoising autoencoders instead, which might benefit more from "stacked" training.
b) Do I freeze the weights in the 40-30-10 encoder when training the NN which would be the same as pregenerating the 10 feature representation from the original 40 feature data set and training on the new 10 feature representation data set.
When you train whole NN you do not freeze anything. Pretraining is only a kind of preconditioning for the optimization process - you show your method where to start, but you do not want to limit the fitting procedure of actual supervised learning.
PS. I already have a question out asking about whether I need to tie the weights of the encoder and decoder
No, you do not have to tie weights, especially that you actually throw away your decoder anyway. Tieing the weights is important for some more probabilistic models in order to make minimization procedure possible (like in the case of RBMs), but for autoencoder there is no point.

Resources