Unconvential use of a RNN/LSTM

Unconvential use of a RNN/LSTM - machine-learning

If I well understood a RNN allows to predict the next value considering the last values in a sequence. Let say I want to predict the next value of the function cos(x) and I have a dataset with the results for x in range(0, 1000). First I feed the model with cos(0) to predict cos(1), then cos(1) to predict cos(2) etc.. and at each step the weights are tuned and the model keep a memory of the last values to make the next prediction.
In my case I want to train a model to predict the quality of videos. For this I have a dataset with videos annotated. For each video, for each frame, I compute a set of 36 features which are not spatially related. So the shape of the inputs is (nb_videos, nb_frames, 36). For each video I have a score representing the global video quality and the shape of the labels is (nb_videos, 1).
I don't know which kind of NN I can use. n_frames x 36 is far too big I think for a simple multi-layer perceptron. Features may make sense along time axis but not along features axis, so unless I train 36 models with 1D convolutions a CNN is useless. Finally, features come in a sequence but the problem with a RNN is that it need a score for each element of the sequence and the model works only to predict the next values in this sequence in particular.
My idea is to have 1 RNN model which is trained for any video. I feed the RNN n_frames times with the 36 features in the good order and only after these n_frames iterations the model give a prediction. Then this prediction is used to tune the weights. Then we do this the number of epochs with a video picked randomly in the dataset.
Does it make sense?
Does something similar exist?

I don't think you're making an unconventional use of a RNN/LSTM, and your idea makes sense. If I understood it correctly, your idea involves using a many to one RNN:
Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
where the input at each timestep corresponds to one frame with 36 features, and the output at the last timestep conveys information about the whole video. In Keras, this could be something along the lines of:
from keras.models import Sequential
from keras.layers import LSTM, Dense
nb_frames = 10
model = Sequential()
model.add(LSTM(20, input_shape=(nb_frames, 36)))
model.add(Dense(1, activation='relu'))
model.compile('rmsprop', 'mse')
model.summary()
Many to one RNNs are very common, and you wouldn't be making an unconventional use of them.

Related

Can logistic and lineair regression produce a prediction on a scale?

I currently have a dataset of drawings, each drawing being represented by some features. Each feature (independent variable) is a continuous number. None of the drawings have a label as of yet, which is why I am planning to start a sort of questionaire with people. However, before I can correctly setup such questionaire, I should have an idea of what kind of labels I should use for my training data.
At first thought, I was thinking about letting people rate the drawings on a scale, for example from 1 to 5 with 1 being bad, 3 being average and 5 being good. Alternatively, I could also reduce the question to a simple good or bad question. The latter would mean I lose some valuable information, but the dependent variable could then be considered 'binary'.
Using the training data I then composed, I would need to have a machine learning algorithm (model) which given a drawing, predicts if the drawing is good or not. Ideally, I would have some way of tuning the strictness in this prediction. For example, the model could instead of simply predicting 'good' or 'bad', predict the likelyhood of a painting being good on a scale of 0 to 1. I could then say "Well, let's say all paintings which are 70% likely to be good, are considered as good". Another example would be that the model predicts the goodness using the same categorical values the people used to rate the drawing initially. So it would either predict the drawing being a 1, 2, 3, 4 or 5. Similar to my first example, I could then say "Well, all paintings which are rated at least a 4, are considered good paintings" and tune this threshhold to my liking.
After doing some research, I came up with logistic and linear regression being good candidates. However, if which of the two would be the best for my scenario? Equally important, how would I need to format my labels? Just simple 0's and 1's or a scale?

You could use a 1 vs all representation if you wanted to use a multi-class categorical classification:
Essentially, you train 1 classifier for every category you have (you have 10 categories, so you have 10 classifiers) and then each classifier is just trained to predict whether or not the category belongs to each specific class.
There are alternative ways to make multi-class logistic regression work that only require training a single model, such as by using categorical cross entropy, but given that you'd like to use ordinal data, a linear regression used as a regression model is likely more ideal. You'd predict a value between 1 and 10 and then just round to the nearest integer. This way you aren't penalizing close guesses as much as far away guesses.

what keeps you from using a logistic regression model. Due to a lack of better dataset I used the standard diabetes data. The target variable is an integer between 50 and 200. I normalised the data between [-1,1] such that I can use sigmoid as activation function. For the loss I decided to use
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import MaxPooling2D, Input, Convolution2D
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
x_train=diabetes.data
y_train=2*(diabetes.target-min(diabetes.target))/(max(diabetes.target)-min(diabetes.target))-1
inputs = tf.keras.Input(shape=(x_train.shape[1],))
outputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(), # Optimizer
loss=tf.keras.losses.MSE,
metrics=['sparse_categorical_accuracy'])
history = model.fit(x_train, y_train,
batch_size=64,
epochs=300,
validation_data=(x_train, y_train))
You could also use a linear regression model. There you only need to replace the activation function by linear. However I think the squashing character, besides ensuring hat there is no rating larger 1 or smaller -1.
A last alternative would be to train pair-wise preference. The idea is to show the human two drawings and ask which one he likes more. Then build a binary model, e.g., logistic regression. This approach appears preferable to me as it is easier to answer for the human

Regression like quantification of the importance of variables in random forest

Is it possible to quantify the importance of variables in figuring out the probability of an observation falling into one class. Something similar to Logistic regression.
For example:
If I have the following independent variables
1) Number of cats the person has
2) Number of dogs a person has
3) Number of chickens a person has
With my dependent variable being: Whether a person is a part of PETA or not
Is it possible to say something like "if the person adopts one more cat than his existing range of animals, his probability of being a part of PETA increases by 0.12"
I am currently using the following methodology to reach this particular scenario:
1) Build a random forest model using the training data
2) Predict the customer's probability to fall in one particular class(Peta vs non Peta)
3) Artificially increase the number of cats owned by each observation by 1
4) Predict the customer's new probability to fall in one of the two classes
5) The average change between (4)'s probability and (2)'s probability is the average increase in a person's probability if he has adopted a cat.
Does this make sense? Is there any flaw in the methodology that I haven't thought of? Is there a better way of doing the same ?

If you're using scikitlearn, you can easily do this by accessing the feature_importance_ property of the fitted RandomForestClassifier. According to SciKitLearn:
The relative rank (i.e. depth) of a feature used as a decision node in
a tree can be used to assess the relative importance of that feature
with respect to the predictability of the target variable. Features
used at the top of the tree contribute to the final prediction
decision of a larger fraction of the input samples. The expected
fraction of the samples they contribute to can thus be used as an
estimate of the relative importance of the features.
By averaging
those expected activity rates over several randomized trees one can
reduce the variance of such an estimate and use it for feature
selection.
The property feature_importance_ stores the average depth of each feature among the trees.
Here's an example. Let's start by importing the necessary libraries.
# using this for some array manipulations
import numpy as np
# of course we're going to plot stuff!
import matplotlib.pyplot as plt
# dummy iris dataset
from sklearn.datasets import load_iris
#random forest classifier
from sklearn.ensemble import RandomForestClassifier
Once we have these, we're going to load the dummy dataset, define a classification model and fit the data to the model.
data = load_iris()

# we're gonna use 100 trees
forest = RandomForestClassifier(n_estimators = 100)

# fit data to model by passing features and labels
forest.fit(data.data, data.target)
Now we can use the Feature Importance Property to get a score of each feature, based on how well it is able to classify the data into different targets.
# find importances of each feature
importances = forest.feature_importances_
# find the standard dev of each feature to assess the spread
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)

# find sorting indices of importances (descending)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(data.data.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
Feature ranking:
1. feature 2 (0.441183)
2. feature 3 (0.416197)
3. feature 0 (0.112287)
4. feature 1 (0.030334)
Now we can plot the importance of each feature as a bar graph and decide if it's worth keeping them all. We also plot the error bars to assess the significance.
plt.figure()
plt.title("Feature importances")
plt.bar(range(data.data.shape[1]), importances[indices],
color="b", yerr=std[indices], align="center")
plt.xticks(range(data.data.shape[1]), indices)
plt.xlim([-1, data.data.shape[1]])
plt.show()
Bar graph of feature importances

I apologize. I didn't catch the part where you mention what kind of statement you're trying to make. I'm assuming your response variable is either 1 or zero. You could try something like this:
Fit a linear regression model over the data. This won't really give you the most accurate fit, but it will be robust to get the information you're looking for.
Find the response of the model with the original inputs. (It most likely won't be ones or zeros)
Artificially change the inputs, and find the difference in the outputs of the original data and modified data, like you suggested in your question.
Try it out with a logistic regression as well. It really depends on your data and how it is distributed to find what kind of regression will work best. You definitely have to use a regression to find the change in probability with change in input.
You can even try a single layer neural network with a regression/linear output layer to do the same thing. Add layers or non-linear activation functions if the data is less likely to be linear.
Cheers!

Multiclass classification for n classes with number of output neurons = ceiling of log2 (n)

Suppose I want to use a multilayer perceptron to classify 3 classes. When it comes to number of output neurons, anybody would instantly say - use 3 output neurons with softmax activation. But what if I use 2 output neurons with sigmoid activations to output [0,0] for class 1, [0,1] for class 2 and [1,0] for class 3? Basically getting a binary encoded output with each bit being output by each output neuron. Wouldn't this technique decrease output neurons(and hence number of parameters) by a lot? A 100 class word classification for simple NLP application would require 100 output neurons for softmax where as you can cover it with 7 output neurons with the above technique. One disadvantage is that you won't get the probability scores for all the classes. My question is, is this approach correct? If so, would you consider it to be more efficient than softmaxing for datasets with large number of classes?

You could do this, but then you would have to rethink your loss function. The cross-entropy loss used in training a model for classification is the likelihood of a categorical distribution, which assumes you have a probability associated with every class. The loss function requires 3 output probabilities and you only have 2 output values.
However, there are ways to do it anyway: you could use a binary cross-entropy loss on each element of your output, but this would be a different probabilistic assumption about your model. You'd be assuming that your classes have some shared characteristics [0,0] and [0,1] share a value. The decreased degrees of freedom are probably going to give you marginally worse performance (but other parts of the MLP may pick up the slack).
If you're really worried about the parameter cost of the final layer, then you might be better just not training it at all. This paper shows a fixed Hadamard matrix on the final layer is as good as training it.

Machine Learning - Huge Only positive text dataset

I have a dataset with thousand of sentences belonging to a subject. I would like to know what would be best to create a classifier that will predict a text as "True" or "False" depending on whether they talk about that subject or not.
I've been using solutions with Weka (basic classifiers) and Tensorflow (neural network approaches).
I use string to word vector to preprocess the data.
Since there are no negative samples, I deal with a single class. I've tried one-class classifier (libSVM in Weka) but the number of false positives is so high I cannot use it.
I also tried adding negative samples but when the text to predict does not fall in the negative space, the classifiers I've tried (NB, CNN,...) tend to predict it as a false positive. I guess it's because of the sheer amount of positive samples
I'm open to discard ML as the tool to predict the new incoming data if necessary
Thanks for any help

I have eventually added data for the negative class and build a Multilineal Naive Bayes classifier which is doing the job as expected.
(the size of the data added is around one million samples :) )

My answer is based on the assumption that that adding of at least 100 negative samples for author’s dataset with 1000 positive samples is acceptable for the author of the question, since I have no answer for my question about it to the author yet
Since this case with detecting of specific topic is looks like particular case of topics classification I would recommend using classification approach with the two simple classes 1 class – your topic and another – all other topics for beginning
I succeeded with the same approach for face recognition task – at the beginning I built model with one output neuron with high level of output for face detection and low if no face detected
Nevertheless such approach gave me too low accuracy – less than 80%
But when I tried using 2 output neurons – 1 class for face presence on image and another if no face detected on the image, then it gave me more than 90% accuracy for MLP, even without using of CNN
The key point here is using of SoftMax function for the output layer. It gives significant increase of accuracy. From my experience, it increased accuracy of the MNIST dataset even for MLP from 92% up to 97% for the same model
About dataset. Majority of classification algorithms with a trainer, at least from my experience are more efficient with equal quantity of samples for each class in a training data set. In fact, if I have for 1 class less than 10% of average quantity for other classes it makes model almost useless for the detection of this class. So if you have 1000 samples for your topic, then I suggest creating 1000 samples with as many different topics as possible
Alternatively, if you don’t want to create a such big set of negative samples for your dataset, you can create a smaller set of negative samples for your dataset and use batch training with a size of batch = 2x your negative sample quantity. In order to do so, split your positive samples in n chunks with the size of each chunk ~ negative samples quantity and when train your NN by N batches for each iteration of training process with chunk[i] of positive samples and all your negative samples for each batch. Just be aware, that lower accuracy will be the price for this trade-off
Also, you could consider creation of more generic detector of topics – figure out all possible topics which can present in texts which your model should analyze, for example – 10 topics and create a training dataset with 1000 samples per each topic. It also can give higher accuracy
One more point about the dataset. The best practice is to train your model only with part of a dataset, for example – 80% and use the rest 20% for cross-validation. This cross-validation of unknown previously data for model will give you a good estimation of your model accuracy in real life, not for the training data set and allows to avoid overfitting issues
About building of model. I like doing it by "from simple to complex" approach. So I would suggest starting from simple MLP with SoftMax output and dataset with 1000 positive and 1000 negative samples. After reaching 80%-90% accuracy you can consider using CNN for your model, and also I would suggest increasing training dataset quantity, because deep learning algorithms are more efficient with bigger dataset

For text data you can use Spy EM.
The basic idea is to combine your positive set with a whole bunch of random samples, some of which you hold out. You initially treat all the random documents as the negative class, and train a classifier with your positive samples and these negative samples.
Now some of those random samples will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring held out true positive samples.
Then you iterate this process until it stablizes.

Suggest using Kfold splits or validation_split kwarg in Keras Training?

In many examples, I see train/cross-validation dataset splits being performed by using a Kfold, StratifiedKfold, or other pre-built dataset splitter. Keras models have a built in validation_split kwarg that can be used for training.
model.fit(self, x, y, batch_size=32, nb_epoch=10, verbose=1, callbacks=[], validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None)
(https://keras.io/models/model/)
validation_split: float between 0 and 1: fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.
I am new to the field and tools, so my intuition on what the different splitters offer you. Mainly though, I can't find any information on how Keras' validation_split works. Can someone explain it to me and when separate method is preferable? The built-in kwarg seems to me like the cleanest and easiest way to split test datasets, without having to architect your training loops much differently.

The difference between the two is quite subtle and they can be used in conjunction.
Kfold and similar functions in scikit-learn will randomly split your data into k folds. You can then train models holding out a single fold each time and testing on the fold.
validation_split takes a fraction of your data non-randomly. According to the Keras documentation it will take the fraction from the end of your data, e.g. 0.1 will hold out the final 10% of rows in the input matrix. The purpose of the validation split is to allow you to assess how the model is performing on the training set and a held out set at every epoch in the training period. If the model continues to improve on the training set but not the validation set then it is a clear sign of potential overfitting.
You could theoretically use KFold cross-validation to construct a model while also using validation_split to monitor the performance of each model. At each fold you will be generating a new validation_split from the training data.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart