Keras classification model - machine-learning

I need help to build keras model for classification.
I have
Input: 167 points of optical spectrum
Output 11 classes of investigated substance.
But in one data set can be spectre of substance with several substance (for example contains classes 2,3,4).
I tried to use categorical_crossentropy, but it is suitable only for non-intersecting classes.
KerasDoc:
Note: when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros expect for a 1 at the index corresponding to the class of the sample). In order to convert integer targets into categorical targets, you can use the Keras utility to_categorical:
My code:
model = Sequential()
model.add(Dense(64, input_dim=167))
model.add(Dense(32))
model.add(Dense(11))
model.add(Activation('sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
I tried many models but can not get a good result.

You should probably go well with sigmoid and binary_crossentropy (See here)
PS: This is not your case, but for a categorial_crossentropy, you should ideally use a softmax activation. The softmax outputs things optimized to maximize one class only.
(If anyone would like to complement this answer with a good or better "optimizer", feel free).

Related

Can logistic and lineair regression produce a prediction on a scale?

I currently have a dataset of drawings, each drawing being represented by some features. Each feature (independent variable) is a continuous number. None of the drawings have a label as of yet, which is why I am planning to start a sort of questionaire with people. However, before I can correctly setup such questionaire, I should have an idea of what kind of labels I should use for my training data.
At first thought, I was thinking about letting people rate the drawings on a scale, for example from 1 to 5 with 1 being bad, 3 being average and 5 being good. Alternatively, I could also reduce the question to a simple good or bad question. The latter would mean I lose some valuable information, but the dependent variable could then be considered 'binary'.
Using the training data I then composed, I would need to have a machine learning algorithm (model) which given a drawing, predicts if the drawing is good or not. Ideally, I would have some way of tuning the strictness in this prediction. For example, the model could instead of simply predicting 'good' or 'bad', predict the likelyhood of a painting being good on a scale of 0 to 1. I could then say "Well, let's say all paintings which are 70% likely to be good, are considered as good". Another example would be that the model predicts the goodness using the same categorical values the people used to rate the drawing initially. So it would either predict the drawing being a 1, 2, 3, 4 or 5. Similar to my first example, I could then say "Well, all paintings which are rated at least a 4, are considered good paintings" and tune this threshhold to my liking.
After doing some research, I came up with logistic and linear regression being good candidates. However, if which of the two would be the best for my scenario? Equally important, how would I need to format my labels? Just simple 0's and 1's or a scale?
You could use a 1 vs all representation if you wanted to use a multi-class categorical classification:
Essentially, you train 1 classifier for every category you have (you have 10 categories, so you have 10 classifiers) and then each classifier is just trained to predict whether or not the category belongs to each specific class.
There are alternative ways to make multi-class logistic regression work that only require training a single model, such as by using categorical cross entropy, but given that you'd like to use ordinal data, a linear regression used as a regression model is likely more ideal. You'd predict a value between 1 and 10 and then just round to the nearest integer. This way you aren't penalizing close guesses as much as far away guesses.
what keeps you from using a logistic regression model. Due to a lack of better dataset I used the standard diabetes data. The target variable is an integer between 50 and 200. I normalised the data between [-1,1] such that I can use sigmoid as activation function. For the loss I decided to use
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import MaxPooling2D, Input, Convolution2D
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
x_train=diabetes.data
y_train=2*(diabetes.target-min(diabetes.target))/(max(diabetes.target)-min(diabetes.target))-1
inputs = tf.keras.Input(shape=(x_train.shape[1],))
outputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(), # Optimizer
loss=tf.keras.losses.MSE,
metrics=['sparse_categorical_accuracy'])
history = model.fit(x_train, y_train,
batch_size=64,
epochs=300,
validation_data=(x_train, y_train))
You could also use a linear regression model. There you only need to replace the activation function by linear. However I think the squashing character, besides ensuring hat there is no rating larger 1 or smaller -1.
A last alternative would be to train pair-wise preference. The idea is to show the human two drawings and ask which one he likes more. Then build a binary model, e.g., logistic regression. This approach appears preferable to me as it is easier to answer for the human

Need help choosing loss function

I have used resnet50 to solve a multi-class classification problem. The model outputs probabilities for each class. Which loss function should I choose for my model?
After choosing binary cross entropy :
After choosing categorical cross entropy:
The above results are for the same model with just different loss functions.This model is supposed to classify images into 26 classes so categorical cross entropy should work.
Also, in the first case accuracy is about 96% but losses are so high. Why?
edit 2:
Model architecture:
You definitely need to use categorical_crossentropy for a multi-classification problem. binary_crossentropy will reduce your problem down to a binary classification problem in a way that's unclear without further looking into it.
I would say that the reason you are seeing high accuracy in the first (and to some extent the second) case is because you are overfitting. The first dense layer you are adding contains 8 million parameters (!!! to see that do model.summary()), and you only have 70k images to train it with 8 epochs. This architectural choice is very demanding both in computing power and in data requirement. You are also using a very basic optimizer (SGD). Try to use a more powerful Adam.
Finally, I am a bit surprised at your choice to take a 'sigmoid' activation function in the output layer. Why not a more classic 'softmax'?
For a multi-class classification problem you use the categorical_crossentropy loss, as what it does is match the ground truth probability distribution with the one predicted by the model.
This is exactly what is used for multi-class classification, you have a misconception of you think you can't use this loss.

About correctly using dropout in RNNs (Keras)

I am confused between how to correctly use dropout with RNN in keras, specifically with GRU units. The keras documentation refers to this paper (https://arxiv.org/abs/1512.05287) and I understand that same dropout mask should be used for all time-steps. This is achieved by dropout argument while specifying the GRU layer itself. What I don't understand is:
Why there are several examples over the internet including keras own example (https://github.com/keras-team/keras/blob/master/examples/imdb_bidirectional_lstm.py) and "Trigger word detection" assignment in Andrew Ng's Coursera Seq. Models course, where they add a dropout layer explicitly "model.add(Dropout(0.5))" which, in my understanding, will add a different mask to every time-step.
The paper mentioned above suggests that doing this is inappropriate and we might lose the signal as well as long-term memory due to the accumulation of this dropout noise over all the time-steps.
But then, how are these models (using different dropout masks at every time-step) are able to learn and perform well.
I myself have trained a model which uses different dropout masks at every time-step, and although I haven't gotten results as I wanted, the model is able to overfit the training data. This, in my understanding, invalidates the "accumulation of noise" and "signal getting lost" over all the time-steps (I have 1000 time-step series being input to the GRU layers).
Any insights, explanations or experience with the situation will be helpful. Thanks.
UPDATE:
To make it more clear I'll mention an extract from keras documentation of Dropout Layer ("noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features").
So, I believe, it can be seen that when using Dropout layer explicitly and needing the same mask at every time-step (as mentioned in the paper), we need to edit this noise_shape argument which is not done in the examples I linked earlier.
As Asterisk explained in his comment, there is a fundamental difference between dropout within a recurrent unit and dropout after the unit's output. This is the architecture from the keras tutorial you linked in your question:
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
You're adding a dropout layer after the LSTM finished its computation, meaning that there won't be any more recurrent passes in that unit. Imagine this dropout layer as teaching the network not to rely on the output for a specific feature of a specific time step, but to generalize over information in different features and time steps. Dropout here is no different to feed-forward architectures.
What Gal & Ghahramani propose in their paper (which you linked in the question) is dropout within the recurrent unit. There, you're dropping input information between the time steps of a sequence. I found this blogpost to be very helpful to understand the paper and how it relates to the keras implementation.

LDA and PCA on a dataset containing two classes

I would like to compare the accuracies of running logistic regression on a dataset following PCA and LDA. The dataset I am using is the wisconsin cancer dataset, which contains two classes: malignant or benign tumors and 30 features. I have already conducted PCA on this data and have been able to get good accuracy scores with 10 PCAs. I know that LDA is similar to PCA. My understanding is that you calculate the mean vectors of each feature for each class, compute scatter matricies and then get the eigenvalues for the dataset. Is LDA similar to PCA in the sense that I can choose 10 LDA eigenvalues to better separate my data? I have tried LDA with scikit learn, however it has only given me one LDA back. Is this becasue I only have 2 classes, or do I need to do an addiontional step? I would like to have 10 LDAs in order to compare it with my 10 PCAs. Is this even possible?
Actually both LDA and PCA are linear transformation techniques: LDA is a supervised whereas PCA is unsupervised (ignores class labels). You can picture PCA as a technique that finds the directions of maximal variance.And LDA as a technique that also cares about class separability (note that here, LD 2 would be a very bad linear discriminant).Remember that LDA makes assumptions about normally distributed classes and equal class covariances (at least the multiclass version; the generalized version by Rao).

The best loss function for pixelwise binary classification in keras

I built a deep learning model which accept image of size 250*250*3 and output 62500(250*250) binary vector which contains 0s in pixels that represent the background and 1s in pixels which represents ROI.
My model is based on DenseNet121 but when i use softmax as an activation function in last layer and categorical cross entropy loss function , the loss is nan.
What is the best loss and activation function that i can use it in my model?
What is the difference between binary cross entropy and categorical cross entropy loss function?
Thanks in advance.
What is the best loss and activation function that i can use it in my model?
Use binary_crossentropy because every output is independent, not mutually exclusive and can take values 0 or 1, use sigmoid in the last layer.
Check this interesting question/answer
What is the difference between binary cross entropy and categorical cross entropy loss function?
Here is a good set of answers to that question.
Edit 1: My bad, use binary_crossentropy.
After a quick look at the code (again) I can see that keras uses:
for binary_crossentropy -> tf.nn.sigmoid_cross_entropy_with_logits
(From tf docs): Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. For instance, one could perform multilabel classification where a picture can contain both an elephant and a dog at the same time.
for categorical_crossentropy -> tf.nn.softmax_cross_entropy_with_logits
(From tf docs): Measures the probability error in discrete classification tasks in which the classes are mutually exclusive (each entry is in exactly one class). For example, each CIFAR-10 image is labeled with one and only one label: an image can be a dog or a truck, but not both.

Resources