Combination of classification and regression - machine-learning

The dataset I am working with contains the readings of an 8-sensors gas-sensor-array. The response of a sensor depends on the gas stimuli (methane, ethylene, etc.) and the concentration of the gas (20 ppm, 50 ppm, etc.). The dataset consists of 640 examples and each example is of shape=(6000,8) since there are 8 sensors on the array.
(sensor-array response to 100ppm of Methane)
My task is to make a model that will predict the class of the sensor-array reading (from which gas this reading is) and after that, I want to predict the concentration of that gas.
So far I have built a classification model based on 1D convolutional layers which successfully classifies examples into four categories (gases) with 98% accuracy.
How could I predict the concentration value of the gas? Is it possible to perform a regression analysis on the classified examples or should I look for a whole different approach?

For this task, I would just make a multi output neural network like this:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
inp = Input(shape=(n_features,))
hidden1 = Dense(20, activation='relu', kernel_initializer='he_normal')(inp)
hidden2 = Dense(10, activation='relu', kernel_initializer='he_normal')(hidden1)
out_reg = Dense(1, activation='linear')(hidden2)
out_class = Dense(n_class, activation='softmax')(hidden2)
model = Model(inputs=inp, outputs=[out_reg, out_class])
model.compile(loss=['mse','sparse_categorical_crossentropy'], optimizer='adam')
model.fit(X_train, [y_train_reg, y_train_class], epochs=150, batch_size=32, verbose=2)
One output for regression and another for classification. Below is the image of neural network architecture:
If you don't know how to create such networks, please read the documentation.

Related

sklearn multiclass svm function

I have multi class labels and want to compute the accuracy of my model.
I am kind of confused on which sklearn function I need to use.
As far as I understood the below code is only used for the binary classification.
# dividing X, y into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state = 0)
# training a linear SVM classifier
from sklearn.svm import SVC
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(X_train, y_train)
svm_predictions = svm_model_linear.predict(X_test)
# model accuracy for X_test
accuracy = svm_model_linear.score(X_test, y_test)
print accuracy
and as I understood from the link:
Which decision_function_shape for sklearn.svm.SVC when using OneVsRestClassifier?
for multiclass classification I should use OneVsRestClassifier with decision_function_shape (with ovr or ovo and check which one works better)
svm_model_linear = OneVsRestClassifier(SVC(kernel = 'linear',C = 1, decision_function_shape = 'ovr')).fit(X_train, y_train)
The main problem is that the time of predicting the labels does matter to me but it takes about 1 minute to run the classifier and predict the data (also this time is added to the feature reduction such as PCA which also takes sometime)? any suggestions to reduce the time for svm multiclassifer?
There are multiple things to consider here:
1) You see, OneVsRestClassifier will separate out all labels and train multiple svm objects (one for each label) on the given data. So each time, only binary data will be supplied to single svm object.
2) SVC internally uses libsvm and liblinear, which have a 'OvO' strategy for multi-class or multi-label output. But this point will be of no use because of point 1. libsvm will only get binary data.
Even if it did, it doesnt take into account the 'decision_function_shape'. So it does not matter if you provide decision_function_shape = 'ovr' or decision_function_shape = 'ovr'.
So it seems that you are looking at the problem wrong. decision_function_shape should not affect the speed. Try standardizing your data before fitting. SVMs work well with standardized data.
When wrapping models with the ovr or ovc classifiers, you could set the n_jobs parameters to make them run faster, e.g. sklearn.multiclass.OneVsOneClassifier(estimator, n_jobs=-1) or sklearn.multiclass.OneVsRestClassifier(estimator, n_jobs=-1).
Although each single SVM classifier in sklearn could only use one CPU core at a time, the ensemble multi class classifier could fit multiple models at the same time by setting n_jobs.

Why should we normalize data for deep learning in Keras?

I was testing some network architectures in Keras for classifying the MNIST dataset. I have implemented one that is similar to the LeNet.
I have seen that in the examples that I have found on the internet, there is a step of data normalization. For example:
X_train /= 255
I have performed a test without this normalization and I have seen that the performance (accuracy) of the network has decreased (keeping the same number of epochs). Why has this happened?
If I increase the number of epochs, the accuracy can reach the same level reached by the model trained with normalization?
So, the normalization affects the accuracy, or only the training speed?
The complete source code of my training script is below:
from keras.models import Sequential
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.core import Activation
from keras.layers.core import Flatten
from keras.layers.core import Dense
from keras.datasets import mnist
from keras.utils import np_utils
from keras.optimizers import SGD, RMSprop, Adam
import numpy as np
import matplotlib.pyplot as plt
from keras import backend as k
def build(input_shape, classes):
model = Sequential()
model.add(Conv2D(20, kernel_size=5, padding="same",activation='relu',input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(Conv2D(50, kernel_size=5, padding="same", activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(Flatten())
model.add(Dense(500))
model.add(Activation("relu"))
model.add(Dense(classes))
model.add(Activation("softmax"))
return model
NB_EPOCH = 4 # number of epochs
BATCH_SIZE = 128 # size of the batch
VERBOSE = 1 # set the training phase as verbose
OPTIMIZER = Adam() # optimizer
VALIDATION_SPLIT=0.2 # percentage of the training data used for
evaluating the loss function
IMG_ROWS, IMG_COLS = 28, 28 # input image dimensions
NB_CLASSES = 10 # number of outputs = number of digits
INPUT_SHAPE = (1, IMG_ROWS, IMG_COLS) # shape of the input
(X_train, y_train), (X_test, y_test) = mnist.load_data()
k.set_image_dim_ordering("th")
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
X_train = X_train[:, np.newaxis, :, :]
X_test = X_test[:, np.newaxis, :, :]
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
y_train = np_utils.to_categorical(y_train, NB_CLASSES)
y_test = np_utils.to_categorical(y_test, NB_CLASSES)
model = build(input_shape=INPUT_SHAPE, classes=NB_CLASSES)
model.compile(loss="categorical_crossentropy",
optimizer=OPTIMIZER,metrics=["accuracy"])
history = model.fit(X_train, y_train, batch_size=BATCH_SIZE, epochs=NB_EPOCH, verbose=VERBOSE, validation_split=VALIDATION_SPLIT)
model.save("model2")
score = model.evaluate(X_test, y_test, verbose=VERBOSE)
print('Test accuracy:', score[1])
Normalization is a generic concept not limited only to deep learning or to Keras.
Why to normalize?
Let me take a simple logistic regression example which will be easy to understand and to explain normalization.
Assume we are trying to predict if a customer should be given loan or not. Among many available independent variables lets just consider Age and Income.
Let the equation be of the form:
Y = weight_1 * (Age) + weight_2 * (Income) + some_constant
Just for sake of explanation let Age be usually in range of [0,120] and let us assume Income in range of [10000, 100000]. The scale of Age and Income are very different. If you consider them as is then weights weight_1 and weight_2 may be assigned biased weights. weight_2 might bring more importance to Income as a feature than to what weight_1 brings importance to Age. To scale them to a common level, we can normalize them. For example, we can bring all the ages in range of [0,1] and all incomes in range of [0,1]. Now we can say that Age and Income are given equal importance as a feature.
Does Normalization always increase the accuracy?
Apparently, No. It is not necessary that normalization always increases accuracy. It may or might not, you never really know until you implement. Again it depends on at which stage in you training you apply normalization, on whether you apply normalization after every activation, etc.
As the range of the values of the features gets narrowed down to a particular range because of normalization, its easy to perform computations over a smaller range of values. So, usually the model gets trained a bit faster.
Regarding the number of epochs, accuracy usually increases with number of epochs provided that your model doesn't start over-fitting.
A very good explanation for Normalization/Standardization and related terms is here.
In a nutshell, normalization reduces the complexity of the problem your network is trying to solve. This can potentially increase the accuracy of your model and speed up the training. You bring the data on the same scale and reduce variance. None of the weights in the network are wasted on doing a normalization for you, meaning that they can be used more efficiently to solve the actual task at hand.
As #Shridhar R Kulkarni says, normalization is a general concept and doesn’t only apply to keras.
It’s often applied as part of data preparation for ML learning models to change numeric values in the dataset to fit a standard scale without distorting the differences in their ranges. As such, normalization enhances the cohesion of entity types within a model by reducing the probability of inconsistent data.
However, not every other dataset and use case requires normalization, it’s primarily necessary when features have different ranges. You may use when;
You want to improve your model’s convergence efficiency and make
optimization feasible
When you want to make training less sensitive to scale features, you can better
solve coefficients.
Want to improve analysis from multiple models.
Normalization is not recommended when;
-Using decision tree models or ensembles based on them
-Your data is not normally distributed- you may have to use other data pre-
processing techniques
-If your dataset comprises already scaled variables
In some cases, normalization can improve performance. However, it is not always necessary.
The critical thing is to understand your dataset and scenario first, then you’ll know whether you need it or not. Sometimes, you can experiment to see if it gives you good performance or not.
Check out deepchecks and see how to deal with important data-related checks you come across in ML.
For example, to check duplicated data in your set, you can use the following code detailed code
from deepchecks.checks.integrity.data_duplicates import DataDuplicates
from deepchecks.base import Dataset, Suite
from datetime import datetime
import pandas as pd
I think there are some issue with the convergence of the optimizer function too. Here i show a simple linear regression. Three examples:
First with an array with small values and it works as expected.
Second an array with bigger values and the loss function explodes toward infinity, suggesting the need to normalize. And at the end in model 3 the same array as case two but it has been normalized and we get convergence.
github colab enabled ipython notebook
I've use the MSE optimizer function i don't know if other optimizers suffer the same issues.

Accuracy remains zero while training LSTM in keras

I am trying to train LSTM, but while training accuracy remains zero in each epoch.
I have transformed data to multivariate Time-series data and also shape in three-dimensional shape.
I also have normalised data using minmaxsaller.
I have tried on a number of the epoch from 5 to 50 and batch size from 25 to 200.
I have tried data samples from 1000000 to 1000 but none is working.
Every time I am getting training accuracy zero only.
Can anyone help me in understanding it or suggest some more experiments.
Following is my network.
from keras.layers.core import Dense,Activation,Dropout
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.layers import Flatten
model = Sequential()
model.add(LSTM(50,return_sequences=True, input_shape=(X_train_values.shape[1], X_train_values.shape[2])))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(1))
model.add(Activation('linear'))
model.compile(loss='mse',optimizer='rmsprop',metrics=['accuracy'])
history = model.fit(X_train_values, y_train.values,epochs=25, batch_size=30, verbose=2, shuffle=False)
me too, I'm a student from china, when I train LSTM model, the model's accuracy is very close zeros, but predicted answer and test collections is very close.
enter image description here
enter image description here

How to calculate prediction uncertainty using Keras?

I would like to calculate NN model certainty/confidence (see What my deep model doesn't know) - when NN tells me an image represents "8", I would like to know how certain it is. Is my model 99% certain it is "8" or is it 51% it is "8", but it could also be "6"? Some digits are quite ambiguous and I would like to know for which images the model is just "flipping a coin".
I have found some theoretical writings about this but I have trouble putting this in code. If I understand correctly, I should evaluate a testing image multiple times while "killing off" different neurons (using dropout) and then...?
Working on MNIST dataset, I am running the following model:
from keras.models import Sequential
from keras.layers import Dense, Activation, Conv2D, Flatten, Dropout
model = Sequential()
model.add(Conv2D(128, kernel_size=(7, 7),
activation='relu',
input_shape=(28, 28, 1,)))
model.add(Dropout(0.20))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(Dropout(0.20))
model.add(Flatten())
model.add(Dense(units=64, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(units=10, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
model.fit(train_data, train_labels, batch_size=100, epochs=30, validation_data=(test_data, test_labels,))
How should I predict with this model so that I get its certainty about predictions too? I would appreciate some practical examples (preferably in Keras, but any will do).
To clarify, I am looking for an example of how to get certainty using the method outlined by Yurin Gal (or an explanation of why some other method yields better results).
If you want to implement dropout approach to measure uncertainty you should do the following:
Implement function which applies dropout also during the test time:
import keras.backend as K
f = K.function([model.layers[0].input, K.learning_phase()],
[model.layers[-1].output])
Use this function as uncertainty predictor e.g. in a following manner:
def predict_with_uncertainty(f, x, n_iter=10):
result = numpy.zeros((n_iter,) + x.shape)
for iter in range(n_iter):
result[iter] = f(x, 1)
prediction = result.mean(axis=0)
uncertainty = result.var(axis=0)
return prediction, uncertainty
Of course you may use any different function to compute uncertainty.
Made a few changes to the top voted answer. Now it works for me.
It's a way to estimate model uncertainty. For other source of uncertainty, I found https://eng.uber.com/neural-networks-uncertainty-estimation/ helpful.
f = K.function([model.layers[0].input, K.learning_phase()],
[model.layers[-1].output])
def predict_with_uncertainty(f, x, n_iter=10):
result = []
for i in range(n_iter):
result.append(f([x, 1]))
result = np.array(result)
prediction = result.mean(axis=0)
uncertainty = result.var(axis=0)
return prediction, uncertainty
Your model uses a softmax activation, so the simplest way to obtain some kind of uncertainty measure is to look at the output softmax probabilities:
probs = model.predict(some input data)[0]
The probs array will then be a 10-element vector of numbers in the [0, 1] range that sum to 1.0, so they can be interpreted as probabilities. For example the probability for digit 7 is just probs[7].
Then with this information you can do some post-processing, typically the predicted class is the one with highest probability, but you can also look at the class with second highest probability, etc.
A simpler way is to set training=True on any dropout layers you want to run during inference as well (essentially tells the layer to operate as if it's always in training mode - so it is always present for both training and inference).
import keras
inputs = keras.Input(shape=(10,))
x = keras.layers.Dense(3)(inputs)
outputs = keras.layers.Dropout(0.5)(x, training=True)
model = keras.Model(inputs, outputs)
Code above is from this issue.

Why pretraining for DNN is not specified in keras?

The problem is more about the training algorithm for DNN rather than the software keras.
As far as I know, deep neural network works due to the improvement in training algorithm. From the 1980s, the BP algorithm has been used to train neural network but will result in over-fitting problem when the network is deep. About 10 years ago, Hinton improved the algorithm by first pre-traning the network using unlabeled data and then using BP algorithm. The pre-traning plays an important role to avoid over-fitting.
However, as I begin to try Keras, the example (in the below) of mnist DNN using SGD algorithm without any mention about the pre-training process leads to a very high prediction accuracy. So, I begin to wonder where has the pre-training gone? Wheter I misundertood the deep learning training algorithm (I think the classical BP is almost the same as SGD)? Or a new traning technique has replaced the pre-traning process?
Very grateful for your help!
'''Trains a simple deep NN on the MNIST dataset.
Gets to 98.40% test accuracy after 20 epochs
(there is *a lot* of margin for parameter tuning).
2 seconds per epoch on a K520 GPU.
'''
from __future__ import print_function
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, Adam, RMSprop
from keras.utils import np_utils
batch_size = 128
nb_classes = 10
nb_epoch = 20
# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
model = Sequential()
model.add(Dense(512, input_shape=(784,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
history = model.fit(X_train, Y_train,
batch_size=batch_size, nb_epoch=nb_epoch,
verbose=1, validation_data=(X_test, Y_test))
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
You are wrong.
Past vs. Today
The difference between Neural Networks in the past and the ones today is not about the training algorithm. Every DNN is trained with Backpropagation based on some SGD-based algorithm, exactly like in the past. (There are some new algorithms trying to reduce parameter-tuning with adaptive learning-rates like Adam, RMSprop and co.; but plain SGD is still the most common algorithm and was used for AlphaGo for example)
The difference is just the size = number of layers (deepness; which is possible due to GPU-based evaluation) and the choosings of activation-functions. ReLU is just working better than the classic Sigmoid or Tanh activations (regarding speed and stability).
Pre-training
I also think, that pre-training was very popular 5-10 years ago but nobody is doing that today (if you got enough data)!
Let me quote from here:
It's true that unsupervised pre-training was initially what made it possible to train deeper networks, but the last few years the pre-training approach has been largely obsoleted.
Nowadays, deep neural networks are a lot more similar to their 80's cousins. Instead of pre-training, the difference is now in the activation functions and regularisation methods used (and sometimes in the optimisation algorithm, although much more rarely).
I would say that the "pre-training era", which started around 2006, ended in the early '10s when people started using rectified linear units (ReLUs), and later dropout, and discovered that pre-training was no longer beneficial for this type of networks.
I can recommend these slides as introduction to modern Deep Learning (as starting point).
Pretraining Is actually gaining again a lot of traction in NLP community, see OpenAI's GPT: the idea is that pretraining acts as an unsupervised initialization step before fine-tuning the model with the supervised data. This is because unlabeled data is much more abundant that labeled counterpart and it can be exploited to derive sensible weights inside the model that express the hidden links inside the dataset structures.
Hope that the explanation was not too goofy :)

Resources