Accuracy does not go up on a keras model - machine-learning

I'm trying to train a model on data from the Higgs Boson challenge on kaggle. The first thing I decided to do was to create a simple keras model. I've tried different amount and width of layers, different cost functions, different optimizers different functions in neurons, but the accuracy on the training set is always between 0.65-0.7 range. I don't really understand why. Here's my an example of a model that worked so weird:
from keras.layers import Dense, merge, Activation, Dropout
from keras.models import Model
from keras.models import Sequential
from keras.optimizers import SGD
model = Sequential()
model.add(Dense(600, input_shape=(30,),activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(400, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
sgd = SGD(lr=0.01, decay=1e-6)
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])
model.fit(train,labels,nb_epoch=1,batch_size=1)
I also tried larger models and got such an accuracy too. Please tell me what I am doing wrong.
EDIT
I have tried training this model with 100 epochs and the batch size 0f 100 and got loss equal to 4.9528 and accuracy to 0.6924 again. And it always outputs zero for every example.

The problem arise from the fact that your model always outputs the majority class. It's not a weighted problem (one of the classes appears more than the other) and it seems that your network "learns" to always output the same class.
Try using a different classifier (Random Forest for example) and you'll see that the accuracy is much better.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
When trying to address the issue with the neural network I uses SMOTE to balance the train dataset. You should use "adam" as the optimizer for the classification. Also, a much smaller network architecture should be enough for this problem.
from keras.layers import Dense, Dropout
from keras.models import Sequential
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
df = pd.read_csv("training.csv")
y = np.array(df['Label'].apply(lambda x: 0 if x=='s' else 1))
X = np.array(df.drop(["EventId","Label"], axis=1))
sm = SMOTE()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
X_res, y_res = sm.fit_sample(X_train, y_train)
model = Sequential()
model.add(Dense(25, input_shape=(31,),activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(10, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer="adam",loss='binary_crossentropy',metrics=['accuracy'])
model.fit(X_res, y_res,validation_data=(X_test, y_test),nb_epoch=100,batch_size=100)
An example results:
Epoch 11/100
230546/230546 [==============================] - 5s - loss: 0.5146 - acc: 0.7547 - val_loss: 0.3365 - val_acc: 0.9138
Epoch 12/100
230546/230546 [==============================] - 5s - loss: 0.4740 - acc: 0.7857 - val_loss: 0.3033 - val_acc: 0.9270
Epoch 13/100
230546/230546 [==============================] - 5s - loss: 0.4171 - acc: 0.8295 - val_loss: 0.2821 - val_acc: 0.9195

You are training way too short
model.fit(train,labels,nb_epoch=1,batch_size=1)
this means you are going once through the data, with extremely small batch, it should be something among the lines of
model.fit(train, labels, nb_epoch=100, batch_size=100)

Related

ResNet50 torchvision implementation gives low accuracy on CIFAR-10

I am new to Deep Learning and PyTorch. I am using the resnet-50 model in the torchvision module on cifar10. I have imported the CIFAR-10 dataset from torchvision. The accuracy is very low on testing and I have tried configuring the classification layers but there is no change in the accuracy. Is there something wrong with my code? Am I making a mistake in calculating the accuracy?
import torchvision
import torch
import torch.nn as nn
from torch import optim
import os
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import numpy as np
from collections import OrderedDict
import matplotlib.pyplot as plt
transformations=transforms.Compose([transforms.ToTensor(),transforms.Normalize([0.485, 0.456, 0.406],[0.229, 0.224, 0.225])])
trainset=torchvision.datasets.CIFAR10(root='./CIFAR10',download=True,transform=transformations,train=True)
testset=torchvision.datasets.CIFAR10(root='./CIFAR10',download=True,transform=transformations,train=False)
trainloader=DataLoader(dataset=trainset,batch_size=4)
testloader=DataLoader(dataset=testset,batch_size=4)
inputs,labels=next(iter(trainloader))
labels=labels.float()
inputs.size()
print(labels.type())
resnet=torchvision.models.resnet50(pretrained=True)
if torch.cuda.is_available():
resnet=resnet.cuda()
inputs,labels=inputs.cuda(),torch.Tensor(labels).cuda()
outputs=resnet(inputs)
outputs.size()
for param in resnet.parameters():
param.requires_grad=False
numft=resnet.fc.in_features
print(numft)
resnet.fc=torch.nn.Sequential(nn.Linear(numft,1000),nn.ReLU(),nn.Linear(1000,10))
resnet.cuda()
resnet.train(True)
optimizer=torch.optim.SGD(resnet.parameters(),lr=0.001,momentum=0.9)
criterion=nn.CrossEntropyLoss()
for epoch in range(5):
resnet.train(True)
trainloss=0
correct=0
for x,y in trainloader:
x,y=x.cuda(),y.cuda()
optimizer.zero_grad()
yhat=resnet(x)
loss=criterion(yhat,y)
loss.backward()
optimizer.step()
trainloss+=loss.item()
print('Epoch: {} Loss: {}'.format(epoch,(trainloss/len(trainloader))))
accuracy=[]
running_corrects=0.0
for x_test,y_test in testloader:
x_test,y_test=x_test.cuda(),y_test.cuda()
yhat=resnet(x_test)
_,z=yhat.max(1)
running_corrects += torch.sum(y_test == z)
accuracy.append(running_corrects/len(testloader))
print(running_corrects/len(testloader))
accuracy=max(accuracy)
print(accuracy)
OUTPUT AFTER TRAINING/TESTING
Epoch: 0 Loss: 1.9808503997325897
Epoch: 1 Loss: 1.7917569598436356
Epoch: 2 Loss: 1.624434965057373
Epoch: 3 Loss: 1.4082191940283775
Epoch: 4 Loss: 1.1343850775527955
tensor(1.1404, device='cuda:0')
tensor(1.1404, device='cuda:0')
Couple of my observations:
You may want to fine-tune learning-rate and number of epochs and batch size. For example, currently you are training your model for only five epochs which might not be sufficient to achieve high accuracy. you can try with lager value of epochs.
Have you tried adapting backbone (feature extractor) model for CIFAR10 dataset by setting `param.requires_grad=True? Because the original model is trained on imagenet that might need to adapt on CIFAR10.
Before evaluation/testing you may like to set resnet.train(False) or resnet.eval() to let the model know that you are in eval mode. Furthermore, you may want to evaluate your model under the scope of no_grad() by using with torch.no_grad(): that will speed up inference time and reduce memory usage.
[CIFAR-10 is a balanced dataset so it's an optional (EDA) task here.] Have you checked the class distribution of CIFAR10 in terms of whether it's an imbalanced dataset or not? If it's an imbalanced dataset you may want to employ weighted cross entropy for you loss calculation. There are other strategies to tackle class-imbalance like over-sampling or under-sampling.
Regarding test accuracy, You need to divide the total number of correct prediction by the total number of samples in the dataset, len(testloader.dataset) instead of len(testloader). If you want your accuracy in the range of [0,100], just multiply by 100. You can print test accuracy for each epoch to check how it's changing whereas you are currently showing the maximum accuracy.

Different score on cross_val_score() and accuracy_score() on sklearn

I'm working with document classification.
I have about totally 14000 (document + category) data and I splitted them: 10000 to train data (x_train and y_train) and 4000 to test data (x_test and y_test).
And I used Doc2Vec() of gensim to vectorize the document: trained with x_train (not with x_test).
Here is my code applying Doc2Vec():
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn import utils
total_data = [TaggedDocument(words=t, tags=[l]) for t, l in zip(prep_texts, labels)]
utils.shuffle(total_data)
train_data = total_data[:10000]
test_data = total_data[10000:]
d2v = Doc2Vec(dm=0, vector_size=100, window=5,
alpha=0.025, min_alpha=0.001, min_count=5,
sample=0, workers=8, hs=0, negative=5)
d2v.build_vocab([d for d in train_data])
d2v.train(train_data,
total_examples=len(train_data),
epochs=10)
So x_train and x_test is inferred vector from trained Doc2Vec().
Then I applied SVC of sklearn.svm to it like below.
from sklearn.svm import SVC
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score
clf = SVC()
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
scoring = 'accuracy'
score = cross_val_score(clf, x_train, y_train, cv=k_fold, n_jobs=8, scoring=scoring)
print(score)
print('Valid acc: {}'.format(round(np.mean(score)*100, 4)))
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print('Test acc: {}'.format(accuracy_score(y_test, y_pred)))
The result I got:
[0.916 0.912 0.908 0.923 0.901 0.922 0.921 0.908 0.923 0.924]
Valid acc: 91.58
Test acc: 0.6641691196146642
I am very confused that why I got very different score on cross_val_score() and accuracy_score().
I will write down my thinking below blockquotes:
When processing cross_val_score(), it will do cross-validation.
Then for each fold, (assume n_splits=10) 9/10 of train set will be used to train the classifier and left 1/10 of train set will be used to validate the classifier.
It means 1/10 of train set is always new for the model. So there is no difference between 1/10 of train set and test set in terms of newness for the model.
Is there any wrong thinking?
According to my current thinking, I cannot understand why I got very different score on cross_val_score() and accuracy_score().
Thanks in advance!!
EDIT:
I realized that when I trained Doc2Vec() with not only x_train but also x_test, I could get better scores like below:
[0.905 0.886 0.883 0.91 0.888 0.903 0.904 0.897 0.906 0.905]
Valid acc: 89.87
Test acc: 0.8413165640888414
Yes, this is very natural to be better but I realized that the problem was not classification but vectorization.
But as you can see, there is still 5% difference between valid and test accuracy.
Now I'm still wondering why this difference occur and finding methods to improve the Doc2Vec().

DNN binary classifier's accuracy not increasing

My binary classifier DNN's accuracy seems stuck since epoch 1. I think this means that the model is not learning. Any insight on why this is happening?
Problem statement: I would like to classify a given sequence of readings for sensors (ex. [0 1 15 1 0 3]) into either 0 or 1 (0 equivalent to "idle" state, 1 equivalent to "active" state).
About the dataset: Dataset is available here
The "state" column is the target, while the rest of the columns are the features.
I've tried using SGD instead of Adam, tried using different kernel initializes, tried changing the number of hidden layers and number of neurons per layer and tried using sklearn's StandardScaler instead of the MinMaxScaler. None of these approaches seemed to change the outcome.
This is the code:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam
from keras.initializers import he_uniform
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
seed = 7
random_state = np.random.seed(seed)
data = pd.read_csv('Dataset/Reformed/Model0_Dataset.csv')
X = data.drop(['state'], axis=1).values
y = data['state'].values
#min_max_scaler = MinMaxScaler()
std_scaler = StandardScaler()
# X_scaled = min_max_scaler.fit_transform(X)
X_scaled = std_scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=random_state)
# One Hot encode targets
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
enc = OneHotEncoder(categories='auto')
y_train_enc = enc.fit_transform(y_train).toarray()
y_test_enc = enc.fit_transform(y_test).toarray()
epochs = 500
batch_size = 100
model = Sequential()
model.add(Dense(700, input_shape=(X.shape[1],), kernel_initializer=he_uniform(seed)))
model.add(Dropout(0.5))
model.add(Dense(1400, activation='relu', kernel_initializer=he_uniform(seed)))
model.add(Dropout(0.5))
model.add(Dense(700, activation='relu', kernel_initializer=he_uniform(seed)))
model.add(Dropout(0.5))
model.add(Dense(800, activation='relu', kernel_initializer=he_uniform(seed)))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.summary()
early_stopping_monitor = EarlyStopping(patience=25)
# model.compile(SGD(lr=.01, decay=1e-6, momentum=0.9, nesterov=True), loss='binary_crossentropy', metrics=['accuracy'])
model.compile(Adam(lr=.01, decay=1e-6), loss='binary_crossentropy', metrics=['accuracy'], )
history = model.fit(X_train, y_train_enc, validation_split=0.2, batch_size=batch_size,
callbacks=[early_stopping_monitor], epochs=epochs, shuffle=True, verbose=1)
eval = model.evaluate(X_test, y_test_enc, batch_size=batch_size, verbose=1)
Expected results: Accuracy increasing (and loss decreasing) with each epoch (at least for the early epochs).
Actual results: The following values are fixed throughout the entire training process:
loss: 8.0118 - acc: 0.5001 - val_loss: 8.0366 - val_acc: 0.4987
You are using the wrong loss, with a two-output softmax you should use categorical_crossentropy and you should one-hot encode your labels. If you want to use binary_crossentropy, then the output layer should be a one unit with a sigmoid activation.

How to avoid overfitting on a simple feed forward network

Using the pima indians diabetes dataset I'm trying to build an accurate model using Keras. I've written the following code:
# Visualize training history
from keras import callbacks
from keras.layers import Dropout
tb = callbacks.TensorBoard(log_dir='/.logs', histogram_freq=10, batch_size=32,
write_graph=True, write_grads=True, write_images=False,
embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None)
# Visualize training history
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
import numpy
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:, 0:8]
Y = dataset[:, 8]
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='relu', name='first_input'))
model.add(Dense(500, activation='tanh', name='first_hidden'))
model.add(Dropout(0.5, name='dropout_1'))
model.add(Dense(8, activation='relu', name='second_hidden'))
model.add(Dense(1, activation='sigmoid', name='output_layer'))
# Compile model
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
# Fit the model
history = model.fit(X, Y, validation_split=0.33, epochs=1000, batch_size=10, verbose=0, callbacks=[tb])
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
After several tries, I've added dropout layers in order to avoid overfitting, but with no luck. The following graph shows that the validation loss and training loss gets separate at one point.
What else could I do to optimize this network?
UPDATE:
based on the comments I got I've tweaked the code like so:
model = Sequential()
model.add(Dense(12, input_dim=8, kernel_initializer='uniform', kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l1(0.01), activation='relu',
name='first_input')) # added regularizers
model.add(Dense(8, activation='relu', name='first_hidden')) # reduced to 8 neurons
model.add(Dropout(0.5, name='dropout_1'))
model.add(Dense(5, activation='relu', name='second_hidden'))
model.add(Dense(1, activation='sigmoid', name='output_layer'))
Here are the graphs for 500 epochs
The first example gave a validation accuracy > 75% and the second one gave an accuracy of < 65% and if you compare the losses for epochs below 100, its less than < 0.5 for the first one and the second one was > 0.6. But how is the second case better?.
The second one to me is a case of under-fitting: the model doesnt have enough capacity to learn. While the first case has a problem of over-fitting because its training was not stopped when overfitting started (early stopping). If the training was stopped at say 100 epoch, it would be a far better model compared between the two.
The goal should be to obtain small prediction error in unseen data and for that you increase the capacity of the network till a point beyond which overfitting starts to happen.
So how to avoid over-fitting in this particular case? Adopt early stopping.
CODE CHANGES: To include early stopping and input scaling.
# input scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Early stopping
early_stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=1, mode='auto')
# create model - almost the same code
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu', name='first_input'))
model.add(Dense(500, activation='relu', name='first_hidden'))
model.add(Dropout(0.5, name='dropout_1'))
model.add(Dense(8, activation='relu', name='second_hidden'))
model.add(Dense(1, activation='sigmoid', name='output_layer')))
history = model.fit(X, Y, validation_split=0.33, epochs=1000, batch_size=10, verbose=0, callbacks=[tb, early_stop])
The Accuracy and loss graphs:
First, try adding some regularization (https://keras.io/regularizers/) like with this code:
model.add(Dense(12, input_dim=12,
kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l1(0.01)))
Also, make sure to decrease your network size i.e. you don't need a hidden layer of 500 neurons - try just taking that out to decrease the representation power and maybe even another layer if it's still overfitting. Also, only use relu activation. Maybe also try increasing your dropout rate to something like 0.75 (although it's already high). You probably also don't need to run it for so many epochs - it will just begin to overfit after long enough.
For a dataset like the Diabetes one you can use a much simpler network. Try to reduce the neurons in your second layer. (Is there a specific reason why you chose tanh as the activation there?).
In addition you simply can add an EarlyStopping callback to your training: https://keras.io/callbacks/

What's the difference between a bidirectional LSTM and an LSTM?

Can someone please explain this? I know bidirectional LSTMs have a forward and backward pass but what is the advantage of this over a unidirectional LSTM?
What is each of them better suited for?
LSTM in its core, preserves information from inputs that has already passed through it using the hidden state.
Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past.
Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backwards you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.
What they are suited for is a very complicated question but BiLSTMs show very good results as they can understand context better, I will try to explain through an example.
Lets say we try to predict the next word in a sentence, on a high level what a unidirectional LSTM will see is
The boys went to ....
And will try to predict the next word only by this context, with bidirectional LSTM you will be able to see information further down the road for example
Forward LSTM:
The boys went to ...
Backward LSTM:
... and then they got out of the pool
You can see that using the information from the future it could be easier for the network to understand what the next word is.
Adding to Bluesummer's answer, here is how you would implement Bidirectional LSTM from scratch without calling BiLSTM module. This might better contrast the difference between a uni-directional and bi-directional LSTMs. As you see, we merge two LSTMs to create a bidirectional LSTM.
You can merge outputs of the forward and backward LSTMs by using either {'sum', 'mul', 'concat', 'ave'}.
left = Sequential()
left.add(LSTM(output_dim=hidden_units, init='uniform', inner_init='uniform',
forget_bias_init='one', return_sequences=True, activation='tanh',
inner_activation='sigmoid', input_shape=(99, 13)))
right = Sequential()
right.add(LSTM(output_dim=hidden_units, init='uniform', inner_init='uniform',
forget_bias_init='one', return_sequences=True, activation='tanh',
inner_activation='sigmoid', input_shape=(99, 13), go_backwards=True))
model = Sequential()
model.add(Merge([left, right], mode='sum'))
model.add(TimeDistributedDense(nb_classes))
model.add(Activation('softmax'))
sgd = SGD(lr=0.1, decay=1e-5, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)
print("Train...")
model.fit([X_train, X_train], Y_train, batch_size=1, nb_epoch=nb_epoches, validation_data=([X_test, X_test], Y_test), verbose=1, show_accuracy=True)
In comparison to LSTM, BLSTM or BiLSTM has two networks, one access pastinformation in forward direction and another access future in the reverse direction. wiki
A new class Bidirectional is added as per official doc here: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional
model = Sequential()
model.add(Bidirectional(LSTM(10, return_sequences=True), input_shape=(5,
10)))
and activation function can be added like this:
model = Sequential()
model.add(Bidirectional(LSTM(num_channels,
implementation = 2, recurrent_activation = 'sigmoid'),
input_shape=(input_length, input_dim)))
Complete example using IMDB data will be like this.The result after 4 epoch.
Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
17465344/17464789 [==============================] - 4s 0us/step
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/4
25000/25000 [==============================] - 78s 3ms/step - loss: 0.4219 - acc: 0.8033 - val_loss: 0.2992 - val_acc: 0.8732
Epoch 2/4
25000/25000 [==============================] - 82s 3ms/step - loss: 0.2315 - acc: 0.9106 - val_loss: 0.3183 - val_acc: 0.8664
Epoch 3/4
25000/25000 [==============================] - 91s 4ms/step - loss: 0.1802 - acc: 0.9338 - val_loss: 0.3645 - val_acc: 0.8568
Epoch 4/4
25000/25000 [==============================] - 92s 4ms/step - loss: 0.1398 - acc: 0.9509 - val_loss: 0.3562 - val_acc: 0.8606
BiLSTM or BLSTM
import numpy as np
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.datasets import imdb
n_unique_words = 10000 # cut texts after this number of words
maxlen = 200
batch_size = 128
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=n_unique_words)
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
y_train = np.array(y_train)
y_test = np.array(y_test)
model = Sequential()
model.add(Embedding(n_unique_words, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print('Train...')
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=4,
validation_data=[x_test, y_test])
Another use case of bidirectional LSTM might be for word classification in the text. They can see the past and future context of the word and are much better suited to classify the word.
It can also be helpful in Time Series Forecasting problems, like predicting the electric consumption of a household. However, we can also use LSTM in this but Bidirectional LSTM will also do a better job in it.

Resources