ResNet50 torchvision implementation gives low accuracy on CIFAR-10 - machine-learning

I am new to Deep Learning and PyTorch. I am using the resnet-50 model in the torchvision module on cifar10. I have imported the CIFAR-10 dataset from torchvision. The accuracy is very low on testing and I have tried configuring the classification layers but there is no change in the accuracy. Is there something wrong with my code? Am I making a mistake in calculating the accuracy?
import torchvision
import torch
import torch.nn as nn
from torch import optim
import os
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import numpy as np
from collections import OrderedDict
import matplotlib.pyplot as plt
transformations=transforms.Compose([transforms.ToTensor(),transforms.Normalize([0.485, 0.456, 0.406],[0.229, 0.224, 0.225])])
trainset=torchvision.datasets.CIFAR10(root='./CIFAR10',download=True,transform=transformations,train=True)
testset=torchvision.datasets.CIFAR10(root='./CIFAR10',download=True,transform=transformations,train=False)
trainloader=DataLoader(dataset=trainset,batch_size=4)
testloader=DataLoader(dataset=testset,batch_size=4)
inputs,labels=next(iter(trainloader))
labels=labels.float()
inputs.size()
print(labels.type())
resnet=torchvision.models.resnet50(pretrained=True)
if torch.cuda.is_available():
resnet=resnet.cuda()
inputs,labels=inputs.cuda(),torch.Tensor(labels).cuda()
outputs=resnet(inputs)
outputs.size()
for param in resnet.parameters():
param.requires_grad=False
numft=resnet.fc.in_features
print(numft)
resnet.fc=torch.nn.Sequential(nn.Linear(numft,1000),nn.ReLU(),nn.Linear(1000,10))
resnet.cuda()
resnet.train(True)
optimizer=torch.optim.SGD(resnet.parameters(),lr=0.001,momentum=0.9)
criterion=nn.CrossEntropyLoss()
for epoch in range(5):
resnet.train(True)
trainloss=0
correct=0
for x,y in trainloader:
x,y=x.cuda(),y.cuda()
optimizer.zero_grad()
yhat=resnet(x)
loss=criterion(yhat,y)
loss.backward()
optimizer.step()
trainloss+=loss.item()
print('Epoch: {} Loss: {}'.format(epoch,(trainloss/len(trainloader))))
accuracy=[]
running_corrects=0.0
for x_test,y_test in testloader:
x_test,y_test=x_test.cuda(),y_test.cuda()
yhat=resnet(x_test)
_,z=yhat.max(1)
running_corrects += torch.sum(y_test == z)
accuracy.append(running_corrects/len(testloader))
print(running_corrects/len(testloader))
accuracy=max(accuracy)
print(accuracy)
OUTPUT AFTER TRAINING/TESTING
Epoch: 0 Loss: 1.9808503997325897
Epoch: 1 Loss: 1.7917569598436356
Epoch: 2 Loss: 1.624434965057373
Epoch: 3 Loss: 1.4082191940283775
Epoch: 4 Loss: 1.1343850775527955
tensor(1.1404, device='cuda:0')
tensor(1.1404, device='cuda:0')

Couple of my observations:
You may want to fine-tune learning-rate and number of epochs and batch size. For example, currently you are training your model for only five epochs which might not be sufficient to achieve high accuracy. you can try with lager value of epochs.
Have you tried adapting backbone (feature extractor) model for CIFAR10 dataset by setting `param.requires_grad=True? Because the original model is trained on imagenet that might need to adapt on CIFAR10.
Before evaluation/testing you may like to set resnet.train(False) or resnet.eval() to let the model know that you are in eval mode. Furthermore, you may want to evaluate your model under the scope of no_grad() by using with torch.no_grad(): that will speed up inference time and reduce memory usage.
[CIFAR-10 is a balanced dataset so it's an optional (EDA) task here.] Have you checked the class distribution of CIFAR10 in terms of whether it's an imbalanced dataset or not? If it's an imbalanced dataset you may want to employ weighted cross entropy for you loss calculation. There are other strategies to tackle class-imbalance like over-sampling or under-sampling.
Regarding test accuracy, You need to divide the total number of correct prediction by the total number of samples in the dataset, len(testloader.dataset) instead of len(testloader). If you want your accuracy in the range of [0,100], just multiply by 100. You can print test accuracy for each epoch to check how it's changing whereas you are currently showing the maximum accuracy.

Related

How to use GPU while training a model?

I am running a code to train a resnet model on a kaggle notebook. I have chosen the accelerator as GPU so I haven't made any mistakes there. I am training the model using the following code:
model.cuda()
for epoch in range(10):
model.train(True)
trainloss=0
for x,y in trainloader:
x,y=x.cuda(),y.cuda()
yhat=model(x)
optimizer.zero_grad()
loss=criterion(yhat,y)
loss.backward()
optimizer.step()
trainloss+=loss.item()
print('Epoch {} Loss: {}'.format(epoch,(trainloss/len(trainloader.dataset))))
model.eval()
testcorrect=0
with torch.no_grad():
for test_x,test_y in testloader:
test_x,test_y=test_x.cuda(),test_y.cuda()
yhat=model(test_x)
_,z=yhat.max(1)
testcorrect+=(test_y==z).sum().item()
print('Model Accuracy: ',(testcorrect/len(testloader.dataset)))
Network Code:
model=torchvision.models.resnet18(pretrained=True)
num_ftrs=model.fc.in_features
model.fc=nn.Sequential(nn.Linear(num_ftrs,1000),
nn.ReLU(),
nn.Linear(1000,2)
)
If you see I have used the .cuda() function on both my model as well as the tensors(inside the training part as well as validation part). However the GPU usage shown for the kaggle notebook is 0% while my CPU usage is up to 99%. Am I missing any code which is required to train the model using the GPU?
It might be that your model doesn't give GPU enough work. Try to make your network more GPU-hungry, e.g. introduce some linear layer with a bunch of neurons, etc. to double check that in that case you see increased GPU usage. Also I noticed that the measurement is delayed by a bit, so maybe you give GPU some work which it can do in a fraction of a second and the GPU usage bar doesn't have a chance to go higher from 0%.
Maybe you could share the actual network you're using?
I can see the GPU usage going to 100% in Kaggle notebook with a toy example like this (notice 2500 x 2500 linear layer here):
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
trainloader = [(torch.Tensor(np.random.randn(1000, 5)), torch.Tensor([1.0] * 1000))] * 1000
model = nn.Sequential(nn.Linear(5, 2500), nn.Linear(2500, 1500), nn.Linear(1500, 1))
model.cuda()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.)
criterion = lambda x,y : ((x-y)**2).mean()
for epoch in range(10):
for x,y in trainloader:
x,y=x.cuda(),y.cuda()
yhat=model(x)
optimizer.zero_grad()
loss=criterion(yhat,y)
loss.backward()
optimizer.step()
print(epoch)

Different score on cross_val_score() and accuracy_score() on sklearn

I'm working with document classification.
I have about totally 14000 (document + category) data and I splitted them: 10000 to train data (x_train and y_train) and 4000 to test data (x_test and y_test).
And I used Doc2Vec() of gensim to vectorize the document: trained with x_train (not with x_test).
Here is my code applying Doc2Vec():
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn import utils
total_data = [TaggedDocument(words=t, tags=[l]) for t, l in zip(prep_texts, labels)]
utils.shuffle(total_data)
train_data = total_data[:10000]
test_data = total_data[10000:]
d2v = Doc2Vec(dm=0, vector_size=100, window=5,
alpha=0.025, min_alpha=0.001, min_count=5,
sample=0, workers=8, hs=0, negative=5)
d2v.build_vocab([d for d in train_data])
d2v.train(train_data,
total_examples=len(train_data),
epochs=10)
So x_train and x_test is inferred vector from trained Doc2Vec().
Then I applied SVC of sklearn.svm to it like below.
from sklearn.svm import SVC
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score
clf = SVC()
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
scoring = 'accuracy'
score = cross_val_score(clf, x_train, y_train, cv=k_fold, n_jobs=8, scoring=scoring)
print(score)
print('Valid acc: {}'.format(round(np.mean(score)*100, 4)))
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print('Test acc: {}'.format(accuracy_score(y_test, y_pred)))
The result I got:
[0.916 0.912 0.908 0.923 0.901 0.922 0.921 0.908 0.923 0.924]
Valid acc: 91.58
Test acc: 0.6641691196146642
I am very confused that why I got very different score on cross_val_score() and accuracy_score().
I will write down my thinking below blockquotes:
When processing cross_val_score(), it will do cross-validation.
Then for each fold, (assume n_splits=10) 9/10 of train set will be used to train the classifier and left 1/10 of train set will be used to validate the classifier.
It means 1/10 of train set is always new for the model. So there is no difference between 1/10 of train set and test set in terms of newness for the model.
Is there any wrong thinking?
According to my current thinking, I cannot understand why I got very different score on cross_val_score() and accuracy_score().
Thanks in advance!!
EDIT:
I realized that when I trained Doc2Vec() with not only x_train but also x_test, I could get better scores like below:
[0.905 0.886 0.883 0.91 0.888 0.903 0.904 0.897 0.906 0.905]
Valid acc: 89.87
Test acc: 0.8413165640888414
Yes, this is very natural to be better but I realized that the problem was not classification but vectorization.
But as you can see, there is still 5% difference between valid and test accuracy.
Now I'm still wondering why this difference occur and finding methods to improve the Doc2Vec().

Accuracy remains zero while training LSTM in keras

I am trying to train LSTM, but while training accuracy remains zero in each epoch.
I have transformed data to multivariate Time-series data and also shape in three-dimensional shape.
I also have normalised data using minmaxsaller.
I have tried on a number of the epoch from 5 to 50 and batch size from 25 to 200.
I have tried data samples from 1000000 to 1000 but none is working.
Every time I am getting training accuracy zero only.
Can anyone help me in understanding it or suggest some more experiments.
Following is my network.
from keras.layers.core import Dense,Activation,Dropout
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.layers import Flatten
model = Sequential()
model.add(LSTM(50,return_sequences=True, input_shape=(X_train_values.shape[1], X_train_values.shape[2])))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(1))
model.add(Activation('linear'))
model.compile(loss='mse',optimizer='rmsprop',metrics=['accuracy'])
history = model.fit(X_train_values, y_train.values,epochs=25, batch_size=30, verbose=2, shuffle=False)
me too, I'm a student from china, when I train LSTM model, the model's accuracy is very close zeros, but predicted answer and test collections is very close.
enter image description here
enter image description here

Why is keras way slower than sklearn?

I'm dealing with a simple logistic regression problem. Each sample contains 7423 features. Totally 4000 training samples and 1000 testing samples. Sklearn takes 0.01s to train the model and achieves 97% accuracy, but Keras (TensorFlow backend) takes 10s to achieve same accuracy after 50 epoches (even one epoch is 20x slower than sklearn). Anyone can shed light on this huge gap?
Samples:
X_train: matrix of 4000*7423, 0.0 <= value <= 1.0
y_train: matrix of 4000*1, value = 0.0 or 1.0
X_test: matrix of 1000*7423, 0.0 <= value <= 1.0
y_test: matrix of 1000*1, value = 0.0 or 1.0
Sklearn code:
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import accuracy_score
classifier = LogisticRegression()
**# Finished in 0.01s**
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
print('test accuracy = %.2f' % accuracy_score(predictions, y_test))
*[output]: test accuracy = 0.97*
Keras code:
# Using TensorFlow as backend
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(1, input_dim=X_train.shape[1], activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
**# Finished in 10s**
model.fit(X_train, y_train, batch_size=64, nb_epoch=50, verbose=0)
result = model.evaluate(X_test, y_test, verbose=0)
print('test accuracy = %.2f' % result[1])
*[output]: test accuracy = 0.97*
It might be the optimizer or the loss. You use a non linearity. You also use probably a different batch size under the hood in sklearn.
But the way I see it is that you have a specific task, one of the tool is tailored and made to resolve it, the other is a more complex structure that can resolve it but is not optimized to do so and probably does a lot of things that are not needed for this problem which slows everything down.

Why pretraining for DNN is not specified in keras?

The problem is more about the training algorithm for DNN rather than the software keras.
As far as I know, deep neural network works due to the improvement in training algorithm. From the 1980s, the BP algorithm has been used to train neural network but will result in over-fitting problem when the network is deep. About 10 years ago, Hinton improved the algorithm by first pre-traning the network using unlabeled data and then using BP algorithm. The pre-traning plays an important role to avoid over-fitting.
However, as I begin to try Keras, the example (in the below) of mnist DNN using SGD algorithm without any mention about the pre-training process leads to a very high prediction accuracy. So, I begin to wonder where has the pre-training gone? Wheter I misundertood the deep learning training algorithm (I think the classical BP is almost the same as SGD)? Or a new traning technique has replaced the pre-traning process?
Very grateful for your help!
'''Trains a simple deep NN on the MNIST dataset.
Gets to 98.40% test accuracy after 20 epochs
(there is *a lot* of margin for parameter tuning).
2 seconds per epoch on a K520 GPU.
'''
from __future__ import print_function
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, Adam, RMSprop
from keras.utils import np_utils
batch_size = 128
nb_classes = 10
nb_epoch = 20
# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
model = Sequential()
model.add(Dense(512, input_shape=(784,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
history = model.fit(X_train, Y_train,
batch_size=batch_size, nb_epoch=nb_epoch,
verbose=1, validation_data=(X_test, Y_test))
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
You are wrong.
Past vs. Today
The difference between Neural Networks in the past and the ones today is not about the training algorithm. Every DNN is trained with Backpropagation based on some SGD-based algorithm, exactly like in the past. (There are some new algorithms trying to reduce parameter-tuning with adaptive learning-rates like Adam, RMSprop and co.; but plain SGD is still the most common algorithm and was used for AlphaGo for example)
The difference is just the size = number of layers (deepness; which is possible due to GPU-based evaluation) and the choosings of activation-functions. ReLU is just working better than the classic Sigmoid or Tanh activations (regarding speed and stability).
Pre-training
I also think, that pre-training was very popular 5-10 years ago but nobody is doing that today (if you got enough data)!
Let me quote from here:
It's true that unsupervised pre-training was initially what made it possible to train deeper networks, but the last few years the pre-training approach has been largely obsoleted.
Nowadays, deep neural networks are a lot more similar to their 80's cousins. Instead of pre-training, the difference is now in the activation functions and regularisation methods used (and sometimes in the optimisation algorithm, although much more rarely).
I would say that the "pre-training era", which started around 2006, ended in the early '10s when people started using rectified linear units (ReLUs), and later dropout, and discovered that pre-training was no longer beneficial for this type of networks.
I can recommend these slides as introduction to modern Deep Learning (as starting point).
Pretraining Is actually gaining again a lot of traction in NLP community, see OpenAI's GPT: the idea is that pretraining acts as an unsupervised initialization step before fine-tuning the model with the supervised data. This is because unlabeled data is much more abundant that labeled counterpart and it can be exploited to derive sensible weights inside the model that express the hidden links inside the dataset structures.
Hope that the explanation was not too goofy :)

Resources