I'm dealing with a simple logistic regression problem. Each sample contains 7423 features. Totally 4000 training samples and 1000 testing samples. Sklearn takes 0.01s to train the model and achieves 97% accuracy, but Keras (TensorFlow backend) takes 10s to achieve same accuracy after 50 epoches (even one epoch is 20x slower than sklearn). Anyone can shed light on this huge gap?
Samples:
X_train: matrix of 4000*7423, 0.0 <= value <= 1.0
y_train: matrix of 4000*1, value = 0.0 or 1.0
X_test: matrix of 1000*7423, 0.0 <= value <= 1.0
y_test: matrix of 1000*1, value = 0.0 or 1.0
Sklearn code:
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import accuracy_score
classifier = LogisticRegression()
**# Finished in 0.01s**
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
print('test accuracy = %.2f' % accuracy_score(predictions, y_test))
*[output]: test accuracy = 0.97*
Keras code:
# Using TensorFlow as backend
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(1, input_dim=X_train.shape[1], activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
**# Finished in 10s**
model.fit(X_train, y_train, batch_size=64, nb_epoch=50, verbose=0)
result = model.evaluate(X_test, y_test, verbose=0)
print('test accuracy = %.2f' % result[1])
*[output]: test accuracy = 0.97*
It might be the optimizer or the loss. You use a non linearity. You also use probably a different batch size under the hood in sklearn.
But the way I see it is that you have a specific task, one of the tool is tailored and made to resolve it, the other is a more complex structure that can resolve it but is not optimized to do so and probably does a lot of things that are not needed for this problem which slows everything down.
Related
I'm working with document classification.
I have about totally 14000 (document + category) data and I splitted them: 10000 to train data (x_train and y_train) and 4000 to test data (x_test and y_test).
And I used Doc2Vec() of gensim to vectorize the document: trained with x_train (not with x_test).
Here is my code applying Doc2Vec():
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn import utils
total_data = [TaggedDocument(words=t, tags=[l]) for t, l in zip(prep_texts, labels)]
utils.shuffle(total_data)
train_data = total_data[:10000]
test_data = total_data[10000:]
d2v = Doc2Vec(dm=0, vector_size=100, window=5,
alpha=0.025, min_alpha=0.001, min_count=5,
sample=0, workers=8, hs=0, negative=5)
d2v.build_vocab([d for d in train_data])
d2v.train(train_data,
total_examples=len(train_data),
epochs=10)
So x_train and x_test is inferred vector from trained Doc2Vec().
Then I applied SVC of sklearn.svm to it like below.
from sklearn.svm import SVC
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score
clf = SVC()
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
scoring = 'accuracy'
score = cross_val_score(clf, x_train, y_train, cv=k_fold, n_jobs=8, scoring=scoring)
print(score)
print('Valid acc: {}'.format(round(np.mean(score)*100, 4)))
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print('Test acc: {}'.format(accuracy_score(y_test, y_pred)))
The result I got:
[0.916 0.912 0.908 0.923 0.901 0.922 0.921 0.908 0.923 0.924]
Valid acc: 91.58
Test acc: 0.6641691196146642
I am very confused that why I got very different score on cross_val_score() and accuracy_score().
I will write down my thinking below blockquotes:
When processing cross_val_score(), it will do cross-validation.
Then for each fold, (assume n_splits=10) 9/10 of train set will be used to train the classifier and left 1/10 of train set will be used to validate the classifier.
It means 1/10 of train set is always new for the model. So there is no difference between 1/10 of train set and test set in terms of newness for the model.
Is there any wrong thinking?
According to my current thinking, I cannot understand why I got very different score on cross_val_score() and accuracy_score().
Thanks in advance!!
EDIT:
I realized that when I trained Doc2Vec() with not only x_train but also x_test, I could get better scores like below:
[0.905 0.886 0.883 0.91 0.888 0.903 0.904 0.897 0.906 0.905]
Valid acc: 89.87
Test acc: 0.8413165640888414
Yes, this is very natural to be better but I realized that the problem was not classification but vectorization.
But as you can see, there is still 5% difference between valid and test accuracy.
Now I'm still wondering why this difference occur and finding methods to improve the Doc2Vec().
Post Average Ensemble classification, I am getting a wierd confusion matrix and even weirder metric scores.
Code:-
x = data_train[categorical_columns + numerical_columns]
y = data_train['target']
from imblearn.over_sampling import SMOTE
x_sample, y_sample = SMOTE().fit_sample(x, y.values.ravel())
x_sample = pd.DataFrame(x_sample)
y_sample = pd.DataFrame(y_sample)
# checking the sizes of the sample data
print("Size of x-sample :", x_sample.shape)
print("Size of y-sample :", y_sample.shape)
# Train-Test split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_sample, y_sample,
test_size=0.40,
shuffle=False)
Accuracy is 99.9% but recall,f1-score and precision are 0. Never faced this problem before.I have used Adaboost Classifier.
Confusion Matrix for ADB:
[[46399 25]
[ 0 0]]
Accuracy for ADB:
0.9994614854385663
Precision for ADB:
0.0
Recall for ADB:
0.0
f1_score for ADB:
0.0
Since it is an imbalanced dataset so I have used SMOTE. And now I am getting the results as follows:
Confusion Matrix for ETC:
[[ 0 0]
[ 336 92002]]
Accuracy for ETC:
0.99636119474106
Precision for ETC:
1.0
Recall for ETC:
0.99636119474106
f1_score for ETC:
0.9981772811109906
This is happening because you have unbalanced dataset (99.9% 0's and only 0.1% 1's). In such scenario's using accuracy as metric can be misleading.
You can read more about what metrics to use in such scenario's here
HI as above answers mentioned it is because of skewed(unbalanced data). However, I would like to give a simpler solution. Use SVM's.
model = sklearn.svm.SVC(class_weight = 'balanced')
model.fit(X_train, y_train)
Using balanced class_weight would automatically give equal importance to all the classes irrespective of the number of datapoints of each class in the dataset. Also, using 'rbf' kernel in SVM would give a really good accuracy.
I used support vector machine model for classification using iris data set. I used train test split function to split the data-set into training and testing subsets.
when test_size was 0.3 the accuracy was low and then I decreased the size of testing subset to 0.06 and now the accuracy is 1 ie. 100%. obviously, the reason is clear, its because with testing data the amount of noise and fluctuations as decreases.
My question is- we want our model to be efficient but what value of test_size is acceptable for that. at what value of test_size will the result be viable.
here is some line of code from my program-
from sklearn import datasets
from sklearn import svm
import numpy as np
from sklearn import metrics
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
C=1.0
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train ,y_test = train_test_split(X,y,test_size=0.06, random_state=4)
svc = svm.SVC(kernel='linear', C=C).fit(x_train,y_train)
y_pred = svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
lin_svc = svm.LinearSVC(C=C).fit(x_train,y_train)
y_pred = lin_svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(x_train,y_train)
y_pred =rbf_svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
poly_svc = svm.SVC(kernel='poly',degree=3, C=C).fit(x_train,y_train)
y_pred = poly_svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
result is 100% accuracy for all 4 cases.
I'm trying to reproduce the model in this WildML - Implementing a Neural Network From Scratch tutorial but using Keras instead. I've tried to use all of the same configurations as the tutorial, but I keep getting a linear classification even after tweaking the number of epochs, batch sizes, activation functions, and number of units in the hidden layer:
Here's my code:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.utils.visualize_util import plot
from keras.utils.np_utils import to_categorical
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn import datasets, linear_model
# Build model
model = Sequential()
model.add(Dense(input_dim=2, output_dim=3, activation="tanh", init="normal"))
model.add(Dense(output_dim=2, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
# Train
np.random.seed(0)
X, y = sklearn.datasets.make_moons(200, noise=0.20)
y_binary = to_categorical(y)
model.fit(X, y_binary, nb_epoch=100)
# Helper function to plot a decision boundary.
# If you don't fully understand this function don't worry, it just generates the contour plot below.
def plot_decision_boundary(pred_func):
# Set min and max values and give it some padding
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the function value for the whole gid
Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)
# Predict and plot
plot_decision_boundary(lambda x: model.predict_classes(x, batch_size=200))
plt.title("Decision Boundary for hidden layer size 3")
plt.show()
I believe I figured out the problem. If I remove the np.random.seed(0) and train for 2000 epochs, the output isn't always linear and occasionally gets to higher 90%+ accuracy:
It must have been that np.random.seed(0) led to the parameters being seeded poorly, and since I was fixing the random seeding I would see the same graph every time.
I think I solved this problem, but I do not know why it should be solved. If you change the output layer's activation function to 'sigmoid' instead of 'softmax', the system will work.
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics= . ['accuracy'])
From this, I can achieve an accuracy of 95% or greater. If I leave the above code in softmax, the linear classifier remains.
.
The problem is more about the training algorithm for DNN rather than the software keras.
As far as I know, deep neural network works due to the improvement in training algorithm. From the 1980s, the BP algorithm has been used to train neural network but will result in over-fitting problem when the network is deep. About 10 years ago, Hinton improved the algorithm by first pre-traning the network using unlabeled data and then using BP algorithm. The pre-traning plays an important role to avoid over-fitting.
However, as I begin to try Keras, the example (in the below) of mnist DNN using SGD algorithm without any mention about the pre-training process leads to a very high prediction accuracy. So, I begin to wonder where has the pre-training gone? Wheter I misundertood the deep learning training algorithm (I think the classical BP is almost the same as SGD)? Or a new traning technique has replaced the pre-traning process?
Very grateful for your help!
'''Trains a simple deep NN on the MNIST dataset.
Gets to 98.40% test accuracy after 20 epochs
(there is *a lot* of margin for parameter tuning).
2 seconds per epoch on a K520 GPU.
'''
from __future__ import print_function
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, Adam, RMSprop
from keras.utils import np_utils
batch_size = 128
nb_classes = 10
nb_epoch = 20
# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
model = Sequential()
model.add(Dense(512, input_shape=(784,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
history = model.fit(X_train, Y_train,
batch_size=batch_size, nb_epoch=nb_epoch,
verbose=1, validation_data=(X_test, Y_test))
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
You are wrong.
Past vs. Today
The difference between Neural Networks in the past and the ones today is not about the training algorithm. Every DNN is trained with Backpropagation based on some SGD-based algorithm, exactly like in the past. (There are some new algorithms trying to reduce parameter-tuning with adaptive learning-rates like Adam, RMSprop and co.; but plain SGD is still the most common algorithm and was used for AlphaGo for example)
The difference is just the size = number of layers (deepness; which is possible due to GPU-based evaluation) and the choosings of activation-functions. ReLU is just working better than the classic Sigmoid or Tanh activations (regarding speed and stability).
Pre-training
I also think, that pre-training was very popular 5-10 years ago but nobody is doing that today (if you got enough data)!
Let me quote from here:
It's true that unsupervised pre-training was initially what made it possible to train deeper networks, but the last few years the pre-training approach has been largely obsoleted.
Nowadays, deep neural networks are a lot more similar to their 80's cousins. Instead of pre-training, the difference is now in the activation functions and regularisation methods used (and sometimes in the optimisation algorithm, although much more rarely).
I would say that the "pre-training era", which started around 2006, ended in the early '10s when people started using rectified linear units (ReLUs), and later dropout, and discovered that pre-training was no longer beneficial for this type of networks.
I can recommend these slides as introduction to modern Deep Learning (as starting point).
Pretraining Is actually gaining again a lot of traction in NLP community, see OpenAI's GPT: the idea is that pretraining acts as an unsupervised initialization step before fine-tuning the model with the supervised data. This is because unlabeled data is much more abundant that labeled counterpart and it can be exploited to derive sensible weights inside the model that express the hidden links inside the dataset structures.
Hope that the explanation was not too goofy :)