xgboost: Huge logloss despite reasonable accuracy - machine-learning

I train a xgboost classifier on a binary classification problem. It produces 70% accurate predictions. Yet logloss is very big at 9.13. I suspect that might be because a few predictions are very much off the target, but I do not understand why it happens - other people report much better logloss (0.55 - 0.6) on the same data with xgboost.
from readCsv import x_train, y_train
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
from xgboost import XGBClassifier
seed=7
test_size=0.09
X_train, X_test, y_train, y_test = train_test_split(
x_train, y_train, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier(max_depth=5,
learning_rate=0.02,
objective= 'binary:logistic',
n_estimators = 5000)
model.fit(X_train, y_train)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
ll = log_loss(y_test, y_pred)
print("Log_loss: %f" % ll)
print(model)
produces following output:
Accuracy: 73.54%
Log_loss: 9.139162
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
gamma=0, learning_rate=0.02, max_delta_step=0, max_depth=5,
min_child_weight=1, missing=None, n_estimators=5000, nthread=-1,
objective='binary:logistic', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True, subsample=1)
Anyone knows reasons for my high logloss? Thanks!

solution: use model.predict_proba(), not model.predict()
This reduced logloss from 7+ to 0.52, which is in expected range. model.predict() was outputing values of huge magnitude like 1e18, it seems it needed to go through some function which would make it a valid probability score (between 0 and 1).

Related

Different score on cross_val_score() and accuracy_score() on sklearn

I'm working with document classification.
I have about totally 14000 (document + category) data and I splitted them: 10000 to train data (x_train and y_train) and 4000 to test data (x_test and y_test).
And I used Doc2Vec() of gensim to vectorize the document: trained with x_train (not with x_test).
Here is my code applying Doc2Vec():
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn import utils
total_data = [TaggedDocument(words=t, tags=[l]) for t, l in zip(prep_texts, labels)]
utils.shuffle(total_data)
train_data = total_data[:10000]
test_data = total_data[10000:]
d2v = Doc2Vec(dm=0, vector_size=100, window=5,
alpha=0.025, min_alpha=0.001, min_count=5,
sample=0, workers=8, hs=0, negative=5)
d2v.build_vocab([d for d in train_data])
d2v.train(train_data,
total_examples=len(train_data),
epochs=10)
So x_train and x_test is inferred vector from trained Doc2Vec().
Then I applied SVC of sklearn.svm to it like below.
from sklearn.svm import SVC
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score
clf = SVC()
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
scoring = 'accuracy'
score = cross_val_score(clf, x_train, y_train, cv=k_fold, n_jobs=8, scoring=scoring)
print(score)
print('Valid acc: {}'.format(round(np.mean(score)*100, 4)))
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print('Test acc: {}'.format(accuracy_score(y_test, y_pred)))
The result I got:
[0.916 0.912 0.908 0.923 0.901 0.922 0.921 0.908 0.923 0.924]
Valid acc: 91.58
Test acc: 0.6641691196146642
I am very confused that why I got very different score on cross_val_score() and accuracy_score().
I will write down my thinking below blockquotes:
When processing cross_val_score(), it will do cross-validation.
Then for each fold, (assume n_splits=10) 9/10 of train set will be used to train the classifier and left 1/10 of train set will be used to validate the classifier.
It means 1/10 of train set is always new for the model. So there is no difference between 1/10 of train set and test set in terms of newness for the model.
Is there any wrong thinking?
According to my current thinking, I cannot understand why I got very different score on cross_val_score() and accuracy_score().
Thanks in advance!!
EDIT:
I realized that when I trained Doc2Vec() with not only x_train but also x_test, I could get better scores like below:
[0.905 0.886 0.883 0.91 0.888 0.903 0.904 0.897 0.906 0.905]
Valid acc: 89.87
Test acc: 0.8413165640888414
Yes, this is very natural to be better but I realized that the problem was not classification but vectorization.
But as you can see, there is still 5% difference between valid and test accuracy.
Now I'm still wondering why this difference occur and finding methods to improve the Doc2Vec().

MLPRegressor gives very negative scores

I'm kind of new to machine learning and I am using MLPRegressor. I split my data with
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
then I make and fit the model, using 10-fold validation for test set.
nn = MLPRegressor(hidden_layer_sizes=(100, 100), activation='relu',
solver='lbfgs', max_iter=500)
nn.fit(X_train, y_train)
TrainScore = nn.score(X_train, y_train)
kfold = KFold(n_splits=10, shuffle=True, random_state=0)
print("Cross-validation scores:\t{} ".format(cross_val_score(nn, X_test, y_test, cv=kfold)))
av_corss_val_score = np.mean(cross_val_score(nn, X_test, y_test, cv=kfold))
print("The average cross validation score is: {}".format(av_corss_val_score))
The problem is that the test scores I receive are very negative (-4256). What could be possible be wrong?
To keep syntax the same, sklearn maximizes every metric, whether classification accuracy or regression MSE. Therefore, the objective function is defined in a way that a more positive number is good and more negative number is bad. Hence, a less negative MSE is preferred.
Moving on to why it may be so negative in your case, it could be broadly due to two things: overfitting or underfitting. There are tonnes of resources out there to help you from this point forward.

DNN binary classifier's accuracy not increasing

My binary classifier DNN's accuracy seems stuck since epoch 1. I think this means that the model is not learning. Any insight on why this is happening?
Problem statement: I would like to classify a given sequence of readings for sensors (ex. [0 1 15 1 0 3]) into either 0 or 1 (0 equivalent to "idle" state, 1 equivalent to "active" state).
About the dataset: Dataset is available here
The "state" column is the target, while the rest of the columns are the features.
I've tried using SGD instead of Adam, tried using different kernel initializes, tried changing the number of hidden layers and number of neurons per layer and tried using sklearn's StandardScaler instead of the MinMaxScaler. None of these approaches seemed to change the outcome.
This is the code:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam
from keras.initializers import he_uniform
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
seed = 7
random_state = np.random.seed(seed)
data = pd.read_csv('Dataset/Reformed/Model0_Dataset.csv')
X = data.drop(['state'], axis=1).values
y = data['state'].values
#min_max_scaler = MinMaxScaler()
std_scaler = StandardScaler()
# X_scaled = min_max_scaler.fit_transform(X)
X_scaled = std_scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=random_state)
# One Hot encode targets
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
enc = OneHotEncoder(categories='auto')
y_train_enc = enc.fit_transform(y_train).toarray()
y_test_enc = enc.fit_transform(y_test).toarray()
epochs = 500
batch_size = 100
model = Sequential()
model.add(Dense(700, input_shape=(X.shape[1],), kernel_initializer=he_uniform(seed)))
model.add(Dropout(0.5))
model.add(Dense(1400, activation='relu', kernel_initializer=he_uniform(seed)))
model.add(Dropout(0.5))
model.add(Dense(700, activation='relu', kernel_initializer=he_uniform(seed)))
model.add(Dropout(0.5))
model.add(Dense(800, activation='relu', kernel_initializer=he_uniform(seed)))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.summary()
early_stopping_monitor = EarlyStopping(patience=25)
# model.compile(SGD(lr=.01, decay=1e-6, momentum=0.9, nesterov=True), loss='binary_crossentropy', metrics=['accuracy'])
model.compile(Adam(lr=.01, decay=1e-6), loss='binary_crossentropy', metrics=['accuracy'], )
history = model.fit(X_train, y_train_enc, validation_split=0.2, batch_size=batch_size,
callbacks=[early_stopping_monitor], epochs=epochs, shuffle=True, verbose=1)
eval = model.evaluate(X_test, y_test_enc, batch_size=batch_size, verbose=1)
Expected results: Accuracy increasing (and loss decreasing) with each epoch (at least for the early epochs).
Actual results: The following values are fixed throughout the entire training process:
loss: 8.0118 - acc: 0.5001 - val_loss: 8.0366 - val_acc: 0.4987
You are using the wrong loss, with a two-output softmax you should use categorical_crossentropy and you should one-hot encode your labels. If you want to use binary_crossentropy, then the output layer should be a one unit with a sigmoid activation.

In a SVM model, will results be viable when I decrease the test size to 0.06

I used support vector machine model for classification using iris data set. I used train test split function to split the data-set into training and testing subsets.
when test_size was 0.3 the accuracy was low and then I decreased the size of testing subset to 0.06 and now the accuracy is 1 ie. 100%. obviously, the reason is clear, its because with testing data the amount of noise and fluctuations as decreases.
My question is- we want our model to be efficient but what value of test_size is acceptable for that. at what value of test_size will the result be viable.
here is some line of code from my program-
from sklearn import datasets
from sklearn import svm
import numpy as np
from sklearn import metrics
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
C=1.0
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train ,y_test = train_test_split(X,y,test_size=0.06, random_state=4)
svc = svm.SVC(kernel='linear', C=C).fit(x_train,y_train)
y_pred = svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
lin_svc = svm.LinearSVC(C=C).fit(x_train,y_train)
y_pred = lin_svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(x_train,y_train)
y_pred =rbf_svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
poly_svc = svm.SVC(kernel='poly',degree=3, C=C).fit(x_train,y_train)
y_pred = poly_svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
result is 100% accuracy for all 4 cases.

Why is keras way slower than sklearn?

I'm dealing with a simple logistic regression problem. Each sample contains 7423 features. Totally 4000 training samples and 1000 testing samples. Sklearn takes 0.01s to train the model and achieves 97% accuracy, but Keras (TensorFlow backend) takes 10s to achieve same accuracy after 50 epoches (even one epoch is 20x slower than sklearn). Anyone can shed light on this huge gap?
Samples:
X_train: matrix of 4000*7423, 0.0 <= value <= 1.0
y_train: matrix of 4000*1, value = 0.0 or 1.0
X_test: matrix of 1000*7423, 0.0 <= value <= 1.0
y_test: matrix of 1000*1, value = 0.0 or 1.0
Sklearn code:
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import accuracy_score
classifier = LogisticRegression()
**# Finished in 0.01s**
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
print('test accuracy = %.2f' % accuracy_score(predictions, y_test))
*[output]: test accuracy = 0.97*
Keras code:
# Using TensorFlow as backend
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(1, input_dim=X_train.shape[1], activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
**# Finished in 10s**
model.fit(X_train, y_train, batch_size=64, nb_epoch=50, verbose=0)
result = model.evaluate(X_test, y_test, verbose=0)
print('test accuracy = %.2f' % result[1])
*[output]: test accuracy = 0.97*
It might be the optimizer or the loss. You use a non linearity. You also use probably a different batch size under the hood in sklearn.
But the way I see it is that you have a specific task, one of the tool is tailored and made to resolve it, the other is a more complex structure that can resolve it but is not optimized to do so and probably does a lot of things that are not needed for this problem which slows everything down.

Resources