Model compression: Let me explain this in simple terms.
Lets X_train (features), Y_train (target) be the training data.
X_train, Y_train ------> M1 (Example: decision tree)
X_train --------> M1 ----> Y_pred (predicted Y for X_train)
Now
Case 1:
X_train, Y_pred -----------> M2 (Example: any model that is NOT decision tree)
X_train ---------------> M2 ----------> Y_pred1
Case 2:
X_train, Y_train -----------> M2 (Example: any model that is NOT decision tree)
X_train ---------------> M2 ----------> Y_pred2
Now I compute AUC score for M2.
Case 1:
AUC (Y_pred, Y_pred1)
Case 2:
AUC (Y_train, Y_pred2)
Case 1 AUC is higher than Case 2 AUC. Case 1 is called model compression. I like to get the intuition behind it. Of course AUC is calculated with probabilities.
The intuition behind the result is that the conditional entropy of Y_pred given X_train is zero. So M2 can learn X_train-> Y_pred more easily than in second case.
Related
I have some data x_train, y_train as sequences of values :
x_train.shape, y_train.shape = (5027, 60, 2), (5027, 1)
And I want to use a sequence of x_train and y_train (of shape (60,2) and (1)) to train a model so that using 60 observations I can try to predict one point (class 0 or 1)
Something like that :
clf = KNeighborsClassifier(n_neighbors=5, metric= dtw)
clf.fit(x_train, y_train)
#or
for i in range(x_train.shape[0]):
clf.fit(x_train[i],y_train[i])
Of course I got an Error because of the shapes and I would like to know if there is a way to do something like that ?
Generally if one dataset is given we use
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
y_pred = lr.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))
If we are doing validation on the training dataset
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=0)
If both Train and Test datasets are given in separate datasets, where do I use Test dataset in the code?
Normally I will split the available data 2 times like this:
#first split: 80% (green+blue) and 20% (orange)
X_model, X_test, y_model, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#second split: 75% (green) and 25% (orange)
X_train, X_val, y_train, y_val = train_test_split(X_model, y_model, test_size=0.25, random_state=0)
Note the the first split is 80:20 ratio, and the second split is 75:25 ratio, in order to get the 60:20:20 ratio in the overall dataset. And if you're already given separate dataset for model train/val and model test, you can skip the first split.
After this you can proceed to train and evaluate the model (using _train and _val):
lr = some_model()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_val)
print(confusion_matrix(y_val, y_pred))
print(accuracy_score(y_val, y_pred)) #accuracy score (TRAIN)
print(classification_report(y_val, y_pred))
This is repeated using various models, and in model tuning, until you can find the best performing model and hyperparameters. It is highly recommended to do cross validation here to identify a better and more robust model.
After the winning model is found, do a test on the holdout data (using _test):
y_pred = lr.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred)) #accuracy score (TEST)
print(classification_report(y_test, y_pred))
Now you can compare the 2 accuracy scores (TRAIN and TEST) to see if model is overfitting or underfitting.
I'm working with document classification.
I have about totally 14000 (document + category) data and I splitted them: 10000 to train data (x_train and y_train) and 4000 to test data (x_test and y_test).
And I used Doc2Vec() of gensim to vectorize the document: trained with x_train (not with x_test).
Here is my code applying Doc2Vec():
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn import utils
total_data = [TaggedDocument(words=t, tags=[l]) for t, l in zip(prep_texts, labels)]
utils.shuffle(total_data)
train_data = total_data[:10000]
test_data = total_data[10000:]
d2v = Doc2Vec(dm=0, vector_size=100, window=5,
alpha=0.025, min_alpha=0.001, min_count=5,
sample=0, workers=8, hs=0, negative=5)
d2v.build_vocab([d for d in train_data])
d2v.train(train_data,
total_examples=len(train_data),
epochs=10)
So x_train and x_test is inferred vector from trained Doc2Vec().
Then I applied SVC of sklearn.svm to it like below.
from sklearn.svm import SVC
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score
clf = SVC()
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
scoring = 'accuracy'
score = cross_val_score(clf, x_train, y_train, cv=k_fold, n_jobs=8, scoring=scoring)
print(score)
print('Valid acc: {}'.format(round(np.mean(score)*100, 4)))
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print('Test acc: {}'.format(accuracy_score(y_test, y_pred)))
The result I got:
[0.916 0.912 0.908 0.923 0.901 0.922 0.921 0.908 0.923 0.924]
Valid acc: 91.58
Test acc: 0.6641691196146642
I am very confused that why I got very different score on cross_val_score() and accuracy_score().
I will write down my thinking below blockquotes:
When processing cross_val_score(), it will do cross-validation.
Then for each fold, (assume n_splits=10) 9/10 of train set will be used to train the classifier and left 1/10 of train set will be used to validate the classifier.
It means 1/10 of train set is always new for the model. So there is no difference between 1/10 of train set and test set in terms of newness for the model.
Is there any wrong thinking?
According to my current thinking, I cannot understand why I got very different score on cross_val_score() and accuracy_score().
Thanks in advance!!
EDIT:
I realized that when I trained Doc2Vec() with not only x_train but also x_test, I could get better scores like below:
[0.905 0.886 0.883 0.91 0.888 0.903 0.904 0.897 0.906 0.905]
Valid acc: 89.87
Test acc: 0.8413165640888414
Yes, this is very natural to be better but I realized that the problem was not classification but vectorization.
But as you can see, there is still 5% difference between valid and test accuracy.
Now I'm still wondering why this difference occur and finding methods to improve the Doc2Vec().
I'm kind of new to machine learning and I am using MLPRegressor. I split my data with
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
then I make and fit the model, using 10-fold validation for test set.
nn = MLPRegressor(hidden_layer_sizes=(100, 100), activation='relu',
solver='lbfgs', max_iter=500)
nn.fit(X_train, y_train)
TrainScore = nn.score(X_train, y_train)
kfold = KFold(n_splits=10, shuffle=True, random_state=0)
print("Cross-validation scores:\t{} ".format(cross_val_score(nn, X_test, y_test, cv=kfold)))
av_corss_val_score = np.mean(cross_val_score(nn, X_test, y_test, cv=kfold))
print("The average cross validation score is: {}".format(av_corss_val_score))
The problem is that the test scores I receive are very negative (-4256). What could be possible be wrong?
To keep syntax the same, sklearn maximizes every metric, whether classification accuracy or regression MSE. Therefore, the objective function is defined in a way that a more positive number is good and more negative number is bad. Hence, a less negative MSE is preferred.
Moving on to why it may be so negative in your case, it could be broadly due to two things: overfitting or underfitting. There are tonnes of resources out there to help you from this point forward.
I'm trying to do a one-step ahead contrived stock market prediction and I'm unsure if I am doing everything correctly as my validation loss does not go down and my graphs look off to me.
With no dropout I get the following - it looks like a common case of overfitting?: here
However I would expect the predictions to basically mirror the training data in this case as it has learnt it's pattern but instead I get this graph when plotting y_train vs the predictions:
here
Strangely when plotting y_test vs predictions it looks more accurate:
here
How could y_train be so far off with such a low training MSE and y_test be more accurate with a high MSE?
When adding dropout, the model simply fails to learn or just learns much slower on the training set like this (only 10 epochs to cut down on training time but the pattern holds of val loss not decreasing).
here
Some information about the model.
My data shapes are as follows:
x_train size is: (172544, 20, 197)
x_test size is: (83968, 20, 197)
y_train size is: (172544, 1)
y_test size is: (83968, 1)
X is set up as 197 features at timesteps [0,1,2,..19] and has a corresponding Y label at timestep [20]. The repeats for the next sequence [1,2,3...20] and Y label [21] and so on.
All data is normalized to mean 0, std_dev 1 (on the training set) then applied to the test set.
Code for the model:
batch_size = 512
DROPOUT = 0.0
timesteps = x_train.shape[1]
data_dim = x_train.shape[2]
model = Sequential()
model.add(LSTM(512, stateful=True, return_sequences=True, implementation=2,
dropout=DROPOUT,
batch_input_shape=(batch_size, timesteps, data_dim)))
model.add(LSTM(256, stateful=True, return_sequences=True, implementation=2,
dropout=DROPOUT))
model.add(LSTM(256, stateful=True, return_sequences=False, implementation=2,
dropout=DROPOUT))
model.add(Dense(1, activation='linear'))
nadam = Nadam()
model.compile(loss='mse',
optimizer=nadam,
metrics=['mse','mae','mape'])
history = model.fit(x_train, y_train,validation_data=(x_test, y_test),
epochs=100,batch_size=batch_size, shuffle=False, verbose=1, callbacks=[reduce_lr])
EDIT: Even when using two samples the same happens
x_train size is: (2, 2, 197)
x_test size is: (2, 2, 197)
y_train size is: (2, 1)
y_test size is: (2, 1)
y_train vs predictions