Confusing error message from calling DecisionTreeClassifier in scikit learn - machine-learning

Below is the second half of my code where i call DecisionTreeClassifier in scikit learn, but i am getting this error: Y_pred = DecisionTreeClassifier.predict(x_test)
TypeError: predict() missing 1 required positional argument: 'X' . Can't make sense of why i would get this error message, since i am clearly calling x.
model = DecisionTreeClassifier(min_samples_leaf=100)
model.fit(x_train,y_train)
scores = cross_val_score(model, x_train,y_train, cv=10)
print('mean: {:.3f} (std: {:.3f})'.format(scores.mean(), scores.std()), end='\n\n')
#make prediction
Y_pred = DecisionTreeClassifier.predict(x_test)
acc_train = accuracy_score(train[y_train],Y_pred)
print ('Train Accuracy: %f'%acc_train)

DecisionTreeClassifier is a class. To use it, you need to instantiate a class instance. You did this in the first line of your code: model = DecisionTreeClassifier(min_samples_leaf=100). Now, you need to use this instance (i.e. model), which you trained it on the training data, for prediction:
Y_pred = model.predict(x_test)

Related

Model training when calculating training error and cross-validation error

I want to calculate training error and cross validation error for the same training set.
Model: RandomForestRegressor
Metrics: Training error -> RMSE, Cross validation error -> k fold cross validation
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
scores = cross_val_score(forest_reg, X_train_transformed, y_train,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
print(tree_rmse_scores.mean())
forest_reg = RandomForestRegressor()
forest_reg.fit(X_train_transformed, y_train)
error = mean_squared_error(y_train, forest_reg.predict(X_train_transformed))
print(error)
My understanding is that model has to be trained explicitly when calculating training error but for cross val score, model is fitted for every k-1 folds and validated on 1 fold for k times. In this case, no explicit fitting is not required before calling cross_val_score().
Any issue with above code? I'm getting huge training error than CV error. Is my above understanding incorrect?
You forgot to take the square root of the MSE in the approach without CV:
forest_reg = RandomForestRegressor()
forest_reg.fit(X_train_transformed, y_train)
error = mean_squared_error(y_train, forest_reg.predict(X_train_transformed))
print(np.sqrt(error)) # <-- take the square root here or already above
The difference was so huge because you compared RMSE with MSE. Now you should see what you expected.
How much larger than validation error?
Isn't it natural that training error is larger than validation error with or without it is just 'validation' or 'cross validataion'.

How to check the predicted output during fitting of the model in Keras?

I am new in Keras and I learned fitting and evaluating the model.
After evaluating the model one can see the actual predictions made by model.
I am wondering Is it also possible to see the predictions during fitting in Keras? Till now I cant find any code doing this.
Since this question doesn't specify "epochs", and since using callbacks may represent extra computation, I don't think it's exactly a duplication.
With tensorflow, you can use a custom training loop with eager execution turned on. A simple tutorial for creating a custom training loop: https://www.tensorflow.org/tutorials/eager/custom_training_walkthrough
Basically you will:
#transform your data in to a Dataset:
dataset = tf.data.Dataset.from_tensor_slices(
(x_train, y_train)).shuffle(some_buffer).batch(batchSize)
#the above is buggy in some versions regarding shuffling, you may need to shuffle
#again between each epoch
#create an optimizer
optimizer = tf.keras.optimizers.Adam()
#create an epoch loop:
for e in range(epochs):
#create a batch loop
for i, (x, y_true) in enumerate(dataset):
#create a tape to record actions
with tf.GradientTape() as tape:
#take the model's predictions
y_pred = model(x)
#calculate loss
loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)
#calculate gradients
gradients = tape.gradient(loss, model.trainable_weights)
#apply gradients
optimizer.apply_gradients(zip(gradients, model.trainable_weights)
You can use the y_pred var for doing anything, including getting its numpy_pred = y_pred.numpy() value.
The tutorial gives some more details about metrics and validation loop.

grid search random forests "RandomForestClassifier instance is not fitted yet"

Im trying to do grid search on random forests classifier, I'm trying to test different PCA componenet and n_estimators
model_rf = RandomForestClassifier()
pca_rf = Pipeline([('pca', PCA()), ('rf', RandomForestClassifier())])
param_grid_rf = [{
'pca__n_components': [20],
'rf__n_estimators': [5]
}]
grid_cv_rf = GridSearchCV(estimator=pca_rf, cv=5,param_grid=param_grid_rf)
grid_cv_rf.fit(x_train, y_train1)
test_pca_evaluate = pca.transform(x_test)
y_pred = model_rf.predict(test_pca_evaluate)#error here
In the last line I get an error "This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this method."
It's a pretty straightforward error - the RandomForestClassifier whose method you're calling hasn't been fit yet, meaning you haven't called model_rf.fit. That object isn't fitted by your grid_cv_rf object.
I think what you want is grid_cv_rf.predict(x_test), because that grid_cv_rf object does both your PCA as well as RF fitting.

Difference between cross_val_score and another way of calculating accuracy

I tried to calculate the accuracy and was puzzled by the fact that cross_val_score gives a rather low result, than by comparing the predicted results with the correct.
First way of counting, that gives
[0.8033333333333333, 0.7908333333333334, 0.8033333333333333, 0.7925,0.8066666666666666]
kf = KFold(shuffle=True, n_splits=5)
scores = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
scores.append(np.sum(y_pred == y_test) / len(y_test))
Second way gives array([0.46166667, 0.53583333, 0.40916667, 0.44666667, 0.3775 ]):
model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
cross_val_score(model, X, y, cv = 5, scoring='accuracy')
What's my mistake?
cross_val_score will use a StratifiedKFold cv iterator when not specified otherwise. A StratifiedKFold will keep the ratio of classes balanced the same way in train and test split. For more explanation, see my other answer here:-
https://stackoverflow.com/a/48314533/3374996
On the other hand, in your first approach you are using KFold which will not keep the balance of classes. In addition you are doing shuffling of data in that.
So in each fold, there is data difference in your two approaches and hence different results.
The low score in cross_val_score is probably because of the fact that you are providing the complete data to it, instead of breaking it into test and training set. This generally leads to leakage of information which results in your model giving incorrect predictions. See this post for more explanation.
References
Learn the right way to validate models

Tensorflow low train/test accuracy

I restored a pre-trained model in Tensorflow 1.2 to do some testing work. I assumed the model was well-trained since the loss decreased to very low (0.0001). However, with either the testing samples or training samples, the accuracy ops give me a value which is almost 0. Is this because I'm using the wrong accuracy function or is it because the model is the problem?
Here is the accuracy function, the test_image below is a batch with a single test sample, test_image_label is a single label:
correct_prediction = tf.equal(tf.argmax(GoogleNet(test_image), 1), tf.argmax(test_image_label, 0))
accuracy = tf.cast(correct_prediction, tf.float32)
with Session() as less:
accuracy_vector = []
for num in range(len(testnames)):
accuracy_vector.append(sess.run(accuracy, feed_dict={keep_prob: 1.0}))
print(accuracy_vector)
mean_accuracy = sess.run(tf.divide(tf.add_n(accuracy_vector), len(testnames)))
print("test accuracy %g"%mean_accuracy)
The model is defined as GoogleNet(data) above, it is a function that returns the logits of the input batch. The training was done like this:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=train_label_batch, logits=GoogleNet(train_batch)))
train_step = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(cost, global_step=global_step)
The train_step is ran in every iteration. I think it is worth noting that after restored the model, I cannot run print(GoogleNet(test_image).eval(feed_dict={keep_prob: 1.0})) in the session, with which I intended to take a look at the output of the model. It returns the error of FailedPreconditionError (see above for traceback): Attempting to use uninitialized value Variable_213
[[Node: Variable_213/read = Identity[T=DT_FLOAT, _class=["loc:#Variable_213"], _device="/job:localhost/replica:0/task:0/cpu:0"](Variable_213)]]

Resources