Why is my explained variance a negative value for regression models? - machine-learning

This is the code I'm using to compare performance metrics of different regression models on my timeseries data (basically I'm trying to predict certain values based off the month & day of the year)
import sklearn.metrics as metrics
def regression_results(y_true, y_pred):
predictions=y_pred
test_labels=y_true
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2))
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
# Regression metrics
explained_variance=metrics.explained_variance_score(y_true, y_pred)
mean_absolute_error=metrics.mean_absolute_error(y_true, y_pred)
mse=metrics.mean_squared_error(y_true, y_pred)
mean_squared_log_error=metrics.mean_squared_log_error(y_true, y_pred)
median_absolute_error=metrics.median_absolute_error(y_true, y_pred)
r2=metrics.r2_score(y_true, y_pred)
print('explained_variance: ', round(explained_variance,4))
print('mean_squared_log_error: ', round(mean_squared_log_error,4))
print('r2: ', round(r2,4))
print('MAE: ', round(mean_absolute_error,4))
print('MSE: ', round(mse,4))
print('RMSE: ', round(np.sqrt(mse),4))
These are the results I'm getting for randomforestregressor model (and all other regression models display similar results, including the negative explained variance value).
Mean Absolute Error: 0.02
Accuracy: 98.41 %.
explained_variance: -0.4901
mean_squared_log_error: 0.0001
r2: -0.5035
MAE: 0.0163
MSE: 0.0004
RMSE: 0.0205
Does this mean my data is bad?

Related

LSTM regression model flat prediction

This is a time series regression problem for the battery capacity as output and a single input variable as voltage; the relation is non-linear.
LSTM Model prediction of the test data always returns a semi-flat line, probably the mean of the output variable in the training data.
This is an example of predicted vs test set output values, with the following model parameters:
(Window size: 10, batch site: 256, LSTM nodes: 16)
Prediction of the test data
Data had been normalized, down-sampled to 1 sec and later to 3 sec, original sampling was 10 Hz.
I was suspecting the voltage fluctuation is the problem, but sampling at 3 seconds hadn't resulted into noticeable improvement.
Here are the data after being down-sampled to 3 seconds:
Normalized Training Data ; Y:SOC, X: Voltage
Normalized Test Data ; Y:SOC, X: Voltage
I've tried many changes in the model and learning parameters as follows, but still the behavior is the same.
That's why i think it's not a parameter tuning issue, rather the model is not learning at all.
LSTM layer: always single, followed by Dense with no options.
LSTM nodes: [4,8,16,32]
Epoch: : [16,32,64,128]
window size (input vector depth): [8,32,64,128]
Batch size: [32,64,128,256]
learning rate: [.0005,.0001,.001]
optimizer : ADAM, options:[ none, clipnorm=1, clipvalue=0.5]
Model specification Code:
backend.clear_session()
model1 = Sequential()
model1.add(LSTM(16,input_shape=(win_sz, features_cnt) )) # stateless
model1.add(layers.Dense(1))
model1.summary()
Model training and validation Code:
n_epochs = 12
iterations = tr_samples_sh_cnt // batch_sz_tr
loss = tf.keras.losses.MeanAbsoluteError()
optimizer = tf.optimizers.Adam(learning_rate = 0.001)
loss_history = []
#tf.function
def train_model_on_batch():
start = epoch * batch_sz_tr
X_batch = df_feat_tr_3D[start:start+batch_sz_tr, :, :]
y_batch = df_SOC_tr_2D[start:start+batch_sz_tr, :]
with tf.GradientTape() as tape:
current_loss = loss(model1(X_batch), y_batch)
gradients = tape.gradient(current_loss, model1.trainable_variables)
optimizer.apply_gradients(zip(gradients, model1.trainable_variables))
return current_loss
for epoch in range(n_epochs+1):
for iteration in range(iterations):
current_loss = train_model_on_batch()
if epoch % 1 == 0:
loss_history.append(current_loss.numpy())
print("{}. \t\tLoss: {}".format(
epoch, loss_history[-1]))
print('\nTraining complete.')
P_test = model1.predict(df_feat_test_3D)
After adding sigmoid activation function in both LSTM and Dense layers, a very small change observed, but far from reasonable fit.
Prediction of the test data after adding activation function
The problem was the activation function as #Dr. Snoopy recommended

the stacking model, I want to see the recall and precision results

In the stacking model, I want to see the recall and accuracy results, I have tried many methods and I have not found results. I have found recall and precision in another model but I stuck with the stacking model., little help would go a long way.
estimator = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('dec_tree', dec_tree),
('knn', knn),
('xgb' , xgb),
('ext' , ext),
('grad' , grad),
('hist', hist)]
#bulid stack model
stack_model= StackingClassifier(
estimators=estimator, final_estimator=LogisticRegression())
#train stack model
stack_model.fit(x_train, y_train)
#make preduction
y_train_pred = stack_model.predict(x_train)
y_test_pred = stack_model.predict(x_test)
#traning set performance
stack_model_train_accuracy = accuracy_score(y_train,y_train_pred)
stack_model_train_f1 = f1_score(y_train,y_train_pred, average ='weighted')
#Testing set performance
stack_model_test_accuracy = accuracy_score(y_test,y_test_pred)
stack_model_test_f1= f1_score(y_test,y_test_pred, average ='weighted')
#print
print ('Model Performance For Traning Set')
print ('- Accuracy: %s' % stack_model_train_accuracy)
print ('- f1: %s' % stack_model_train_f1)
print ('______________________________________')
print ('Model Performance For Testing Set')
print ('- Accuracy: %s' % stack_model_test_accuracy)
print ('- f1: %s' % stack_model_test_f1)
until here it is working >> but I need the recall and precision. if I check them out, in the same way, I checked the accuracy and f_score > > it will be wrong! and if I used classification_report it will be an error too.

Why is scikit-learn SVM classifier cross validation so slow?

I am trying to compare multiple classifiers on a dataset that I have. To get accurate accuracy scores for the classifiers I am now performing 10 fold cross validation for each classifier. This goes well for all of them except SVM (both linear and rbf kernels). The data is loaded like this:
dataset = pd.read_csv("data/distance_annotated_indels.txt", delimiter="\t", header=None)
X = dataset.iloc[:, [5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Cross validation for for example a Random Forest works fine:
start = time.time()
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cv = ShuffleSplit(n_splits=10, test_size=0.2)
scores = cross_val_score(classifier, X, y, cv=10)
print(classification_report(y_test, y_pred))
print("Random Forest accuracy after 10 fold CV: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2) + ", " + str(round(time.time() - start, 3)) + "s")
Output:
precision recall f1-score support
0 0.97 0.95 0.96 3427
1 0.95 0.97 0.96 3417
avg / total 0.96 0.96 0.96 6844
Random Forest accuracy after 10 fold CV: 0.92 (+/- 0.06), 90.842s
However for SVM this process takes ages (waited for 2 hours, still nothing). The sklearn website does not make me any wiser. Is there something I should be doing different for SVM classifiers? The SVM code is as follows:
start = time.time()
classifier = SVC(kernel = 'linear')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
scores = cross_val_score(classifier, X, y, cv=10)
print(classification_report(y_test, y_pred))
print("Linear SVM accuracy after 10 fold CV: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2) + ", " + str(round(time.time() - start, 3)) + "s")
If you have a lot of samples the computational complexity of the problem gets in the way, see Training complexity of Linear SVM.
Consider playing with the verbose flag of cross_val_score to see more logs about progress. Also, with n_jobs set to a value > 1 (or even using all CPUs with n_jobs set to -1, if memory allows) you could speed up computation via parallelization. http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html can be useful to evaluate these options.
If performance is poor I'd consider reducing the value of cv (see https://stats.stackexchange.com/questions/27730/choice-of-k-in-k-fold-cross-validation for a discussion on this)
Also you can control the time with changing max_iter. If it set to -1 it can go forever according to soltion space. Set some integer value say 10000 as a stopping criteria.
alternatively you can try using optimized SVM implementation - for example with scikit-learn-intelex - https://github.com/intel/scikit-learn-intelex
First install package
pip install scikit-learn-intelex
And then add in your python script
from sklearnex import patch_sklearn
patch_sklearn()

Difference between Tensorflow and Scikitlearn log_loss function implementation

Hi I am trying to get into tensorflow and feeling a bit dumb.
Does log_loss in TF differ from sklearn's one?
Here are some lines from my code, how I am calculating:
from sklearn.metrics import log_loss
tmp = np.array(y_test)
y_test_t = np.array([tmp, -(tmp-1)]).T[0]
tf_log_loss = tf.losses.log_loss(predictions=tf.nn.softmax(logits), labels=tf_y)
with tf.Session() as sess:
# training
a = sess.run(tf.nn.softmax(logits), feed_dict={tf_x: xtest, keep_prob: 1.})
print(" sk.log_loss: ", log_loss(y_test, a,eps=1e-7 ))
print(" tf.log_loss: ", sess.run(tf_log_loss, feed_dict={tf_x: xtest, tf_y: y_test_t, keep_prob: 1.}))
Output I get
Epoch 7, Loss: 0.4875 Validation Accuracy: 0.818981
sk.log_loss: 1.76533018874
tf.log_loss: 0.396557
Epoch 8, Loss: 0.4850 Validation Accuracy: 0.820738
sk.log_loss: 1.77217639627
tf.log_loss: 0.393351
Epoch 9, Loss: 0.4835 Validation Accuracy: 0.823374
sk.log_loss: 1.78479079656
tf.log_loss: 0.390572
Seems like while tf.log_loss converges sk.log_loss diverges.
I had the same problem. After looking up the source code of tf.losses.log_loss, its key lines show wat is going on:
losses = - math_ops.multiply(labels, math_ops.log(predictions + epsilon))
- math_ops.multiply((1 - labels), math_ops.log(1 - predictions + epsilon))
It is binary log-loss (i.e. every class is considered non-exclusive) rather than multi-class log-loss.
As I worked with probabilities (rather than logits), I couldn't use tf.nn.softmax_cross_entropy_with_logits (though, I could have applied logarithm).
My solution was to implement log-loss by hand:
loss = tf.reduce_sum(tf.multiply(- labels, tf.log(probs))) / len(probs)
See also:
https://github.com/tensorflow/tensorflow/issues/2462
difference between tensorflow tf.nn.softmax and tf.nn.softmax_cross_entropy_with_logits

How to use a fixed validation set (not K-fold cross validation) in Scikit-learn for a decision tree classifier/random forest classifier?

I am new to machine learning and data science. Sorry, if it is a very stupid question.
I see there is an inbuilt function for cross-validation but not for a fixed validation set. I have a dataset with 50,000 samples labeled with years from 1990 to 2010. I need to train different classifiers on 1990-2008 samples, then validate on 2009 samples, and test on 2010 samples.
EDIT:
After #Quan Tran's answer, I tried this. This is how it should be?
# Fit a decision tree
estimator1 = DecisionTreeClassifier( max_depth = 9, max_leaf_nodes=9)
estimator1.fit(X_train, y_train)
print estimator1
# validate using validation set
acc = np.zeros((20,20)) # store accuracy
for i in range(20):
for j in range(20):
estimator1 = DecisionTreeClassifier(max_depth = i+1, max_leaf_nodes=j+2)
estimator1.fit(X_valid, y_valid)
y_pred = estimator1.predict(X_valid)
acc[i,j] = accuracy_score(y_valid, y_pred)
best_mod = np.where(acc == acc.max())
print best_mod
print acc[best_mod]
# Predict target values
estimator1 = DecisionTreeClassifier(max_depth = int(best_mod[0]) + 1, max_leaf_nodes= int(best_mod[1]) + 2)
estimator1.fit(X_valid, y_valid)
y_pred = estimator1.predict(X_test)
confusion = metrics.confusion_matrix(y_test, y_pred)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
# Classification Accuracy
print "======= ACCURACY ========"
print((TP + TN) / float(TP + TN + FP + FN))
print accuracy_score(y_valid, y_pred)
# store the predicted probabilities for class
y_pred_prob = estimator1.predict_proba(X_test)[:, 1]
# plot a ROC curve for y_test and y_pred_prob
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for DecisionTreeClassifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
print("======= AUC ========")
print(metrics.roc_auc_score(y_test, y_pred_prob))
I get this answer, which is not the best accuracy.
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
max_features=None, max_leaf_nodes=9, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
(array([5]), array([19]))
[ 0.8489011]
======= ACCURACY ========
0.574175824176
0.538461538462
======= AUC ========
0.547632099893
In this case, there are three separate sets. The train set, the test set and the validation set.
The train set is used to fit the parameters of the classifier. For example:
clf = DecisionTreeClassifier(max_depth=2)
clf.fit(trainfeatures, labels)
The validation set is used to tune the hyper parameters of the classifier or find the cutoff point for the training procedure. For example, in the case of Decision tree, max_depth is a hyper parameter. You will need to find a good set of hyper parameters by experimenting with different values of hyper parameters (tuning) and compare the performance measures (accuracy/precision,..) on the validation set.
The test set is used to estimate the error rate on unseen data. After having the performance measures on the test set, the model must not be trained/tuned any further.

Resources