the vertical axis of the KS curve is tpr,fpr and (tpr-fpr), the horizontal axis is the threshold.
tpr=(tp/tp+fn).
When the threshold = 0 , predict all the sample to 1,so the tp = number of positive samples , fn = 0.
Thus, the tpr=1.
But all the KS curves I found on the Internet begin with (0,0). Shouldn't it be (0,1)? I am so confused! Thanks for answering!
TP: number of positive prediction which actually are 1
FP: number of positive predition which actually are 0
TN: number of negative prediction which actually are 0
FN: number of negative predition which actually are 1
When threthoud = 0, model only predicts positive, so that FN=TN=0. FPR = FP/(FP+TN) = 1, TPR=TP/(TP+FN) = 1, so this point should be (1,1). You make a mistake that
When threthoud = 1 , model only predicts negative, so that TP = FP = 0. FPR = FP/(FP+TN) = 0, TPR=TP/(TP+FN) = 0, so this point should be (0,0).
# roc curve and auc
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot
import pandas as pd
# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[1,1], random_state=1)
# split into train/test sets
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2)
# fit a model
model = KNeighborsClassifier(n_neighbors=3)
model.fit(trainX, trainy)
# predict probabilities
probs = model.predict_proba(testX)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(testy, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = roc_curve(testy, probs)
# plot no skill
pyplot.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
pyplot.plot(fpr, tpr, marker='.')
# show the plot
pyplot.show()
# see calculations
pd.DataFrame({'fpr':fpr,'tpr':tpr,'thresholds':thresholds})
Outputs:
fpr tpr threshouds
0 0.000000 0.000000 2.000000
1 0.054264 0.561983 1.000000
2 0.217054 0.884298 0.666667
3 0.406977 0.975207 0.333333
4 1.000000 1.000000 0.000000
Related
I have a task to find the optimal hyperparameter(k) of KNN. I plotted the k vs AUC curve using roc_auc_score. I am supposed to find k such that cv_auc is maximum and the gap between train_auc and cv_auc is minimum. How can I achieve that?
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
train_auc=[]
cv_auc=[]
k=[i for i in range(1,50,5)]
for i in k:
knn=KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train_bow,y_train)
y_train_pred=knn.predict_proba(x_train_bow)[:,1]
y_cv_pred=knn.predict_proba(x_cv_bow)[:,1]
train_auc.append(roc_auc_score(y_train,y_train_pred))
cv_auc.append(roc_auc_score(y_cv,y_cv_pred))
#plot the roc curve
plt.plot(k,train_auc,label="Train AUC")
plt.plot(k,cv_auc,label="CV AUC")
plt.legend()
plt.xlabel('K:hyperparameter')
plt.ylabel('AUC')
plt.title("Error plot")
plt.show()
picture of the roc curve
print(cv_auc)
print(cv_auc.index(max(cv_auc)))
array1 = np.array(train_auc)
array2 = np.array(cv_auc)
subtracted_array = np.subtract(array1, array2)
subtracted = list(subtracted_array)
print(subtracted)
subtracted.index(min(subtracted))
Output:
[0.6241694315220194, 0.6985803616697652, 0.7222662029418654, 0.7429448007376901, 0.7433472984472336, 0.7492335494812746, 0.7499829512940709, 0.7594353468596283, 0.757365782209453, 0.7518153165574067]
7
[0.3758305684779806, 0.1995133667387895, 0.1433755719502956, 0.10953834255228179, 0.09624883964242126, 0.08236753388538032, 0.07710481774180344, 0.06538756093043141, 0.05998659695603492, 0.06576356656762017]
8
How to go about plotting the decision boundaries for a Random Forest analysis with 10 classes?
I get the error:
ValueError: X has 2 features, but RandomForestClassifier is expecting
240 features as input.
Can you help me get the decision boundaries for the 10 classes if possible? Thanks for your time!
Here is my code:
from sklearn.datasets import make_classification
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
# Generate noisy Data
num_trainsamples = 500
num_testsamples = 50
X_train,y_train = make_classification(n_samples=num_trainsamples,
n_features=240,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=9,
flip_y=0.2,
#weights=[0.5,0.5],
random_state=17)
X_test,y_test = make_classification(n_samples=50,
n_features=num_testsamples,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=10,
flip_y=0.2,
#weights=[0.5,0.5],
random_state=17)
model = RandomForestClassifier()
parameter_space = {
'n_estimators': [10,50,100],
'criterion': ['gini', 'entropy'],
'max_depth': np.linspace(10,50,11),
}
clf = GridSearchCV(model, parameter_space, cv = 5, scoring = "accuracy", verbose = True) # model
my_model = clf.fit(X_train, y_train)
# define bounds of the domain
min1, max1 = X_train[:, 0].min()-1, X_train[:, 0].max()+1
min2, max2 = X_train[:, 1].min()-1, X_train[:, 1].max()+1
# define the x and y scale
x1grid = np.arange(min1, max1, 0.1)
x2grid = np.arange(min2, max2, 0.1)
# create all of the lines and rows of the grid
xx, yy = np.meshgrid(x1grid, x2grid)
# flatten each grid to a vector
r1, r2 = xx.flatten(), yy.flatten()
r1, r2 = r1.reshape((len(r1), 1)), r2.reshape((len(r2), 1))
# horizontal stack vectors to create x1,x2 input for the model
grid = np.hstack((r1,r2))
yhat = clf.predict(grid)
# reshape the predictions back into a grid
zz = yhat.reshape(xx.shape)
# plot the grid of x, y and z values as a surface
plt.contourf(xx, yy, zz, cmap='Paired')
# create scatter plot for samples from each class
for class_value in range(2):
# get row indexes for samples with this class
row_ix = np.where(y == class_value)
# create scatter of these samples
plt.scatter(X_train[row_ix, 0], X_train[row_ix, 1], cmap='Paired')
X_train = {my training data features}
y_train = {my training data truth}
kf = KFold(n_splits=5, random_state=42, shuffle=True)
score = cross_val_score(SVC(), X_train, y_train, scoring = 'accuracy', cv = kf, n_jobs = -1)
gives this:
array([1. , 0.98717949, 1. , 1. , 0.98701299])
I run this code to get AUC:
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
plt.figure(figsize=(10,10))
i = 0
for train, test in kf.split(npX_train):
model = SVC(probability=True).fit(npX_train[train], npy_train[train])
probas_ = model.predict_proba(npX_train[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(npy_train[test], probas_[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
plt.plot(fpr, tpr, lw=1, alpha=0.3,
label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
label='Chance', alpha=.8)
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
lw=2, alpha=.8)
std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
label=r'$\pm$ 1 std. dev.')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.xlabel('False Positive Rate',fontsize=18)
plt.ylabel('True Positive Rate',fontsize=18)
plt.title('Cross-Validation ROC of SVM',fontsize=18)
plt.legend(loc="lower right", prop={'size': 15})
plt.show()
which gives me this:
but if I get a confusion matrix for each iteration:
for train, test in kf.split(npX_train):
model = SVC(probability=True).fit(npX_train[train], npy_train[train])
# make confusion matrix plot for iteration
y_pred = model.predict(npX_train[test])
cm = confusion_matrix(npy_train[test], y_pred)
cm_display = ConfusionMatrixDisplay(cm).plot()
plot_confusion_matrix(model, npX_train[test], npy_train[test])
plt.plot()
The accuracy for label 1, which I care about does not look that great. Of the 22 true label 1, seems to get it right 20 times out of all runs.
My questions are:
Did I mess up that AUC plot or is that slight bend in the blue mean ROC line reflecting the inaccuracy of the model?
Is there a better way to evaluate accuracy for a biased input where I care about the accurate prediction of the more rare event?
For biased or imbalanced datasets use the metric F1 score. F1 score uses precision and recall.
Read for more detail on f1 score
https://medium.com/analytics-vidhya/accuracy-vs-f1-score-6258237beca2
Sklearn
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
DecisionTreeClassifier has a method predict_proba which calculates the probability of input data point X . How this predict probability is calculated for an already trained model ?
The predicted class probability is the fraction of samples of the same class in a leaf. This means that if your leaf contains 10 x 1 and 90 x 0. The probability that the label is 1 will be 10% as in this example:
from sklearn.tree import DecisionTreeClassifier
import numpy as np
X = np.zeros((100, 1))
y = np.zeros((100, ))
y[-10:] = 1
dtc = DecisionTreeClassifier(max_depth=1).fit(X, y)
dtc.predict_proba([[0]])
which output:
array([[0.9, 0.1]])
I am trying to use LSTM Recurrent Neural Net using Keras to forecast future purchase. My input variables are time-window of purchases for previous 5 days, and a categorical variable which I encoded as dummy variables A, B, ...,I. My input data looks like following:
>>> dataframe.head()
day price A B C D E F G H I TS_bigHolidays
0 2015-06-16 7.031160 1 0 0 0 0 0 0 0 0 0
1 2015-06-17 10.732429 1 0 0 0 0 0 0 0 0 0
2 2015-06-18 21.312692 1 0 0 0 0 0 0 0 0 0
My problem is my forecasts/fitted values (both for trained and test data) seem to be shifted forward. Here is a plot:
My question is what parameter in LSTM Keras should I change to correct this issue? Or do I need to change anything in my input data?
Here is my code:
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas
import math
import time
import csv
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from sklearn.preprocessing import MinMaxScaler
np.random.seed(1234)
exo_feature = ["A","B","C","D","E","F","G","H","I", "TS_bigHolidays"]
look_back = 5 #this is number of days we are looking back for sliding window of time series
forecast_period_length = 40
# load the dataset
dataframe = pandas.read_csv('processedDataframeGameSphere.csv', header = 0, engine='python', skipfooter=6)
dataframe["price"] = dataframe['price'].astype('float32')
scaler = MinMaxScaler(feature_range=(0, 100))
dataframe["price"] = scaler.fit_transform(dataframe['price'])
# this function is used to make sliding window for time series data
def create_dataframe(dataframe, look_back=1):
dataX, dataY = [], []
for i in range(dataframe.shape[0]-look_back-1):
price_lookback = dataframe['price'][i: (i + look_back)] #i+look_back is exclusive here
exog_feature = dataframe[exo_feature].ix[i + look_back - 1] #Y is i+ look_back ,that's why
row_i = price_lookback.append(exog_feature)
dataX.append(row_i)
dataY.append(dataframe["price"][i + look_back])
return np.array(dataX), np.array(dataY)
window_dataframe, Y = create_dataframe(dataframe, look_back)
# split into train and test sets
train_size = int(dataframe.shape[0] - forecast_period_length) #28 is the number of days we want to forecast , 4 weeks
test_size = dataframe.shape[0] - train_size
test_size_start_point_with_lookback = train_size - look_back
trainX, trainY = window_dataframe[0:train_size,:], Y[0:train_size]
print(trainX.shape)
print(trainY.shape)
#below changed datawindowY indexing, since it's just array.
testX, testY = window_dataframe[train_size:dataframe.shape[0],:], Y[train_size:dataframe.shape[0]]
# reshape input to be [samples, time steps, features]
trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
print(trainX.shape)
print(testX.shape)
# create and fit the LSTM network
dimension_input = testX.shape[2]
model = Sequential()
layers = [dimension_input, 50, 100, 1]
epochs = 100
model.add(LSTM(
input_dim=layers[0],
output_dim=layers[1],
return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(
layers[2],
return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(
output_dim=layers[3]))
model.add(Activation("linear"))
start = time.time()
model.compile(loss="mse", optimizer="rmsprop")
print "Compilation Time : ", time.time() - start
model.fit(
trainX, trainY,
batch_size= 10, nb_epoch=epochs, validation_split=0.05,verbose =2)
# Estimate model performance
trainScore = model.evaluate(trainX, trainY, verbose=0)
trainScore = math.sqrt(trainScore)
trainScore = scaler.inverse_transform(np.array([[trainScore]]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = model.evaluate(testX, testY, verbose=0)
testScore = math.sqrt(testScore)
testScore = scaler.inverse_transform(np.array([[testScore]]))
print('Test Score: %.2f RMSE' % (testScore))
# generate predictions for training
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# shift train predictions for plotting
np_price = np.array(dataframe["price"])
print(np_price.shape)
np_price = np_price.reshape(np_price.shape[0],1)
trainPredictPlot = np.empty_like(np_price)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
testPredictPlot = np.empty_like(np_price)
testPredictPlot[:, :] = np.nan
testPredictPlot[len(trainPredict)+look_back+1:dataframe.shape[0], :] = testPredict
# plot baseline and predictions
plt.plot(dataframe["price"])
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.show()
It's not a problem of LSTM, if you use just simple feed-forward network, the effect will be the same.
the problem is the network tend to mimic yesterday value instead of 'forecasting' you expect.
(it is nice strategy in term of reducing MSE loss)
you need more 'care' to avoid this issue and it's not a simple issue.