I have a task to find the optimal hyperparameter(k) of KNN. I plotted the k vs AUC curve using roc_auc_score. I am supposed to find k such that cv_auc is maximum and the gap between train_auc and cv_auc is minimum. How can I achieve that?
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
train_auc=[]
cv_auc=[]
k=[i for i in range(1,50,5)]
for i in k:
knn=KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train_bow,y_train)
y_train_pred=knn.predict_proba(x_train_bow)[:,1]
y_cv_pred=knn.predict_proba(x_cv_bow)[:,1]
train_auc.append(roc_auc_score(y_train,y_train_pred))
cv_auc.append(roc_auc_score(y_cv,y_cv_pred))
#plot the roc curve
plt.plot(k,train_auc,label="Train AUC")
plt.plot(k,cv_auc,label="CV AUC")
plt.legend()
plt.xlabel('K:hyperparameter')
plt.ylabel('AUC')
plt.title("Error plot")
plt.show()
picture of the roc curve
print(cv_auc)
print(cv_auc.index(max(cv_auc)))
array1 = np.array(train_auc)
array2 = np.array(cv_auc)
subtracted_array = np.subtract(array1, array2)
subtracted = list(subtracted_array)
print(subtracted)
subtracted.index(min(subtracted))
Output:
[0.6241694315220194, 0.6985803616697652, 0.7222662029418654, 0.7429448007376901, 0.7433472984472336, 0.7492335494812746, 0.7499829512940709, 0.7594353468596283, 0.757365782209453, 0.7518153165574067]
7
[0.3758305684779806, 0.1995133667387895, 0.1433755719502956, 0.10953834255228179, 0.09624883964242126, 0.08236753388538032, 0.07710481774180344, 0.06538756093043141, 0.05998659695603492, 0.06576356656762017]
8
I've built up my own neural model, trained it, and got 99.58% accuracy. But I am facing a problem with plotting the confusion matrix. There are some examples available for flow_from_directory but no examples exist for image_dataset_from_directory. Can anyone help me?
See the post How to plot confusion matrix for prefetched dataset in Tensorflow using
true_categories = tf.concat([y for x, y in val_ds], axis=0)
to get the true labels for the validation set. Then you can plot the confusion matrix with something like this
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import confusion_matrix
cm = confusion_matrix(true_categories, predicted_id)
fig = plt.figure(figsize = (8,8))
ax1 = fig.add_subplot(1,1,1)
sns.set(font_scale=1.4) #for label size
sns.heatmap(cm, annot=True, annot_kws={"size": 12},
cbar = False, cmap='Purples');
ax1.set_ylabel('True Values',fontsize=14)
ax1.set_xlabel('Predicted Values',fontsize=14)
plt.show()
Here is the code I created to be able to assemble the matrix of confusion
Note:
test_dataset is a tf.data.Dataset variable.
I used validation_dataset = tf.keras.preprocessing.image_dataset_from_directory()
import tensorflow as tf
y_true = []
y_pred = []
for x,y in validation_dataset:
y= tf.argmax(y,axis=1)
y_true.append(y)
y_pred.append(tf.argmax(model.predict(x),axis = 1))
y_pred = tf.concat(y_pred, axis=0)
y_true = tf.concat(y_true, axis=0)
I have succeeded build binary classification model for image in CNN using Keras and made the prediction using model.predict_classes() and here is my code:
import numpy as np
import os,sys
from keras.models import load_model
import PIL
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
model = load_model('./potholes16_2.h5')
model.compile (loss = 'binary_crossentropy',
optimizer = 'adam',
metric = ['accuracy'])
path= os.path.abspath("./potholes14/test/positive")
extensions = 'JPG'
if __name__ == "__main__":
for f in os.listdir(path):
if os.path.isfile(os.path.join(path,f)):
f_text, f_ext= os.path.splitext(f)
f_ext= f_ext[1:].upper()
if f_ext in extensions:
print (f)`enter code here`
img = Image.open(os.path.join(path,f))
new_width = 200
new_height = 200
img = img.resize((new_width, new_height), Image.ANTIALIAS)
#width, height= image.size
img = np.reshape(img,[1,new_width,new_height,3])
classes = model.predict_classes(img)
print (classes)
Now I want to count total of images which correctly predicted, for example how many classes are belong to class 0 or class 1?
You need to invoke the model.evaluate function; supposing you want to evaluate the data in x_test with the ground truth labels in y_test, then:
score = model.evaluate(x_test, y_test, verbose=0)
score[0] will give you the loss (binary cross entropy in your case), while score[1] contains the required binary accuracy.
See the docs for more details (scroll down looking for evaluate).
You must have the a a sample array of the data you are predicting on correct? well you could load that data as well. Keep the code you have,
classes = model.predict_classes(img)
yields
array([[ 0.94981687],[ 0.57888238],[ 0.58651019],[ 0.30058956],[ 0.21879381]])
and your class data looks like this
class_validation = np.array([[1],[0],[0],[0],[1]])
Then just find where there equal once rounding classes
np.where(np.round(classes,0)==class_validation)[0].shape[0]
Note: there are many was to write the last line, that assums your numpy array is shape (number_of_sample,1)
Another way to check
totalCorrect = class_validation[((np.round(classes,0) - class_validation)==0)]
print('Correct in Class 1 = ',np.count_nonzero(totalCorrect),'Correct in Class 0 = ',abs(len(totalCorrect)-np.count_nonzero(totalCorrect)))
I try to calculate the Earth Mover Distance between two 1-dimensional numpy histograms, like:
(array([ 0.53586639, 0.71448852, 1.22534781, 1.68262046, 1.20391316]), array([ 0. , 0.18648936, 0.37297871, 0.55946807, 0.74595742,
0.93244678]), <a list of 5 Patch objects>)
and
(array([ 0.05986936, 0.41133267, 1.0449142 , 2.43569242, 2.50891394]), array([ 0.17373296, 0.32851441, 0.48329586, 0.63807731, 0.79285876,
0.9476402 ]), <a list of 5 Patch objects>)
I want to do it for 1-dimensional arrays, not for images. I want a simple solution.
A simple python code:
import numpy as np
def wasserstein_distance(A,B):
n = len(A)
dist = np.zeros(n)
for x in range(n-1):
dist[x+1] = A[x]-B[x]+dist[x]
return np.sum(abs(dist))
Can someone give a full working code (not a snippet, but something that runs on a variable-length recurrent neural network) on how would you use the PackedSequence method in PyTorch?
There do not seem to be any examples of this in the documentation, github, or the internet.
https://github.com/pytorch/pytorch/releases/tag/v0.1.10
Not the most beautiful piece of code, but this is what I gathered for my personal use after going through PyTorch forums and docs. There can be certainly better ways to handle the sorting - restoring part, but I chose it to be in the network itself
EDIT: See answer from #tusonggao which makes torch utils take care of sorting parts
class Encoder(nn.Module):
def __init__(self, vocab_size, embedding_size, embedding_vectors=None, tune_embeddings=True, use_gru=True,
hidden_size=128, num_layers=1, bidrectional=True, dropout=0.6):
super(Encoder, self).__init__()
self.embed = nn.Embedding(vocab_size, embedding_size, padding_idx=0)
self.embed.weight.requires_grad = tune_embeddings
if embedding_vectors is not None:
assert embedding_vectors.shape[0] == vocab_size and embedding_vectors.shape[1] == embedding_size
self.embed.weight = nn.Parameter(torch.FloatTensor(embedding_vectors))
cell = nn.GRU if use_gru else nn.LSTM
self.rnn = cell(input_size=embedding_size, hidden_size=hidden_size, num_layers=num_layers,
batch_first=True, bidirectional=True, dropout=dropout)
def forward(self, x, x_lengths):
sorted_seq_lens, original_ordering = torch.sort(torch.LongTensor(x_lengths), dim=0, descending=True)
ex = self.embed(x[original_ordering])
pack = torch.nn.utils.rnn.pack_padded_sequence(ex, sorted_seq_lens.tolist(), batch_first=True)
out, _ = self.rnn(pack)
unpacked, unpacked_len = torch.nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
indices = Variable(torch.LongTensor(np.array(unpacked_len) - 1).view(-1, 1)
.expand(unpacked.size(0), unpacked.size(2))
.unsqueeze(1))
last_encoded_states = unpacked.gather(dim=1, index=indices).squeeze(dim=1)
scatter_indices = Variable(original_ordering.view(-1, 1).expand_as(last_encoded_states))
encoded_reordered = last_encoded_states.clone().scatter_(dim=0, index=scatter_indices, src=last_encoded_states)
return encoded_reordered
Actually there is no need to mind the sorting - restoring problem yourself, let the torch.nn.utils.rnn.pack_padded_sequence function do all the work, by setting the parameter enforce_sorted=False.
Then the returned PackedSequence object will carry the sorting related info in its sorted_indices and unsorted_indicies attributes, which can be used properly by the followed nn.GRU or nn.LSTM to restore the original index order.
Runnable code example:
import torch
from torch import nn
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
data = [torch.tensor([1]),
torch.tensor([2, 3, 4, 5]),
torch.tensor([6, 7]),
torch.tensor([8, 9, 10])]
lengths = [d.size(0) for d in data]
padded_data = pad_sequence(data, batch_first=True, padding_value=0)
embedding = nn.Embedding(20, 5, padding_idx=0)
embeded_data = embedding(padded_data)
packed_data = pack_padded_sequence(embeded_data, lengths, batch_first=True, enforce_sorted=False)
lstm = nn.LSTM(5, 5, batch_first=True)
o, (h, c) = lstm(packed_data)
# (h, c) is the needed final hidden and cell state, with index already restored correctly by LSTM.
# but o is a PackedSequence object, to restore to the original index:
unpacked_o, unpacked_lengths = pad_packed_sequence(o, batch_first=True)
# now unpacked_o, (h, c) is just like the normal output you expected from a lstm layer.
print(unpacked_o, unpacked_lengths)
We get the output of unpacked_o, unpacked_lengths something like follows:
# output (unpacked_o, unpacked_lengths):
tensor([[[ 1.5230, -1.7530, 0.5462, 0.6078, 0.9440],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],
[[ 1.8888, -0.5465, 0.5404, 0.4132, -0.3266],
[ 0.1657, 0.5875, 0.4556, -0.8858, 1.1443],
[ 0.8957, 0.8676, -0.6614, 0.6751, -1.2377],
[-1.8999, 2.8260, 0.1650, -0.6244, 1.0599]],
[[ 0.0637, 0.3936, -0.4396, -0.2788, 0.1282],
[ 0.5443, 0.7401, 1.0287, -0.1538, -0.2202],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],
[[-0.5008, 2.1262, -0.3623, 0.5864, 0.9871],
[-0.6996, -0.3984, 0.4890, -0.8122, -1.0739],
[ 0.3392, 1.1305, -0.6669, 0.5054, -1.7222],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]],
grad_fn=<IndexSelectBackward>) tensor([1, 4, 2, 3])
Comparing it with the original data and lengths, we can find the sorting - restoring problem has been neatly taken care of.