Is it an overfitting problem for SVM classification? - machine-learning

I am new in Machine learning, and I want to detect emotions from the face.
Preprocessing: I used equalizeHist to equalizes the histogram of grayscale images (JAFFE database with 213 images), in the goal to normalizes the brightness and increases the contrast of the image.
Feature extraction: I extract features with Gabors filters from images, and I get a matrix of 213x120. I split data: 60% train data and 40% test data, and I normalize it.
Train and test: for training the model, I use SVM classifier with an RBF kernel. With grid-search, I select the best couple C, gamma (using 10-fold cross-validation). Then, I test the performance of the model on the test data (unseen data), and I get 89% accuracy.
The problem is when I want to predict emotion from a new input face, I get a false result.. Is it an overfitting problem?
UPDATE 1: feature extraction
gf = GaborFilter(ksize=(11, 11), freq_nbr=5, or_nbr=8, lambd=4, sigma=8, gamma=5, psi=5*np.pi/6) #Gabor filter with *cv2.getGaborKernel*
data = []
for j in range(len(roi1_images)): # len(roi1_images) = 213
vect1 = feature_extraction_roi_avr(roi1_images[j], gf.kernels) # feature vect from left eye
vect2 = feature_extraction_roi_avr(roi2_images[j], gf.kernels) # feature vect from right eye (40 features)
vect3 = feature_extraction_roi_avr(roi3_images[j], gf.kernels) # feature vect from mouth (40 features)
vect = np.concatenate((vect1, vect2, vect3), axis=None) # feature vect of one face (40 features)
data.append(vect)
data = np.array(data) # Data matrix (213, 120)
UPDATE 2: Learn and test model
clf = GridSearchCV(estimator=SVC(kernel='rbf'), param_grid=svm_parameters, cv=10, n_jobs=-1) # grid search with 10fold cross validation
scaler = StandardScaler()
# Split data
X_train_, X_test_, y_train, y_test = train_test_split(data, data_labels, random_state=0, test_size=0.4, stratify=data_labels)
X_train = scaler.fit_transform(X_train) # Normalize train data
clf.fit(X_train, y_train) # fit SVM model
X_test = scaler.transform(X_test) # Normalize test data
score = clf.score(X_test, y_test) # calculate mean accuracy
print(score) # score accuracy = 0.8953488372093024
UPDATE 3: predict new input
landmarks, gs_image = detect_landmarks(path_image)
roi1_image = extract_roi(gs_image, landmarks, 1) #extract region 1 (eye 1)
roi2_image = extract_roi(gs_image, landmarks, 2) #extract region 2 (eye 2)
roi3_image = extract_roi(gs_image, landmarks, 3) #extract region 3 (eye 3)
vect1 = feature_extraction_roi_avr(roi1_image, gf.kernels)
vect2 = feature_extraction_roi_avr(roi2_image, gf.kernels)
vect3 = feature_extraction_roi_avr(roi3_image, gf.kernels)
vect = np.concatenate((vect1, vect2, vect3), axis=0)
vect = np.reshape(vect, (1,-1))
vect = scaler.transform(vect)
class_ = clf.predict(vect)[0]
print(class_,end=" ")

Related

High Train and Validation Accuracy, Bad Test Accuracy

I am trying to classify 2 classes of images. Though I am getting high train and validation accuracy (0.97) after 10 epochs, my test results are awful (precision 0.48) and the confusion matrix shows the network is predicting the images for the wrong class (attached results).
There are only 2 classes in the dataset, each class has 10,000 image examples (after augmentation). I am using the VGG16 network. The full dataset is split 20% to test set (this split was performed by taking random images from each class therefore it is shuffled). The remaining images are split to 80% train and 20% valid sets (as indicated in the ImageDataGenerator line of the code). So in the end there are:
12,904 Train images belonging to 2 classes
3,224 Valid images belonging to 2 classes
4,032 Test images belonging to 2 classes
This is my code:
def CNN(CNN='VGG16', choice='predict', prediction='./dataset/Test/image.jpg'):
''' Train images using one of several CNNs '''
Train = './dataset/Train'
Tests = './dataset/Test'
shape = (224, 224)
epochs = 10
batches = 16
classes = []
for c in os.listdir(Train): classes.append(c)
IDG = keras.preprocessing.image.ImageDataGenerator(validation_split=0.2)
train = IDG.flow_from_directory(Train, target_size=shape, color_mode='rgb',
classes=classes, batch_size=batches, shuffle=True, subset='training')
valid = IDG.flow_from_directory(Train, target_size=shape, color_mode='rgb',
classes=classes, batch_size=batches, shuffle=True, subset='validation')
tests = IDG.flow_from_directory(Tests, target_size=shape, color_mode='rgb',
classes=classes, batch_size=batches, shuffle=True)
input_shape = train.image_shape
if CNN == 'VGG16' or 'vgg16':
model = VGG16(weights=None, input_shape=input_shape,
classes=len(classes))
elif CNN == 'VGG19' or 'vgg19':
model = VGG19(weights=None, input_shape=input_shape,
classes=len(classes))
elif CNN == 'ResNet50' or 'resnet50':
model = ResNet50(weights=None, input_shape=input_shape,
classes=len(classes))
elif CNN == 'DenseNet201' or 'densenet201':
model = DenseNet201(weights=None, input_shape=input_shape,
classes=len(classes))
model.compile(optimizer=keras.optimizers.SGD(
lr=1e-3,
decay=1e-6,
momentum=0.9,
nesterov=True),
loss='categorical_crossentropy',
metrics=['accuracy'])
Esteps = int(train.samples/train.next()[0].shape[0])
Vsteps = int(valid.samples/valid.next()[0].shape[0])
if choice == 'train':
history= model.fit_generator(train,
steps_per_epoch=Esteps,
epochs=epochs,
validation_data=valid,
validation_steps=Vsteps,
verbose=1)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
Y_pred = model.predict_generator(tests, verbose=1)
y_pred = np.argmax(Y_pred, axis=1)
matrix = confusion_matrix(tests.classes, y_pred)
df_cm = pd.DataFrame(matrix, index=classes, columns=classes)
plt.figure(figsize=(10,7))
sn.heatmap(df_cm, annot=True)
print(classification_report(tests.classes,y_pred,target_names=classes))
model.save_weights('weights.h5')
elif choice == 'predict':
model.load_weights('./weights.h5')
img = image.load_img(prediction, target_size=shape)
im = image.img_to_array(img)
im = np.expand_dims(im, axis=0)
if CNN == 'VGG16' or 'vgg16':
im = keras.applications.vgg16.preprocess_input(im)
prediction = model.predict(im)
print(prediction)
elif CNN == 'VGG19' or 'vgg19':
im = keras.applications.vgg19.preprocess_input(im)
prediction = model.predict(im)
print(prediction)
elif CNN == 'ResNet50' or 'resnet50':
im = keras.applications.resnet50.preprocess_input(im)
prediction = model.predict(im)
print(prediction)
print(keras.applications.resnet50.decode_predictions(prediction))
elif CNN == 'DenseNet201' or 'densenet201':
im = keras.applications.densenet201.preprocess_input(im)
prediction = model.predict(im)
print(prediction)
print(keras.applications.densenet201.decode_predictions(prediction))
CNN(CNN='VGG16', choice='train')
Results:
precision recall f1-score support
Predator 0.49 0.49 0.49 2016
Omnivore 0.49 0.49 0.49 2016
accuracy -- -- 0.49 4032
I suspect that the ImageDataGenerator() is not shuffling the images "before" the train/valid split. If this is the case how can i force the ImageDataGenerator here in Keras to shuffle the dataset before the split?
If shuffling is not the case, how can i solve my issue? what am I doing wrong?
So your model is basically overfitting, which means that it is "memorizing" your training set. I have a few suggestions:
check that your 2 prediction classes are balanced in your training set. I.e. 50-50 split of 0 and 1. For example, if 90% of your training data is labeled 0, then your model will simply predict everything to be 0 and get right in the validation 90% of the time.
if your training data is already balanced, it means that your model isn't generalizing. Perhaps you could try using the pre-trained model instead of custom training every layer of VGG? You can load the pre-trained weights of VGG but do not include top and train only the dense layers.
Use cross validation. Reshuffle the data in each validation and see whether results in the test set improve.
Somehow, the image generator of Keras works well when combined with fit() or fit_generator() function, but fails miserably when combined
with predict_generator() or the predict() function.
When using Plaid-ML Keras back-end for AMD processor, I would rather loop through all test images one-by-one and get the prediction for each image in each iteration.
import os
from PIL import Image
import keras
import numpy
# code for creating dan training model is not included
print("Prediction result:")
dir = "/path/to/test/images"
files = os.listdir(dir)
correct = 0
total = 0
#dictionary to label all animal category class.
classes = {
0:'This is Cat',
1:'This is Dog',
}
for file_name in files:
total += 1
image = Image.open(dir + "/" + file_name).convert('RGB')
image = image.resize((100,100))
image = numpy.expand_dims(image, axis=0)
image = numpy.array(image)
image = image/255
pred = model.predict_classes([image])[0]
animals_category = classes[pred]
if ("cat" in file_name) and ("cat" in sign):
print(correct,". ", file_name, animals_category)
correct+=1
elif ("dog" in file_name) and ("dog" in animals_category):
print(correct,". ", file_name, animals_category)
correct+=1
print("accuracy: ", (correct/total))

CIFAR-10 test set classification accuracy different on PyTorch and Keras

I’ve made a custom CNN in PyTorch for classifying 10 classes in the CIFAR-10 dataset. My classification accuracy on the test dataset is 45.739%, this is very low and I thought it’s because my model is not very deep but I implemented the same model in Keras and the classification accuracy come outs to be 78.92% on test dataset. No problem in Keras however I think there's something I'm missing in my PyTorch program.
I have used the same model architecture, strides, padding, dropout rate, optimizer, loss function, learning rate, batch size, number of epochs on both PyTorch and Keras and despite that, the difference in the classification accuracy is still huge thus I’m not able to decide how I should debug my PyTorch program further.
For now I suspect 3 things: in Keras, I use the categorical cross entropy loss function (one hot vector labels) and in PyTorch I use the standard cross entropy loss function (scalar indices labels), can this be a problem?, if not then I suspect either my training loop or the code for calculating classification accuracy in PyTorch. I have attached both my programs below, will be grateful to any suggestions.
My program in Keras:
#================Function that defines the CNN model===========
def CNN_model():
model = Sequential()
model.add(Conv2D(32,(3,3),activation='relu',padding='same', input_shape=(size,size,channels))) #SAME PADDING
model.add(Conv2D(32,(3,3),activation='relu')) #VALID PADDING
model.add(MaxPooling2D(pool_size=(2,2))) #VALID PADDING
model.add(Dropout(0.25))
model.add(Conv2D(64,(3,3),activation='relu', padding='same')) #SAME PADDING
model.add(Conv2D(64,(3,3),activation='relu')) #VALID PADDING
model.add(MaxPooling2D(pool_size=(2,2))) #VALID PADDING
model.add(Dropout(0.25))
model.add(Conv2D(128,(3,3),activation='relu', padding='same')) #SAME PADDING
model.add(Conv2D(128,(3,3),activation='relu')) #VALID PADDING
model.add(MaxPooling2D(pool_size=(2,2),name='feature_extractor_layer')) #VALID PADDING
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512, activation='relu', name='second_last_layer'))
model.add(Dropout(0.25))
model.add(Dense(10, activation='softmax', name='softmax_layer')) #10 nodes in the softmax layer
model.summary()
return model
#=====Main program starts here========
#get_train_data() and get_test_data() are my own custom functions to get CIFAR-10 dataset
images_train, labels_train, class_train = get_train_data(0,10)
images_test, labels_test, class_test = get_test_data(0,10)
model = CNN_model()
model.compile(loss='categorical_crossentropy', #loss function of the CNN
optimizer=Adam(lr=1.0e-4), #Optimizer
metrics=['accuracy'])#'accuracy' metric is to be evaluated
#images_train and images_test contain images and
#class_train and class_test contains one hot vectors labels
model.fit(images_train,class_train,
batch_size=128,
epochs=50,
validation_data=(images_test,class_test),
verbose=1)
scores=model.evaluate(images_test,class_test,verbose=0)
print("Accuracy: "+str(scores[1]*100)+"% \n")
My program in PyTorch:
#========DEFINE THE CNN MODEL=====
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3,1,1)#SAME PADDING
self.conv2 = nn.Conv2d(32,32,3,1,0)#VALID PADDING
self.pool1 = nn.MaxPool2d(2,2) #VALID PADDING
self.drop1 = nn.Dropout2d(0.25) #DROPOUT OF 0.25
self.conv3 = nn.Conv2d(32,64,3,1,1)#SAME PADDING
self.conv4 = nn.Conv2d(64,64,3,1,0)#VALID PADDING
self.pool2 = nn.MaxPool2d(2,2)#VALID PADDING
self.drop2 = nn.Dropout2d(0.25) #DROPOUT OF 0.25
self.conv5 = nn.Conv2d(64,128,3,1,1)#SAME PADDING
self.conv6 = nn.Conv2d(128,128,3,1,0)#VALID PADDING
self.pool3 = nn.MaxPool2d(2,2)#VALID PADDING
self.drop3 = nn.Dropout2d(0.25) #DROPOUT OF 0.25
self.fc1 = nn.Linear(128*2*2, 512)#128*2*2 IS OUTPUT DIMENSION AFTER THE PREVIOUS LAYER
self.drop4 = nn.Dropout(0.25) #DROPOUT OF 0.25
self.fc2 = nn.Linear(512,10) #10 output nodes
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
x = self.pool1(x)
x = self.drop1(x)
x = F.relu(self.conv3(x))
x = F.relu(self.conv4(x))
x = self.pool2(x)
x = self.drop2(x)
x = F.relu(self.conv5(x))
x = F.relu(self.conv6(x))
x = self.pool3(x)
x = self.drop3(x)
x = x.view(-1,2*2*128) #FLATTENING OPERATION 2*2*128 IS OUTPUT AFTER THE PREVIOUS LAYER
x = F.relu(self.fc1(x))
x = self.drop4(x)
x = self.fc2(x) #LAST LAYER DOES NOT NEED SOFTMAX BECAUSE THE LOSS FUNCTION WILL TAKE CARE OF IT
return x
#=======FUNCTION TO CONVERT INPUT AND TARGET TO TORCH TENSORS AND LOADING INTO GPU======
def PrepareInputDataAndTargetData(device,images,labels,batch_size):
#GET MINI BATCH OF TRAINING IMAGES AND RESHAPE THE TORCH TENSOR FOR CNN PROCESSING
mini_batch_images = torch.tensor(images)
mini_batch_images = mini_batch_images.view(batch_size,3,32,32)
#GET MINI BATCH OF TRAINING LABELS, TARGET SHOULD BE IN LONG FORMAT SO CONVERT THAT TOO
mini_batch_labels = torch.tensor(labels)
mini_batch_labels = mini_batch_labels.long()
#FEED THE INPUT DATA AND TARGET LABELS TO GPU
mini_batch_images = mini_batch_images.to(device)
mini_batch_labels = mini_batch_labels.to(device)
return mini_batch_images,mini_batch_labels
#==========MAIN PROGRAM==========
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#get_train_data() and get_test_data() are my own custom functions to get CIFAR-10 dataset
Images_train, Labels_train, Class_train = get_train_data(0,10)
Images_test, Labels_test, Class_test = get_test_data(0,10)
net = Net()
net = net.double() #https://discuss.pytorch.org/t/runtimeerror-expected-object-of-scalar-type-double-but-got-scalar-type-float-for-argument-2-weight/38961
print(net)
#MAP THE MODEL ONTO THE GPU
net = net.to(device)
#CROSS ENTROPY LOSS FUNCTION AND ADAM OPTIMIZER
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=1e-4)
#PREPARE THE DATALOADER
#Images_train contains images and Labels_trains contains indices i.e. 0,1,...,9
dataset = TensorDataset( Tensor(Images_train), Tensor(Labels_train) )
trainloader = DataLoader(dataset, batch_size= 128, shuffle=True)
#START TRAINING THE CNN MODEL FOR 50 EPOCHS
for epoch in range(0,50):
for i, data in enumerate(trainloader, 0):
inputs, labels = data
inputs = torch.tensor(inputs).double()
inputs = inputs.view(len(inputs),3,32,32) #RESHAPE THE IMAGES
labels = labels.long() #MUST CONVERT LABEL TO LONG FORMAT
#MAP THE INPUT AND LABELS TO THE GPU
inputs=inputs.to(device)
labels=labels.to(device)
#FORWARD PROP, BACKWARD PROP, PARAMETER UPDATE
optimizer.zero_grad()
outputs = net.forward(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
#CALCULATE CLASSIFICATION ACCURACY ON ALL 10 CLASSES
with torch.no_grad():
Images_class,Labels_class = PrepareInputDataAndTargetData(device,Images_test,Labels_test,len(Images_test))
network_outputs = net.forward(Images_class)
correct = (torch.argmax(network_outputs.data,1) == Labels_class.data).float().sum()
acc = float(100.0*(correct/len(Images_class)))
print("Accuracy is: "+str(acc)+"\n")
del Images_class
del Labels_class
del network_outputs
del correct
del acc
torch.cuda.empty_cache()
print("Done\n")
I am not fully aware of how the actual core backend works in both libraries however I suppose that the classification accuracy of any model should be almost the same regardless of the library.

Using softmax in Neural Networks to decide label of input

I am using the keras model with the following layers to predict a label of input (out of 4 labels)
embedding_layer = keras.layers.Embedding(MAX_NB_WORDS,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
sequence_input = keras.layers.Input(shape = (MAX_SEQUENCE_LENGTH,),
dtype = 'int32')
embedded_sequences = embedding_layer(sequence_input)
hidden_layer = keras.layers.Dense(50, activation='relu')(embedded_sequences)
flat = keras.layers.Flatten()(hidden_layer)
preds = keras.layers.Dense(4, activation='softmax')(flat)
model = keras.models.Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])
model.fit(X_train, Y_train, batch_size=32, epochs=100)
However, the softmax function returns a number of outputs of 4 (because I have 4 labels)
When I'm using the predict function to get the predicted Y using the same model, I am getting an array of 4 for each X rather than one single label deciding the label for the input.
model.predict(X_test, batch_size = None, verbose = 0, steps = None)
How do I make the output layer of the keras model, or the model.predict function, decide on one single label, rather than output weights for each label?
The following is a common function to sample from a probability vector
def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
Taken from here.
The temperature parameter decides how much the differences between the probability weights are weightd. A temperature of 1 is considering each weight "as it is", a temperature larger than 1 reduces the differences between the weights, a temperature smaller than 1 increases them.
Here an example using a probability vector on 3 labels:
p = np.array([0.1, 0.7, 0.2]) # The first label has a probability of 10% of being chosen, the second 70%, the third 20%
print(sample(p, 1)) # sample using the input probabilities, unchanged
print(sample(p, 0.1)) # the new vector of probabilities from which to sample is [ 3.54012033e-09, 9.99996371e-01, 3.62508322e-06]
print(sample(p, 10)) # the new vector of probabilities from which to sample is [ 0.30426696, 0.36962778, 0.32610526]
To see the new vector make sample return preds.

Linear Regression in Tensor Flow - Error when modifying getting started code

I am very new to TensorFlow and I am in parallel learning traditional machine learning techniques. Previously, I was able to successfully implement linear regression modelling in matlab and in Python using scikit.
When I tried to reproduce it using Tensorflow with the same dataset, I am getting invalid outputs. Could someone advise me on where I am making the mistake or what I am missing!
Infact, I am using the code from tensor flow introductory tutorial and I just changed the x_train and y_train to a different data set.
# Loading the ML coursera course ex1 (Wk 2) data to try it out
'''
path = r'C:\Users\Prasanth\Dropbox\Python Folder\ML in Python\data\ex1data1.txt'
fh = open(path,'r')
l1 = []
l2 = []
for line in fh:
temp = (line.strip().split(','))
l1.append(float(temp[0]))
l2.append(float(temp[1]))
'''
l1 = [6.1101, 5.5277, 8.5186, 7.0032, 5.8598, 8.3829, 7.4764, 8.5781, 6.4862, 5.0546, 5.7107, 14.164, 5.734, 8.4084, 5.6407, 5.3794, 6.3654, 5.1301, 6.4296, 7.0708, 6.1891, 20.27, 5.4901, 6.3261, 5.5649, 18.945, 12.828, 10.957, 13.176, 22.203, 5.2524, 6.5894, 9.2482, 5.8918, 8.2111, 7.9334, 8.0959, 5.6063, 12.836, 6.3534, 5.4069, 6.8825, 11.708, 5.7737, 7.8247, 7.0931, 5.0702, 5.8014, 11.7, 5.5416, 7.5402, 5.3077, 7.4239, 7.6031, 6.3328, 6.3589, 6.2742, 5.6397, 9.3102, 9.4536, 8.8254, 5.1793, 21.279, 14.908, 18.959, 7.2182, 8.2951, 10.236, 5.4994, 20.341, 10.136, 7.3345, 6.0062, 7.2259, 5.0269, 6.5479, 7.5386, 5.0365, 10.274, 5.1077, 5.7292, 5.1884, 6.3557, 9.7687, 6.5159, 8.5172, 9.1802, 6.002, 5.5204, 5.0594, 5.7077, 7.6366, 5.8707, 5.3054, 8.2934, 13.394, 5.4369]
l2 = [17.592, 9.1302, 13.662, 11.854, 6.8233, 11.886, 4.3483, 12.0, 6.5987, 3.8166, 3.2522, 15.505, 3.1551, 7.2258, 0.71618, 3.5129, 5.3048, 0.56077, 3.6518, 5.3893, 3.1386, 21.767, 4.263, 5.1875, 3.0825, 22.638, 13.501, 7.0467, 14.692, 24.147, -1.22, 5.9966, 12.134, 1.8495, 6.5426, 4.5623, 4.1164, 3.3928, 10.117, 5.4974, 0.55657, 3.9115, 5.3854, 2.4406, 6.7318, 1.0463, 5.1337, 1.844, 8.0043, 1.0179, 6.7504, 1.8396, 4.2885, 4.9981, 1.4233, -1.4211, 2.4756, 4.6042, 3.9624, 5.4141, 5.1694, -0.74279, 17.929, 12.054, 17.054, 4.8852, 5.7442, 7.7754, 1.0173, 20.992, 6.6799, 4.0259, 1.2784, 3.3411, -2.6807, 0.29678, 3.8845, 5.7014, 6.7526, 2.0576, 0.47953, 0.20421, 0.67861, 7.5435, 5.3436, 4.2415, 6.7981, 0.92695, 0.152, 2.8214, 1.8451, 4.2959, 7.2029, 1.9869, 0.14454, 9.0551, 0.61705]
print ('List length and data type', len(l1), type(l1))
#------------------#
import tensorflow as tf
# Model parameters
W = tf.Variable([0], dtype=tf.float64)
b = tf.Variable([0], dtype=tf.float64)
# Model input and output
x = tf.placeholder(tf.float64)
linear_model = W * x + b
y = tf.placeholder(tf.float64)
# loss or cost function
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer (gradient descent) with learning rate = 0.01
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# training data (labelled input & output swt)
# Using coursera data instead of sample data
#x_train = [1.0, 2, 3, 4]
#y_train = [0, -1, -2, -3]
x_train = l1
y_train = l2
# training loop (1000 iterations)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
for i in range(1000):
sess.run(train, {x: x_train, y: y_train})
# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))
Output
List length and data type: 97 <class 'list'>
W: [ nan] b: [ nan] loss: nan
One major problem with your estimator is the loss function. Since you use tf.reduce_sum, the loss grows with the number of samples, which you have to compensate by using a smaller learning rate. A better solution would be to use mean square error loss
loss = tf.reduce_mean(tf.square(linear_model - y))

Cross validation in classifying text documents using scikit-learn

Do you first do cross validation followed by feature extraction or the other way while classifying text documents using scikit-learn?
Here is my pipeline:
union = FeatureUnion(
transformer_list = [
('tfidf', TfidfVectorizer()),
('featureEx', FeatureExtractor()),
('spell_chker', Spellingchecker()),
], n_jobs = -1)
I am doing it in the following way, but I wonder if I should extract the features first and do the cross validation. In this example X is list of documents and y is label.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)
X_train = union.fit_transform(X_train)
X_test = union.transform(X_test)
ch2 = SelectKBest(f_classif, k = 7000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
clf = SVC(C=1, gamma=0.001, kernel = 'linear', probability=True).fit(
X_train , y_train)
print("classification report:")
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()
Doing the feature selection and then cross validating on those features is sometimes common on text data, but it is less desirable. This can lead to over-fitting and the cross-validation procedure may over-estimate your true accuracy.
When you do the feature selection first, that feauter selection process got to look at all the data. The point of cross validation is to hide 1 fold from the others. By doing the FS first, you leak some of that data knowledge to the other folds.

Resources