Split train and test data for image classification CNN - machine-learning

So i have binary classification problem for image, there are balanced dataset for class a and b.
I have 307 images for each class. i want to ask, when i split to train and test dataset, should the train and test also balanced for each class? or any method to split the dataset

You can use sklearn.model_selection.StratifiedShuffleSplit which uses stratified random sampling, proportional random sampling, or quota random sampling. This will give a better distribution.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
# dummy dataset
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
sss.get_n_splits(X, y)
print(sss)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
307 can be low for a CNN, you can also use data augmentation to increase your samples.
https://github.com/mdbloice/Augmentor

Related

Is it necessary to use cross-validation after data is split using StratifiedShuffleSplit?

I used StratifiedShuffleSplit to split the data and now I am wondering if I need to use cross-validation again as I go for building the classification model(Logistic Regression,KNN,Random Forest etc.) I am confused about it because reading the documentation in Sklearn I get the impression that StratifiedShuffleSplit is a mix of splitting the data and cross-validating it at the same time.
StratifiedShuffleSplit provides you just a list with train/test indices. How it will be used depends on you.
You can fit the model with train set and predict on the test and calculate the score manually - so implementing cross validation by yourself
Or you can use cross_val_score and pass StratifiedShuffleSplit() to it and cross_val_score will do the same thing.
Example:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
model = RandomForestClassifier(n_estimators=1, random_state=1)
# Calculate scores automatically
accuracy_per_split = cross_val_score(model, X, y, scoring="accuracy", cv=sss, n_jobs=1)
print(f"Accuracies per splits: {accuracy_per_split}")
# Calculate scores manually
accuracy_per_split = []
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
accuracy_per_split.append(acc)
print(f"Accuracies per splits: {accuracy_per_split}")

Naive Bayes model from scikit learn

What is the difference between the two gnb.fit()?
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train.ravel())
gnb.fit(X_train, y_train).predict(X_test)
I see two differences:
np.ravel flattens arrays, e.g.
x = np.array([[1, 2, 3], [4, 5, 6]])
np.ravel(x)
array([1, 2, 3, 4, 5, 6])
gnb.fit returns the model itself, and it can be used to immediately predict. Predicting does not modify the model.
If your y_train is one-dimensional, you'll get the same model regardless of whether you call gbn.fit(x, y) or gbn.fit(x, y.ravel()) because y == y.ravel()

how to predict new values when a machine learning model was standardized StandardScaler

I'm working on a machine learning model, I have a dataframe with the data
I normalize the data with a standard distribution
scaler = StandardScaler()
df = scaler.fit_transform(df)
I divide the datasets into target and characteristics
X_df = df[X_characteristics_list]
y_df = df[target]
I split into train and test then I train the model
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size = 0.25)
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
I predict the test to validate the effectiveness
y_test_pred = forest.predict(X2_test)
mse = mean_squared_error(y_test, y_test_pred)
But when is time to test in real life I need to leave the model ready to predict
If i Want to predict just one record
let say [100,20,34]
I can't because I need the record standardized, and transform it with StandardScaler does not work because it depends on standard deviation so I would need the original dataset
What's the best way to solve this problem.
See below:
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.preprocessing import StandardScaler
# Create our input and output matrices
>>> X, y = make_classification()
# Split train-test... "test" will be production/unobserved/"real-life" data
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
# What does X_train look like?
>>> X_train
array([[-0.08930702, -2.71113991, -0.93849926, ..., 0.21650905,
0.68952722, 0.61365789],
[-0.31143977, -1.87817904, 0.08287492, ..., -0.41332943,
-0.58967179, 1.7239411 ],
[-1.62287589, 1.10691318, -0.630556 , ..., -0.35060008,
1.11270562, 0.08106694],
...,
[-0.59797041, 0.90218081, 0.89983074, ..., -0.54374315,
1.18534841, -0.03397969],
[-1.2006559 , 1.01890955, -1.21617181, ..., 1.76263322,
1.38280423, -1.0192972 ],
[ 0.11883425, 1.42952643, -1.23647358, ..., 1.02509208,
-1.14308885, 0.72096531]])
# Let's scale it
>>> scaler = StandardScaler()
>>> X_train = scaler.fit_transform(X_train)
>>> X_train
array([[ 0.08867642, -1.97950269, -1.1214106 , ..., 0.22075623,
0.57844552, 0.46487917],
[-0.10736984, -1.34896243, 0.00808597, ..., -0.37670234,
-0.6045418 , 1.57819736],
[-1.26479555, 0.91071257, -0.78086855, ..., -0.3171979 ,
0.96979563, -0.06916763],
...,
[-0.36025134, 0.7557329 , 0.91152449, ..., -0.50041152,
1.03697478, -0.18452874],
[-0.89215959, 0.84409499, -1.42847749, ..., 1.68739437,
1.21957946, -1.17253964],
[ 0.27237431, 1.15492649, -1.4509284 , ..., 0.98777012,
-1.116335 , 0.57247992]])
# Fit the model
>>> model = LogisticRegression()
>>> model.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
# Now let's use the already-fitted StandardScaler object to simply transform
# *not fit_transform* the test data
>>> X_test = scaler.transform(X_test)
>>> model.predict(X_test)
array([1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0,
0, 0, 0])
Note that using joblib or pickle you can save the scaler object and re-load it for scaling in "real-time" later on.

Very low performance even after oversampling dataset

I'm using an MLPClassifier for classification of heart diseases. I used imblearn.SMOTE to balance the objects of each class. I was getting very good results (85% balanced acc.), but i was advised that i would not use SMOTE on test data, only for train data. After i made this changes, the performance of my classifier fell down too much (~35% balanced accuracy) and i don't know what can be wrong.
Here is a simple benchmark with training data balanced but test data unbalanced:
And this is the code:
def makeOverSamplesSMOTE(X,y):
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy='all')
X, y = sm.fit_sample(X, y)
return X,y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)
## Normalize data
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
## SMOTE only on training data
X_train, y_train = makeOverSamplesSMOTE(X_train, y_train)
clf = MLPClassifier(hidden_layer_sizes=(20),verbose=10,
learning_rate_init=0.5, max_iter=2000,
activation='logistic', solver='sgd', shuffle=True, random_state=30)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
I'd like to know what i'm doing wrong, since this seems to be the proper way of preparing data.
The first mistake in your code is when you are transforming data into standard format. You only need to fit StandardScaler once and that is on X_train. You shouldn't refit it on X_test. So the correct code will be:
def makeOverSamplesSMOTE(X,y):
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy='all')
X, y = sm.fit_sample(X, y)
return X,y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)
## Normalize data
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
## SMOTE only on training data
X_train, y_train = makeOverSamplesSMOTE(X_train, y_train)
clf = MLPClassifier(hidden_layer_sizes=(20),verbose=10,
learning_rate_init=0.5, max_iter=2000,
activation='logistic', solver='sgd', shuffle=True, random_state=30)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
For the machine learning model, try reducing the learning rate. it is too high. the default learning rate in sklearn is 0.001. Try changing the activation function and the number of layers. Also not every ML model works on every dataset so you might need to look at your data and choose ML model accordingly.
Hope you have already got better result for your model.I tried by changing few parameter, and I getting accuracy of 65%, when I change it to 90:10 sample I got an accuracy of 70%.
But accuracy can mislead,so I calculated F1 score which give you better picture of prediction.
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(1,),verbose=False,
learning_rate_init=0.001,
max_iter=2000,
activation='logistic', solver='sgd', shuffle=True, random_state=50)
clf.fit(X_train_res, y_train_res)
y_pred = clf.predict(X_test)
from sklearn.metrics import accuracy_score, confusion_matrix ,classification_report
score=accuracy_score(y_test, y_pred, )
print(score)
cr=classification_report(y_test, clf.predict(X_test))
print(cr)
Accuracy = 0.65
classification report :
precision recall f1-score support
0 0.82 0.97 0.89 33
1 0.67 0.31 0.42 13
2 0.00 0.00 0.00 6
3 0.00 0.00 0.00 4
4 0.29 0.80 0.42 5
micro avg 0.66 0.66 0.66 61
macro avg 0.35 0.42 0.35 61
weighted avg 0.61 0.66 0.61 61
confusion_matrix:
array([[32, 0, 0, 0, 1],
[ 4, 4, 2, 0, 3],
[ 1, 1, 0, 0, 4],
[ 1, 1, 0, 0, 2],
[ 1, 0, 0, 0, 4]], dtype=int64)

TFLearn model evaluation

I am new to the machine learning and TensorFlow. I am trying to train a simple model to recognize gender. I use small data-set of height, weight, and shoe size. However, I have encountered a problem with evaluating model's accuracy.
Here's the entire code:
import tflearn
import tensorflow as tf
import numpy as np
# [height, weight, shoe_size]
X = [[181, 80, 44], [177, 70, 43], [160, 60, 38], [154, 54, 37], [166, 65, 40],
[190, 90, 47], [175, 64, 39], [177, 70, 40], [159, 55, 37], [171, 75, 42],
[181, 85, 43], [170, 52, 39]]
# 0 - for female, 1 - for male
Y = [1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0]
data = np.column_stack((X, Y))
np.random.shuffle(data)
# Split into train and test set
X_train, Y_train = data[:8, :3], data[:8, 3:]
X_test, Y_test = data[8:, :3], data[8:, 3:]
# Build neural network
net = tflearn.input_data(shape=[None, 3])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 1, activation='linear')
net = tflearn.regression(net, loss='mean_square')
# fix for tflearn with TensorFlow 12:
col = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
for x in col:
tf.add_to_collection(tf.GraphKeys.VARIABLES, x)
# Define model
model = tflearn.DNN(net)
# Start training (apply gradient descent algorithm)
model.fit(X_train, Y_train, n_epoch=100, show_metric=True)
score = model.evaluate(X_test, Y_test)
print('Training test score', score)
test_male = [176, 78, 42]
test_female = [170, 52, 38]
print('Test male: ', model.predict([test_male])[0])
print('Test female:', model.predict([test_female])[0])
Even though model's prediction is not very accurate
Test male: [0.7158362865447998]
Test female: [0.4076206684112549]
The model.evaluate(X_test, Y_test) always returns 1.0. How do I calculate real accuracy on the test data-set using TFLearn?
You want to do binary classification in this case. Your network is set to perform linear regression.
First, transform the labels (gender) to categorical features:
from tflearn.data_utils import to_categorical
Y_train = to_categorical(Y_train, nb_classes=2)
Y_test = to_categorical(Y_test, nb_classes=2)
The output layer of your network needs two output units for the two classes you want to predict. Also the activation needs to be softmax for classification. The tf.learn default loss is cross-entropy and the default metric is accuracy, so this is already correct.
# Build neural network
net = tflearn.input_data(shape=[None, 3])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net)
The output will now be a vector with the probability for each gender. For example:
[0.991, 0.009] #female
Bear in mind that you will hopelessly overfit the network with your tiny data set. This means that during training the accuracy will approach 1 while, the accuracy on your test set will be quite poor.

Resources