Properly declaring input_shape for neural network in Keras? - machine-learning

I am attempting to write code to identify data types after loading it in from CSV files. So there are 5 possible labels, and the feature vector contains a list of lists. The feature vector is a list of lists with the following shape:
[slash_count, dash_count, colon_count, letters, dot_count, digits]
I then split my feature and label vectors into training, testing, and validation sets. I found some code on Stackoverflow that someone wrote to do this and I have used the same:
X_train, X_test, y_train, y_test = train_test_split(ml_list, labels, test_size=0.3, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=1)
After doing this, I normalize the features in scale [0,1], and then I create the categorical variables for the labels:
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.fit_transform(X_test)
X_val_minmax = min_max_scaler.fit_transform(X_val)
from keras.utils import to_categorical
y_train_minmax = to_categorical(y_train)
y_test_minmax = to_categorical(y_test)
y_val_minmax = to_categorical(y_val)
Next, I attempt to find the shape of the newly recoded variables:
print(y_train_minmax.shape) #(91366, 4)
print(X_train_minmax.shape) #(91366, 6)
print(X_test_minmax.shape) #(55939, 6)
print(X_val_minmax.shape) #(39157, 6)
print(y_train_minmax.shape) #(91366, 4)
print(y_test_minmax.shape) #(55939, 4)
print(y_val_minmax.shape) #(39157, 4)
Finally, I build the model and attempt to fit it:
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(91366, 6)))
model.add(layers.Dense(3, activation='softmax'))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train_minmax, y_train_minmax, epochs=5, batch_size=128)
I get this message when I run the code:
ValueError: Error when checking input: expected dense_1_input to have 3 dimensions, but got array with shape (91366, 6)
I believe that the error is in when I create the neural network with the input shape. I am having a hard time understanding where I went wrong. Any help would be great!

You should change this line:
model.add(layers.Dense(512, activation='relu', input_shape=(6,)))
In keras you don't need to directly specify the number of examples you have in your dataset. As input_shape you need to provide only a shape of a single data point.
Another potential error which I spotted in your code snippet is that you should set:
model.add(layers.Dense(4, activation='softmax'))
As your output single data point has a shape of (4,). It's not consistent with what you've said about possible layers so I'd also advise rechecking your data.
Another possible mistake which I spotted is that you are not training separate scalers for train, test and valid datasets - but a single one on a train set - and then scale your other dataset using a trained scaler.

Related

How to compare baseline and GridSearchCV results fair?

I am a bit confusing with comparing best GridSearchCV model and baseline.
For example, we have classification problem.
As a baseline, we'll fit a model with default settings (let it be logistic regression):
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
pred = baseline.predict(X_train)
print(accuracy_score(y_train, pred))
So, the baseline gives us accuracy using the whole train sample.
Next, GridSearchCV:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
X_val, X_test_val,y_val,y_test_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
parameters = [ ... ]
best_model = GridSearchCV(LogisticRegression(parameters,scoring='accuracy' ,cv=cv))
best_model.fit(X_val, y_val)
print(best_model.best_score_)
Here, we have accuracy based on validation sample.
My questions are:
Are those accuracy scores comparable? Generally, is it fair to compare GridSearchCV and model without any cross validation?
For the baseline, isn't it better to use Validation sample too (instead of the whole Train sample)?
No, they aren't comparable.
Your baseline model used X_train to fit the model. Then you're using the fitted model to score the X_train sample. This is like cheating because the model is going to already perform the best since you're evaluating it based on data that it has already seen.
The grid searched model is at a disadvantage because:
It's working with less data since you have split the X_train sample.
Compound that with the fact that it's getting trained with even less data due to the 5 folds (it's training with only 4/5 of X_val per fold).
So your score for the grid search is going to be worse than your baseline.
Now you might ask, "so what's the point of best_model.best_score_? Well, that score is used to compare all the models used when searching for the optimal hyperparameters in your search space, but in no way should be used to compare against a model that was trained outside of the grid search context.
So how should one go about conducting a fair comparison?
Split your training data for both models.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Fit your models using X_train.
# fit baseline
baseline.fit(X_train, y_train)
# fit using grid search
best_model.fit(X_train, y_train)
Evaluate models against X_test.
# baseline
baseline_pred = baseline.predict(X_test)
print(accuracy_score(y_test, baseline_pred))
# grid search
grid_pred = best_model.predict(X_test)
print(accuracy_score(y_test, grid_pred))

sklearn cross valid / cross predict

I understand that cross_val_predict / cross_val trains n out-of-folds models and then aggragate them to produce the final prediction. This is done on the train phase. Now, I want to use the fitted models to predict the test data. I can use for loop to collect predictions on the test data and aggregate them but first I want to ask if there is a build-in sklearn method for this?
from sklearn.model_selection import cross_val_predict, train_test_split
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
lasso = linear_model.Lasso()
y_train_hat = cross_val_predict(lasso, X_train, y_train, cv=3)
y_test_hat = do_somthing(lasso, X_test)```
Thanks
The 3 models from your cross_val_predict are not saved anywhere, so you can't make predictions with them. You can use instead cross_validate with return_estimator=True. You'll still be left with three models that you'll have to manually use to make and aggregate predictions. (You could in principle put those models into an ensemble model like VotingClassifier, but at least for now there is no prefit argument to prevent refitting your estimators. There some discussion in Issue 7382 and links from there.)

Model validation with separate dataset

Apologies, quite new to sklearn. I'm trying to validate a model using an external dataset for binary classification of text strings. I've trained the model but want to use it against another dataset of a different size for prediction rather than include the data in the initial dataset split. Is this even possible?
Initial split
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(3, 3))
Xprod = vectorizer.fit_transform(prod_good)
X = vectorizer.fit_transform(total_requests)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=21)
Test the model
linear_svm=LinearSVC(C=1)
linear_svm.fit(X_train, y_train)
y_pred = linear_svm.predict(X_test)
score_test = metrics.accuracy_score(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)
New prediction
newpred = linear_svm.predict(Xprod)
...
Error:
ValueError: X has 4553 features per sample; expecting 24422
Think I'm misunderstanding some basic concepts here
The function fit_transform makes a fit and then a transform. So this line fits your vectorizer and then transforms total_requests to X:
X = vectorizer.fit_transform(total_requests)
As your vectorizer must be fitted only one time (in order to have the same matrix of features each time you use your vectorizer), to compute Xprod, you just need to use transform:
Xprod = vectorizer.transform(prod_good)
Also, you need to compute Xprod after the vectorizer is fitted, so compute Xprod after X.

Validation accuracy and validation loss almost remains constant in every epoch

I am making an autonomous farming robot for my final year project. I want to move it autonomously in lanes in side the farms. I am just using the raspberry pi image in front of my vehicle. I collect my data through pi and then send it to my computer for training.
Initially i have just trained it for moving in a straight line. As i have not used encoders in my motors so there is a possibility of its being diverging along one direction , so i have to constantly give it the feedback to stay on the right path.
Sample image is as follows, Note this is black and white image :enter image description here
I have 836 images for training and 356 for validation. When i am trying to train it, my model accuracy doesnot improves much. I have tried changing different structures, from fully connected layers to different convolutional layers, my training accuracy doesnot improves much and perhaps most of the times validation accuracy and validation loss remains same.
I am confused that why is this so, is this to do with my code or should i apply computer vision techniques on the image so that features are more prominently visible. What should be the best approach to tackle this problem.
My code is as follows:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
# fix dimension ordering issue
from keras import backend as K
import numpy as np
import glob
import pandas as pd
from sklearn.model_selection import train_test_split
K.set_image_dim_ordering('th')
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
def load_data(path):
print("Loading training data...")
training_data = glob.glob(path)[0]
data=np.load(training_data)
a=data['train']
b=data['train_labels']
s=np.concatenate((a, b), axis=1)
data=pd.DataFrame(s)
data=data.sample(frac=1)
X = data.iloc[:,:-4]
y=data.iloc[:,-4:]
print("Image array shape: ", X.shape)
print("Label array shape: ", y.shape)
# normalize data
# train validation split, 7:3
return train_test_split(X, y, test_size=0.3)
data_path = "*.npz"
X_train,X_test,y_train,y_test=load_data(data_path)
# reshape to be [samples][channels][width][height]
X_train = X_train.values.reshape(X_train.shape[0], 1, 120, 320).astype('float32')
X_test = X_test.values.reshape(X_test.shape[0], 1, 120, 320).astype('float32')
# normalize inputs from 0-255 to 0-1
X_train = X_train / 255.0
X_test = X_test / 255.0
# one hot encode outputs
num_classes = y_test.shape[1]
# define a simple CNN model
def baseline_model():
model = Sequential()
model.add(Conv2D(30, (5, 5), input_shape=(1, 120, 320), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(15, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
# build the model
model = baseline_model()
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=10)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("CNN Error: %.2f%%" % (100-scores[1]*100))
sample output: This is the best output and it is of the above code:
enter image description here
I solved this problem by changing the structure of my algorithm and using NVIDIA's deep learning car algorithm to solve this problem. The algorithm is very robust and applies basic computer vision also on it. You can easily find sample implementation for toy cars on medium/youtube also.
this article was really helpful for me:
https://towardsdatascience.com/deeppicar-part-1-102e03c83f2c
additionally this resource was also very helpful:
https://zhengludwig.wordpress.com/projects/self-driving-rc-car/

How to Build a Decision tree Regressor model

I am learning ML and was doing a simple handsOn as below:
//
Split boston.data into two sets names x_train and x_test. Also, split boston.target into two sets y_train and y_test.
Build a Decision tree Regressor model from x_train set, with default parameters.
//
I did following code for this:
from sklearn import datasets, model_selection, tree
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = model_selection.train_test_split(boston.data,boston.target, random_state=30)
dt = tree.DecisionTreeRegressor()
dt_reg = dt.fit(x_train)
When I am doing above, it's giving:
TypeError: fit() missing 1 required positional argument: 'y'
Can I fit a model for one training dataset?
What should I give here as 'y'?
As the error states, the fit() method takes 2 parameters for a regression problem, the predictors and the outcome:
dt_reg = dt.fit(x_train, y_train)
Supervised learning models such as the regression tree you are using require a set of observations composed of features (each row of X_train can be understood as a vector containing features for one observation) and a target outcome (each element in the vector y_train)

Resources