scikit learn train_test_split function not working as expected

scikit learn train_test_split function not working as expected - machine-learning

I am using train test split function to separate data for training and testing, but function assigns wrong label for separated train test data. Instead of assigning label from expected row it assigns label from 2nd row from expected row. Please, Let me know where i am going wrong ?
data = pd.read_csv('To_Tanaji.csv')
print(data.columns)
print(data.shape)
#plt.hist(train["DiffCorrectLatRawLat"])
#test = pd.read_csv('test.csv')
#np.polyfit(data['DistanceRaw2GPS'], data['DistanceCorrected2GPS'], 2)
Output= data.DistanceCorrected2GPS
Input=data.DistanceRaw2GPS
X_train, X_test, y_train, y_test = train_test_split(Input, Output, test_size=0.2)

I won't suggest turnning off the shuffle parameter in your train_test_split function rather keep your random_state variable fixed for reproducible splits. It's better to split randomly than splitting say the top 20% of the dataset this can skew your data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Input, Output, test_size = 0.20, random_state = 0)
If the split labels are wrong you should make sure the Output and Input variables are assigned correctly or not.

The train_test_split function will shuffle your data by default. If you don't want this, use shuffle=False.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
If possible, provide your input data (scrambled or not) to reproduce the problem.

Related

sklearn cross valid / cross predict

I understand that cross_val_predict / cross_val trains n out-of-folds models and then aggragate them to produce the final prediction. This is done on the train phase. Now, I want to use the fitted models to predict the test data. I can use for loop to collect predictions on the test data and aggregate them but first I want to ask if there is a build-in sklearn method for this?
from sklearn.model_selection import cross_val_predict, train_test_split
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
lasso = linear_model.Lasso()
y_train_hat = cross_val_predict(lasso, X_train, y_train, cv=3)
y_test_hat = do_somthing(lasso, X_test)```
Thanks

The 3 models from your cross_val_predict are not saved anywhere, so you can't make predictions with them. You can use instead cross_validate with return_estimator=True. You'll still be left with three models that you'll have to manually use to make and aggregate predictions. (You could in principle put those models into an ensemble model like VotingClassifier, but at least for now there is no prefit argument to prevent refitting your estimators. There some discussion in Issue 7382 and links from there.)

Is there a way to do a stratified train/test split without shuffling the data?

I'm using time sensitive data and would like to maintain the order of the data but stratifying the data since I've got multiple labels. I haven't found any libraries that allow this.

Hi Juanro could you provide with an example what are you trying to do as it might help to understand the problem in a better way
Thanks:

Please refer to the train_test_split documentation.
You can do somthing like this:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=0,
stratify=y)
The stratify = y will give a stratified split with same proportions of class labels as the input dataset.

MLPRegressor gives very negative scores

I'm kind of new to machine learning and I am using MLPRegressor. I split my data with
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
then I make and fit the model, using 10-fold validation for test set.
nn = MLPRegressor(hidden_layer_sizes=(100, 100), activation='relu',
solver='lbfgs', max_iter=500)
nn.fit(X_train, y_train)
TrainScore = nn.score(X_train, y_train)
kfold = KFold(n_splits=10, shuffle=True, random_state=0)
print("Cross-validation scores:\t{} ".format(cross_val_score(nn, X_test, y_test, cv=kfold)))
av_corss_val_score = np.mean(cross_val_score(nn, X_test, y_test, cv=kfold))
print("The average cross validation score is: {}".format(av_corss_val_score))
The problem is that the test scores I receive are very negative (-4256). What could be possible be wrong?

To keep syntax the same, sklearn maximizes every metric, whether classification accuracy or regression MSE. Therefore, the objective function is defined in a way that a more positive number is good and more negative number is bad. Hence, a less negative MSE is preferred.
Moving on to why it may be so negative in your case, it could be broadly due to two things: overfitting or underfitting. There are tonnes of resources out there to help you from this point forward.

Difference between cross_val_score and another way of calculating accuracy

I tried to calculate the accuracy and was puzzled by the fact that cross_val_score gives a rather low result, than by comparing the predicted results with the correct.
First way of counting, that gives
[0.8033333333333333, 0.7908333333333334, 0.8033333333333333, 0.7925,0.8066666666666666]
kf = KFold(shuffle=True, n_splits=5)
scores = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
scores.append(np.sum(y_pred == y_test) / len(y_test))
Second way gives array([0.46166667, 0.53583333, 0.40916667, 0.44666667, 0.3775 ]):
model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
cross_val_score(model, X, y, cv = 5, scoring='accuracy')
What's my mistake?

cross_val_score will use a StratifiedKFold cv iterator when not specified otherwise. A StratifiedKFold will keep the ratio of classes balanced the same way in train and test split. For more explanation, see my other answer here:-
https://stackoverflow.com/a/48314533/3374996
On the other hand, in your first approach you are using KFold which will not keep the balance of classes. In addition you are doing shuffling of data in that.
So in each fold, there is data difference in your two approaches and hence different results.

The low score in cross_val_score is probably because of the fact that you are providing the complete data to it, instead of breaking it into test and training set. This generally leads to leakage of information which results in your model giving incorrect predictions. See this post for more explanation.
References
Learn the right way to validate models

Properly declaring input_shape for neural network in Keras?

I am attempting to write code to identify data types after loading it in from CSV files. So there are 5 possible labels, and the feature vector contains a list of lists. The feature vector is a list of lists with the following shape:
[slash_count, dash_count, colon_count, letters, dot_count, digits]
I then split my feature and label vectors into training, testing, and validation sets. I found some code on Stackoverflow that someone wrote to do this and I have used the same:
X_train, X_test, y_train, y_test = train_test_split(ml_list, labels, test_size=0.3, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=1)
After doing this, I normalize the features in scale [0,1], and then I create the categorical variables for the labels:
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.fit_transform(X_test)
X_val_minmax = min_max_scaler.fit_transform(X_val)
from keras.utils import to_categorical
y_train_minmax = to_categorical(y_train)
y_test_minmax = to_categorical(y_test)
y_val_minmax = to_categorical(y_val)
Next, I attempt to find the shape of the newly recoded variables:
print(y_train_minmax.shape) #(91366, 4)
print(X_train_minmax.shape) #(91366, 6)
print(X_test_minmax.shape) #(55939, 6)
print(X_val_minmax.shape) #(39157, 6)
print(y_train_minmax.shape) #(91366, 4)
print(y_test_minmax.shape) #(55939, 4)
print(y_val_minmax.shape) #(39157, 4)
Finally, I build the model and attempt to fit it:
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(91366, 6)))
model.add(layers.Dense(3, activation='softmax'))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train_minmax, y_train_minmax, epochs=5, batch_size=128)
I get this message when I run the code:
ValueError: Error when checking input: expected dense_1_input to have 3 dimensions, but got array with shape (91366, 6)
I believe that the error is in when I create the neural network with the input shape. I am having a hard time understanding where I went wrong. Any help would be great!

You should change this line:
model.add(layers.Dense(512, activation='relu', input_shape=(6,)))
In keras you don't need to directly specify the number of examples you have in your dataset. As input_shape you need to provide only a shape of a single data point.
Another potential error which I spotted in your code snippet is that you should set:
model.add(layers.Dense(4, activation='softmax'))
As your output single data point has a shape of (4,). It's not consistent with what you've said about possible layers so I'd also advise rechecking your data.
Another possible mistake which I spotted is that you are not training separate scalers for train, test and valid datasets - but a single one on a train set - and then scale your other dataset using a trained scaler.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

scikit learn train_test_split function not working as expected - machine-learning

The train_test_split function will shuffle your data by default. If you don't want this, use shuffle=False. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html If possible, provide your input data (scrambled or not) to reproduce the problem.

Related

sklearn cross valid / cross predict

Is there a way to do a stratified train/test split without shuffling the data?

MLPRegressor gives very negative scores

Difference between cross_val_score and another way of calculating accuracy

Properly declaring input_shape for neural network in Keras?

Categories

Resources