train_test_split - without random, with original order - machine-learning

I want to use train_test_split(X, y, test_size = 0.2), but I don't want the data to be random - I want the first 80% of the data to be train and the last 20% to be test. Can it be done ?

I thought train_test_split was still using random while initial shuffle is off. This is actually can be solved with a simple shuffle=False arg:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42,shuffle=False)

Related

Errors with LSTM input shapes with time series data

I'm trying to predict torque from 8 features with an LSTM layer in my neural network. I'm having trouble with the input shape and have looked around on many sites for a solution. I'm quite new to machine learning and am having trouble understanding the problem and how I can fix this. Here is my code, dataset, and error message.
file = r'/content/drive/MyDrive/only_force_pt1.csv'
df = pd.read_csv(file)
X = df.iloc[:, 1:9]
y = df.iloc[:,9]
print(X)
print(y)
df.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, shuffle = True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, shuffle = True)
[verbose, epochs, batch_size] = [1, 200, 32]
input_shape = (X_train.shape[0],X_train.shape[1])
model = Sequential()
# LSTM
model.add(LSTM(64, input_shape=input_shape, return_sequences = True))
model.add(Dense(32, activation='relu', kernel_regularizer=keras.regularizers.l2(0.001)))
#model.add(Dropout(0.2))
#model.add(Dense(32, activation='relu', kernel_regularizer=keras.regularizers.l2(0.001)))
model.add(Dense(1,activation='relu'))
earlystopper = EarlyStopping(monitor='val_loss', min_delta=0, patience = 20, verbose =1, mode = 'auto')
model.summary()
model.compile(loss = 'mse', optimizer = Adam(learning_rate = 0.001), metrics=[tf.keras.metrics.RootMeanSquaredError()])
history = model.fit(X_train, y_train, batch_size = batch_size, epochs = epochs, verbose = verbose, validation_data=(X_val,y_val), callbacks = [earlystopper])
ValueError: Input 0 of layer "sequential_17" is incompatible with the layer: expected shape=(None, 3634, 8), found shape=(None, 8)
dataset: https://drive.google.com/drive/folders/1BQOXffFYioCiPug2VcBZEZVD-u3y9bcl?usp=sharing][1]
As I understand your problem, I think that you are passing the number of data points as an additional dimension on the input shape of the LSTM layer. Your data dimensionality is 8 and 3634(=X_train.shape[0]) is the number of data points, which should match the first dimension (with None) of the input tensors, and should not be passed as a dimension to the LSTM because it is determined by the batch size.
If that's the case, change the input_shape definition to:
input_shape = (X_train.shape[1],)
and it should work.

What is the use of look_back in LSTM?

X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.5, random_state=0)
model = Sequential()
look_back=1

how to predict new values when a machine learning model was standardized StandardScaler

I'm working on a machine learning model, I have a dataframe with the data
I normalize the data with a standard distribution
scaler = StandardScaler()
df = scaler.fit_transform(df)
I divide the datasets into target and characteristics
X_df = df[X_characteristics_list]
y_df = df[target]
I split into train and test then I train the model
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size = 0.25)
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
I predict the test to validate the effectiveness
y_test_pred = forest.predict(X2_test)
mse = mean_squared_error(y_test, y_test_pred)
But when is time to test in real life I need to leave the model ready to predict
If i Want to predict just one record
let say [100,20,34]
I can't because I need the record standardized, and transform it with StandardScaler does not work because it depends on standard deviation so I would need the original dataset
What's the best way to solve this problem.
See below:
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.preprocessing import StandardScaler
# Create our input and output matrices
>>> X, y = make_classification()
# Split train-test... "test" will be production/unobserved/"real-life" data
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
# What does X_train look like?
>>> X_train
array([[-0.08930702, -2.71113991, -0.93849926, ..., 0.21650905,
0.68952722, 0.61365789],
[-0.31143977, -1.87817904, 0.08287492, ..., -0.41332943,
-0.58967179, 1.7239411 ],
[-1.62287589, 1.10691318, -0.630556 , ..., -0.35060008,
1.11270562, 0.08106694],
...,
[-0.59797041, 0.90218081, 0.89983074, ..., -0.54374315,
1.18534841, -0.03397969],
[-1.2006559 , 1.01890955, -1.21617181, ..., 1.76263322,
1.38280423, -1.0192972 ],
[ 0.11883425, 1.42952643, -1.23647358, ..., 1.02509208,
-1.14308885, 0.72096531]])
# Let's scale it
>>> scaler = StandardScaler()
>>> X_train = scaler.fit_transform(X_train)
>>> X_train
array([[ 0.08867642, -1.97950269, -1.1214106 , ..., 0.22075623,
0.57844552, 0.46487917],
[-0.10736984, -1.34896243, 0.00808597, ..., -0.37670234,
-0.6045418 , 1.57819736],
[-1.26479555, 0.91071257, -0.78086855, ..., -0.3171979 ,
0.96979563, -0.06916763],
...,
[-0.36025134, 0.7557329 , 0.91152449, ..., -0.50041152,
1.03697478, -0.18452874],
[-0.89215959, 0.84409499, -1.42847749, ..., 1.68739437,
1.21957946, -1.17253964],
[ 0.27237431, 1.15492649, -1.4509284 , ..., 0.98777012,
-1.116335 , 0.57247992]])
# Fit the model
>>> model = LogisticRegression()
>>> model.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
# Now let's use the already-fitted StandardScaler object to simply transform
# *not fit_transform* the test data
>>> X_test = scaler.transform(X_test)
>>> model.predict(X_test)
array([1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0,
0, 0, 0])
Note that using joblib or pickle you can save the scaler object and re-load it for scaling in "real-time" later on.

Very low performance even after oversampling dataset

I'm using an MLPClassifier for classification of heart diseases. I used imblearn.SMOTE to balance the objects of each class. I was getting very good results (85% balanced acc.), but i was advised that i would not use SMOTE on test data, only for train data. After i made this changes, the performance of my classifier fell down too much (~35% balanced accuracy) and i don't know what can be wrong.
Here is a simple benchmark with training data balanced but test data unbalanced:
And this is the code:
def makeOverSamplesSMOTE(X,y):
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy='all')
X, y = sm.fit_sample(X, y)
return X,y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)
## Normalize data
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
## SMOTE only on training data
X_train, y_train = makeOverSamplesSMOTE(X_train, y_train)
clf = MLPClassifier(hidden_layer_sizes=(20),verbose=10,
learning_rate_init=0.5, max_iter=2000,
activation='logistic', solver='sgd', shuffle=True, random_state=30)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
I'd like to know what i'm doing wrong, since this seems to be the proper way of preparing data.
The first mistake in your code is when you are transforming data into standard format. You only need to fit StandardScaler once and that is on X_train. You shouldn't refit it on X_test. So the correct code will be:
def makeOverSamplesSMOTE(X,y):
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy='all')
X, y = sm.fit_sample(X, y)
return X,y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)
## Normalize data
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
## SMOTE only on training data
X_train, y_train = makeOverSamplesSMOTE(X_train, y_train)
clf = MLPClassifier(hidden_layer_sizes=(20),verbose=10,
learning_rate_init=0.5, max_iter=2000,
activation='logistic', solver='sgd', shuffle=True, random_state=30)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
For the machine learning model, try reducing the learning rate. it is too high. the default learning rate in sklearn is 0.001. Try changing the activation function and the number of layers. Also not every ML model works on every dataset so you might need to look at your data and choose ML model accordingly.
Hope you have already got better result for your model.I tried by changing few parameter, and I getting accuracy of 65%, when I change it to 90:10 sample I got an accuracy of 70%.
But accuracy can mislead,so I calculated F1 score which give you better picture of prediction.
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(1,),verbose=False,
learning_rate_init=0.001,
max_iter=2000,
activation='logistic', solver='sgd', shuffle=True, random_state=50)
clf.fit(X_train_res, y_train_res)
y_pred = clf.predict(X_test)
from sklearn.metrics import accuracy_score, confusion_matrix ,classification_report
score=accuracy_score(y_test, y_pred, )
print(score)
cr=classification_report(y_test, clf.predict(X_test))
print(cr)
Accuracy = 0.65
classification report :
precision recall f1-score support
0 0.82 0.97 0.89 33
1 0.67 0.31 0.42 13
2 0.00 0.00 0.00 6
3 0.00 0.00 0.00 4
4 0.29 0.80 0.42 5
micro avg 0.66 0.66 0.66 61
macro avg 0.35 0.42 0.35 61
weighted avg 0.61 0.66 0.61 61
confusion_matrix:
array([[32, 0, 0, 0, 1],
[ 4, 4, 2, 0, 3],
[ 1, 1, 0, 0, 4],
[ 1, 1, 0, 0, 2],
[ 1, 0, 0, 0, 4]], dtype=int64)

What could be the best way to make your SVM faster and reliable?

I'm new to data mining. I have implemented my linear SVM as following.
X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.1, random_state = 0)
#print X_train.shape, y_train.shape
#print X_test.shape, y_test.shape
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print clf.score(X_test, y_test)
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=10)
print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()*2 ))
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],'C': [1, 10, 100, 1000]},{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
scores = ['precision', 'recall']
svr = svm.SVC(C=1)
for score in scores:
print("# Tuning hyper-parameters for %s"% score)
clf =GridSearchCV(svr, tuned_parameters, cv=10,scoring='%s_macro'% score)
clf.fit(X_train, y_train)
print("best parameters %s" % clf.best_params_)
Here, My data is too huge so what should I do to make my linear svm to run it very fast?
Do parameter tuning only on a sample.
Once you have found good parameters, then use the entire data set.

Resources