K cross validation with different results everytime - machine-learning

All my models are initialized with the below:
def intiailize_clf_models(self):
model = RandomForestClassifier(random_state=42)
self.clf_models.append((model))
model = ExtraTreesClassifier(random_state=42)
self.clf_models.append((model))
model = MLPClassifier(random_state=42)
self.clf_models.append((model))
model = LogisticRegression(random_state=42)
self.clf_models.append((model))
model = xgb.XGBClassifier(random_state=42)
self.clf_models.append((model))
model = lgb.LGBMClassifier(random_state=42)
self.clf_models.append((model))
Which loops through the models and performs k fold cross validation with :
def kfold_cross_validation(self):
clf_models = self.get_models()
models = []
self.results = {}
for model in clf_models:
self.current_model_name = model.__class__.__name__
cross_validate = cross_val_score(model, self.xtrain, self.ytrain, cv=4)
self.mean_cross_validation_score = cross_validate.mean()
print("Kfold cross validation for", self.current_model_name)
self.results[self.current_model_name] = self.mean_cross_validation_score
models.append(model)
Anytime i run this cross validation, i get a different result even after i have set a random state on the different models. I would like to know why i get different results in cross validation and what can be done about it

This is because you did not set the random_state for your k-fold generator. By default when you pass a int value for cv as
cross_validate = cross_val_score(model, self.xtrain, self.ytrain, cv=4)
cross_val_score will call (Stratified)KFold using a different random state with every call causing your model's parameters to differ leading to different results.
The relevant part from the source file.
cv: int, cross-validation generator or an iterable, default=None
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a `(Stratified)KFold`,
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and ``y`` is
either binary or multiclass, :class:`StratifiedKFold` is used. In all
other cases, :class:`KFold` is used.
To remedy this you can pass your own cross-validation generator with a controlled random state as stated in the documentation above. For example:
# (code untested)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=4, random_state=42)
cross_validate = cross_val_score(model, self.xtrain, self.ytrain, cv=skf)

I found the solution to my question.
Setting a random seed with the below solved the problem:
seed = np.random.seed(22)

Related

How to make prediction with TFF?

My question is : How can I predict a label of such image with Tensorflow Federated ?
After completing the evaluation of the model, I would like to predict the label of a given image. Like in Keras we do this :
# new instance where we do not know the answer
Xnew = array([[0.89337759, 0.65864154]])
# make a prediction
ynew = model.predict_classes(Xnew)
# show the inputs and predicted outputs
print("X=%s, Predicted=%s" % (Xnew[0], ynew[0]))
Output:
X=[0.89337759 0.65864154], Predicted=[0]
here is how state and model_fn was created:
def model_fn():
keras_model = create_compiled_keras_model()
return tff.learning.from_compiled_keras_model(keras_model, sample_batch)
iterative_process = tff.learning.build_federated_averaging_process(model_fn, server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0),client_weight_fn=None)
state = iterative_process.initialize()
I find this error :
list(self._name_to_index.keys())[:10]))
AttributeError: The tuple of length 2 does not have named field "assign_weights_to". Fields (up to first 10): ['trainable', 'non_trainable']
Thanks
(Requires TFF 0.16.0 or newer)
Since the code is building a tff.learning.Model from a tf.keras.Model you may be able to use the assign_weights_to method on the tff.learning.ModelWeights object (the type of state.model).
This method is used in the Federated Learning for Text Generation tutorial.
This might look something like (near the bottom, the early portions are an example FL training loop):
def create_keras_model() -> tf.keras.Model:
...
def model_fn():
...
return tff.learning.from_keras_model(create_keras_model())
training_process = tff.learning. build_federated_averaging_process(model_fn, ...)
state = training_process.initialize()
for _ in range(NUM_ROUNDS):
state, metrics = training_process.next(state, ...)
model_for_inference = create_keras_model()
state.model.assign_weights_to(model_for_inference)
Once the weights from state have been assigned back into the Keras model, the code can use the standard Keras APIs, such as tf.keras.Model.predict_on_batch
predictions = model_for_inference.predict_on_batch(batch)

LSTM sequence prediction overfits on one specific value only

hello guys i am new in machine learning. I am implementing federated learning on with LSTM to predict the next label in a sequence. my sequence looks like this [2,3,5,1,4,2,5,7]. for example, the intention is predict the 7 in this sequence. So I tried a simple federated learning with keras. I used this approach for another model(Not LSTM) and it worked for me, but here it always overfits on 2. it always predict 2 for any input. I made the input data so balance, means there are almost equal number for each label in last index (here is 7).I tested this data on simple deep learning and greatly works. so it seems to me this data mybe is not suitable for LSTM or any other issue. Please help me. This is my Code for my federated learning. Please let me know if more information is needed, I really need it. Thanks
def get_lstm(units):
"""LSTM(Long Short-Term Memory)
Build LSTM Model.
# Arguments
units: List(int), number of input, output and hidden units.
# Returns
model: Model, nn model.
"""
model = Sequential()
inp = layers.Input((units[0],1))
x = layers.LSTM(units[1], return_sequences=True)(inp)
x = layers.LSTM(units[2])(x)
x = layers.Dropout(0.2)(x)
out = layers.Dense(units[3], activation='softmax')(x)
model = Model(inp, out)
optimizer = keras.optimizers.Adam(lr=0.01)
seqLen=8 -1;
global_model = Mymodel.get_lstm([seqLen, 64, 64, 15]) # 14 categories we have , array start from 0 but never can predict zero class
global_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
def main(argv):
for comm_round in range(comms_round):
print("round_%d" %( comm_round))
scaled_local_weight_list = list()
global_weights = global_model.get_weights()
np.random.shuffle(train)
temp_data = train[:]
# data divided among ten users and shuffled
for user in range(10):
user_data = temp_data[user * userDataSize: (user+1)*userDataSize]
X_train = user_data[:, 0:seqLen]
X_train = np.asarray(X_train).astype(np.float32)
Y_train = user_data[:, seqLen]
Y_train = np.asarray(Y_train).astype(np.float32)
local_model = Mymodel.get_lstm([seqLen, 64, 64, 15])
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
local_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
local_model.set_weights(global_weights)
local_model.fit(X_train, Y_train)
scaling_factor = 1 / 10 # 10 is number of users
scaled_weights = scale_model_weights(local_model.get_weights(), scaling_factor)
scaled_local_weight_list.append(scaled_weights)
K.clear_session()
average_weights = sum_scaled_weights(scaled_local_weight_list)
global_model.set_weights(average_weights)
predictions=global_model.predict(X_test)
for i in range(len(X_test)):
print('%d,%d' % ((np.argmax(predictions[i])), Y_test[i]),file=f2 )
I could find some reasons for my problem, so I thought I can share it with you:
1- the proportion of different items in sequences are not balanced. I mean for example I have 1000 of "2" and 100 of other numbers, so after a few rounds the model fitted on 2 because there are much more data for specific numbers.
2- I changed my sequences as there are not any two items in a sequence while both have same value. so I could remove some repetitive data from the sequences and make them more balance. maybe it is not the whole presentation of activities but in my case it makes sense.

Integrate the ImageDataGenerator in own customized fit_generator

I want to fit a Siamese CNN with multiple inputs that are stored in my memory and no label (just an arbitrary dummy label). Therefore, I had to write my own data_generator function for using a CNN model in Keras.
My data generator is of the following form
class DataGenerator(keras.utils.Sequence):
def __init__(self, train_data, train_triplets, batch_size=32, dim=(128,128), n_channels=3, shuffle=True):
self.dim = dim
self.batch_size = batch_size
#Added
self.train_data = train_data
self.train_triplets = train_triplets
self.n_channels = n_channels
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
n_row = self.train_triplets.shape[0]
return int(np.floor(n_row / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
#print(index)
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = self.train_triplets.iloc[indexes,]
# Generate data
[anchor, positive, negative] = self.__data_generation(list_IDs_temp)
y_train = np.random.randint(2, size=(1,2,self.batch_size)).T
return [anchor,positive, negative], y_train
def on_epoch_end(self):
'Updates indexes after each epoch'
n_row = self.train_triplets.shape[0]
self.indexes = np.arange(n_row)
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples'
# anchor positive and negatives: (n_samples, *dim, n_channels)
# Initialization
anchor = np.zeros((self.batch_size,*self.dim,self.n_channels))
positive = np.zeros((self.batch_size,*self.dim,self.n_channels))
negative = np.zeros((self.batch_size,*self.dim,self.n_channels))
nrow_temp = list_IDs_temp.shape[0]
# Generate data
for i in range(nrow_temp):
list_ind = list_IDs_temp.iloc[i,]
anchor[i] = self.train_data[list_ind[0]]
positive[i] = self.train_data[list_ind[1]]
negative[i] = self.train_data[list_ind[2]]
return [anchor, positive, negative]
where train_data is a list of all images and train triplets a data frame containing image indices to create my inputs containing of a triplet of images.
Now, I want to do some data augmenting for each mini batch supplied to my CNN. I have tried to integrate the ImageDataGenarator of Keras but I couldn't implement it in my code. Is it somehow possible to do it ? I am not very experienced with python and would appreciate any help.
Does this article answer your question?
To put it in a nutshell, Kera's ImageDataGenerator lacks flexibility when it comes to personalized batch generators, and the easiest way to still use data augmentation is simply to switch to another data augmentation tool (like the albumentations library described in the previous article, but you could also use imgaug as well).
I just want to warn you that I encountered several issues with albumentations (that I described in this question on GitHub, but for now I still have had no answers), so maybe using imgaug is a better idea.
Hope this helps, good luck with your model !

What is the meaning of the GridSearchCV best_score_ attribute? (the value is different from the mean of the cross validation array)

I'm confused with the results, probably I'm not getting the concept of cross validation and GridSearch right. I had followed the logic behind this post:
https://randomforests.wordpress.com/2014/02/02/basics-of-k-fold-cross-validation-and-gridsearchcv-in-scikit-learn/
argd = CommandLineParser(argv)
folder,fname=argd['dir'],argd['fname']
df = pd.read_csv('../../'+folder+'/Results/'+fname, sep=";")
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df['A']
explanatory_variable_columns.remove('A')
y = np.array([1 if e else 0 for e in response_variable_column])
X =df[list(explanatory_variable_columns)].as_matrix()
kf_total = KFold(len(X), n_folds=5, indices=True, shuffle=True, random_state=4)
dt=DecisionTreeClassifier(criterion='entropy')
min_samples_split_range=[x for x in range(1,20)]
dtgs=GridSearchCV(estimator=dt, param_grid=dict(min_samples_split=min_samples_split_range), n_jobs=1)
scores=[dtgs.fit(X[train],y[train]).score(X[test],y[test]) for train, test in kf_total]
# SAME AS DOING: cross_validation.cross_val_score(dtgs, X, y, cv=kf_total, n_jobs = 1)
print scores
print np.mean(scores)
print dtgs.best_score_
RESULTS OBTAINED:
# score [0.81818181818181823, 0.78181818181818186, 0.7592592592592593, 0.7592592592592593, 0.72222222222222221]
# mean score 0.768
# .best_score_ 0.683486238532
ADDITIONAL NOTE:
I ran it using another combination of the explanatory variables (using only some of them) and I got the inverse problem. Now the .best_score_ is higher than all the values in the cross validation array.
# score [0.74545454545454548, 0.70909090909090911, 0.79629629629629628, 0.7407407407407407, 0.64814814814814814]
# mean score 0.728
# .best_score_ 0.802752293578
The code is confusing several things.
dtgs.fit(X[train_],y[train_]) does internal 3-fold cross-validation for every parameter combination from param_grid, producing a grid of 20 results, which you can open by calling dtgs.grid_scores_.
[dtgs.fit(X[train_],y[train_]).score(X[test],y[test]) for train_, test in kf_total] Therefore this line fits grid search five times and then takes its score using 5-Fold cross validation. The result is the array of scores of 5-Fold validation.
And when you call dtgs.best_score_ you get the best score in the grid of the results of 3-fold validation of hyperparameters for the last fit (of 5).

Using partialPlot after fitting a Random Forest model in caret

After I fit a randomForest using the train() function, I'm having problems invoking partialPlot() and plotmo(). Here's some reproducible code:
library(AER)
library(caret)
data(Mortgage)
fitControl <- trainControl(method = "repeatedcv"
,number = 5
,repeats = 10
,allowParallel = TRUE)
library(doMC)
registerDoMC(cores=10)
Final.rfModel <- train(form=networth ~ ., data=Mortgage, method = "rf", metric='RMSE', trControl = fitControl, tuneLength=10, importance = TRUE)
#### partial plots fail
partialPlot(Final.rfModel$finalModel, Mortgage, "liquid")
library(plotmo)
plotmo(Final.rfModel$finalModel)
There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.
So randomForest will not create dummy variables when you use randomForest(y ~ ., data = dat) but train (and most others) will using a call like train(y ~ ., data = dat).
The error occurs because rate, married and a few other predictors are factors. The dummy variables created by train don't have the same names so partialPlot can't find them.
Using the non-formula method with train will pass the factor predictors to randomForest and everything will work.
TL;DR
Use the non-formula method with train in this case:
Final.rfModel <- train(form=networth ~ ., data=Mortgage,
method = "rf",
metric='RMSE',
trControl = fitControl,
tuneLength=10,
importance = TRUE)
Max

Resources