something about the reproducibility of pytorch on multi-GPU - machine-learning

I've set the random seed to make my model reproducible and it works when I use a single GPU to train my model.
But it doesn't seem to work when I tried to use nn.DataParallel() to train my model on two GPUs. The result is different every time.
So where is the problem?
the function of setting seed is like this:
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

Related

Is TimeSeriesSplit necessary if I set shuffle to False in train_test_split?

Currently, I am using a regression model on a time-series. To assign train and test data, I generally use train_test_split from sklearn and it looks like this:
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X.to_numpy(), Y.to_numpy(), test_size=.30, shuffle=False)
Doing this, each variable is assigned an array with the relevant data. However, since it is time-series data, I would like the training set to only use values that come before the test values. I assumed setting shuffle to False did this, and it appears that way when I just look at the variables, but surely this TimeSeriesSplit model wouldn't be created if this was case, right?
When I try to implement this, the documentation for TimeSeriesSplit is extremely confusing for a beginner. It seems to only take in a vaguely defined "nsplits" parameter, but it doesn't have any other instructions on how I can just simply assign xtrain, xtest, ytrain, and ytest. So is it even necessary? Does it do anything that isn't done by setting shuffle to False?
The furthest I can get, even after exhausting all resources is:
time_series_split = TimeSeriesSplit()
The dataset is structured as follows:
date
this month's value
next_month's returns
01/22
30
-5
02/22
10
5
03/22
50
-10
04/22
5
1
Using TimeSeriesSplit() usually means cross-validation. If you just need to separate the test set, train_test_split(shuffle=False) is sufficient.
However, if you happen to need several subsets without future data leakage, TimeSeriesSplit() is able to simplify things a lot, i.e.:
time_series_split = TimeSeriesSplit(n_splits=4)
gs = GridSearchCV(cv=time_series_split, ... # add parameters as needed)
gs.fit(X_train, y_train)

How to use pretrained weights of a model for initializing the weights in next iteration?

I have a model architecture. I have saved the entire model using torch.save() for some n number of iterations. I want to run another iteration of my code by using the pre-trained weights of the model I saved previously.
Edit: I want the weight initialization for the new iteration be done from the weights of the pretrained model
Edit 2: Just to add, I don't plan to resume training. I intend to save the model and use it for a separate training with same parameters. Think of it like using a saved model with weights etc. for a larger run and more samples (i.e. a complete new training job)
Right now, I do something like:
# default_lr = 5
# default_weight_decay = 0.001
# model_io = the pretrained model
model = torch.load(model_io)
optim = torch.optim.Adam(model.parameters(),lr=default_lr, weight_decay=default_weight_decay)
loss_new = BCELoss()
epochs = default_epoch
.
.
training_loop():
....
outputs = model(input)
....
.
#similarly for test loop
Am I missing something? I have to run for a very long epoch for a huge number of sample so can not afford to wait to see the results then figure out things.
Thank you!
From the code that you have posted, I see that you are only loading the previous model parameters in order to restart your training from where you left it off. This is not sufficient to restart your training correctly. Along with your model parameters (weights), you also need to save and load your optimizer state, especially when your choice of optimizer is Adam which has velocity parameters for all your weights that help in decaying the learning rate.
In order to smoothly restart training, I would do the following:
# For saving your model
state = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict()
}
model_save_path = "Enter/your/model/path/here/model_name.pth"
torch.save(state, model_save_path)
# ------------------------------------------
# For loading your model
state = torch.load(model_save_path)
model = MyNetwork()
model.load_state_dict(state['model'])
optim = torch.optim.Adam(model.parameters(),lr=default_lr, weight_decay=default_weight_decay)
optim.load_state_dict(state['optimizer'])
Besides these, you may also want to save your learning rate if you are using a learning rate decay strategy, your best validation accuracy so far which you may want for checkpointing purposes, and any other changeable parameter which might affect your training. But in most of the cases, saving and loading just the model weights and optimizer state should be sufficient.
EDIT: You may also want to look at this following answer which explains in detail how you should save your model in different scenarios.

How to "Iterate" on Computer Vision machine learning model?

I've created a model using google clouds vision api. I spent countless hours labeling data, and trained a model. At the end of almost 20 hours of "training" the model, it's still hit and miss.
How can I iterate on this model? I don't want to lose the "learning" it's done so far.. It works about 3/5 times.
My best guess is that I should loop over the objects again, find where it's wrong, and label accordingly. But I'm not sure of the best method for that. Should I be labeling all images where it "misses" as TEST data images? Are there best practices or resources I can read on this topic?
I'm by no means an expert, but here's what I'd suggest in order of most to least important:
1) Add more data if possible. More data is always a good thing, and helps develop robustness with your network's predictions.
2) Add dropout layers to prevent over-fitting
3) Have a tinker with kernel and bias initialisers
4) [The most relevant answer to your question] Save the training weights of your model and reload them into a new model prior to training.
5) Change up the type of model architecture you're using. Then, have a tinker with epoch numbers, validation splits, loss evaluation formulas, etc.
Hope this helps!
EDIT: More information about number 4
So you can save and load your model weights during or after the model has trained. See here for some more in-depth information about saving.
Broadly, let's cover the basics. I'm assuming you're going through keras but the same applies for tf:
Saving the model after training
Simply call:
model_json = model.to_json()
with open("{Your_Model}.json", "w") as json_file:
json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("{Your_Model}.h5")
print("Saved model to disk")
Loading the model
You can load the model structure from json like so:
from keras.models import model_from_json
json_file = open('{Your_Model.json}', 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
And load the weights if you want to:
model.load_weights('{Your_Weights}.h5', by_name=True)
Then compile the model and you're ready to retrain/predict. by_name for me was essential to re-load the weights back into the same model architecture; leaving this out may cause an error.
Checkpointing the model during training
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath={checkpoint_path},
save_weights_only=True,
verbose=1)
# Train the model with the new callback
model.fit(train_images,
train_labels,
epochs=10,
validation_data=(test_images,test_labels),
callbacks=[cp_callback]) # Pass callback to training

can we save a partially trained Machine Learning model, reload it again and train from the point it was saved?

I want to know is there any way in which we can partially save a Scikit-Learn Machine Learning model and reload it again to train it from the point it was saved before?
For models such as Scikitlearn applied to sentiment analysis, I would suspect you need to save two important things: 1) your model, 2) your vectorizer.
Remember that after training your model, your words are represented by a vector of length N, and that is defined according to your total number of words.
Below is a piece from my test-model and test-vectorizer saved in order to be used latter.
SAVING THE MODEL
import pickle
pickle.dump(vectorizer, open("model5vectorizer.pickle", "wb"))
pickle.dump(classifier_fitted, open("model5.pickle", "wb"))
LOADING THE MODEL IN A NEW SCRIPT (.py)
import pickle
model = pickle.load(open("model5.pickle", "rb"))
vectorizer = pickle.load(open("model5vectorizer.pickle", "rb"))
TEST YOUR MODEL
sentence_test = ["Results by Andutta et al (2013), were completely wrong and unrealistic."]
USING THE VECTORIZER (model5vectorizer.pickle) !!
sentence_test_data = vectorizer.transform(sentence_test)
print("### sentence_test ###")
print(sentence_test)
print("### sentence_test_data ###")
print(sentence_test_data)
# OBS-1: VECTOR HERE WILL HAVE SAME LENGTH AS BEFORE :)
# OBS-2: If you load the default vectorizer or a different one, then you may see the following problems
# sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.
# # ValueError: X has 8 features per sample; expecting 11
result1 = model.predict(sentence_test_data) # using saved vectorizer from calibrated model
print("### RESULT ###")
print(result1)
Hope that helps.
Regards,
Andutta
When a data set is fitted to a Scikit-learn machine learning model, it is trained and supposedly ready to be used for prediction purposes. By training a model with let's say, 100 samples and using it and then going back to it and fitting another 50 samples to it, you will not make it better but you will rebuild it.
If your purpose is to build a model and make it more powerful as it interacts with more samples, you would be thinking of a real-time condition, such as a mobile robot for mapping an environment with a Kalman Filter.

How to split documents into training set and test set?

I am trying to build a classification model. I have 1000 text documents in local folder. I want to divide them into training set and test set with a split ratio of 70:30(70 -> Training and 30 -> Test) What is the better approach to do so? I am using python.
I wanted a approach programatically to split the training set and test set. First to read the files in local directory. Second, to build a list of those files and shuffle them. Thirdly to split them into a training set and test set.
I tried a few ways by using built in python keywords and functions only to fail. Lastly I got the idea of approaching it. Also Cross-validation is a good option to be considered for the building general classification models.
Not sure exactly what you're after, so I'll try to be comprehensive. There will be a few steps:
Get a list of the files
Randomize the files
Split files into training and testing sets
Do the thing
1. Get a list of the files
Let's assume that your files all have the extension .data and they're all in the folder /ml/data/. What we want to do is get a list of all of these files. This is done simply with the os module. I'm assuming you have no subdirectories; this would change if there were.
import os
def get_file_list_from_dir(datadir):
all_files = os.listdir(os.path.abspath(datadir))
data_files = list(filter(lambda file: file.endswith('.data'), all_files))
return data_files
So if we were to call get_file_list_from_dir('/ml/data'), we would get back a list of all the .data files in that directory (equivalent in the shell to the glob /ml/data/*.data).
2. Randomize the files
We don't want the sampling to be predictable, as that is considered a poor way to train an ML classifier.
from random import shuffle
def randomize_files(file_list):
shuffle(file_list)
Note that random.shuffle performs an in-place shuffling, so it modifies the existing list. (Of course this function is rather silly since you could just call shuffle instead of randomize_files; you can write this into another function to make it make more sense.)
3. Split files into training and testing sets
I'll assume a 70:30 ratio instead of any specific number of documents. So:
from math import floor
def get_training_and_testing_sets(file_list):
split = 0.7
split_index = floor(len(file_list) * split)
training = file_list[:split_index]
testing = file_list[split_index:]
return training, testing
4. Do the thing
This is the step where you open each file and do your training and testing. I'll leave this to you!
Cross-Validation
Out of curiosity, have you considered using cross-validation? This is a method of splitting your data so that you use every document for training and testing. You can customize how many documents are used for training in each "fold". I could go more into depth on this if you like, but I won't if you don't want to do it.
Edit: Alright, since you requested I will explain this a little bit more.
So we have a 1000-document set of data. The idea of cross-validation is that you can use all of it for both training and testing — just not at once. We split the dataset into what we call "folds". The number of folds determines the size of the training and testing sets at any given point in time.
Let's say we want a 10-fold cross-validation system. This means that the training and testing algorithms will run ten times. The first time will train on documents 1-100 and test on 101-1000. The second fold will train on 101-200 and test on 1-100 and 201-1000.
If we did, say, a 40-fold CV system, the first fold would train on document 1-25 and test on 26-1000, the second fold would train on 26-40 and test on 1-25 and 51-1000, and on.
To implement such a system, we would still need to do steps (1) and (2) from above, but step (3) would be different. Instead of splitting into just two sets (one for training, one for testing), we could turn the function into a generator — a function which we can iterate through like a list.
def cross_validate(data_files, folds):
if len(data_files) % folds != 0:
raise ValueError(
"invalid number of folds ({}) for the number of "
"documents ({})".format(folds, len(data_files))
)
fold_size = len(data_files) // folds
for split_index in range(0, len(data_files), fold_size):
training = data_files[split_index:split_index + fold_size]
testing = data_files[:split_index] + data_files[split_index + fold_size:]
yield training, testing
That yield keyword at the end is what makes this a generator. To use it, you would use it like so:
def ml_function(datadir, num_folds):
data_files = get_file_list_from_dir(datadir)
randomize_files(data_files)
for train_set, test_set in cross_validate(data_files, num_folds):
do_ml_training(train_set)
do_ml_testing(test_set)
Again, it's up to you to implement the actual functionality of your ML system.
As a disclaimer, I'm no expert by any means, haha. But let me know if you have any questions about anything I've written here!
that's quite simple if you use numpy, first load the documents and make them a numpy array, and then:
import numpy as np
docs = np.array([
'one', 'two', 'three', 'four', 'five',
'six', 'seven', 'eight', 'nine', 'ten',
])
idx = np.hstack((np.ones(7), np.zeros(3))) # generate indices
np.random.shuffle(idx) # shuffle to make training data and test data random
train = docs[idx == 1]
test = docs[idx == 0]
print(train)
print(test)
the result:
['one' 'two' 'three' 'six' 'eight' 'nine' 'ten']
['four' 'five' 'seven']
Just make a list of the filenames using os.listdir(). Use collections.shuffle() to shuffle the list, and then training_files = filenames[:700] and testing_files = filenames[700:]
You can use train_test_split method provided by sklearn. See documentation here:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Resources