I've working on a CNN over several hundred GBs of images. I've created a training function that bites off 4Gb chunks of these images and calls fit over each of these pieces. I'm worried that I'm only training on the last piece on not the entire dataset.
Effectively, my pseudo-code looks like this:
DS = lazy_load_400GB_Dataset()
for section in DS:
X_train = section.images
Y_train = section.classes
model.fit(X_train, Y_train, batch_size=16, nb_epoch=30)
I know that the API and the Keras forums say that this will train over the entire dataset, but I can't intuitively understand why the network wouldn't relearn over just the last training chunk.
Some help understanding this would be much appreciated.
Best,
Joe
This question was raised at the Keras github repository in Issue #4446: Quick Question: can a model be fit for multiple times? It was closed by François Chollet with the following statement:
Yes, successive calls to fit will incrementally train the model.
So, yes, you can call fit multiple times.
For datasets that do not fit into memory, there is an answer in the Keras Documentation FAQ section
You can do batch training using model.train_on_batch(X, y) and
model.test_on_batch(X, y). See the models documentation.
Alternatively, you can write a generator that yields batches of
training data and use the method model.fit_generator(data_generator, samples_per_epoch, nb_epoch).
You can see batch training in action in our CIFAR10 example.
So if you want to iterate your dataset the way you are doing, you should probably use model.train_on_batch and take care of the batch sizes and iteration yourself.
One more thing to note is that you should make sure the order in which the samples you train your model with is shuffled after each epoch. The way you have written the example code seems to not shuffle the dataset. You can read a bit more about shuffling here and here
Related
I have a binary classification problem I'm trying to tackle in Keras. To start, I was following the usual MNIST example, using softmax as the activation function in my output layer.
However, in my problem, the 2 classes are highly unbalanced (1 appears ~10 times more often than the other). And what's even more critical, they are non-symmetrical in the way they may be mistaken.
Mistaking an A for a B is way less severe than mistaking a B for an A. Just like a caveman trying to classify animals into pets and predators: mistaking a pet for a predator is no big deal, but the other way round will be lethal.
So my question is: how would I model something like this with Keras?
thanks a lot
A non-exhaustive list of things you could do:
Generate a balanced data set using data augmentations. If the data are images, you can add image augmentations in a custom data generator that will output balanced amounts of data from each class per batch and save the results to a new data set. If the data are tabular, you can use a library like imbalanced-learn to perform over/under sampling.
As #Daniel said you can use class_weights during training (in the fit method) in a way that mistakes on important class are penalized more. See this tutorial: Classification on imbalanced data. The same idea can be implemented with a custom loss function with/without class_weights during training.
I have a training dataset with images that looks like this:
x=[image1,image2...imageN]
and an output dataset that looks like this:
y=[output1,output2...]
I don't understand how does the model.fit works in regards to processing the images. Meaning, if I choose shuffle=False will the model take the first image first and go through the whole feedforward, backprop, etc. and compare it to output1, and then the second image, and so on?
Or does the model randomly select images from my dataset?
If you specify shuffle = True the generator will shuffle the dataset before each epoch. It will then go through the shuffled dataset one batch at a time, if it reaches then end before the next epoch it will go back to the start.
If you specify shuffle = False it will go through the dataset in the same order every epoch.
I believe a similar question is asked here.
shuffle in the model.fit of keras
As far as I know, your thought process is correct up to a certain extent. The model takes a random image from the dataset and the associated output for that index and then trains on it. Quite similar to using a random number to select an image from the batch, training it comparing with the output and then marking it as trained to avoid retraining on the same example.
I am using cross_val_score function with LeaveOneOut function as my data has 60 samples.
I am confused on how cross_val_score computes the results for each estimation in Leave One Out cross validation (LOOCV).
In the LOOCV, for one instance, it fits, let's say Decision Trees Classifier (DTC), model using 59 samples for training and predicts the single remaining one.
Then the main question is this:
Does it fit a new model at each instance (namely 60 different fits) inside cross_val_score?
If so, things get confusing.
Then I can have an average accuracy (out of 60) score for performance evaluation. But I need to come up with a best DTC model in general not just for my own data, though it is based my data.
If I use the entire data, the it fits perfectly but that model simply over-fits.
I want to have a single DTC model that works best in general based on my data.
Here is my code if that make sense:
model = DecisionTreeClassifier(random_state=27, criterion='gini', max_depth=4, max_features='auto' )
loocv = LeaveOneOut()
results = cross_val_score(model, X, y, cv=loocv)
I do not fully understand what do you want to find out.
Does it fit a new model at each instance (namely 60 different fits) inside cross_val_score?`
Yes, it does in your case. What is the follow up question to help to clarify the confusion that you have in such case?
The idea of the CV is that one gets a performance estimate of the model building procedure that you have chosen. The final model can (and should to benefit most from the data) be built on the full dataset. Then you can use it to predict on test data and you can use your cross_val_score outcome to get an estimate of performance for this model. See more elaborate answer as well as very useful links in my earlier answer.
My answer applies to a larger dataset. There might be nuisances related to small dataset treatment, that I'm not aware of, but I do not see why the logic does not generalise to this case.
I have about a 30% and 70% for class 0 (minority class) and class 1 (majority class). Since I do not have a lot of data, I am planning to oversample the minority class to balance out the classes to become a 50-50 split. I was wondering if oversampling should be done before or after splitting my data into train and test sets. I have generally seen it done before splitting in online examples, like this:
df_class0 = train[train.predict_var == 0]
df_class1 = train[train.predict_var == 1]
df_class1_over = df_class1.sample(len(df_class0), replace=True)
df_over = pd.concat([df_class0, df_class1_over], axis=0)
However, wouldn't that mean that the test data will likely have duplicated samples from the training set (because we have oversampled the training set)? This means that testing performance wouldn't necessarily be on new, unseen data. I am fine doing this, but I would like to know what is considered good practice. Thank you!
I was wondering if oversampling should be done before or after splitting my data into train and test sets.
It should certainly be done after splitting, i.e. it should be applied only to your training set, and not to your validation and test ones; see also my related answer here.
I have generally seen it done before splitting in online examples, like this
From the code snippet you show, it is not at all obvious that it is done before splitting, as you claim. It depends on what exactly the train variable is here: if it is the product of a train-test split, then the oversampling takes place after splitting indeed, as it should be.
However, wouldn't that mean that the test data will likely have duplicated samples from the training set (because we have oversampled the training set)? This means that testing performance wouldn't necessarily be on new, unseen data.
Exactly, this is the reason why the oversampling should be done after splitting to train-test, and not before.
(I once witnessed a case where the modeller was struggling to understand why he was getting a ~ 100% test accuracy, much higher than his training one; turned out his initial dataset was full of duplicates -no class imbalance here, but the idea is similar- and several of these duplicates naturally ended up in his test set after the split, without of course being new or unseen data...).
I am fine doing this
You shouldn't :)
From my experience this is a bad practice. As you mentioned test data should contain unseen samples so it would not overfit and give you better evaluation of training process. If you need to increase sample sizes - think about data transformation possibilities.
E.g. human/cat image classification, as they are symmetric you can double sample size by mirroring images.
I am currently trying to use satellite imagery to recognize Apples orchards. And I am facing a small problem in the number of representative data for each class.
In fact my question is :
Is it possible to take randomly some different images in my "not-apples" class at each epoch because I have much more of theses (compared to the "apples" one) and I want to increase the probability my network will classify out an image unrepresentative.
Thanks in advance for your help
That is not possible in Keras. Keras will, by default, shuffle your training data and then train on it in a mini-batch fashion. However, there are still ways to re-balance your dataset.
The imbalanced training data problem that you are facing is pretty common. You have many options available to you; I list a few below:
You can adjust the relative weights of your classes using class_weight keyword of the model.fit() function.
You can "up-sample" your "apples" class or "down-sample" your "non-apples" class to have equal numbers of both classes during training.
You can generate synthetic images of your "apples" class to augment your data set. To this end, the ImageDataGenerator class in Keras can be particularly useful. This Keras tutorial is a good introduction to its usage.
In my experience, I've found #2 and #3 to be most useful. #1 is limited by the fact that the convergence of stochastic gradient descent suffers when using class weights differing by a couple orders of magnitude and smaller batch sizes.
Jason Brownlee has put together a list of tactics for dealing with imbalanced classes that might also be useful to you.