What if i train classifier two time? [duplicate] - machine-learning

This question already has an answer here:
How can i train multiple times an SVM classifier from sklearn in Python?
(1 answer)
Closed 4 years ago.
If I train a classifier two times, like:
clf.fit(X,y)
clf.fit(X,y)
Will it overwrite the existing classifier or will it just train it one time?

Yes, clf will be fit with the last data you try to fit it with. See the answer here https://stackoverflow.com/a/28884168/9458191 for more information.

Whenever you call .fit(...) on a classifier, it will only retain the new fit, essentially overwriting any previous training.
If you are using an entirely different dataset, the resulting classfier will obviously be different than before the second .fit(...) call. If you are using the same dataset, then the classifier may or may not be any different. Some classifiers are deterministic in training, if this is the case then they should not be any different. Some classifiers are non-deterministic, however, and those could have different results during the second training.

Related

Why do we store X_test to y_preds variable in Scikit learn? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am currently working on a Machine Learning project with no prior hands-on experience of Machine Learning or Python. I have just encountered the following code online, but don't know why is that actually happening.
Where is the trained data stored? is it stored in X_train or X_test?
Why did we predict X_test and stored it to y_preds variable? Since we used y_preds, I was expecting something like this:
y_preds = clf.predict(y_test)
Code:
from sklearn.model_selection import train_test_split
# Using train_test_split() function, defining test data size + storing it to variables of test, train
and split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fitting the data into the training model defined above
clf.fit(X_train, y_train);
# Making predictions on our trained data
y_preds = clf.predict(X_test)
In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.
Learning problems fall into a few categories:
A) supervised learning, in which the data comes with additional attributes that we want to predict (Click here to go to the scikit-learn supervised learning page).This problem can be either:
classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of a classification problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.
regression: if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.
B) unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization (Click here to go to the Scikit-Learn unsupervised learning page).
Basically, machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on which we test the learned properties.
Take a look at the link below.
https://scikit-learn.org/stable/user_guide.html
That is an excellent resource for learning all about Scikit Learn. It's hard to get your mind around some of these things, but it's a great learning experience, and it really does work!

How to correctly combine my classifiers?

I have to solve 2 class classification problem.
I have 2 classifiers that output probabilities. Both of them are neural networks of different architecture.
Those 2 classifiers are trained and saved into 2 files.
Now I want to build meta classifier that will take probabilities as input and learn weights of those 2 classifiers.
So it will automatically decide how much should I "trust" each of my classifiers.
This model is described here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier
I plan to use mlxtend library, but it seems that StackingClassifier refits models.
I do not want to refit because it takes very huge amount of time.
From the other side I understand that refitting is necessary to "coordinate" work of each classifier and "tune" the whole system.
What should I do in such situation?
I won't talk about mlxtend because I haven't worked with it but I'll tell you the general idea.
You don't have to refit these models to the training set but you have to refit them to parts of it so you can create out-of-fold predictions.
Specifically, split your training data in a few pieces (usually 3 to 10). Keep one piece (i.e. fold) as validation data and train both models on the other folds. Then, predict the probabilities for the validation data using both models. Repeat the procedure treating each fold as a validation set. In the end, you should have the probabilities for all data points in the training set.
Then, you can train a meta-classifier using these probabilities and the ground truth labels. You can use the trained meta-classifier on your new data.

Calling "fit" multiple times in Keras

I've working on a CNN over several hundred GBs of images. I've created a training function that bites off 4Gb chunks of these images and calls fit over each of these pieces. I'm worried that I'm only training on the last piece on not the entire dataset.
Effectively, my pseudo-code looks like this:
DS = lazy_load_400GB_Dataset()
for section in DS:
X_train = section.images
Y_train = section.classes
model.fit(X_train, Y_train, batch_size=16, nb_epoch=30)
I know that the API and the Keras forums say that this will train over the entire dataset, but I can't intuitively understand why the network wouldn't relearn over just the last training chunk.
Some help understanding this would be much appreciated.
Best,
Joe
This question was raised at the Keras github repository in Issue #4446: Quick Question: can a model be fit for multiple times? It was closed by François Chollet with the following statement:
Yes, successive calls to fit will incrementally train the model.
So, yes, you can call fit multiple times.
For datasets that do not fit into memory, there is an answer in the Keras Documentation FAQ section
You can do batch training using model.train_on_batch(X, y) and
model.test_on_batch(X, y). See the models documentation.
Alternatively, you can write a generator that yields batches of
training data and use the method model.fit_generator(data_generator, samples_per_epoch, nb_epoch).
You can see batch training in action in our CIFAR10 example.
So if you want to iterate your dataset the way you are doing, you should probably use model.train_on_batch and take care of the batch sizes and iteration yourself.
One more thing to note is that you should make sure the order in which the samples you train your model with is shuffled after each epoch. The way you have written the example code seems to not shuffle the dataset. You can read a bit more about shuffling here and here

Caffe fine-tuning vs. starting from scratch

Context: let's say I have trained a CNN on datasetA and I've obtained caffeModelA.
Current situation: new pictures arrive so I can build a new dataset, datasetB
Question: would these two situations lead to same caffemodel?
merge datasetA and datasetB and train the net from scratch.
perform some fine-tuning on existing caffeModelA by training it only on datasetB (as explained here: http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html)
It might seem a dumb question, but I'm not really sure about its answer. And it's really important because if the two approximations lead to same result I can save time by performing number 2.
Note: bear in mind that it's the same problem, so no need to change architecture here, I just plan to add new images to the training.
In the Flicker-style example the situation is a bit more generic. They use the weights of first layers from a model trained for a different classification task and employ it for a new task, training only a new last layer and fine-tuning the first layers a bit (by setting a low learning rate for those pretrained layers). Your case is similar but more specific, you want to use the pretrained model to train the exact architecture for the exact same task but with an extension of your data.
If your question if whether Option 1. will produce exactly the same model (all resulting weights are equal) as Option 2. Then no, most probably not.
In Option 2. the network is trained for iterations of dataset A then for dataset B then dataset A again..and so on (assuming both were just concatenated together).
While in Option 1. will have the network trained for some iterations/epochs on dataset A, then later continue learning for iterations/epochs on only dataset B and that's it. So the solver will see a different sequence of gradients in both options resulting in two different models. That's from a strict theoretical perspective.
If you ask from a practical perspective, the two options will probably end up with very similar models. How many epochs (not iterations) did you train on dataset A ? say N epochs, then you can safely go with Option 2. and train your existing model further on dataset B for the same number of epochs and same learning rate and batch size.

Imbalanced Data for Random ferns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
For a Multiclass problem, should the data be balanced for machine learning algorithms such as Random forests and Random ferns or is it ok for it to be imbalanced for a certain extent?
The issue with imbalanced classes raises when the disproportion alters the separability of the classes instances. But this does not happen in ever imbalanced dataset: sometimes the more data you have from one class the better you can differentiate the scarse data from it since it lets you find more easily which features are meaningful to create an discriminating plane (even though you are not using discriminative analysis the point is to classify-separate the instances according to classes).
For example I can remember the KDDCup2004 protein classification task in which one class had 99.1% of the instances in the training set but if you tried to use under sampling methods to alleviate the imbalance you would only get worse results. That meaning that the large amount of data from the first class defined the data in the smaller one.
Concerning random forests, and decision trees in general, they work by selecting, at each step, the most promising feature that can partitionate the set into two (or more) class-meaningful subsets. Having inherently more data about one class does not bias this partitioning by default ( = always) but only when the imbalance is not representative of the classes real distributions.
So I suggest that you first run a multivariate analysis to try to get the extent of imbalance among classes in your dataset and the run a series of experiments with different undersampling ratios if you still ar ein doubt.
I have used Random Forrests in my task before. Although the data don't need be balanced, however if the positive samples are too few, the pattern of the data maybe drown in the noise. Most of classify methods even (random forrests and ada boost) should have this flaw more or less.'Over sample' may be a good idea to deal with this problem.
Perhaps the paper Logistic Regression in rareis useful with this sort of problem, although its topic is logistic regression.

Resources