I have three datasets: train, validation, test and I am currently using an XGBoost Classifier to do the job on a classification task.
I trained the XGBClassifier on the train set and saved it as a pickle file to avoid having to re-train it every time. Once I load the model from the pickle file, I am able to use the predict method from it, but I don't seem to be able to train this model on the validation set or any other new dataset.
Note: I do not get any error output, the jupyter lab cell looks like it's working perfectly, but my CPU cores are all resting during this cell's operation, so I see the model isn't being fitted.
Could this be a problem with XGBoost or pickle dumped models are not able to be fitted again after loading?
I had the exact same question a year ago, You can find here the question and answer
Though, in this way, you will keep adding "trees" (boosters) to your existing model, using your new data.
It might be better to train a new model on your training + validation data sets.
Whatever you decide to do, you should try both options and evaluate your results to see what fits better for your data.
Related
I am using Weka software to classify model. I have confusion using training and testing dataset partition. I divide 60% of the whole dataset as training dataset and save it to my hard disk and use 40% of data as test dataset and save this data to another file. The data that I am using is an imbalanced data. So I applied SMOTE in my training dataset. After that, in the classify tab of the Weka I selected Use training set option from Test options and used Random Forest classifier to do the classification on the training dataset. After getting the result I chose Supplied test set option from Test options and load my test dataset from hard disk and again ran the classifier.
I try to find out tutorial on how to load training set and test set in Weka but did not get it. I did the above process depend upon my understanding.
Therefore, I would like to know is that the right way to perform classification on training and test dataset?
Thank you.
There is no need to evaluate your classifier on the training set (this will be overly optimistic, since the classifier has already seen this data). Just use the Supplied test set option, then your classifier will get trained automatically on the currently loaded dataset before being evaluated on the specified test set.
Instead of manually splitting your data, you could also use the Percentage split test option, with 60% to be used for your training data.
When using filters, you should always wrap them (in this case SMOTE) and your classifier (in this case RandomForest) in the FilteredClassifier meta-classifier. That way, you will ensure that the training and test set data will get transformed correctly. This will also avoid the problem of leaking information into the test set when transforming the full dataset with a supervised filter and splitting the dataset into train/test afterwards. Finally, it also documents nicely what preprocessing is being done to your input data, all in a single command-line string.
If you need to apply more than one filter, use the MultiFilter to apply them sequentially.
I have an image dataset for multi-class image classification- training & testing images. I trained and saved my model (as .h5 file) on training data, using 80-20% as train-validation split.
Now, I want to predict the classes for test images.
Which option is better and is it always the case?
Use the trained model as it is for "test images" prediction.
Train the saved model on whole training data (i.e, including 20% of the validation images) and then do predictions on test images. But in case, there will be no validation data, and hence, how does the model ensure that it keeps the loss to be minimum during training.
If you already properly trained the model, you do not need to retrain again. (Unless you are doing something specific with transfer learning). The whole purpose of having test data is to use as a test case to see how well you model did on unseen data.
If I have to make only a/some prediction(s), do I need to re-train my NN every time? Or I can, pardon me if this is silly, "save" the training and only do the test?
Currently I'm using Pycharm, but I've seen that with other IDEs, like Spyder, you can execute selected lines of code, in that case how does the NN keeps the training without the need to re-train?
Sorry if those question are too naive.
No, you don't need to re-train your NN every time. Just save your model parameters into a file and load to make new predictions.
Are you using any machine learning framework like Tensorflow or Keras? In Keras is very easy to implement this, there are two methods, first you can save model during training using the Callbacks and second, are possible to use your_model_name.save('file_name.h5') and then load with load_model('file_name.h5) to do some predictions. Use your_model_name.prediction(x).
By the way, there is a nice guide to how you can properly save the full model architecture or model weights.
EDIT: For both methods you can use load_model, is very simple!
How to merge the train, test and validation set of mnist in tensorflow for batch training. Anyone can help me?
What would be the purpose of using a testing set to train a model... Then it would become a training set too.
Sets are named training, validation and testing for a reason.
So you train your model with the training data. Once the model is trained, you validate it over the validation data. You test the performance of the model over the testing data. The training method you use (batch or something else) will NEVER change the fact that training/validation/testing data should never be mixed with one another.
If this does not answer your question, then edit your question and specify, because it is rather vague at the present moment.
This may sound like a naive question, but i am quite new on this. Let's say I use the Google pre-trained word2vector model (https://github.com/dav/word2vec) to train a classification model. I save my classification model. Now I load back the classification model into memory for testing new instances. Do I need to load the Google word2vector model again? Or is it only used for training my model?
It depends on how your corpuses and test examples are structured and pre-processed.
You are probably using the pre-trained word-vectors to turn text into numerical features. At first, text examples are vectorized to train the classifier. Later, other (test/production) text examples will be vectorized in the same, and presented to get the classifier to get its judgements.
So you will need to use the same text-to-vectors process for test/production text examples as was used during training. Perhaps you've done that in a separate earlier bulk step, in which case you already have the features in the vector form the classifier uses. But often your classifier pipeline will itself take raw text, and vectorize it – in which case it will need the same pre-trained (word)->(vector) mappings available at test time as were available during training.