I have a saved ML model and I want to load its training and testing set data so that I can add it to the next month data of training and testing
Currently I am concatnating the x_train,y_train,x_test,y_test of the model to the training and testing se without loading the model
Related
I've come across code samples where weighCol is being created for both training and test data.
But I'd want to check the model performance on unseen data, will model.transform() work if there is no weightCol.
I have an image dataset for multi-class image classification- training & testing images. I trained and saved my model (as .h5 file) on training data, using 80-20% as train-validation split.
Now, I want to predict the classes for test images.
Which option is better and is it always the case?
Use the trained model as it is for "test images" prediction.
Train the saved model on whole training data (i.e, including 20% of the validation images) and then do predictions on test images. But in case, there will be no validation data, and hence, how does the model ensure that it keeps the loss to be minimum during training.
If you already properly trained the model, you do not need to retrain again. (Unless you are doing something specific with transfer learning). The whole purpose of having test data is to use as a test case to see how well you model did on unseen data.
I have training data with 16 columns and test data with 14columns and the last two target columns from training data is not present in test data(which is very important). Also the test data is already given and training data is also given.
The approach i was thinking is to start off by combining the train and test data and then split the data as X_Train, Y_Train, X_Test and Y_Test. Is it a good way to do or is there any other way to start off?
I haven't coded for it yet. But before i could do i need some advice to start it.
Thanks
Well I don't know what task you want to solve, but it seems like you want to train a model on your training dataset and then predict the targets of your test dataset (that's why you don't have those).
If you want to evaluate how good your model is doing in the training phase you can split your training data into a real training set and a validation set with test_train_split(X_train,y_train). If the validation accuracy is good enough you take your trained model and call model.predict(X_test) on it
For evaluating your model you could just split your training set into training and testing ( using 20% for testing ) and use cross validation.
Your test set is useless for evaluation if it doesn't contain target variable. I'm thinking that this is an assignment or a competition ur taking ? Because they always give you a test set with keeping the targets for themselves for evaluating you
I have three datasets: train, validation, test and I am currently using an XGBoost Classifier to do the job on a classification task.
I trained the XGBClassifier on the train set and saved it as a pickle file to avoid having to re-train it every time. Once I load the model from the pickle file, I am able to use the predict method from it, but I don't seem to be able to train this model on the validation set or any other new dataset.
Note: I do not get any error output, the jupyter lab cell looks like it's working perfectly, but my CPU cores are all resting during this cell's operation, so I see the model isn't being fitted.
Could this be a problem with XGBoost or pickle dumped models are not able to be fitted again after loading?
I had the exact same question a year ago, You can find here the question and answer
Though, in this way, you will keep adding "trees" (boosters) to your existing model, using your new data.
It might be better to train a new model on your training + validation data sets.
Whatever you decide to do, you should try both options and evaluate your results to see what fits better for your data.
I have two data set, train data set and test data set and I want to predict on these data sets.
My train data set has these features:
ID, name, age, Time of recruitment, Time fired, status
My test data set has these features:
ID, name, age, Time of recruitment
Now I want to predict “status” for test data set, but number of train data set features are different from test data set. Train data set has the “Time fired” feature while test data set has not it. What should I do?
If a model uses a particular attribute, then it is required in the test set for prediction.
Is 'Time fired' an important/critical attribute for the prediction? If not you could just leave it out in the training data as well. If it is, you'll have to find a way to collect it in the test data as well.