Training data has more columns than test data - machine-learning

I have training data with 16 columns and test data with 14columns and the last two target columns from training data is not present in test data(which is very important). Also the test data is already given and training data is also given.
The approach i was thinking is to start off by combining the train and test data and then split the data as X_Train, Y_Train, X_Test and Y_Test. Is it a good way to do or is there any other way to start off?
I haven't coded for it yet. But before i could do i need some advice to start it.
Thanks

Well I don't know what task you want to solve, but it seems like you want to train a model on your training dataset and then predict the targets of your test dataset (that's why you don't have those).
If you want to evaluate how good your model is doing in the training phase you can split your training data into a real training set and a validation set with test_train_split(X_train,y_train). If the validation accuracy is good enough you take your trained model and call model.predict(X_test) on it

For evaluating your model you could just split your training set into training and testing ( using 20% for testing ) and use cross validation.
Your test set is useless for evaluation if it doesn't contain target variable. I'm thinking that this is an assignment or a competition ur taking ? Because they always give you a test set with keeping the targets for themselves for evaluating you

Related

Can I assess my model's performance with LOOCV on the whole dataset?

Let's say I initially split my dataset into training (80%) and test (20%) sets, perform a 10-fold CV on my training set and obtain an average R² of 75%. After that I check the best model's accuracy on the test set and obtain an R² of 74%, which indicates that the model is fairly robust. Now, before deploying it to real applications, I tune it with the whole data. Someone asks me the model's approximate R²; if I say 74% or 75%, I will me ignoring the fact that the model was now tunned with more data (test set). Is it a resonable approach to perform a leave one out CV on the chosen model with the whole data, compare the predicted targets with the real ones, check the R² (let's say it's 80% now) and say that the real-world model will most likely have an R² of 80%? I see no problems with that, but I do not know if this approach is correct.
It is true that you should train again on the whole data and it might lead to performance improvements. However, in this context your whole data should not be train + test! It should be just the trainign dataset but without any cross-validation. So before you had %80 for training and you were doing 10 fold CV, meaning that you were training your model actually on %72 of your complete data(train+test) and keeping the %8 for the validation. Now you should train it on the whole %80 percent and report your final results again on the unseen test set.
If you do LOOCV on the train + test, you can not report your performance on validation samples because this is how the model is finetuned and you might as well overfit to validation data.

Is there a training/validation split happening internally, or is there just one training set and testing set?

so recently I've been following the tutorial in https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html and I came up with the following question: is there a training/validation split happening internally?
The thing is, in this tutorial, the main dataset is spliced into training and testing. Here, the training set is used for training and the testing in the evaluate() function.
To my knowledge, when dealing with neural networks usually the data is split into 3 sets: training, validation and testing. In this tutorial though, it is only split into training and testing. For what I know, usually the model is trained and then evaluated, and the weights are then updated according to what was learnt in the evaluation step. However, I can't seem to find any connection between the evaluate function and training. Therefore, in this example the model is being evaluated AND tested using the same dataset.
Is there something here that I might be missing? Is there an internal split of the training dataset happening during training (into training and validation) and the function evaluate() is simply used for testing the performance of the model?
for epoch in range(num_epochs):
# train for one epoch, printing every 10 iterations
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model, data_loader_test, device=device)```
Is there a training/validation split happening internally?
Is there an internal split of the training dataset happening during
training (into training and validation) and the function evaluate() is
simply used for testing the performance of the model?
No, you are not missing anything. What you see is exactly what's being done there. There is no internal splitting happening. It's just an example to show how something is done in pytorch without cluttering it unncessarily.
Some datasets such as CIFAR10/CIFAR100, come only with a train/test set and usually it has been the norm to just train and then evaluate on the test set in examples. However, nothing stops you from splitting the training set however you like, it's up to you. In such tutorials they just tried to keep everything as simple as possible.

Can i predict and evaluated model with the whole dataset?

I split the dataset into train and test of 80-20 ration respectively. I predicted and evaluated with test dataset. And my question is can we evaluate and predict model with the whole dataset before that I shuffle entire dataset. Can we do that? If not, why should not we do that? what is wrongdoing like that?
Data Snooping is the quick answer what you are looking for.
In other words, your model would seem outperforming on your test data if it was trained on 100% data first. Model would become an overfitted model that basically would predict seen data with higher accuracy however would fail to do so with any sort of unseen test data.
You can do it, however it would result in overfitted model. You can try k fold cross validation method in stead.
If you use the whole dataset for training, the model will fit to all the variances in data (overfitting). As a result, the performance of your model on similar data will be high. However, the model will exhibit low performance on unseen data with a different distribution compared to your training dataset. One way to prevent this is to: a) split your data into training, validation, and testing datasets (see the note below), b) apply k-fold cross-validation on training and validation splits, c) verify the performance of your models from step b on the third split (test dataset).
Note: There is no consensus on the naming of the splits. Some sources name them training-validation-testing while others use training-testing-validation.

Machine learning: training model from test data

I was wondering if a model trains itself from the test data as well while evaluating it multiple times, leading to a over-fitting scenario. Normally we split the training data into train-test splits and I noticed some people split it into 3 sets of data - train, test and eval. eval is for final evaluation of the model. I might be wrong but my point is that if the above mentioned scenario is not true, then there is no need for an eval data set.
Need some clarification.
The best way to evaluate how well a model will perform in the 'wild' is to evaluate its performance on a data set it has not seen (i.e., been trained on) -- assuming you have the labels in a supervised learning problem.
People split their data into train/test/eval and use the training data to estimate/learn the model parameters and the test set to tune the model (e.g., by trying different hyperparameter combinations). A model is usually selected based on the hyperparameter combination that optimizes a test metric (regression - MSE, R^2, etc.; classification - AUC, accuracy, etc.). Then the selected model is usually retrained on the combined train + test data set. After retraining, the model is evaluated based on its performance on the eval data set (assuming you have some ground truth labels to evaluate your predictions). The eval metric is what you report as the generalization metric -- that is, how well your model performs on novel data.
Does this help?
Consider you have train and test datasets. Train dataset is the one in which you know the output and you train your model on train dataset and you try to predict the output of Test dataset.
Most people split train dataset into train and validation. So first you run your model on train data and evaluate it on validation set. Then again you run the model on test dataset.
Now you are wondering how this will help and of any use?
This helps you to understand your model performance on seen data(validation data) and unseen data(your test data).
Here comes bias-variance trade-off into picture.
https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
Let's consider a binary classification example where a student's previous semester grades, Sports achievements, Extracurriculars etc are used to predict whether or not he will pass the final semester.
Let's say we have around 10000 samples (data of 10000 students).
Now we split them:
Training set - 6000 samples
Validation set - 2000 samples
Test set - 1000 samples
The training data is generally split into three (training set, validation set, and test set) for the following reasons:
1) Feature Selection: Let's assume you have trained the model using some algorithm. You calculate the training accuracy and validation accuracy. You plot the learning curves and find if the model is overfitting or underfitting and make changes (add or remove features, add more samples etc). Repeat until you have the best validation accuracy. Now test the model with the test set to get your final score.
2) Parameter Selection: When you use algorithms like KNN, And you need to find the best K value which fits the model properly. You can plot the accuracy of different K value and choose the best validation accuracy and use it for your test set. (same applies when you find n_estimators for Random forests etc)
3) Model Selection: Also you can train the model with different algorithms and choose the model which better fits the data by testing out the accuracy using validation set.
So basically the Validation set helps you evaluate your model's performance how you must fine-tune it for best accuracy.
Hope you find this helpful.

Overfitting and Data splitting

Let's say that I have a data file like:
Index,product_buying_date,col1,col2
0,2013-01-16,34,Jack
1,2013-01-12,43,Molly
2,2013-01-21,21,Adam
3,2014-01-09,54,Peirce
4,2014-01-17,38,Goldberg
5,2015-01-05,72,Chandler
..
..
2000000,2015-01-27,32,Mike
with some more data and I have a target variable y. Assume something as per your convenience.
Now I am aware that we divide the data into 2 parts i.e. Train and Test. And then we divide Train into 70:30, build the model with 70% and validate it with 30%. We tune the parameters so that model does not get overfit. And then predict with the Test data. For example: I divide 2000000 into two equal parts. 1000000 is train and I divide it in validate i.e. 30% of 1000000 which is 300000 and 70% is where I build the model i.e. 700000.
QUESTION: Is the above logic depending upon how the original data splits?
Generally we shuffle the data and then break it into train, validate and test. (train + validate = Train). (Please don't confuse here)
But what if the split is alternate. Like When I divide it in Train and Test first, I give even rows to Test and odd rows to Train. (Here data is initially sort on the basis of 'product_buying_date' column so when i split it in odd and even rows it gets uniformly split.
And when I build the model with Train I overfit it so that I get maximum AUC with Test data.
QUESTION: Isn't overfitting helping in this case?
QUESTION: Is the above logic depending upon how the original data
splits?
If dataset is large(hundred of thousand), you can randomly split the data and you should not have any problem but if dataset is small then you can adopt the different approaches like cross-validation to generate the data set. Cross-validation states that you split you make n number of training-validation set out of your Training set.
suppose you have 2000 data points, you split like
1000 - Training dataset
1000 - testing dataset.
5-cross validation would mean that you would make five 800/200 training/validation dataset.
QUESTION: Isn't overfitting helping in this case?
Number one rule of the machine learning is that, you don't touch the test data set. It's a holly data set that should not be touched.
If you overfit the test data to get maximum AUC score then there won't be any meaning of validation dataset. Foremost aim of any ml algorithm is to reduce the generalization error i.e. algorithm should be able to perform good on unseen data. If you would tune your algorithm with testing data. you won't be able to meet this criteria. In cross-validation also you do not touch your testing set. you select your algorithm. tune its parameter with validation dataset and after you have done with that apply your algorithm to test dataset which is your final score.

Resources