Training Data Vs. Test Data - machine-learning

This might sound like an elementary question but I am having a major confusion regarding Training Set and Test.
When we use Supervised learning techniques such as Classification to predict something a common practice is to split the dataset into two parts training and test set. The training set will have a predictor variable, we train the model on the dataset and "predict" things.
Let's take an example. We are going to predict loan defaulters in a bank and we have the German credit data set where we are predicting defaulters and non- defaulters but there is already a definition column which says whether a customer is a defaulter or Non-defaulter.
I understand the logic of prediction on UNSEEN data, like the Titanic survival data but what is the point of prediction where a class is already mentioned, such as German credit lending data.

As you said, the idea is to come up a model that you can predict UNSEEN data. The test data is only used to measure the performance of your model created through training data. You want to make sure the model you comes up does not "overfit" your training data. That's why the testing data is important. Eventually, you will use the model to predict whether a new loaner is going to default or not, thus making a business decision whether to approve the loan application.

The reason why they include the defaulted values is so that you can verify that the model is working as expected and predicting the correct results. Without which there is no way for anyone to be confident that their model is working as expected.

The ultimate purpose of training a model is to apply it to what you call UNSEEN data.
Even in your German credit lending example, at the end of the day you will have a trained model that you could use to predict if new - unseen - credit applications will default or not. And you should be able to use it in the future for any new credit application, as long as you are able to represent the new credit data in the same format you used to train your model.
On the other hand, the test set is just a formalism used to estimate how good the model is. You cannot know for sure how accurate your model it is going to be with future credit applications, but what you can do is to save a small part of your training data, and use it only to check the model's performance after it has been built. That's what you would call the test set (or more precisely, a validation set).

Related

K fold cross validation makes no part of data blind to the model

I have a conceptual question about K fold Cross validation.
In general, we train a model to learn based on test data and validate it with test data, and we assume the system is blind to this data, and this is why we can evaluate if the system really learnt or not.
Now with k fold, the final model actually have seen (indirectly, though) all data, so why it is still valid??? It already has seen all data and we do not know how it predicts unseen data.
This is my question that based on this fact, why we know this method valid?
Thanks.
In K-Fold Cross Validation, you actually train K different models. Let's say we are doing 5-Fold CV and the size of the dataset is 100 samples. Then, in each fold, we randomly split the data as 80 train samples and 20 test samples. We train on 80 train samples then we test the trained model on 20 left-out test samples. We compute accuracy and note it. At the end, we will have 5 different models. Then, we can average the accuracies of each fold and report this as the average performance of the model. Coming to your question, actually you need to think why we need K-Fold Cross Validation. The answer is, you need to report the performance of you model, right? However, if you just train and evaluate your model with single split, then there is a possibility that your model may be biassed to this specific split. I mean, in this split, a rare case may come out like a highly domain shift between train and test sets which is bad for the performance.
TL;DR: Think of your 'test data' more like 'validation data', which you hope represents truly unseen test data. Ideally if the model performs well for many different validation datasets it will work well when applied to real life test data which wasn't used in the training-validation process.
This confusion is justified. You are correct.
This is where the terminology training data, validation data and test data can make things more clear. Models are trained on training data. This is data directly seen by the model to go through the process of updating its parameters and learn. Validation data is data the we use to validate how well the model has actually learned. It is not directly seen by the model and we use it to judge things like under or overfitting. It is assumed that the validation data is a good representation of test data. Test data is what we will end up applying our model to in the real world, it have never been seen in any way by the model.
Test and validation data are often used interchangeably, with most people just using training and test terminology.
An example:
If you are build a cat detector you collect images of cats, you split these images into training and validation sets. You assume the validation set is an accurate representation of the kinds of cat images people will use your model on in the real world. You train your model on the training data, validate how well it has learned on the validation data and once you think it has learned well you deploy the model. People will use it on their own images to detect cats. These images are the true test data, which have never been seen by the model, but hopefully your validation set was a good indicator of how you model will perform on these images.
K-fold cross validation is best used when your validation set may be small, or you are unsure of how well it represents test data (e.g. if there are only ginger cats in your validation set, it lead to your model failing on test data, so you would like to mix the validation set up). By performing k-fold cross validation you can validate your model more times, with different choices of validation set, which hopefully will give a better indication of your model's generalizability.

Pretraining Deep Learning Model for Weight Initialization

When pretraining Deep Learning model (lets say a deep convolutional neural netowork) in order to achieve good weight initialization, do I use entire training set without validation (so that I avoid information leak) or just subset of training set?
If you want to fine-tune your network after training it on your dataset then you can use the same dataset (making sure that the data in the training/test, and validation sets do not switch around). What you can also do as 'pre-training' is to download a model that is already trained on a similar dataset/problem to yours and then training it on your dataset. This is known as transfer learning and works well for similar problems, but of course the bigger the gap between the 2 problems the more you need to train.
In conclusion: you can use any dataset as long as the validation set remains hidden from the network.
I think if we divide the dataset into training, validation and test data, it will be more useful. Keeping a completely new test data aside and validating the model with only validation data is a good choice. Entire training data should be used for training.

How to include variable attributes in Machine Learning models?

What machine learning techniques can be used to make a model if some attributes change over time? For example predicting prices of a hotel depends on the number of tourists in the city which is time dependent i.e. it changes from time to time.
Also, if we have a good trained model on some static data, then what are the ways to update the model if some data is changed except retraining the model on complete data again?
Regarding the first question, I would just add a feature indicating time. For instance, hotel X will appear in few data records, each one differs in the value of it's "Month" feature (the data-point of August might have an higher price from the one of December). This way the model will take into consideration the time of the year.
Regarding the second question, unless you're using reinforcement learning / online learning, which is used to train models from an oncoming sequences of samples, I don't see a way to change the data without having the train to model again.

What are the basic steps for training a model?

I have been put in charge of a ML employee. I have never worked with ML before.
He spends most of his time training models. We give him text files and the expected result, and he then trains his SVM model.
There are roughly two models to train each month.
This appears to be full-time work for him.
Could someone please tell me what are the basic steps for training a model? I would like to know if this really requires full-time attention!
We use Python.
Thanks
The basic process to train a model involves following steps:
Create a model
Divide data into training and testing data sets
Apply N-Fold Validation technique to remove data bias
Check the accuracy of the model
Repeat above steps until you get required accuracy.
It requires loads of repetition to get higher accuracy and fine tuning the model.
You hired a data scientist. Let him do his work!
Hope this helps!
Loading the Data
Pre-process/Clean the Data
Perform EDA
Treat Missing Values/Outliers
Split the Data
Scale the Data
One-Hot Enconding (if needed)
Train the data (fine-tune the params)
Evaluate the Model

Combining training, validation and test datasets

Is it possible to train a model based on training and validation data sets.Basically end up combining both of them to create a new model. And from that combined model use it to classify all of the data in the test dataset.
This is what is usually done. Assuming that you know how to transfer hyperparameters, as you usually fit model on train data, select hyperparameters based on the score on the valid one. Thus when you combine train + valid you get significantly bigger dataset, thus "optimal hyperparameters" might be completely different from the ones you selected before. So in general - yes, this is exactly what is usually done, but it might be more tricky than you expect (especially if your method is highly stochastic, non deterministic, etc.).

Resources