I have been put in charge of a ML employee. I have never worked with ML before.
He spends most of his time training models. We give him text files and the expected result, and he then trains his SVM model.
There are roughly two models to train each month.
This appears to be full-time work for him.
Could someone please tell me what are the basic steps for training a model? I would like to know if this really requires full-time attention!
We use Python.
Thanks
The basic process to train a model involves following steps:
Create a model
Divide data into training and testing data sets
Apply N-Fold Validation technique to remove data bias
Check the accuracy of the model
Repeat above steps until you get required accuracy.
It requires loads of repetition to get higher accuracy and fine tuning the model.
You hired a data scientist. Let him do his work!
Hope this helps!
Loading the Data
Pre-process/Clean the Data
Perform EDA
Treat Missing Values/Outliers
Split the Data
Scale the Data
One-Hot Enconding (if needed)
Train the data (fine-tune the params)
Evaluate the Model
Related
I have a conceptual question about K fold Cross validation.
In general, we train a model to learn based on test data and validate it with test data, and we assume the system is blind to this data, and this is why we can evaluate if the system really learnt or not.
Now with k fold, the final model actually have seen (indirectly, though) all data, so why it is still valid??? It already has seen all data and we do not know how it predicts unseen data.
This is my question that based on this fact, why we know this method valid?
Thanks.
In K-Fold Cross Validation, you actually train K different models. Let's say we are doing 5-Fold CV and the size of the dataset is 100 samples. Then, in each fold, we randomly split the data as 80 train samples and 20 test samples. We train on 80 train samples then we test the trained model on 20 left-out test samples. We compute accuracy and note it. At the end, we will have 5 different models. Then, we can average the accuracies of each fold and report this as the average performance of the model. Coming to your question, actually you need to think why we need K-Fold Cross Validation. The answer is, you need to report the performance of you model, right? However, if you just train and evaluate your model with single split, then there is a possibility that your model may be biassed to this specific split. I mean, in this split, a rare case may come out like a highly domain shift between train and test sets which is bad for the performance.
TL;DR: Think of your 'test data' more like 'validation data', which you hope represents truly unseen test data. Ideally if the model performs well for many different validation datasets it will work well when applied to real life test data which wasn't used in the training-validation process.
This confusion is justified. You are correct.
This is where the terminology training data, validation data and test data can make things more clear. Models are trained on training data. This is data directly seen by the model to go through the process of updating its parameters and learn. Validation data is data the we use to validate how well the model has actually learned. It is not directly seen by the model and we use it to judge things like under or overfitting. It is assumed that the validation data is a good representation of test data. Test data is what we will end up applying our model to in the real world, it have never been seen in any way by the model.
Test and validation data are often used interchangeably, with most people just using training and test terminology.
An example:
If you are build a cat detector you collect images of cats, you split these images into training and validation sets. You assume the validation set is an accurate representation of the kinds of cat images people will use your model on in the real world. You train your model on the training data, validate how well it has learned on the validation data and once you think it has learned well you deploy the model. People will use it on their own images to detect cats. These images are the true test data, which have never been seen by the model, but hopefully your validation set was a good indicator of how you model will perform on these images.
K-fold cross validation is best used when your validation set may be small, or you are unsure of how well it represents test data (e.g. if there are only ginger cats in your validation set, it lead to your model failing on test data, so you would like to mix the validation set up). By performing k-fold cross validation you can validate your model more times, with different choices of validation set, which hopefully will give a better indication of your model's generalizability.
I know this may be a basic question but I want to know if I am using the train, test split correctly.
Say I have data that ends at 2019, and I want to predict values in the next 5 years.
The graph I produced is provided below:
My training data starts from 1996-2014 and my test data starts from 2014-2019. The test data perfectly fits the training data. I then used this test data to make predictions from 2019-2024.
Is this the correct way to do it, or my predictions should also be from 2014-2019 just like the test data?
The test/validation data is useful for you to evaluate the predictor to use. Once you have decided which model to use, you should train the model with the whole dataset 1996-2019 so that you do not lose possible valuable knowledge from 2014-2019. Take into account that when working with time-series, usually the newer part of the serie has more importance in your prediction than older values of the serie.
So say for each of my ‘things’ to classify I have:
{house, flat, bungalow, electricityHeated, gasHeated, ... }
Which would be made into a feature vector:
{1,0,0,1,0,...} which would mean a house that is heated by electricity.
For my training data I would have all this data- but for the actual thing I want to classify I might only have what kind of house it is, and a couple other things- not all the data ie.
{1,0,0,?,?,...}
So how would I represent this?
I would want to find the probability that a new item would be gasHeated.
I would be using a SVM linear classifier- I don’t have any core to show because this is purely theoretical at the moment. Any help would be appreciated :)
When I read this question, it seems that you may have confused with feature and label.
You said that you want to predict whether a new item is "gasHeated", then "gasHeated" should be a label rather than a feature.
btw, one of the most-common ways to deal with missing value is to set it as "zero" (or some unused value, say -1). But normally, you should have missing value in both training data and testing data to make this trick be effective. If this only happened in your testing data but not in your training data, it means that your training data and testing data are not from the same distribution, which basically violated the basic assumption of machine learning.
Let's say you have a trained model and a testing sample {?,0,0,0}. Then you can create two new testing samples, {1,0,0,0}, {0,0,0,0}, and you will have two predictions.
I personally don't think SVM is a good approach if you have missing values in your testing dataset. Just like I have mentioned above, although you can get two new predictions, but what if each one has different predictions? It is difficult to assign a probability to results of SVM in my opinion unless you use logistic regression or Naive Bayes. I would prefer Random Forest in this situation.
This might sound like an elementary question but I am having a major confusion regarding Training Set and Test.
When we use Supervised learning techniques such as Classification to predict something a common practice is to split the dataset into two parts training and test set. The training set will have a predictor variable, we train the model on the dataset and "predict" things.
Let's take an example. We are going to predict loan defaulters in a bank and we have the German credit data set where we are predicting defaulters and non- defaulters but there is already a definition column which says whether a customer is a defaulter or Non-defaulter.
I understand the logic of prediction on UNSEEN data, like the Titanic survival data but what is the point of prediction where a class is already mentioned, such as German credit lending data.
As you said, the idea is to come up a model that you can predict UNSEEN data. The test data is only used to measure the performance of your model created through training data. You want to make sure the model you comes up does not "overfit" your training data. That's why the testing data is important. Eventually, you will use the model to predict whether a new loaner is going to default or not, thus making a business decision whether to approve the loan application.
The reason why they include the defaulted values is so that you can verify that the model is working as expected and predicting the correct results. Without which there is no way for anyone to be confident that their model is working as expected.
The ultimate purpose of training a model is to apply it to what you call UNSEEN data.
Even in your German credit lending example, at the end of the day you will have a trained model that you could use to predict if new - unseen - credit applications will default or not. And you should be able to use it in the future for any new credit application, as long as you are able to represent the new credit data in the same format you used to train your model.
On the other hand, the test set is just a formalism used to estimate how good the model is. You cannot know for sure how accurate your model it is going to be with future credit applications, but what you can do is to save a small part of your training data, and use it only to check the model's performance after it has been built. That's what you would call the test set (or more precisely, a validation set).
I want to classify the news article into the category it belongs to. I have 4 categories of news eg." Technology,Sports,Politics and Health." And i have collected around 50 documents for each category as a Training Set
**Is the Training data enough for classification ??? And Which Algorithm should i use for classification?? SVM, Random Forest,Knn, ??
I am using Scikit-learn http://scikit-learn.org/ [python] library for my task
Thanks
There are many ways to attack this problem form CRFs to Random Forests.
With your limited training data, I would suggest going with a model with high bias such as the linear SVM. Start with training one vs all models for each class and predicting the class with the highest probably. This will give you a baseline for how hard your problem is with the given training data.
I prefer you to use Naive-Bayes classification. There is a tool called Ling-pipe where this is already implemented. What you want to do is just refer
http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html
There you have a small sample program Classifynews.java. Run that program by training the data and apply testing .A training data sample is given as "20 newsgroups"
http://qwone.com/~jason/20Newsgroups/
Training can be applied by training the data and if needed you can build an intermediate model and then apply the test data into that model. Naive-Bayes is good for the cases where training data is small.
But its accuracy increases as the size of training data increases. So try to include more news groups. Good luck. Try this and let me know