I know this may be a basic question but I want to know if I am using the train, test split correctly.
Say I have data that ends at 2019, and I want to predict values in the next 5 years.
The graph I produced is provided below:
My training data starts from 1996-2014 and my test data starts from 2014-2019. The test data perfectly fits the training data. I then used this test data to make predictions from 2019-2024.
Is this the correct way to do it, or my predictions should also be from 2014-2019 just like the test data?
The test/validation data is useful for you to evaluate the predictor to use. Once you have decided which model to use, you should train the model with the whole dataset 1996-2019 so that you do not lose possible valuable knowledge from 2014-2019. Take into account that when working with time-series, usually the newer part of the serie has more importance in your prediction than older values of the serie.
Related
I am using z-score to normalize my data before training my model. When I do predictions on a daily basis, I tend to have very few observations each day, perhaps just a dozen or so. My question is, can I normalize the test data just by itself, or should I attach it to the entire training set to normalize it?
The reason I am asking is, the normalization is based on mean and std_dev, which obviously might look very different if my dataset consists only of a few observations.
You need to have all of your data in the same units. Among other things, this means that you need to use the same normalization transformation for all of your input. You don't need to include the new data in the training per se -- however, keep the parameters of the normalization (the m and b of y = mx + b) and apply those to the test data as you receive them.
It's certainly not a good idea to predict on a test set using a model trained with a very different data distribution. I would use the same mean and std of your training data to normalize you test set.
I am working on this shared task http://alt.qcri.org/semeval2017/task4/index.php?id=data-and-tools
which is just a twitter sentiment analysis. Since i am pretty new to machine learning, I am not quite sure how to use both training data and testing data.
So the shared task provides two same sets of twitter tweets one without the result (train) and one with the result.
I current understandings of using these kinds of data in machine learning are as follows:
training set: we are supposed to split this into training and testing portions (90% training and 10% testing maybe?)
But the existing of a separate test data kind of confuses.
Are we supposed to use the result that we got in the test using the 10% portion of the 'training set' and compare that to the actual result 'testing set' ?
Can someone correct my understanding?
When training a machine learning model, you are feeding your algorithm with the dataset called training set, which in this stage, you are telling the algorithm what is the ground truth of each sample you put into the algorithm, that way, the algorithm learns from each sample you are feeding to it. the training set is usually 80% of the whole dataset, the other 20% of the dataset is the testing set, which in this case, you know what is the ground truth of each sample, but you let your algorithm predict what it think the truth is to each sample you let it predict. All those prediction over the testing set are based on what the algorithm have learned from the training set you fed it before.
After you make all the predictions over your testing set you can then check how accurate your model is based on the ground truth in compare to the prediction the model have made.
This might sound like an elementary question but I am having a major confusion regarding Training Set and Test.
When we use Supervised learning techniques such as Classification to predict something a common practice is to split the dataset into two parts training and test set. The training set will have a predictor variable, we train the model on the dataset and "predict" things.
Let's take an example. We are going to predict loan defaulters in a bank and we have the German credit data set where we are predicting defaulters and non- defaulters but there is already a definition column which says whether a customer is a defaulter or Non-defaulter.
I understand the logic of prediction on UNSEEN data, like the Titanic survival data but what is the point of prediction where a class is already mentioned, such as German credit lending data.
As you said, the idea is to come up a model that you can predict UNSEEN data. The test data is only used to measure the performance of your model created through training data. You want to make sure the model you comes up does not "overfit" your training data. That's why the testing data is important. Eventually, you will use the model to predict whether a new loaner is going to default or not, thus making a business decision whether to approve the loan application.
The reason why they include the defaulted values is so that you can verify that the model is working as expected and predicting the correct results. Without which there is no way for anyone to be confident that their model is working as expected.
The ultimate purpose of training a model is to apply it to what you call UNSEEN data.
Even in your German credit lending example, at the end of the day you will have a trained model that you could use to predict if new - unseen - credit applications will default or not. And you should be able to use it in the future for any new credit application, as long as you are able to represent the new credit data in the same format you used to train your model.
On the other hand, the test set is just a formalism used to estimate how good the model is. You cannot know for sure how accurate your model it is going to be with future credit applications, but what you can do is to save a small part of your training data, and use it only to check the model's performance after it has been built. That's what you would call the test set (or more precisely, a validation set).
I know you're supposed to separate your training data from your testing data, but when you make predictions with your model is it OK to use the entire data set?
I assume separating your training and testing data is valuable for assessing the accuracy and prediction strength of different models, but once you've chosen a model I can't think of any downsides to using the full data set for predictions.
You can use full data for prediction but better retain indexes of train and test data. Here are pros and cons of it:
Pro:
If you retain index of rows belonging to train and test data then you just need to predict once (and so time saving) to get all results. You can calculate performance indicators (R2/MAE/AUC/F1/precision/recall etc.) for train and test data separately after subsetting actual and predicted value using train and test set indexes.
Cons:
If you calculate performance indicator for entire data set (not clearly differentiating train and test using indexes) then you will have overly optimistic estimates. This happens because (having trained on train data) model gives good results of train data. Which depending of % split of train and test, will gives illusionary good performance indicator values.
Processing large test data at once may create memory bulge which is can result in crash in all-objects-in-memory languages like R.
In general, you're right - when you've finished selecting your model and tuning the parameters, you should use all of your data to actually build the model (exception below).
The reason for dividing data into train and test is that, without out-of-bag samples, high-variance algorithms will do better than low-variance ones, almost by definition. Consequently, it's necessary to split data into train and test parts for questions such as:
deciding whether kernel-SVR is better or worse than linear regression, for your data
tuning the parameters of kernel-SVR
However, once these questions are determined, then, in general, as long as your data is generated by the same process, the better predictions will be, and you should use all of it.
An exception is the case where the data is, say, non-stationary. Suppose you're training for the stock market, and you have data from 10 years ago. It is unclear that the process hasn't changed in the meantime. You might be harming your prediction, by including more data, in this case.
Yes, there are techniques for doing this, e.g. k-fold cross-validation:
One of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70% for training and 30% for test) is that there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique.
That said, there may not be a good reason for doing so if you have plenty of data, because it means that the model you're using hasn't actually been tested on real data. You're inferring that it probably will perform well, since models trained using the same methods on less data also performed well. That's not always a safe assumption. Machine learning algorithms can be sensitive in ways you wouldn't expect a priori. Unless you're very starved for data, there's really no reason for it.
Before applying SVM on my data I want to reduce its dimension by PCA. Should I separate the Train data and Test data then apply PCA on each of them separately or apply PCA on both sets combined then separate them?
Actually both provided answers are only partially right. The crucial part here is what is the exact problem you are trying to solve. There are two basic possible settings which can be considered, and both are valid under some assumptions.
Case 1
You have some data (which you splitted to train and test) and in the future you will get more data coming from the same distribution.
If this is the case, you should fit PCA on train data, then SVM on its projection, and for testing you just apply already fitted PCA followed by already fitted SVM, and you do exactly the same for new data that will come. This way your test error (under some "size assumptions" should approximate your expected error).
Case 2
You have some data (which you splitted train and test) and in the future you will obtain a big chunk of unlabeled data and you will be able to fit your model then.
In such a case, you fit PCA on whole data provided, learn SVM on labeled part (train set) and evaluate on test set. This way, once new data arrives you can fit PCA using both your data and new ones, and then - train SVM on your old data (as this is the only one having labels). Under the assumption that again - data comes from the same distributions, everything is correct here. You use more data to fit PCA only to have a better estimator (maybe your data is really high dimensional and PCA fails with small sample?).
You should do them separately. If you run pca on both sets combined then you are going to introduce a bias in your svn. The goal of the test set is to see how your algorithm will perform without prior knowledge of the data.
Learn the Projection Matrix of PCA on the train set and use this to reduce the dimensions of the test data.
One benifit is this way you don't have to rely on collecting sufficient data in the test set if you are applying your classifier for actual run time where test data comes one sample at a time.
Also I think separate train and test PCA will fail.Why?
Think of PCA as giving you features, and then you learn a classifier over these features. If over time your data shifts, then the test features you get using PCA would be different, and you don't have a classifier trained on these features. Even if the set of directions/features of the PCA remain same but their order varies your classifier still fails.