I have trained a neural network and an XGBoost model for the same problem, now I am confused that how should I stack them. Should I just pass the output of the neural network as a parameter to the XGBoost model, or should I take the weighting of their results seperately ? Which would be better ?
This question cannot be clearly answered. I would suggest to check both possibilities and chose the one, that worked best.
Using the output of one model as input to the other model
I guess, you know, what you have to do to use the output of the NN as input to XGBoost. You should just take some time, about how you handle the test and train data (see below). Use the "probabilities" rather than the binary labels for that. Of course, you could also try it vice-versa, so that the NN gets the output of the XGBoost model as an additional input.
Using a Votingclassifier
The other possibility is to use a VotingClassifier using soft-voting. You can use VotingClassifier(voting='soft') for that (to be precise sklearn.ensemble.VotingClassifier). You could also play around with the weights here.
Difference
The big difference is, that with the first possibility the XGBoost model might learn, in what areas the NN is weak and in which it is strong, while with the VotingClassifier the outputs of both models are equally weighted for all samples and it relies on the assumption that the model output a "probability" not so close to 0 / 1 if they are not so confident about the prediciton of the specific input record. But this assumption might not be always true.
Handling of the Train/Testdata
In both cases, you need to think about, how you should handle the train/test data. The train/test data should ideally be split the same way for both models. Otherwise you might introduce some kind of data-leakage problem.
For the VotingClassifier this is no problem, because it can be used as a regular skearn model class. For the first method (output of model 1 is one feature of model 2), you should make sure, you do the train-test-split (or the cross-validation) with exactly the same records. If you don't do that, you would run the risk to validate the output of your second model on a record which was in the training set of model 1 (except for the additonal feature of course) and this clearly could cause a data-leakage problem which results in a score that appears to be better than how the model would actually perform on unseen productive data.
It is common practice to augment data (add samples programmatically, such as random crops, etc. in the case of a dataset consisting of images) on both training and test set, or just the training data set?
Only on training. Data augmentation is used to increase the size of the training set and to get more different images.
Technically, you could use data augmentation on the test set to see how the model behaves on such images, but usually, people don't do it.
Data augmentation is done only on training set as it helps the model become more generalize and robust. So there's no point of augmenting the test set.
This answer on stats.SE makes the case for applying crops on the validation / test sets so as to make that input similar the the input in the training set that the network was trained on.
Do it only on the training set. And, of course, make sure that the augmentation does not make the label wrong (e.g. when rotating 6 and 9 by about 180°).
The reason why we use a training and a test set in the first place is that we want to estimate the error our system will have in reality. So the data for the test set should be as close to real data as possible.
If you do it on the test set, you might have the problem that you introduce errors. For example, say you want to recognize digits and you augment by rotating. Then a 6 might look like a 9. But not all examples are that easy. Better be save than sorry.
I would argue that, in some cases, using data augmentation for the validation set can be helpful.
For example, I train a lot of CNNs for medical image segmentation. Many of the augmentation transforms that I use are meant to reduce the image quality so that the network is trained to be robust against such data. If the training set looks bad and the validation set looks nice, it will be hard to compare the losses during training and therefore assessing overfit will be complicated.
I would never use augmentation for the test set unless I'm using test-time augmentation to improve results or estimate aleatoric uncertainty.
In computer vision, you can use data augmentation during test time to obtain different views on the test image. You then have to aggregate the results obtained from each image for example by averaging them.
For example, given this symbol below, changing the point of view can lead to different interpretations :
Some image preprocessing software tools like Roboflow (https://roboflow.com/) apply data augmentation to test data as well. I'd say that if one is dealing with small and rare objects, say, cerebral microbleeds (which are tiny and difficult to spot on magnetic resonance images), augmenting one's test set could be useful. Then you can verify that your model has learned to detect these objects given different orientation and brightness conditions (given that your training data has been augmented in the same way).
The goal of data augmentation is to generalize the model and make it learn more orientation of the images, such that the during testing the model is able to apprehend the test data well. So, it is well practiced to use augmentation technique only for training sets.
The point of adding validation data is to build generalized model so it is nothing but to predict real-world data. inorder to predict real-world data, the validation set should contain real data. There is no problem with augmenting validation data but it won't increase the accuracy of the model.
Here are my two cents:
You train your model on the training data and the validation data: the former to optimize your parameters, and the latter to give you an appropriate stopping condition. The test data is to give you a real-world estimate of how well you can expect your model to perform.
For training, you can augment your training data to increase robustness to various factors including, but not limited to, sampling error, bias between data sources, shifts in global data distribution, positioning, and any other sort of variation you would like to account for.
The validation data should indicate to the training method when the model is most generalizable. By this logic, if you expect to see some variation in real-world data that can be simulated using data augmentation, then by all means, the validation dataset should be augmented.
The test data, on the other hand, should not be augmented, except potentially in special scenarios where data is very limited, and an estimate of real-world performance on test data has too much variance.
You can use augmentation data in training, validation and test sets.
The only thing to avoid is using the same data from the training set in validation or test sets.
For example, if you generate 3 augmented instances from an register of the training data, make sure that no one of these 3 augmented instances accidentally ends up in the validation or test sets.
It turns out that using data from the training set, even augmented data, to validate or test a model is a methodology mistake.
In tensorflow, I plan to build some model and compare it to other baseline models with respect to different subsets of the training data. I.e. I would like to train my model and the baseline models with the same subsets of training data.
In the naive way queue-runner and TFreaders are implemented (e.g. im2txt), this requires duplicating the data per each selection of subsets, which is my case, will require to use very large amounts of disk space.
It will be best, if there would be a way to tell the queue to fetch only samples from a specified subset of ids, or to ignore samples if they are not part of a given subset of ids.
If I understand correctly ignoring samples is not trivial, because it will require to stitch samples from different reads to a single batch.
Does anybody knows a way to do that? Or can suggest an alternative approach which does not requires pre-loading all the training data into the RAM?
Thanks!
You could encode your condition as part of keep_input parameter of tf.train.maybe_batch
I know you're supposed to separate your training data from your testing data, but when you make predictions with your model is it OK to use the entire data set?
I assume separating your training and testing data is valuable for assessing the accuracy and prediction strength of different models, but once you've chosen a model I can't think of any downsides to using the full data set for predictions.
You can use full data for prediction but better retain indexes of train and test data. Here are pros and cons of it:
Pro:
If you retain index of rows belonging to train and test data then you just need to predict once (and so time saving) to get all results. You can calculate performance indicators (R2/MAE/AUC/F1/precision/recall etc.) for train and test data separately after subsetting actual and predicted value using train and test set indexes.
Cons:
If you calculate performance indicator for entire data set (not clearly differentiating train and test using indexes) then you will have overly optimistic estimates. This happens because (having trained on train data) model gives good results of train data. Which depending of % split of train and test, will gives illusionary good performance indicator values.
Processing large test data at once may create memory bulge which is can result in crash in all-objects-in-memory languages like R.
In general, you're right - when you've finished selecting your model and tuning the parameters, you should use all of your data to actually build the model (exception below).
The reason for dividing data into train and test is that, without out-of-bag samples, high-variance algorithms will do better than low-variance ones, almost by definition. Consequently, it's necessary to split data into train and test parts for questions such as:
deciding whether kernel-SVR is better or worse than linear regression, for your data
tuning the parameters of kernel-SVR
However, once these questions are determined, then, in general, as long as your data is generated by the same process, the better predictions will be, and you should use all of it.
An exception is the case where the data is, say, non-stationary. Suppose you're training for the stock market, and you have data from 10 years ago. It is unclear that the process hasn't changed in the meantime. You might be harming your prediction, by including more data, in this case.
Yes, there are techniques for doing this, e.g. k-fold cross-validation:
One of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70% for training and 30% for test) is that there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique.
That said, there may not be a good reason for doing so if you have plenty of data, because it means that the model you're using hasn't actually been tested on real data. You're inferring that it probably will perform well, since models trained using the same methods on less data also performed well. That's not always a safe assumption. Machine learning algorithms can be sensitive in ways you wouldn't expect a priori. Unless you're very starved for data, there's really no reason for it.
"Weka: training and test set are not compatible" can be solved using batch filtering but at the time of training a model I don't have test.arff. My problem caused in the command "stringToWord vector" (on CLI).
So my question is, can Caret package(R) or Scikit learn (Python) provides any alternative for this one.
Note:
1. Functionality provided by "stringToWord vector" is a must requirement.
2. I don't want to retrain my model while testing because it takes lot of time.
Given the requirements you mentioned, you can use Weka's Filtered Classifier option during training and testing. I am not re-iterating what I have recorded as a video cast here and here.
But the basic idea is not to use the StringToWord vector as a direct filter rather to use it as a filtering option in the FilteredClassifier option. The model you generate will be just once. And then you can apply the model directly on your unlabelled data without retraining them or without applying StringToWord vector again on the unlabelled data. FilteredClassifier will take care of these concerns for you.