Why I get different predictions from same neural-network model? - machine-learning

As the title I wrote, I stuck in the problem that my neural network makes different prediction values.
Here is the steps I did to get predictions using neural network.
First, normalized x and made neural network model using 'nnet'.
After, I made predictions using predict command. predict(nnet model, test data)
But the problem is, I got the different predictions whenever I run neural network.
For example,
mymodel<-nnet(~~~~)
predict(mymodel.test data)
I got the value A from prediction.
After, I did the same command 'mymodel<-nnet(~~~~) predict(mymodel.test data)' again, then it's natural that I get the A for prediction value. But at this time, I got B.
If I run it again, I got C.
Why I get the different predictions from same neural network model?
What should I do to solve this problem?

The reason is because you are re-training your model before making another prediction. Training a neural network model by default involves setting the each neuron value some random values on initialization. Therefore, each training would produce different models up to some degree.
To avoid that, either use the same model instance (only execute mymodel<-nnet(~~~~) once at the beginning and keep executing only predict() afterwards), or set a seed for random value generation, so that you always get the same set of random values being initialized

The reason you're getting different results even though the data and model are the same is because every time you ask for a prediction your asking the program to initialize everything the model would need (i.e the weights). If you want the same results each time then you need to keep the same values for the weights and not reinitialize them each time, you could do this by putting them in a Data Base for later use

Related

Why we use validation set (not train or test set) in early stopping function ( DL / CNN )?

This is my first attempt to CNN in Pytorch. I have gone by few tutorials, but still need some clarification.
I have theoretical question, I don't understand why in early stopping function we base on validation set, not train or test set?
Has it something common with metrics we got from validation set?
The number of training epochs is one of the training hyper-parameters. Therefore, you MUST NOT use the test data to determine the value of this hyper-parameter.
Additionally, you cannot use the training set itself to determine the value of early stopping. Therefore, you need to use the validation set for determining this value.

How to stack neural network and xgboost model?

I have trained a neural network and an XGBoost model for the same problem, now I am confused that how should I stack them. Should I just pass the output of the neural network as a parameter to the XGBoost model, or should I take the weighting of their results seperately ? Which would be better ?
This question cannot be clearly answered. I would suggest to check both possibilities and chose the one, that worked best.
Using the output of one model as input to the other model
I guess, you know, what you have to do to use the output of the NN as input to XGBoost. You should just take some time, about how you handle the test and train data (see below). Use the "probabilities" rather than the binary labels for that. Of course, you could also try it vice-versa, so that the NN gets the output of the XGBoost model as an additional input.
Using a Votingclassifier
The other possibility is to use a VotingClassifier using soft-voting. You can use VotingClassifier(voting='soft') for that (to be precise sklearn.ensemble.VotingClassifier). You could also play around with the weights here.
Difference
The big difference is, that with the first possibility the XGBoost model might learn, in what areas the NN is weak and in which it is strong, while with the VotingClassifier the outputs of both models are equally weighted for all samples and it relies on the assumption that the model output a "probability" not so close to 0 / 1 if they are not so confident about the prediciton of the specific input record. But this assumption might not be always true.
Handling of the Train/Testdata
In both cases, you need to think about, how you should handle the train/test data. The train/test data should ideally be split the same way for both models. Otherwise you might introduce some kind of data-leakage problem.
For the VotingClassifier this is no problem, because it can be used as a regular skearn model class. For the first method (output of model 1 is one feature of model 2), you should make sure, you do the train-test-split (or the cross-validation) with exactly the same records. If you don't do that, you would run the risk to validate the output of your second model on a record which was in the training set of model 1 (except for the additonal feature of course) and this clearly could cause a data-leakage problem which results in a score that appears to be better than how the model would actually perform on unseen productive data.

How to look at the parameters of a pytorch model?

I have a simple pytorch neural net that I copied from openai, and I modified it to some extent (mostly the input).
When I run my code, the output of the network remains the same on every episode, as if no training occurs.
I want to see if any training happens, or if some other reason causes the results to be the same.
How can I make sure any movement happens to the weights?
Thanks
Depends on what you are doing, but the easiest would be to check the weights of your model.
You can do this (and compare with the ones from previous iteration) using the following code:
for parameter in model.parameters():
print(parameter.data)
If the weights are changing, the neural network is being optimized (which doesn't necessarily mean it learns anything useful in particular).

How do neural networks learn functions instead of memorize them?

For a class project, I designed a neural network to approximate sin(x), but ended up with a NN that just memorized my function over the data points I gave it. My NN took in x-values with a batch size of 200. Each x-value was multiplied by 200 different weights, mapping to 200 different neurons in my first layer. My first hidden layer contained 200 neurons, each one a linear combination of the x-values in the batch. My second hidden layer also contained 200 neurons, and my loss function was computed between the 200 neurons in my second layer and the 200 values of sin(x) that the input mapped to.
The problem is, my NN perfectly "approximated" sin(x) with 0 loss, but I know it wouldn't generalize to other data points.
What did I do wrong in designing this neural network, and how can I avoid memorization and instead design my NN's to "learn" about the patterns in my data?
It is same with any machine learning algorithm. You have a dataset based on which you try to learn "the" function f(x), which actually generated the data. In real life datasets, it is impossible to get the original function from the data, and therefore we approximate it using something g(x).
The main goal of any machine learning algorithm is to predict unseen data as best as possible using the function g(x).
Given a dataset D you can always train a model, which will perfectly classify all the datapoints (you can use a hashmap to get 0 error on the train set), but which is overfitting or memorization.
To avoid such things, you yourself have to make sure that the model does not memorise and learns the function. There are a few things which can be done. I am trying to write them down in an informal way (with links).
Train, Validation, Test
If you have large enough dataset, use Train, Validation, Test splits. Split the dataset in three parts. Typically 60%, 20% and 20% for Training, Validation and Test, respectively. (These numbers can vary based on need, also in case of imbalanced data, check how to get stratified partitions which preserve the class ratios in every split). Next, forget about the Test partition, keep it somewhere safe, don't touch it. Your model, will be trained using the Training partition. Once you have trained the model, evaluate the performance of the model using the Validation set. Then select another set of hyper-parameter configuration for your model (eg. number of hidden layer, learaning algorithm, other parameters etc.) and then train the model again, and evaluate based on Validation set. Keep on doing this for several such models. Then select the model, which got you the best validation score.
The role of validation set here is to check what the model has learned. If the model has overfit, then the validation scores will be very bad, and therefore in the above process you will discard those overfit models. But keep in mind, although you did not use the Validation set to train the model, directly, but the Validation set was used indirectly to select the model.
Once you have selected a final model based on Validation set. Now take out your Test set, as if you just got new dataset from real life, which no one has ever seen. The prediction of the model on this Test set will be an indication how well your model has "learned" as it is now trying to predict datapoints which it has never seen (directly or indirectly).
It is key to not go back and tune your model based on the Test score. This is because once you do this, the Test set will start contributing to your mode.
Crossvalidation and bootstrap sampling
On the other hand, if your dataset is small. You can use bootstrap sampling, or k-fold cross-validation. These ideas are similar. For example, for k-fold cross-validation, if k=5, then you split the dataset in 5 parts (also be carefull about stratified sampling). Let's name the parts a,b,c,d,e. Use the partitions [a,b,c,d] to train and get the prediction scores on [e] only. Next, use the partitions [a,b,c,e] and use the prediction scores on [d] only, and continue 5 times, where each time, you keep one partition alone and train the model with the other 4. After this, take an average of these scores. This is indicative of that your model might perform if it sees new data. It is also a good practice to do this multiple times and perform an average. For example, for smaller datasets, perform a 10 time 10-folds cross-validation, which will give a pretty stable score (depending on the dataset) which will be indicative of the prediction performance.
Bootstrap sampling is similar, but you need to sample the same number of datapoints (depends) with replacement from the dataset and use this sample to train. This set will have some datapoints repeated (as it was a sample with replacement). Then use the missing datapoins from the training dataset to evaluate the model. Perform this multiple times and average the performance.
Others
Other ways are to incorporate regularisation techniques in the classifier cost function itself. For example in Support Vector Machines, the cost function enforces conditions such that the decision boundary maintains a "margin" or a gap between two class regions. In neural networks one can also do similar things (although it is not same as in SVM).
In neural network you can use early stopping to stop the training. What this does, is train on the Train dataset, but at each epoch, it evaluates the performance on the Validation dataset. If the model starts to overfit from a specific epoch, then the error for Training dataset will keep on decreasing, but the error of the Validation dataset will start increasing, indicating that your model is overfitting. Based on this one can stop training.
A large dataset from real world tends not to overfit too much (citation needed). Also, if you have too many parameters in your model (to many hidden units and layers), and if the model is unnecessarily complex, it will tend to overfit. A model with lesser pameter will never overfit (though can underfit, if parameters are too low).
In the case of you sin function task, the neural net has to overfit, as it is ... the sin function. These tests can really help debug and experiment with your code.
Another important note, if you try to do a Train, Validation, Test, or k-fold crossvalidation on the data generated by the sin function dataset, then splitting it in the "usual" way will not work as in this case we are dealing with a time-series, and for those cases, one can use techniques mentioned here
First of all, I think it's a great project to approximate sin(x). It would be great if you could share the snippet or some additional details so that we could pin point the exact problem.
However, I think that the problem is that you are overfitting the data hence you are not able to generalize well to other data points.
Few tricks that might work,
Get more training points
Go for regularization
Add a test set so that you know whether you are overfitting or not.
Keep in mind that 0 loss or 100% accuracy is mostly not good on training set.

Validation Set in Backpropogation in a Neural Network

I have a neural network model, and so far I am running the training set forward, calculating the errors, and adjusting the weights.
As I understand it, after I do this for each training set example I need to run an example from the validation set forward and calculate the errors. When the validation set error stops decreasing, but the training set error is still decreasing it is time to stop because over-fitting is starting to occur. After we stop, we use the testing set to calculate how much error is in our network.
Please correct me if there are any mistakes so far.
My question is what error are we comparing? Are we just comparing the error of the output layer? Or are we comparing the errors from every node? If so, how exactly do we define the overall error of the network, just sum up all the errors?
My question is what error are we comparing?
We are comparing the error only on the output layer. So, if you plot a error vs epoch graph, you will have two curves there. The line for training error goes down as you have more epochs. But the line for validation error goes down up to certain point before starting to go up. This indicates overfitting and you want to find the last point where the validation error was lowest.
Note that you are talking about each individual samples while I am talking about epochs. For batch methods these errors are usually plotted after one iteration over the data set (training or validation). So each point on the plot is the mean error or mean squared error from that epoch.
Also, if we have more than 1 output, are we just taking the sum of the errors in the output layer, or should it be some kind of weighted sum?
It's interesting for the multiple output case. Basically we are trying to find the early stopping point to stop training the weights. On the very last layer of multiple output network, the weights are being trained using different error derivatives and can possibly have different optimal early stopping points. You may want to plot them separately if you think that is the case. Otherwise, simple sum of error is sufficient. Weighted sum would mean that you care to optimize for on output over another, even when that causes other one(s) to over/under train.
If you are thinking about implementing separate early stopping points, you can use sum of MSEs to get stopping point for all internal weights that depend on all error derivatives. For the weights on the last layer, use their corresponding MSEs to get their separate stopping points.
Let's say I have 60% training, 20% validation, and 20% test set. For each epoch, I run through the 60 training set samples while adjusting the weights on each sample and also calculating the error on each validation sample.
Another way to do the weight update is to calculate the updates for each sample and then apply an average of all updates at the end of the epoch. If your training data has noise/outliers/misclassified samples, this is good. For example, couple outliers will not be able to massively distort the weights since their 'bad' updates will get averaged out with other 'good' updates.
Since there are only 1/3 as many validation samples as training samples, do I run through the validation 3 times for each epoch?
Why do we iterate over the validation set? Do we calculate error in validation to get weight updates? No. We do all our updating only using the training set. Validation is only their to see how our trained model generalizes outside of training data. Think of it as a test before the test you run with test set. Now, does it make sense to run over the validation set 3 times in each epoch? No, it doesn't.
I use the last calculated weights for online learning correct?
Yes. Error calculation and weight updates happen as new samples come in.
When we use the test set to calculate the error of our final model, are we using mse for this or does it even really matter too much which we use?
If your model is producing real valued output, then use MSE. If you system is trying to solve a classification problem, use classification error. i.e. 10% classification error, meaning 10% of the test set was misclassified by your model during test.

Resources