If training set is linearly separable then Perceptron finds linear classifier separating data with 0 risk/loss eventually. But what if we believe data are separable but they are not. In such case, if we would let Perceptron run on such data with number of examples going to infinity, will the error value go to 0 in such limit?
Related
My neural network trainign in pytorch is getting very wierd.
I am training a known dataset that came splitted into train and validation.
I'm shuffeling the data during training and do data augmentation on the fly.
I have those results:
Train accuracy start at 80% and increases
Train loss decreases and stays stable
Validation accuracy start at 30% but increases slowly
Validation loss increases
I have the following graphs to show:
How can you explain that the validation loss increases and the validation accuracy increases?
How can be such a big difference of accuracy between validation and training sets? 90% and 40%?
Update:
I balanced the data set.
It is binary classification. It now has now 1700 examples from class 1, 1200 examples from class 2. Total 600 for validation and 2300 for training.
I still see similar behavior:
**Can it be becuase I froze the weights in part of the network?
**Can it be becuase the hyperparametrs like lr?
I found the solution:
I had different data augmentation for training set and validation set. Matching them also increased the validation accuracy!
If the training set is very large in comparison to the validation set, you are more likely to overfit and learn the training data, which would make generalizing the model very difficult. I see your training accuracy is at 0.98 and your validation accuracy increases at a very slow rate, which would imply that you have overfit your training data.
Try reducing the number of samples in your training set to improve how well your model generalizes to unseen data.
Let me answer your 2nd question first. High accuracy on training data and low accuracy on val/test data indicates the model might not generalize well to infer real cases. That is what the validation process is all about. You need to finetune or even rebuild your model.
With regard to the first question, val loss might not necessarily correspond to the val accuracy. The model makes the prediction based on its model, and loss function calculates the difference between probablities of matrix and the target if you are using CrossEntropy function.
I'm currently training a neural network with tanh activation function (basic backpropagation without weight decay, momentum or any other improvements) to map an arbitrary function. The training data is within the interval (-1,1).
My training approach is as follows:
Splitting data into two parts (training and validation data).
Feeding the whole training data to the network (with weight update).
Feeding the validation data (without weight update) and calculate the MSE of all validation entries.
Repeat (starting at 2) until validation MSE at t-1 is almost equal to MSE at t.
I've tried to normalize the data using
f(x)=(x-mean)/std_deviation
but the MSE on the validation data is not getting below approximately 1.49. Without the normalization process, it reaches 0.005. Does someone has an explanation for this or an hind on improving my training procedure and data normalization?
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
When I trained my neural network with Theano or Tensorflow, they will report a variable called "loss" per epoch.
How should I interpret this variable? Higher loss is better or worse, or what does it mean for the final performance (accuracy) of my neural network?
The lower the loss, the better a model (unless the model has over-fitted to the training data). The loss is calculated on training and validation and its interperation is how well the model is doing for these two sets. Unlike accuracy, loss is not a percentage. It is a summation of the errors made for each example in training or validation sets.
In the case of neural networks, the loss is usually negative log-likelihood and residual sum of squares for classification and regression respectively. Then naturally, the main objective in a learning model is to reduce (minimize) the loss function's value with respect to the model's parameters by changing the weight vector values through different optimization methods, such as backpropagation in neural networks.
Loss value implies how well or poorly a certain model behaves after each iteration of optimization. Ideally, one would expect the reduction of loss after each, or several, iteration(s).
The accuracy of a model is usually determined after the model parameters are learned and fixed and no learning is taking place. Then the test samples are fed to the model and the number of mistakes (zero-one loss) the model makes are recorded, after comparison to the true targets. Then the percentage of misclassification is calculated.
For example, if the number of test samples is 1000 and model classifies 952 of those correctly, then the model's accuracy is 95.2%.
There are also some subtleties while reducing the loss value. For instance, you may run into the problem of over-fitting in which the model "memorizes" the training examples and becomes kind of ineffective for the test set. Over-fitting also occurs in cases where you do not employ a regularization, you have a very complex model (the number of free parameters W is large) or the number of data points N is very low.
They are two different metrics to evaluate your model's performance usually being used in different phases.
Loss is often used in the training process to find the "best" parameter values for your model (e.g. weights in neural network). It is what you try to optimize in the training by updating weights.
Accuracy is more from an applied perspective. Once you find the optimized parameters above, you use this metrics to evaluate how accurate your model's prediction is compared to the true data.
Let us use a toy classification example. You want to predict gender from one's weight and height. You have 3 data, they are as follows:(0 stands for male, 1 stands for female)
y1 = 0, x1_w = 50kg, x2_h = 160cm;
y2 = 0, x2_w = 60kg, x2_h = 170cm;
y3 = 1, x3_w = 55kg, x3_h = 175cm;
You use a simple logistic regression model that is y = 1/(1+exp-(b1*x_w+b2*x_h))
How do you find b1 and b2? you define a loss first and use optimization method to minimize the loss in an iterative way by updating b1 and b2.
In our example, a typical loss for this binary classification problem can be:
(a minus sign should be added in front of the summation sign)
We don't know what b1 and b2 should be. Let us make a random guess say b1 = 0.1 and b2 = -0.03. Then what is our loss now?
so the loss is
Then you learning algorithm (e.g. gradient descent) will find a way to update b1 and b2 to decrease the loss.
What if b1=0.1 and b2=-0.03 is the final b1 and b2 (output from gradient descent), what is the accuracy now?
Let's assume if y_hat >= 0.5, we decide our prediction is female(1). otherwise it would be 0. Therefore, our algorithm predict y1 = 1, y2 = 1 and y3 = 1. What is our accuracy? We make wrong prediction on y1 and y2 and make correct one on y3. So now our accuracy is 1/3 = 33.33%
PS: In Amir's answer, back-propagation is said to be an optimization method in NN. I think it would be treated as a way to find gradient for weights in NN. Common optimization method in NN are GradientDescent and Adam.
Just to clarify the Training/Validation/Test data sets:
The training set is used to perform the initial training of the model, initializing the weights of the neural network.
The validation set is used after the neural network has been trained. It is used for tuning the network's hyperparameters, and comparing how changes to them affect the predictive accuracy of the model. Whereas the training set can be thought of as being used to build the neural network's gate weights, the validation set allows fine tuning of the parameters or architecture of the neural network model. It's useful as it allows repeatable comparison of these different parameters/architectures against the same data and networks weights, to observe how parameter/architecture changes affect the predictive power of the network.
Then the test set is used only to test the predictive accuracy of the trained neural network on previously unseen data, after training and parameter/architecture selection with the training and validation data sets.
I have implemented Q-Learning as described in,
http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf
In order to approx. Q(S,A) I use a neural network structure like the following,
Activation sigmoid
Inputs, number of inputs + 1 for Action neurons (All Inputs Scaled 0-1)
Outputs, single output. Q-Value
N number of M Hidden Layers.
Exploration method random 0 < rand() < propExplore
At each learning iteration using the following formula,
I calculate a Q-Target value then calculate an error using,
error = QTarget - LastQValueReturnedFromNN
and back propagate the error through the neural network.
Q1, Am I on the right track? I have seen some papers that implement a NN with one output neuron for each action.
Q2, My reward function returns a number between -1 and 1. Is it ok to return a number between -1 and 1 when the activation function is sigmoid (0 1)
Q3, From my understanding of this method given enough training instances it should be quarantined to find an optimal policy wight? When training for XOR sometimes it learns it after 2k iterations sometimes it won't learn even after 40k 50k iterations.
Q1. It is more efficient if you put all action neurons in the output. A single forward pass will give you all the q-values for that state. In addition, the neural network will be able to generalize in a much better way.
Q2. Sigmoid is typically used for classification. While you can use sigmoid in other layers, I would not use it in the last one.
Q3. Well.. Q-learning with neural networks is famous for not always converging. Have a look at DQN (deepmind). What they do is solving two important issues. They decorrelate the training data by using memory replay. Stochastic gradient descent doesn't like when training data is given in order. Second, they bootstrap using old weights. That way they reduce non-stationary.
I built a classifier with 13 features ( no binary ones ) and normalized individually for each sample using scikit tool ( Normalizer().transform).
When I make predictions it predicts all training sets as positives and all test sets as negatives ( irrespective of fact whether it is positive or negative )
What anomalies I should focus on in my classifier, feature or data ???
Notes: 1) I normalize test and training sets (individually for each sample) separately.
2) I tried cross validation but the performance is same
3) I used both SVM linear and RBF Kernels
4) I tried without normalizing too. But same poor results
5) I have same number of positive and negative datasets ( 400 each) and 34 samples of positive and 1000+ samples of negative test sets.
If you're training on balanced data the fact that "it predicts all training sets as positive" is probably enough to conclude that something has gone wrong.
Try building something very simple (e.g. a linear SVM with one or two features) and look at the model as well as a visualization of your training data; follow the scikit-learn example: http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html
There's also a possibility that your input data has many large outliers impacting the transform process...
Try doing feature selection on the training data (Seperately from your test/validation data).
Feature selection on your whole dataset can easily lead to overfitting.