Neural Net learning only after first couple of epochs - machine-learning

I have a general question to the Machine Learning experts out there:
How can it be, that a neural network starts learning only after a couple of epochs? The first epochs nothing changes, then all of the sudden rapid progress occurs. What exactly can be behind this?
In my case I am dealing with a resnet type architecture and image processing.
Loss Function
Greetings

Related

machine learning loss during learning jumps all over

what to do when the loss starts to jump all over during learning?
the plot has been generated as such
Thank you
tried different learners like SGD and Adam or different learning rates, also learning-rate-decays, nothing really helped.

Effect of learning rate on the neural network

I have a dataset of about a 100 numeric values. I have set the learning rate of my neural network to 0.0001. I have successfully trained it on the dataset for over 1 million times. But my question is that what is the effect of very low learning rates in the neural networks ?
Low learning rate mainly implies slow convergence: you're moving down the loss function with smaller steps (the step size is the learning rate).
If your function is convex this is not a problem, you will wait more but you'll reach a good solution.
If, as in the case of deep neural networks, your function is not convex than a low learning rate could lead to the reaching of a "good" optimum that's not the best one (getting stuck in a local minimum, without making steps as big as required to jump out of it).
That's why there are different optimization algorithms that are adaptive: such algorithm, like ADAM, RMSProp, ... have different learning rates for each weight in the network (every single learning rate starts from the same value). In this way, the optimization algorithm can work on every single parameter independently with the aim of finding a better solution (and letting the chose of the initial learning rate less critical)

Neural networks - why everybody has different approach with XOR

Currently I'm trying to learn how to work with neural networks by reading books, but mostly internet tutorials.
I often see that "XOR is 'Hello World' of neural networks".
But here is a thing: The author of one tutorial says that for neural network that calculates XOR value we should use 1 hidden layer with 2 neurons. Also he uses backpropagation with deltas to adjust weights.
I implemented this, but even after 1 million epochs I have a problem that network is stuck with input data 1 and 1. There should be "0" as an answer, but answer is usually 0.5something. I checked my code, it is correct.
If I'll try to add just 1 more neuron in the hidden layer, network is successfully calculating XOR after ~50 000 epochs.
At the same time some people saying that "XOR is not a trivial task and we should use network with 2-3 or more layers". Why?
Come on, if XOR creates so much problems, maybe we shouldn't use it as 'hello world' of neural networks? Please explain what is going on.
So neural networks are really interesting. Theres a proof that says that a single perceptron can learn any linear function given enough time. Even more impressive, a neural network with one hidden layer can apparently learn any function, though I've yet to see a proof on that one.
XOR is a good function for teaching neural networks because as CS students, those in the class are likely already familiar with it. In addition, it is not trivial in the sense that a single perceptron can learn it. it isn't linear. See this graphic I put together.
There is no line that separates these values. YET, it is simple enough for humans to understand, and more importantly, that a human can understand the neural network that can solve it. NN are very blackbox-y, it becomes hard to tell why they work really fast. Hell, here is another network config that can solve XOR.
Your example of a more complicated network solving it faster shows the power that comes from combining more neurons and more layers. Its absolutely unnecessary to use 2-3 hidden layers to solve it, but it sure helps speed up the process.
The point is that it is a simple enough problem to solve by human and on a black-board in class, while also being slightly more challenging than a given linear function.
EDIT: Another fantastic example for teaching NNs practically is the MNIST hand drawn digit classification data set. I find that it very easily shows a problem that is simultaneously very simple for humans to understand, very hard to write a non learning program for, and a very practical use case for machine learning. The problem is that the network structure is impossible to draw on a blackboard and trace what is happening in a way practical for a class. XOR achieves this.
EDIT 2: Also, without the code it will probably be hard to diagnose why it isn't converging. Did you write the neurons yourself? What about the optimization function, etc?
EDIT 3: If the output of your function last node is 0.5, try using a step squashing function that makes all values below .5 into 0, and all values above 0.5 into 1. You only have binary output anyway so why bother with a continuous activation on the last node?

convnet suddenly drop accuracy

I was trying to train an emotion recognition model on the fer2013 dataset using the architecture proposed in this paper
The paper uses different dataset than mine so I did some modifications on on the stride and filter size.
After a couple hours of training, accuracy on both training and test set suddenly drops.
After that the accuracy just stay around 0.1-0.2 for both set, never improve anymore.
Does anybody know about this phenomenon?
In any neural network training, if both accuracies i.e. training and validation improves at first and then starts decreasing, it is a sign that your network is failing to converge. More appropriately, your optimizer has started overshooting.
One most likely reason for this could be high learning rate. Reduce your learning rate and then check your example again. Also, in your linked paper, (at least in first glimpse), I couldn't see learning rate mentioned. Since your data is different from the paper's, same learning rate might not work as well.

Neural Networks: Online learning (one-example-at-a-time)

I have a neural network that performs a classification task, and it works fairly well when the training set is large enough.
However, I'm looking for a way to train the NN with one labelled example at a time.
That is, I intercept data, one example at a time, and I need to update my NN weights. I'm not allowed to store the data for future, batch training purposes.
I have a feed-forward NN built(following Stanford's ML course on Coursera) in octave. I can run back-prop on the NN using each new example, but that approach unreliably converges, and takes a significant amount of time.
Are there any other more efficient algorithms for online learning out there in the context of Neural Networks?
I noticed Matlab's adapt() function seems to do just this. The documentation doesn't specify what it's doing behind the scene. Is it just one iteration of back-prop?

Resources