I'm currently learning about Convolutional Neural Networks by studying examples like the MNIST examples. During the training of a neural network, I often see output like:
Epoch | Train loss | Valid loss | Train / Val
--------|--------------|--------------|---------------
50 | 0.004756 | 0.007043 | 0.675330
100 | 0.004440 | 0.005321 | 0.834432
250 | 0.003974 | 0.003928 | 1.011598
500 | 0.002574 | 0.002347 | 1.096366
1000 | 0.001861 | 0.001613 | 1.153796
1500 | 0.001558 | 0.001372 | 1.135849
2000 | 0.001409 | 0.001230 | 1.144821
2500 | 0.001295 | 0.001146 | 1.130188
3000 | 0.001195 | 0.001087 | 1.099271
Besides the epochs, can someone give me an explanation on what exactly each column represents and what the values mean? I see a lot of tutorials on basic cnn's, but I haven't run into one that explains this in detail.
It appears that a held-out set of data is being used, in addition to that used to train the network. Training loss is the error on the training set of data. Validation loss is the error after running the validation set of data through the trained network. Train/valid is the ratio between the two.
Unexpectedly, as the epochs increase both validation and training error drop. At a certain point though, while the training error continues to drop (the network learns the data better and better) the validation error begins to rise -- this is overfitting!
Related
i have dataset, the data set consists of : Date , ID ( the id of the event ), number_of_activities, running_sum ( the running_sum is the running sum of activities by id).
this is a part of my data :
date | id (id of the event) | number_of_activities | running_sum |
2017-01-06 | 156 | 1 | 1 |
2017-04-26 | 156 | 1 | 2 |
2017-07-04 | 156 | 2 | 4 |
2017-01-19 | 175 | 1 | 1 |
2017-03-17 | 175 | 3 | 4 |
2017-04-27 | 221 | 3 | 3 |
2017-05-05 | 221 | 7 | 10 |
2017-05-09 | 221 | 10 | 20 |
2017-05-19 | 221 | 1 | 21 |
2017-09-03 | 221 | 2 | 23 |
the goal for me is to predict the future number of activities in a given event, my question : can i train my model on all the dataset ( all the events) to predict the next one, if so how? because there are inequalities in the number of inputs ( number of rows for each event is different ), and is it possible to exploit the date data as well.
Sure you can. But alot of more information is needed, which you know yourself the best.
I guess we are talking about timeseries here as you want to predict the future.
You might want to have alook at recurrent-neural nets and LSTMs:
An Recurrent-layer takes a timeseries as input and outputs a vector, which contains the compressed information about the whole timeseries. So lets take event 156, which has 3 steps:
The event is your features, which has 3 timesteps. Each timestep has different numbers of activities (or features). To solve this, just use the maximum amount of features occuring and add a padding value (most often simply zero) so they all have the samel length. Then you have a shape, which is suitable for a recurrent neural Net (where LSTMS are currently a good choice)
Update
You said in the comments, that using padding is not option for you, let me try to convince you. LSTMs are good at situations, where the sequence length is different long. However, for this to work you also need to have longer sequences, what the model can learn its patterns from. What I want to say, when some of your sequences have only a few timesteps like 3, but you have other with 50 and more timesteps, the model might have its difficulties to predict these correct, as you have to specify, which timestep you want to use. So either, you prepare your data differently for a clear question, or you dig deeper into the topic using SequenceToSequence Learning, which is very good at computing sequences with different lenghts. For this you will need to set up a Encoder-Decoder network.
The Encoder squashs the whole sequence into one vector, whatever length it is. This one vector is compressed in a way, that it contains the information of the sequence only in one vector.
The Decoder then learns to use this vector for predicting the next outputs of the sequences. This is a known technique for machine-translation, but is suitable for any kind of sequence2sequence tasks. So I would recommend you to create such a Encoder-Decoder network, which for sure will improve your results. Have a look at this tutorial, which might help you further
I have a dataset where some features are numerical, some categorical, and some are strings (e.g. description). To give an example, lets say I have three features:
| Number | Type | Comment |
---------------------------------------------------------
| 1.23 | 1 | Some comment, up to 10000 characters |
| 2.34 | 2 | Different comment many words |
...
Can I have all of them as input to a multi-layer network in dl4j, where numerical and categorical would be regular input features, but string comment feature will be processed first as word-series by a simple RNN (e.g. Embedding -> LSTM)? In other words, architecture should look something like this:
"Number" "Type" "Comment"
| | |
| | Embedding
| | |
| | LSTM
| | |
Main Multi-Layer Network
|
Dense
|
...
|
Output
I think in Keras this can be achieved by Concatenate layer. Is there something like this in DL4J?
Dl4j has 99% keras import coverage. We have concatneate layers as well. Take a look at the various vertices. Whatever you can do in keras should be do able in dl4j, save for very specific cases. More here: https://deeplearning4j.org/docs/latest/deeplearning4j-nn-computationgraph You want a MergeVertex.
Let's assume that I've got patients with information about their diseases and symptoms. I want to estimate probability of P(diseasei = TRUE|symptomj = TRUE). I suppose that I should use NB classifier, but every example I've found apply Naive Bayes when there's only one disease (like predicting the probability of heart attack).
My data look like below:
patient | disease | if_disease_present | symptom
1 | d1 | TRUE | s1
2 | d1 | FALSE | s2
3 | d2 | TRUE | s1
4 | d3 | TRUE | s4
5 | d4 | FALSE | s8
...
My idea was to split data according to diseases and build the number of naive Bayesian models how many unique diseases I have in my data, but I have doubts if it's proper method.
If you want to predict the disease, don't split the data on it.
That is your target variable!
But as is, your table is not suitable for this task. You need to preprocess it, probably do some pivotization.
I've written a simple recurrent network in TensorFlow based on this video that I watched: https://youtu.be/vq2nnJ4g6N0?t=8546
In the video the RNN is demonstrated to produce Shakespeare plays by having the network produce words one character at a time. The output of the network is fed back into the input on the next iteration.
Here's a diagram of my network:
+--------------------------------+
| |
| In: H E L L O W O R L <--+-----+
| | | | | | | | | | | | |
| V V V V V V V V V V | | Recursive feed
| +-----------------+ | |
+-> Hin ->| RNN + Softmax |-> Hout |
+-----------------+ |
| | | | | | | | | | |
Out: V V V V V V V V V V |
E L L O W O R L D ---------+
^ Character predicted by the network
I expect the network to at least do the copying bit correctly. Unfortunately my network always outputs 32 for all values (ASCII space character). I'm not sure what is causing the issue...
Please help me get my network producing poetry!
My code is here:
https://github.com/calebh/namepoet/blob/03f112ced94c3319055fbcc74a2acdb4a9b0d41c/main.py
The corpus can be replaced by a few paragraphs of Lorem Ipsum to speed up training (the network has the same bad behavior).
Sounds like it might be saturating your filters (ergo the activation function is way at the far end of the spectrum and thus has a very low gradient and gets stuck). You might want to try initializing your neurons parameters with a different method.
Also, is there a particular reason you're using GRU? In my experience LSTM units are more reliable, if a bit less efficient.
I'd try to run the code for longer time? You have batch_size = 10, sequence_size = 30 and 20 iterations, your network essentially has seen 6000 characters in total, maybe with a learning rate of 0.001, it wasn't enough to move away from your initialization.
Hence, I'd try to raise the learning rate to a very high value (e.g. 1 or 100) and see if it starts outputting different letters to confirm your implementation is somewhat correct. A network trained with such a high learning rate is usually not going to be accurate at all.
Could you please help me with a neural network?
If I have an arbitrary dataset:
+---+---------+---------+--------------+--------------+--------------+--------------+
| i | Input 1 | Input 2 | Exp.Output 1 | Exp.Output 2 | Act.output 1 | Act.output 2 |
+---+---------+---------+--------------+--------------+--------------+--------------+
| 1 | 0.1 | 0.2 | 1 | 2 | 2 | 4 |
| 2 | 0.3 | 0.8 | 3 | 5 | 8 | 10 |
+---+---------+---------+--------------+--------------+--------------+--------------+
Let's say I have x hidden layers with different numbers of neurons and different types of activation functions each.
When running backpropagation (especially iRprop+), when do I update the weights? Do I update them after calculating each line from the dataset?
I've read that batch learning is often not as efficient as "on-line" training. That means that it is better to update the weights after each line, right?
And do I understand it correctly: an epoch is when you have looped through each line in the input dataset? If so, that would mean that in one epoch, the weights will be updated twice?
Then, where does the total network error (see below) come into play?
[image, from here.]
tl;dr:
Please help help me understand how backprop works
Typically, you would update the weights after each example in the data set (I assume that's what you mean by each line). So, for each example, you would see what the neural network thinks the output should be (storing the outputs of each neuron in the process) and then back propagate the error. So, starting with the final output, compare the ANN's output with the actual output (what the data set says it should be) and update the weights according to a learning rate.
The learning rate should be a small constant, since you are correcting weights for each and every example. And an epoch is one iteration through every example in the data set.