Training darknet finishes immediately

Training darknet finishes immediately - machine-learning

I would like to use the yolo architecture for object detection. Before training the network with my custom data, I followed these steps to train it on the Pascal VOC data: https://pjreddie.com/darknet/yolo/
The instructions are very clear.
But after the final step
./darknet detector train cfg/voc.data cfg/yolo-voc.cfg darknet19_448.conv.23
darknet immediately stops training and announces that weights have been written to the backups/ directory.
At first I thought that the pretraining was simply too good and that the stopping criteria would be reached at once.
So I've used the ./darknet detect command with these weights on one of the test images data/dog. Nothing is found.
If I don't use any pretrained weights, the network does train.
I've edited cfg/yolo-voc.cfg to use
# Testing
#batch=1
#subdivisions=1
# Training
batch=32
subdivisions=8
Now the training process has been runnning for many hours and is keeping my gpu warm.
Is this the intended way to train darknet ?
How can I use pretrained weights correctly, without training just breaking off ?
Is there any setting to create checkpoints, or get an idea of the progress ?

Adding -clear 1 at the end of your training command will clear the stats of how many images this model has seen in previous training. Then you can fine-tune your model on new data(set).
You can find more info about the usage in the function signature
void train_detector(char *datacfg, char *cfgfile, char *weightfile, int *gpus, int ngpus, int clear)
at https://github.com/pjreddie/darknet/blob/b13f67bfdd87434e141af532cdb5dc1b8369aa3b/examples/detector.c
I doubt it that increasing the max number of iterations is a good idea, as the learning rates are usually associated with current # of iteration. We usually increase the max # of iterations, when we want to resume a previous training task that ended because of reaching the max # of iterations, but we believe that with more iterations, it will give better results.
FYI, when you have a small dataset, training on it from scratch or from a classification network may not be a great idea. You may still want to re-use the weights from a detection network trained on large dataset like Coco or ImageNet.

This is an old question so I hope you have your answer by now, but here is mine just in case it helps.
After working with darknet for about a month, I've run into most of the roadblocks that people have asked/posted about on forums. In your case, I'm pretty certain it's because the weights have been trained for the max number of batches already, and when the pre-trained weights were read in darknet assumed training was done.
Relevant personal experience: when I used one of the pretrained weights files, it started from iteration 40101 and ran until 40200 before cutting off.
I would stick to training from scratch if you have custom data, but if you want to try the pre-trained weights again, you might find that changing max batches in the cfg file helps.

Also if using AlexeyAB/darknet they might have a problem with -clear option,
in detector.c:
if (clear) *nets[k].seen = 0
should really be:
if (clear) {*nets[k].seen = 0;*nets[k].cur_iteration = 0;}
otherwise the training loop will exit immediately.

Modify OpenCV number in your darknet/Makefile to 0
OpenCV=0

Related

Should I normalize my features before throwing them into RNN?

I am playing some demos about recurrent neural network.
I noticed that the scale of my data in each column differs a lot. So I am considering to do some preprocess work before I throw data batches into my RNN. The close column is the target I want to predict in the future.
open high low volume price_change p_change ma5 ma10 \
0 20.64 20.64 20.37 163623.62 -0.08 -0.39 20.772 20.721
1 20.92 20.92 20.60 218505.95 -0.30 -1.43 20.780 20.718
2 21.00 21.15 20.72 269101.41 -0.08 -0.38 20.812 20.755
3 20.70 21.57 20.70 645855.38 0.32 1.55 20.782 20.788
4 20.60 20.70 20.20 458860.16 0.10 0.48 20.694 20.806
ma20 v_ma5 v_ma10 v_ma20 close
0 20.954 351189.30 388345.91 394078.37 20.56
1 20.990 373384.46 403747.59 411728.38 20.64
2 21.022 392464.55 405000.55 426124.42 20.94
3 21.054 445386.85 403945.59 473166.37 21.02
4 21.038 486615.13 378825.52 461835.35 20.70
My question is, is preprocessing the data with, say StandardScaler in sklearn necessary in my case? And why?
(You are welcome to edit my question)

It will be beneficial to normalize your training data. Having different features with widely different scales fed to your model will cause the network to weight the features not equally. This can cause a falsely prioritisation of some features over the others in the representation.
Despite that the whole discussion on data preprocessing is controversial either on when exactly it is necessary and how to correctly normalize the data for each given model and application domain there is a general consensus in Machine Learning that running a Mean subtraction as well as a general Normalization preprocessing step is helpful.
In the case of Mean subtraction, the mean of every individual feature is being subtracted from the data which can be interpreted as centering the data around the origin from a geometric point of view. This is true for every dimensionality.
Normalizing the data after the Mean subtraction step results in a normalization of the data dimensionality to approximately the same scale. Note that the different features will loose any prioritization over each other after this step as mentioned above. If you have good reasons to think that the different scales in your features bear important information that the network may need to truly understand the underlying patterns in your dataset, then a normalization will be harmful. A standard approach would be to scale the inputs to have mean of 0 and a variance of 1.
Further preprocessing operations may be helpful in specific cases such as performing PCA or Whitening on your data. Look into the awesome notes of CS231n (Setting up the data and the model) for further reference on these topics as well as for a more detailed explenation of the topics above.

Definetly yes. Most of neural networks work best with data beetwen 0-1 or -1 to 1(depends on output function). Also when some inputs are higher then others network will "think" they are more important. This can make learning very long. Network must first lower weights in this inputs.

I found this https://arxiv.org/abs/1510.01378
If you normalize it may improve convergence so you will get lower training times.

How to make a non-static Caffe network architecture?

I would like to implement a neural network architecture in Caffe which will perform differently based on some iterable variable. For example: the full network might use 10 layers for 4 out of 5 training or testing iterations, but for all other iterations it will truncate the network and only use the last 5 layers. This would require that the input to the first layer and the input to the 5th layer have the same dimensionality of course, but my primary question is how to implement this switching between the two architectures during training/testing.

I guess you can do that using pycaffe and caffe.NetSpec(), but this is not going to be a very nice code...
On the other hand, why don't you train for i iterations the full net, save a snapshot, and then "warm start" the reduced model with the snapshot you saved?
That is: have 'full_trainval.prototxt' with 'full_solver.prototxt' configured to train the full net for i iterations, and 'top_trainval.prototxt' with 'top_solver.prototxt' configured to train only the top layers of the net. Then
~$ $CAFFE_ROOT/build/tools/caffe train -solver full_solver.prototxt
When this stage is through, make sure you have the final sanpshot saved, and then
~$ $CAFFE_ROOT/build/tools/caffe train -solver top_solver.prototxt -snapshot full_train_last_snapshot.solverstate
Finally, you could use net_surgery to merge the weights of the two phases into a single full net.

TensorFlow seq2seq model with low number of target_vocab_size

I am experimenting with the tensorflow seq2seq_model.py model.
The target vocab size I have is around 200.
The documentation the says:
For vocabularies smaller than 512, it might be a better idea to just use a standard softmax loss.
The source-code also has the check:
if num_samples > 0 and num_samples < self.target_vocab_size:
Running the model with only 200 target output vocabulary does not invoke the if statement.
Do I need to write a "standard" softmax loss function to ensure a good training, or can I just let the model run as it comes?
Thanks for the help!

I am doing the same thing. In order to just get my fingers wet with different kinds of structures in the training data I am working in an artificial test-world with just 117 words in the (source and) target vocabulary.
I asked myself the same question and decided to not go through that hassle. My models train well even though I didn't touch the loss, thus still using the sampled_softmax_loss.
Further experiences with those small vocab sizes:
- batchsize 32 is best in my case (smaller ones make it really unstable and I run into nan-issues quickly)
- I am using AdaGrad as the optimizer and it works like magic
- I am working with the model_with_buckets (addressed through translate.py) and having size 512 with num_layers 2 produces the desired outcomes in many cases.

Perceptron learns to reproduce just one pattern all the time

This is rather a weird problem.
A have a code of back propagation which works perfectly, like this:
Now, when I do batch learning I get wrong results even if it concerns just a simple scalar function approximation.
After training the network produces almost the same output for all input patterns.
By this moment I've tried:
Introduced bias weights
Tried with and without updating of input weights
Shuffled the patterns in batch learning
Tried to update after each pattern and accumulating
Initialized weights in different possible ways
Double-checked the code 10 times
Normalized accumulated updates by the number of patterns
Tried different layer, neuron numbers
Tried different activation functions
Tried different learning rates
Tried different number of epochs from 50 to 10000
Tried to normalize the data
I noticed that after a bunch of back propagations for just one pattern, the network produces almost the same output for large variety of inputs.
When I try to approximate a function, I always get just line (almost a line). Like this:
Related question: Neural Network Always Produces Same/Similar Outputs for Any Input
And the suggestion to add bias neurons didn't solve my problem.
I found a post like:
When ANNs have trouble learning they often just learn to output the
average output values, regardless of the inputs. I don't know if this
is the case or why it would be happening with such a simple NN.
which describes my situation closely enough. But how to deal with it?
I am coming to a conclusion that the situation I encounter has the right to be. Really, for each net configuration, one may just "cut" all the connections up to the output layer. This is really possible, for example, by setting all hidden weights to near-zero or setting biases at some insane values in order to oversaturate the hidden layer and make the output independent from the input. After that, we are free to adjust the output layer so that it just reproduces the output as is independently from the input. In batch learning, what happens is that the gradients get averaged and the net reproduces just the mean of the targets. The inputs do not play ANY role.

My answer can not be fully precise because you have not posted the content of the functions perceptron(...) and backpropagation(...).
But from what I guess, you train your network many times on ONE data, then completely on ONE other in a loop for data in training_data, which leads that your network will only remember the last one. Instead, try training your network on every data once, then do that again many times (invert the order of your nested loops).
In other word, the for I = 1:number of patterns loop should be inside the backpropagation(...) function's loop, so this function should contain two loops.
EXAMPLE (in C#):
Here are some parts of a backpropagation function, I simplified it here. At each update of the weights and biases, the entire network is "propagated". The following code can be found at this URL: https://visualstudiomagazine.com/articles/2015/04/01/back-propagation-using-c.aspx
public double[] Train(double[][] trainData, int maxEpochs, double learnRate, double momentum)
{
//...
Shuffle(sequence); // visit each training data in random order
for (int ii = 0; ii < trainData.Length; ++ii)
{
//...
ComputeOutputs(xValues); // copy xValues in, compute outputs
//...
// Find new weights and biases
// Update weights and biases
//...
} // each training item
}
Maybe what is not working is just that you want to enclose everything after this comment (in Batch learn as an example) with a secondary for loop to do multiple epochs of learning:
%--------------------------------------------------------------------------
%% Get all updates

libSVM giving highly inaccurate predictions even for the file that was used to train it

here is the deal.
I am trying to make an SVM based POS tagger.
The feature vectors for the SVM was created with the help of format converters.
Now here is a screenshot of the training file that I am using.
http://tinypic.com/r/n4fn2r/8
I have 25 labels for various POS tags. when i use the java implementation or the command line tools for prediction i get the following results.
http://tinypic.com/r/2dtw5ky/8
I have tried with all the kernels available but it gave more or less the same results.
This is happening even when the training file is used as the testing file.
please help me out here..!!
P.S. I cannot share more than two links. Thus here is a snippet of the model file
svm_type c_svc
kernel_type rbf
gamma 0.000548546
nr_class 25
total_sv 431
rho -0.929467 1.01073 1.0531 1.03472 1.01585 0.953263 1.03027 -0.921365 0.984535 1.02796 1.01266 1.03374 0.949463 0.977925 0.986551 -0.920912 0.940926 -0.955562 0.975386 -0.981959 -0.884042 0.0516955 -0.980884 -0.966095 0.995091 1.023 1.01489 1.00308 0.948314 1.01137 -0.845876 0.968034 1.0076 1.00064 1.01335 0.942633 0.965703 0.979212 -0.861236 0.935055 -0.91739 0.970223 -0.97103 0.0743777 0.970321 -0.971215 -0.931582 0.972377 0.958193 0.931253 0.825797 0.954894 -0.972884 -0.941726 0.945077 0.922366 0.953999 -1.00503 0.840985 0.882229 -0.961742 0.791631 -0.984971 0.855911 -0.991528 -0.951211 -0.962096 -0.99213 -0.99708 -0.957557 -0.308987 -0.455442 -0.94881 -0.995319 -0.974945 -0.964637 -0.902152 -0.955258 -1.05287 -1.00614 -0.
update
Just trained the SVM with svm type as c-SVC and kernel type as linear. Which gave a non-zero(although very poor) accuracy.

As mentioned by #Pedrom, parameter choice is absolutely crucial when training SVMs. I suggest you have a look at this practical guide. Also, 431 words is nowhere near enough to train a 25-class model. You will definitely need more data.
That said, 0% accuracy is indeed odd. Can you please show us the commands you are using to train and evaluate the model?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart