Does anyone know any function for plotting the obtained measures in Caffe? I would like to plot train loss, test loss, and accuracy, train moving average and etc. in one plot. Is there any function except Caffe built-in function that is available online?
Edited:
First, I ran parse_log.py file (the following command):
$python /path/to/caffe/tools/extra/parse_log.py /logfile_path/logfile.log /output_dir
Two files are created based on the log file (lofile.log.train and logfile.log.test). After that,I ran plot_training_log.py file. It has options like:
0: Test accuracy vs. Iters
1: Test accuracy vs. Seconds
2: Test loss vs. Iters
3: Test loss vs. Seconds
4: Train learning rate vs. Iters
5: Train learning rate vs. Seconds
6: Train loss vs. Iters
7: Train loss vs. Seconds
Whenever, I chose option 3, it is showing the following graph:
and by choosing option 0 :
However, whenever, I want to plot train-loss figure, it is giving error:
$python /path/to/caffe/tools/extra/plot_training_log.py.example 6 /output_dir/train_loss_cnn1.png ./logfile.log
Traceback (most recent call last):
File "/home/ss/caffe-master/tools/extra/plot_training_log.py.example", line 191, in <module>
plot_chart(chart_type, path_to_png, path_to_logs)
File "/home/ss/caffe-master/tools/extra/plot_training_log.py.example", line 117, in plot_chart
data = load_data(data_file, x, y)
File "/home/ss/caffe-master/tools/extra/plot_training_log.py.example", line 88, in load_data
data[1].append(float(fields[field_idx1].strip()))
ValueError: invalid literal for float(): 0.522037s/50
My question can be folded into three parts:
Are the plots correct? Is the network behaving well?
From which point this error stem from? I have the following columns in logfile.log.train (#Iters|Seconds |TrainingLoss |LearningRate).
How can I show all chart types in one plot? I tried to include them by comma, like 0,2,3,6, however, it is showing error.
Many thanks in advance.
Take a look at parse_log.py found in $CAFFE_ROOT/tools/extra.
This python utility helps parsing and distilling information from caffe running log.
start training your model by executing the command below:
/home/ubuntu/caffe/build/tools/caffe train --solver /home/ubuntu/yourpath/solver.prototxt 2>&1 | tee /home/ubuntu/yourpath/model_train.log
The training logs will be stored under yourpath/model_train.log.
I haven't looked at caffe's built in plot scripts, but I use the script from here. This only plots your train/test loss, but you can add moving average calculation.
Consider also installing DIGITS, that provides a real-time plot showing all that kind of stuff.
Related
I would like to use the yolo architecture for object detection. Before training the network with my custom data, I followed these steps to train it on the Pascal VOC data: https://pjreddie.com/darknet/yolo/
The instructions are very clear.
But after the final step
./darknet detector train cfg/voc.data cfg/yolo-voc.cfg darknet19_448.conv.23
darknet immediately stops training and announces that weights have been written to the backups/ directory.
At first I thought that the pretraining was simply too good and that the stopping criteria would be reached at once.
So I've used the ./darknet detect command with these weights on one of the test images data/dog. Nothing is found.
If I don't use any pretrained weights, the network does train.
I've edited cfg/yolo-voc.cfg to use
# Testing
#batch=1
#subdivisions=1
# Training
batch=32
subdivisions=8
Now the training process has been runnning for many hours and is keeping my gpu warm.
Is this the intended way to train darknet ?
How can I use pretrained weights correctly, without training just breaking off ?
Is there any setting to create checkpoints, or get an idea of the progress ?
Adding -clear 1 at the end of your training command will clear the stats of how many images this model has seen in previous training. Then you can fine-tune your model on new data(set).
You can find more info about the usage in the function signature
void train_detector(char *datacfg, char *cfgfile, char *weightfile, int *gpus, int ngpus, int clear)
at https://github.com/pjreddie/darknet/blob/b13f67bfdd87434e141af532cdb5dc1b8369aa3b/examples/detector.c
I doubt it that increasing the max number of iterations is a good idea, as the learning rates are usually associated with current # of iteration. We usually increase the max # of iterations, when we want to resume a previous training task that ended because of reaching the max # of iterations, but we believe that with more iterations, it will give better results.
FYI, when you have a small dataset, training on it from scratch or from a classification network may not be a great idea. You may still want to re-use the weights from a detection network trained on large dataset like Coco or ImageNet.
This is an old question so I hope you have your answer by now, but here is mine just in case it helps.
After working with darknet for about a month, I've run into most of the roadblocks that people have asked/posted about on forums. In your case, I'm pretty certain it's because the weights have been trained for the max number of batches already, and when the pre-trained weights were read in darknet assumed training was done.
Relevant personal experience: when I used one of the pretrained weights files, it started from iteration 40101 and ran until 40200 before cutting off.
I would stick to training from scratch if you have custom data, but if you want to try the pre-trained weights again, you might find that changing max batches in the cfg file helps.
Also if using AlexeyAB/darknet they might have a problem with -clear option,
in detector.c:
if (clear) *nets[k].seen = 0
should really be:
if (clear) {*nets[k].seen = 0;*nets[k].cur_iteration = 0;}
otherwise the training loop will exit immediately.
Modify OpenCV number in your darknet/Makefile to 0
OpenCV=0
During the training, I write the log output to file by using the bellow script
~/caffe/build/tools/caffe train --solver=solver.prototxt -gpu 0 2>&1 | tee -a my_log.log
To extract it, I used the python script:
python ~/caffe/tools/extra/parse_log.py ./my_model.log .
The output as
NumIters,Seconds,LearningRate,loss
0.0,2.538275,0.002,1.38629
20.0,56.872385,0.002,1.1333
40.0,106.103729,0.002,0.245525
60.0,144.78454,0.002,0.31936
80.0,168.363851,0.002,0.160776
100.0,191.590772,0.002,1.06693
120.0,215.290937,0.002,0.549629
140.0,238.70139,0.002,0.139573
160.0,262.053791,0.002,0.328959
180.0,286.324327,0.002,0.326179
With batch_size is 4. How can I draw the training loss graph with x-axis is epcho and the y-axis is the loss? I only can draw a graph with the x-axis is iteration and y-axis are loss.
epoch_no = iteration_no*size_of_iteration/total_number_of_samples,
where:
iteration_no - first column in your report,
size_of_iteration is defined in your prototxt file (batch_size parameter of your data level, if you use ordinary data level),
total_number_of_samples is known from your database with samples.
I am classifying small texts (tweets) using Naive Bayes (MultinominalNB) in scikit-learn.
My train data has 1000 features, and my test data has 1200 features.
Let's say 500 features are common for both train and test data.
I wonder why MultinominalNB in scikit learn does not handle unseen features, and gives me an error:
Traceback (most recent call last):
File "/Users/osopova/Documents/00_KSU_Masters/01_2016_Spring/Twitter_project/mda_project_1/step_4.py", line 60, in <module>
predict_Y = classifiers[i].predict(test_X)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 65, in predict
jll = self._joint_log_likelihood(X)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 672, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T)
File "/Library/Python/2.7/site-packages/sklearn/utils/extmath.py", line 184, in safe_sparse_dot
return fast_dot(a, b)
ValueError: matrices are not aligned
It does not handle unseen features because you do not pass any reference naming features. Why do you have 1200 features in one case and 1000 in another? Probably because there were objects in the test setting not present in the training - but how Naive Bayes is supposed to figure out which ones of these 1200 are missing in 1000? In this implementation (which is the only possible when you assume arrays as input) it is your duty to remove all columns, which do not correspond to the ones in the training set, add columns of zeros (in valid spots) if it is the other way around, and most importantly - make sure that "ith" column in one set is the same (captures occurence of the same word/object) as "ith" column in the second one. Consequently in your case there are just 500 columns which can actually be used, and Naive Bayes has no information how to find these. You have to provide, in test scenario, the same 1000 features which were used in train, thus in your case it means removing 700 columns not seen during train, and adding (in valid spots!) 500 columns of zeros.
In particular, scikit-learn gives you plenty of data preprocessing utilities, which do this for you (like CountVectorizer etc.).
here is the deal.
I am trying to make an SVM based POS tagger.
The feature vectors for the SVM was created with the help of format converters.
Now here is a screenshot of the training file that I am using.
http://tinypic.com/r/n4fn2r/8
I have 25 labels for various POS tags. when i use the java implementation or the command line tools for prediction i get the following results.
http://tinypic.com/r/2dtw5ky/8
I have tried with all the kernels available but it gave more or less the same results.
This is happening even when the training file is used as the testing file.
please help me out here..!!
P.S. I cannot share more than two links. Thus here is a snippet of the model file
svm_type c_svc
kernel_type rbf
gamma 0.000548546
nr_class 25
total_sv 431
rho -0.929467 1.01073 1.0531 1.03472 1.01585 0.953263 1.03027 -0.921365 0.984535 1.02796 1.01266 1.03374 0.949463 0.977925 0.986551 -0.920912 0.940926 -0.955562 0.975386 -0.981959 -0.884042 0.0516955 -0.980884 -0.966095 0.995091 1.023 1.01489 1.00308 0.948314 1.01137 -0.845876 0.968034 1.0076 1.00064 1.01335 0.942633 0.965703 0.979212 -0.861236 0.935055 -0.91739 0.970223 -0.97103 0.0743777 0.970321 -0.971215 -0.931582 0.972377 0.958193 0.931253 0.825797 0.954894 -0.972884 -0.941726 0.945077 0.922366 0.953999 -1.00503 0.840985 0.882229 -0.961742 0.791631 -0.984971 0.855911 -0.991528 -0.951211 -0.962096 -0.99213 -0.99708 -0.957557 -0.308987 -0.455442 -0.94881 -0.995319 -0.974945 -0.964637 -0.902152 -0.955258 -1.05287 -1.00614 -0.
update
Just trained the SVM with svm type as c-SVC and kernel type as linear. Which gave a non-zero(although very poor) accuracy.
As mentioned by #Pedrom, parameter choice is absolutely crucial when training SVMs. I suggest you have a look at this practical guide. Also, 431 words is nowhere near enough to train a 25-class model. You will definitely need more data.
That said, 0% accuracy is indeed odd. Can you please show us the commands you are using to train and evaluate the model?
I am currently trying to use the SGDRegressor from scikits learn to solve a multivariate target problem over a large dataset, X ~= (10^6,10^4). As such I am generating the design matrix (X) in parts with the following code, where each iteration produces a batch of size roughly (10^3,10^4):
design = self.__iterX__(events)
reglins = [linear_model.SGDRegressor(fit_intercept=True) for i in range(nTargets)]
for X,times in design:
for i in range(nTargets):
reglins[i].partial_fit(X,y.ix[times].values[:,i])
However I get the following stack trace:
File ".../Enthought/Canopy_64bit/User/lib/python2.7/site- packages/sklearn/linear_model/stochastic_gradient.py", line 841, in partial_fit
coef_init=None, intercept_init=None)
File ".../Enthought/Canopy_64bit/User/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py", line 812, in _partial_fit
sample_weight, n_iter)
File ".../Enthought/Canopy_64bit/User/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py", line 948, in _fit_regressor
intercept_decay)
File "sgd_fast.pyx", line 508, in sklearn.linear_model.sgd_fast.plain_sgd (sklearn/linear_model/sgd_fast.c:8651)
ValueError: floating-point under-/overflow occurred.
Looking around it seems that this can be cause by not normalizing X properly. I understand scikits learn has a variety of functions for this, however given that I generate X in blocks, is it enough to simply normalize each block or would I need to figure out a way to normalize whole columns at a time?
Incidentally, is there a particular reason that the partial_fit function does not allow multivariate targets?
You can fit one block and apply to others:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
x1 = scalar.fit_transform(X_block_1)
xn = scalar.transform(X_block_n)
You can choose other normalization methods from this page.