OpenCV 2.4.3 Haar Classifier Error AdaBoost misclass - opencv

I am using OpenCV 2.4.3 on Ubuntu 12.10 64bit and when I run opencv_training I get an error message shown below. The training continues so I don't think it is a critical error but nonetheless it blatantly says 'Error'. I can't seem to find any solutions for this - what does it mean ( what is AdaBoost ) , why is it complaining about a 'misclass' , and how can I fix it? Anything I found on Google referred to this as simply a 'warning' and basically to forget about it. Thanks!
cd dots ; nice -20 opencv_haartraining -data dots_haarcascade -vec samples.vec -bg negatives.dat -nstages 20 -nsplits 2 -minhitrate 0.999 -maxfalsealarm 0.5 -npos 13 -nneg 10 -w 10 -h 10 -nonsym -mem 4000 -mode ALL
Data dir name: dots_w10_h10_haarcascade
Vec file name: samples.vec
BG file name: negatives.dat, is a vecfile: no
Num pos: 13
Num neg: 10
Num stages: 20
Num splits: 2 (tree as weak classifier)
Mem: 4000 MB
Symmetric: FALSE
Min hit rate: 0.999000
Max false alarm rate: 0.500000
Weight trimming: 0.950000
Equal weights: FALSE
Mode: ALL
Width: 10
Height: 10
Applied boosting algorithm: GAB
Error (valid only for Discrete and Real AdaBoost): misclass
Max number of splits in tree cascade: 0
Min number of positive samples per cluster: 500
Required leaf false alarm rate: 9.53674e-07
Stage 0 loaded
Stage 1 loaded
Stage 2 loaded
Stage 3 loaded
Stage 4 loaded
Stage 5 loaded
Stage 6 loaded
Stage 7 loaded
Tree Classifier
Stage
+---+---+---+---+---+---+---+---+
| 0| 1| 2| 3| 4| 5| 6| 7|
+---+---+---+---+---+---+---+---+
0---1---2---3---4---5---6---7
Number of features used : 7544
Parent node: 7
*** 1 cluster ***
POS: 13 96 0.135417

I don't think this is an error message, rather it is a print out describing how the algorithm will measure it's internal error rate. In this case it is using misclassification of the examples. Real and discrete adaboost will map input samples onto the output range [0,1] so there is a meaningful way of measuring the inaccuracy of the algorithm. If a different variant of adaboost is being used, this error measure might cease to be meaningful.

Related

Multi-class classification in sparse dataset

I have a dataset of factory workstations.
There are two types of error in same particular time.
User selects error and time interval (dependent variable-y)
Machines produces errors during production (independent variables-x)
User selected error types are 8 unique in total so I tried to predict those errors using machine-produced errors(total 188 types) and some other numerical features such as avg. machine speed, machine volume, etc.
Each row represents user-selected error in particular time;
For example in first line user selects time interval as:
2018-01-03 12:02:00 - 2018-01-03 12:05:37
and m_er_1(machine error 1) also occured in same time interval 12 times.
m_er_1_dur(machine error 1 duration) is total duration of machine error in seconds
So I matched those two tables and looks like below;
user_error m_er_1 m_er_2 m_er_3 ... m_er_188 avg_m_speed .. m_er_1_dur
A 12 0 0 0 150 217
B 0 0 2 0 10 0
A 3 0 0 6 34 37
A 0 0 0 0 5 0
D 0 0 0 0 3 0
E 0 0 0 0 1000 0
In the end, I have 1900 rows 390 rows( 376( 188 machine error counts + 188 machine error duration) + 14 numerical features) and due to machine errors it is a sparse dataset, lots of 0.
There a none outliers, none nan values, I normalized and tried several classification algorithms( SVM, Logistic Regression, MLPC, XGBoost, etc.)
I also tried PCA but didn't work well, for 165 components explained_variance_ratio is around 0.95
But accuracy metrics are very low, for logistic regression accuracy score is 0.55 and MCC score around 0.1, recall, f1, precision also very low.
Are there some steps that I miss? What would you suggest for multiclass classification for sparse dataset?
Thanks in advance

opencv_traincascade - samplOpenCV Error: Bad argument

Background: I am trying to train my own OpenCV Haar Classifier for face detection. I am working on a VM with Ubuntu 16.04, my working directory has 2 sub-directories: face containing 2429 images of positives, non-face containing 4548 images of negatives. All images are png, gray scale and have both width and height 19 pixels. I have generated a file positives.info that contains the absolute path to every positive image followed by " 1 0 0 18 18", like so:
/home/user/ML-Trainer/face/face1.png 1 0 0 18 18
/home/user/ML-Trainer/face/face2.png 1 0 0 18 18
/home/user/ML-Trainer/face/face3.png 1 0 0 18 18
and another file negatives.txt that contains the absolute path to every positive image
/home/user/ML-Trainer/non-face/other1.png
/home/user/ML-Trainer/non-face/other2.png
/home/user/ML-Trainer/non-face/other3.png
First I ran the following command:
opencv_createsamples -info positives.info -vec positives.vec -num 2429 -w 19 -h 19
and I get the positives.vec as expected, I then created a empty directory data and ran the following:
opencv_traincascade -data data -vec positives.vec -bg negatives.txt -numPos 2429 -numNeg 4548 -numStages 10 -w 19 -h 19 &
It seems to run smoothly:
PARAMETERS:
cascadeDirName: data
vecFileName: positives.vec
bgFileName: negatives.txt
numPos: 2429
numNeg: 4548
numStages: 10
precalcValBufSize[Mb] : 1024
precalcIdxBufSize[Mb] : 1024
acceptanceRatioBreakValue : -1
stageType: BOOST
featureType: HAAR
sampleWidth: 19
sampleHeight: 19
boostType: GAB
minHitRate: 0.995
maxFalseAlarmRate: 0.5
weightTrimRate: 0.95
maxDepth: 1
maxWeakCount: 100
mode: BASIC
Number of unique features given windowSize [19,19] : 63960
===== TRAINING 0-stage =====
<BEGIN
POS count : consumed 2429 : 2429
NEG count : acceptanceRatio 4548 : 1
Precalculation time: 13
+----+---------+---------+
| N | HR | FA |
+----+---------+---------+
| 1| 1| 1|
+----+---------+---------+
| 2| 1| 1|
+----+---------+---------+
| 3| 0.998765| 0.396218|
+----+---------+---------+
END>
Training until now has taken 0 days 0 hours 1 minutes 7 seconds.
But then I get the following error:
===== TRAINING 1-stage =====
<BEGIN
POS current samplOpenCV Error: Bad argument (Can not get new positive sample. The most possible reason is insufficient count of samples in given vec-file.
) in get, file /home/user/opencv-3.4.0/apps/traincascade/imagestorage.cpp, line 158
terminate called after throwing an instance of 'cv::Exception'
what(): /home/user/opencv-3.4.0/apps/traincascade/imagestorage.cpp:158: error: (-5) Can not get new positive sample. The most possible reason is insufficient count of samples in given vec-file.
in function get
How do I solve this:
samplOpenCV Error: Bad argument
Any help would be greatly appreciated.
EDIT:
I have modified -numPos to a smaller number: 2186 (0.9 * 2429), I did this after reading this answer and it got me to
===== TRAINING 3-stage =====
and it gives me the same error. How should I tune the parameters for the opencv_createsamples command?
I eventually managed to make it work by respecting this formula:
vec-file >= (numPos + (numStages-1) * (1-minHitRate) * (numPose) + S)
numPose - number of positive samples which is used to train each stage
numStages - the count of stages which a cascade classifier will have after the training
S - the count of all the skipped samples from vec-file (for all stages)

Haar cascade resulting file is too small

I am trying to train a cascade to detect an area with specifically structured text (MRZ).
I've gathered 200 positive samples and 572 negative samples.
Trainig went as the following:
opencv_traincascade.exe -data cascades -vec vector/vector.vec -bg bg.txt -numPos 200 -numNeg 572 -numStages 3 -precalcValBufSize 2048 -precalcIdxBufSize 2048 -featureType LBP -mode ALL -w 400 -h 45 -maxFalseAlarmRate 0.8 -minHitRate 0.9988
PARAMETERS:
cascadeDirName: cascades
vecFileName: vector/vector.vec
bgFileName: bg.txt
numPos: 199
numNeg: 572 numStages: 3 precalcValBufSize[Mb] : 2048 precalcIdxBufSize[Mb] : 2048 acceptanceRatioBreakValue : -1 stageType: BOOST featureType: LBP sampleWidth: 400 sampleHeight: 45 boostType: GAB minHitRate: 0.9988 maxFalseAlarmRate: 0.8 weightTrimRate: 0.95 maxDepth: 1 maxWeakCount: 100 Number of unique features given windowSize [400,45] : 8778000
===== TRAINING 0-stage ===== <BEGIN POS count : consumed 199 : 199 NEG count : acceptanceRatio 572 : 1 Precalculation time: 26.994
+----+---------+---------+ | N | HR | FA |
+----+---------+---------+ | 1| 1| 1|
+----+---------+---------+ | 2| 1|0.0244755|
+----+---------+---------+ END>
Training until now has taken 0 days 0 hours 36 minutes 35 seconds.
===== TRAINING 1-stage ===== <BEGIN POS count : consumed 199 : 199 NEG count : acceptanceRatio
0 : 0 Required leaf false alarm rate achieved.
Branch training terminated.
The process was running ~35 minutes and produces a 2 kB file with only 45 lines that seems too small for a good cascade.
Needless to say, it doesn't detect the needed area.
I tried to tune the arguments but to no avail.
I know that it is better to use a larger set of samples, but I think that the result with this samples number should also produce a somewhat reasonable result, not so accurate though.
Is a haar cascade a good approach for detecting areas with specific text (MRZ)?
If so how better accuracy can be achieved?
Thanks in advance.
you want to produce 3 stages with maximum false alarm rate 0.8 per stage, this means after 3 stages the classifier will have a maximum of 0.8^3 false alarm rate = 0.512 but after your first stage, the classifier already reaches false alarm rate of 0.0244755 which is much better than your final aim (0.512) so the classifier is already good enough and does not need any more stages.
If that's not fine for you, increase numStages or decrease maxFalseAlarmRate to some amount that you don't reach the "final quality" within your first stage.
You will probably have to collect more samples and samples that represent the environment better, reaching such low false alarm rates is typically a sign for bad training data (too simple or too similar?).
I can't tell you, whether haar cascades are appropriate for solving your task.

Cascade classifier can't be trained. Check the used training parameters

I need to detect special image (something like symbol +) in scanned document. I'm going to train cascade using opencv_traincascade program (opencv 3.0)
This is my file structure:
C:\imgs\learn1
Bad
1.bmp
....
Good
1.bmp
....
Bad.dat
Good.dat
This my Bad.dat:
Bad\1.bmp
...
Bad\53.bmp
Bad\img001.jpg
...
Bad\img146.jpg
This is my Good.dat (every good file fully contains the special image and nothing more)
Good\1.bmp 1 0 0 60 59
...
Good\100.bmp 1 0 0 27 28
I've successfuly created vec file.
C:\opencv\build\x64\vc12\bin>opencv_createsamples.exe
-info C:\imgs\learn1\Good.dat
-vec samples.vec
-w 10 -h 10
Info file name: C:\imgs\learn1\Good.dat
Img file name: (NULL)
Vec file name: samples.vec
BG file name: (NULL)
Num: 1000
BG color: 0
BG threshold: 80
Invert: FALSE
Max intensity deviation: 40
Max x angle: 1.1
Max y angle: 1.1
Max z angle: 0.5
Show samples: FALSE
Width: 10
Height: 10
Create training samples from images collection...
C:\imgs\learn1\Good.dat(101) : parse errorDone. Created 100 samples
This is call and result of opencv_traincascade
C:\opencv\build\x64\vc12\bin>
-opencv_traincascade.exe
-data haarcascade
-vec C:\opencv\build\x64\vc12\bin\samples.vec
-bg C:\imgs\learn1\Bad.dat
-numStages 16
-minhiteate 0.99
-maxFalseAlarmRate 0.5
-numPos 80
-numNeg 199
-w 10
-h 10
-mode ALL
-precalcValBufSize 1024
-precalcIdxBufSize 1024
PARAMETERS:
cascadeDirName: haarcascade
vecFileName: C:\opencv\build\x64\vc12\bin\samples.vec
bgFileName: C:\imgs\learn1\Bad.dat
numPos: 80
numNeg: 199
numStages: 16
precalcValBufSize[Mb] : 1024
precalcIdxBufSize[Mb] : 1024
acceptanceRatioBreakValue : -1
stageType: BOOST
featureType: HAAR
sampleWidth: 10
sampleHeight: 10
boostType: GAB
minHitRate: 0.995
maxFalseAlarmRate: 0.5
weightTrimRate: 0.95
maxDepth: 1
maxWeakCount: 100
mode: ALL
===== TRAINING 0-stage =====
<BEGIN
POS count : consumed 80 : 80
Train dataset for temp stage can not be filled. Branch training terminated.
Cascade classifier can't be trained. Check the used training parameters.
As you can see, there is some error. Can you help me what is wrong exactly? "Check the used training parameters" is very general phrase.
(The folder C:\opencv\build\x64\vc12\bin\haarcascade exists)
I don't know what was wrong, but I've done it.
1)I've increased number of positive examples to 400
2)I've increased number of negative examples to 398
3)I found that if an image size 61 x 60, I shoud write in Good.dat
Good\1.bmp 1 0 0 60 59
(Image coordinates begin from 0 and end at width-1 and height-1 values)
4)I found type error: minhiteate - > minHitRate
and nothing helps...
5)I try to train in openvc 2.4 and i've got my cascade.xml file
But now I can't use it because of other error, but it's offtopic. (now I,m googling)

Vowpal Wabbit - precision recall f-measure

How do you usually get precision, recall and f-measure from a model created in Vowpal Wabbit on a classification problem?
Are there any available scripts or programs that are commonly used for this with vw's output?
To make a minimal example using the following data in playtennis.txt :
2 | sunny 85 85 false
2 | sunny 80 90 true
1 | overcast 83 78 false
1 | rain 70 96 false
1 | rain 68 80 false
2 | rain 65 70 true
1 | overcast 64 65 true
2 | sunny 72 95 false
1 | sunny 69 70 false
1 | rain 75 80 false
1 | sunny 75 70 true
1 | overcast 72 90 true
1 | overcast 81 75 false
2 | rain 71 80 true
I create the model with:
vw playtennis.txt --oaa 2 -f playtennis.model --loss_function logistic
Then, I get predictions and raw predictions of the trained model on the training data itself with:
vw -t -i playtennis.model playtennis.txt -p playtennis.predict -r playtennis.rawp
Going from here, what scripts or programs do you usually use to get precision, recall and f-measure, given training data playtennis.txt and the predictions on the training data in playtennis.predict?
Also, if this where a multi-label classification problem (each instance can have more than 1 target label, which vw can also handle), would your proposed scripts or programs capable to process these?
Given that you have a pair of 'predicted vs actual' value for each example, you can use Rich Caruana's KDD perf utility to compute these (and many other) metrics.
In the case of multi-class, you should simply consider every correctly classified case a success and every class-mismatch a failure to predict correctly.
Here's a more detailed recipe for the binary case:
# get the labels into *.actual (correct) file
$ cut -d' ' -f1 playtennis.txt > playtennis.actual
# paste the actual vs predicted side-by-side (+ cleanup trailing zeros)
$ paste playtennis.actual playtennis.predict | sed 's/\.0*$//' > playtennis.ap
# convert original (1,2) classes to binary (0,1):
$ perl -pe 's/1/0/g; s/2/1/g;' playtennis.ap > playtennis.ap01
# run perf to determine precision, recall and F-measure:
$ perf -PRE -REC -PRF -file playtennis.ap01
PRE 1.00000 pred_thresh 0.500000
REC 0.80000 pred_thresh 0.500000
PRF 0.88889 pred_thresh 0.500000
Note that as Martin mentioned, vw uses the {-1, +1} convention for binary classification, whereas perf uses the {0, 1} convention so you may have to translate back and forth when switching between the two.
For binary classification, I would recommend to use labels +1 (play tennis) and -1 (don't play tennis) and --loss_function=logistic (although --oaa 2 and labels 1 and 2 can be used as well). VW then reports the logistic loss, which may be more informative/useful evaluation measure than accuracy/precision/recall/f1 (depending on the application). If you want 0/1 loss (i.e. "one minus accuracy"), add --binary.
For precision, recall, f1-score, auc and other measures, you can use the perf tool as suggested in arielf's answer.
For standard multi-class classification (one correct class for each example), use --oaa N --loss_function=logistic and VW will report the 0/1 loss.
For multi-label multi-class classification (more correct labels per example allowed), you can use --multilabel_oaa N (or convert each original example into N binary-classification examples).

Resources