Extract feature values from Solr - machine-learning

I have 80000 Questions&Answers which indexed using Solr, and a feature file.
I'm trying to extract those feature values for each Q&A couple in order to use them for training by algorithm (such as LambdaMart).
The training Algorithm gets as input this format:
<label> qid:<qid> <feature>:<value> ... <feature>:<value> # <info>
For example:
3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B
1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C
1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A
Can anyone help me to extract those feature values?
Thanks!

Related

Xgboost objective rank is position sensitive

I'm running xgboost with objective rank:NDGC and the input format of the dataset needed to be in libsvm fomart
and I'm wondering if the libsvm is position sensitive for this objective as we know that the position is important for search (higher position higher probability to click)
libsvm example of 1 query group:
0 qid:6 1:1 3:5 4:65 5:281 6:2 7:15
0 qid:6 1:2 3:15 4:68 5:13 6:2 7:14
1 qid:6 1:3 3:75 4:65 5:11 6:2 7:9
0 qid:6 1:4 3:20 4:65 5:113 6:2 7:10
2 qid:6 1:5 3:5 4:68 5:83 6:2 7:51
0 qid:6 1:6 3:20 4:65 5:116 6:2 7:3
1 qid:6 1:7 3:25 4:65
Is the ordering of the position relevance?
I know that for NDCG the position is important, how the position is taking into account in Xgboost implementation

opencv_traincascade - samplOpenCV Error: Bad argument

Background: I am trying to train my own OpenCV Haar Classifier for face detection. I am working on a VM with Ubuntu 16.04, my working directory has 2 sub-directories: face containing 2429 images of positives, non-face containing 4548 images of negatives. All images are png, gray scale and have both width and height 19 pixels. I have generated a file positives.info that contains the absolute path to every positive image followed by " 1 0 0 18 18", like so:
/home/user/ML-Trainer/face/face1.png 1 0 0 18 18
/home/user/ML-Trainer/face/face2.png 1 0 0 18 18
/home/user/ML-Trainer/face/face3.png 1 0 0 18 18
and another file negatives.txt that contains the absolute path to every positive image
/home/user/ML-Trainer/non-face/other1.png
/home/user/ML-Trainer/non-face/other2.png
/home/user/ML-Trainer/non-face/other3.png
First I ran the following command:
opencv_createsamples -info positives.info -vec positives.vec -num 2429 -w 19 -h 19
and I get the positives.vec as expected, I then created a empty directory data and ran the following:
opencv_traincascade -data data -vec positives.vec -bg negatives.txt -numPos 2429 -numNeg 4548 -numStages 10 -w 19 -h 19 &
It seems to run smoothly:
PARAMETERS:
cascadeDirName: data
vecFileName: positives.vec
bgFileName: negatives.txt
numPos: 2429
numNeg: 4548
numStages: 10
precalcValBufSize[Mb] : 1024
precalcIdxBufSize[Mb] : 1024
acceptanceRatioBreakValue : -1
stageType: BOOST
featureType: HAAR
sampleWidth: 19
sampleHeight: 19
boostType: GAB
minHitRate: 0.995
maxFalseAlarmRate: 0.5
weightTrimRate: 0.95
maxDepth: 1
maxWeakCount: 100
mode: BASIC
Number of unique features given windowSize [19,19] : 63960
===== TRAINING 0-stage =====
<BEGIN
POS count : consumed 2429 : 2429
NEG count : acceptanceRatio 4548 : 1
Precalculation time: 13
+----+---------+---------+
| N | HR | FA |
+----+---------+---------+
| 1| 1| 1|
+----+---------+---------+
| 2| 1| 1|
+----+---------+---------+
| 3| 0.998765| 0.396218|
+----+---------+---------+
END>
Training until now has taken 0 days 0 hours 1 minutes 7 seconds.
But then I get the following error:
===== TRAINING 1-stage =====
<BEGIN
POS current samplOpenCV Error: Bad argument (Can not get new positive sample. The most possible reason is insufficient count of samples in given vec-file.
) in get, file /home/user/opencv-3.4.0/apps/traincascade/imagestorage.cpp, line 158
terminate called after throwing an instance of 'cv::Exception'
what(): /home/user/opencv-3.4.0/apps/traincascade/imagestorage.cpp:158: error: (-5) Can not get new positive sample. The most possible reason is insufficient count of samples in given vec-file.
in function get
How do I solve this:
samplOpenCV Error: Bad argument
Any help would be greatly appreciated.
EDIT:
I have modified -numPos to a smaller number: 2186 (0.9 * 2429), I did this after reading this answer and it got me to
===== TRAINING 3-stage =====
and it gives me the same error. How should I tune the parameters for the opencv_createsamples command?
I eventually managed to make it work by respecting this formula:
vec-file >= (numPos + (numStages-1) * (1-minHitRate) * (numPose) + S)
numPose - number of positive samples which is used to train each stage
numStages - the count of stages which a cascade classifier will have after the training
S - the count of all the skipped samples from vec-file (for all stages)

Haar cascade resulting file is too small

I am trying to train a cascade to detect an area with specifically structured text (MRZ).
I've gathered 200 positive samples and 572 negative samples.
Trainig went as the following:
opencv_traincascade.exe -data cascades -vec vector/vector.vec -bg bg.txt -numPos 200 -numNeg 572 -numStages 3 -precalcValBufSize 2048 -precalcIdxBufSize 2048 -featureType LBP -mode ALL -w 400 -h 45 -maxFalseAlarmRate 0.8 -minHitRate 0.9988
PARAMETERS:
cascadeDirName: cascades
vecFileName: vector/vector.vec
bgFileName: bg.txt
numPos: 199
numNeg: 572 numStages: 3 precalcValBufSize[Mb] : 2048 precalcIdxBufSize[Mb] : 2048 acceptanceRatioBreakValue : -1 stageType: BOOST featureType: LBP sampleWidth: 400 sampleHeight: 45 boostType: GAB minHitRate: 0.9988 maxFalseAlarmRate: 0.8 weightTrimRate: 0.95 maxDepth: 1 maxWeakCount: 100 Number of unique features given windowSize [400,45] : 8778000
===== TRAINING 0-stage ===== <BEGIN POS count : consumed 199 : 199 NEG count : acceptanceRatio 572 : 1 Precalculation time: 26.994
+----+---------+---------+ | N | HR | FA |
+----+---------+---------+ | 1| 1| 1|
+----+---------+---------+ | 2| 1|0.0244755|
+----+---------+---------+ END>
Training until now has taken 0 days 0 hours 36 minutes 35 seconds.
===== TRAINING 1-stage ===== <BEGIN POS count : consumed 199 : 199 NEG count : acceptanceRatio
0 : 0 Required leaf false alarm rate achieved.
Branch training terminated.
The process was running ~35 minutes and produces a 2 kB file with only 45 lines that seems too small for a good cascade.
Needless to say, it doesn't detect the needed area.
I tried to tune the arguments but to no avail.
I know that it is better to use a larger set of samples, but I think that the result with this samples number should also produce a somewhat reasonable result, not so accurate though.
Is a haar cascade a good approach for detecting areas with specific text (MRZ)?
If so how better accuracy can be achieved?
Thanks in advance.
you want to produce 3 stages with maximum false alarm rate 0.8 per stage, this means after 3 stages the classifier will have a maximum of 0.8^3 false alarm rate = 0.512 but after your first stage, the classifier already reaches false alarm rate of 0.0244755 which is much better than your final aim (0.512) so the classifier is already good enough and does not need any more stages.
If that's not fine for you, increase numStages or decrease maxFalseAlarmRate to some amount that you don't reach the "final quality" within your first stage.
You will probably have to collect more samples and samples that represent the environment better, reaching such low false alarm rates is typically a sign for bad training data (too simple or too similar?).
I can't tell you, whether haar cascades are appropriate for solving your task.

Vowpal Wabbit - precision recall f-measure

How do you usually get precision, recall and f-measure from a model created in Vowpal Wabbit on a classification problem?
Are there any available scripts or programs that are commonly used for this with vw's output?
To make a minimal example using the following data in playtennis.txt :
2 | sunny 85 85 false
2 | sunny 80 90 true
1 | overcast 83 78 false
1 | rain 70 96 false
1 | rain 68 80 false
2 | rain 65 70 true
1 | overcast 64 65 true
2 | sunny 72 95 false
1 | sunny 69 70 false
1 | rain 75 80 false
1 | sunny 75 70 true
1 | overcast 72 90 true
1 | overcast 81 75 false
2 | rain 71 80 true
I create the model with:
vw playtennis.txt --oaa 2 -f playtennis.model --loss_function logistic
Then, I get predictions and raw predictions of the trained model on the training data itself with:
vw -t -i playtennis.model playtennis.txt -p playtennis.predict -r playtennis.rawp
Going from here, what scripts or programs do you usually use to get precision, recall and f-measure, given training data playtennis.txt and the predictions on the training data in playtennis.predict?
Also, if this where a multi-label classification problem (each instance can have more than 1 target label, which vw can also handle), would your proposed scripts or programs capable to process these?
Given that you have a pair of 'predicted vs actual' value for each example, you can use Rich Caruana's KDD perf utility to compute these (and many other) metrics.
In the case of multi-class, you should simply consider every correctly classified case a success and every class-mismatch a failure to predict correctly.
Here's a more detailed recipe for the binary case:
# get the labels into *.actual (correct) file
$ cut -d' ' -f1 playtennis.txt > playtennis.actual
# paste the actual vs predicted side-by-side (+ cleanup trailing zeros)
$ paste playtennis.actual playtennis.predict | sed 's/\.0*$//' > playtennis.ap
# convert original (1,2) classes to binary (0,1):
$ perl -pe 's/1/0/g; s/2/1/g;' playtennis.ap > playtennis.ap01
# run perf to determine precision, recall and F-measure:
$ perf -PRE -REC -PRF -file playtennis.ap01
PRE 1.00000 pred_thresh 0.500000
REC 0.80000 pred_thresh 0.500000
PRF 0.88889 pred_thresh 0.500000
Note that as Martin mentioned, vw uses the {-1, +1} convention for binary classification, whereas perf uses the {0, 1} convention so you may have to translate back and forth when switching between the two.
For binary classification, I would recommend to use labels +1 (play tennis) and -1 (don't play tennis) and --loss_function=logistic (although --oaa 2 and labels 1 and 2 can be used as well). VW then reports the logistic loss, which may be more informative/useful evaluation measure than accuracy/precision/recall/f1 (depending on the application). If you want 0/1 loss (i.e. "one minus accuracy"), add --binary.
For precision, recall, f1-score, auc and other measures, you can use the perf tool as suggested in arielf's answer.
For standard multi-class classification (one correct class for each example), use --oaa N --loss_function=logistic and VW will report the 0/1 loss.
For multi-label multi-class classification (more correct labels per example allowed), you can use --multilabel_oaa N (or convert each original example into N binary-classification examples).

OpenCV 2.4.3 Haar Classifier Error AdaBoost misclass

I am using OpenCV 2.4.3 on Ubuntu 12.10 64bit and when I run opencv_training I get an error message shown below. The training continues so I don't think it is a critical error but nonetheless it blatantly says 'Error'. I can't seem to find any solutions for this - what does it mean ( what is AdaBoost ) , why is it complaining about a 'misclass' , and how can I fix it? Anything I found on Google referred to this as simply a 'warning' and basically to forget about it. Thanks!
cd dots ; nice -20 opencv_haartraining -data dots_haarcascade -vec samples.vec -bg negatives.dat -nstages 20 -nsplits 2 -minhitrate 0.999 -maxfalsealarm 0.5 -npos 13 -nneg 10 -w 10 -h 10 -nonsym -mem 4000 -mode ALL
Data dir name: dots_w10_h10_haarcascade
Vec file name: samples.vec
BG file name: negatives.dat, is a vecfile: no
Num pos: 13
Num neg: 10
Num stages: 20
Num splits: 2 (tree as weak classifier)
Mem: 4000 MB
Symmetric: FALSE
Min hit rate: 0.999000
Max false alarm rate: 0.500000
Weight trimming: 0.950000
Equal weights: FALSE
Mode: ALL
Width: 10
Height: 10
Applied boosting algorithm: GAB
Error (valid only for Discrete and Real AdaBoost): misclass
Max number of splits in tree cascade: 0
Min number of positive samples per cluster: 500
Required leaf false alarm rate: 9.53674e-07
Stage 0 loaded
Stage 1 loaded
Stage 2 loaded
Stage 3 loaded
Stage 4 loaded
Stage 5 loaded
Stage 6 loaded
Stage 7 loaded
Tree Classifier
Stage
+---+---+---+---+---+---+---+---+
| 0| 1| 2| 3| 4| 5| 6| 7|
+---+---+---+---+---+---+---+---+
0---1---2---3---4---5---6---7
Number of features used : 7544
Parent node: 7
*** 1 cluster ***
POS: 13 96 0.135417
I don't think this is an error message, rather it is a print out describing how the algorithm will measure it's internal error rate. In this case it is using misclassification of the examples. Real and discrete adaboost will map input samples onto the output range [0,1] so there is a meaningful way of measuring the inaccuracy of the algorithm. If a different variant of adaboost is being used, this error measure might cease to be meaningful.

Resources