Grid Search in SVM not at all improving the model - machine-learning

I want my SVM to classify the given data into three classes 0,1,2. Initially i'm getting 0 prediction in class 1. So i used Grid search and even after using grid search, class 1 is getting 0.0 precision. What might be wrong? How can i make my model more precise?
before grid search:
precision recall f1-score support
0 0.75 0.44 0.55 41
1 0.00 0.00 0.00 37
2 0.50 0.98 0.66 55
accuracy 0.54 133
macro avg 0.42 0.47 0.40 133
weighted avg 0.44 0.54 0.44 133
after grid search: {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}
precision recall f1-score support
0 0.72 0.56 0.63 41
1 0.00 0.00 0.00 37
2 0.52 0.96 0.68 55
accuracy 0.57 133
macro avg 0.41 0.51 0.44 133
weighted avg 0.44 0.57 0.48 133

Looking at the plot, it seems that the data is linearly separable with high amount of noise. In this case you can use linear kernel of SVM.

Related

Parsing a huge file in Fortran

I am trying to parse an output file of a popular QM program, in order to extract data corresponding to two related properties: 'frequencies' and 'intensities'. An example of how the output file looks can be found below:
Max difference between off-diagonal Polar Derivs IMax= 2 JMax= 3 KMax= 13 EMax= 8.65D-04
Full mass-weighted force constant matrix:
Low frequencies --- -2.0296 -1.7337 -1.3848 -0.0005 -0.0003 0.0007
Low frequencies --- 216.4611 263.3990 368.1703
Diagonal vibrational polarizability:
18.1080784 9.1046025 11.9153848
Diagonal vibrational hyperpolarizability:
127.1032599 2.7794305 -8.7599786
Harmonic frequencies (cm**-1), IR intensities (KM/Mole), Raman scattering
activities (A**4/AMU), depolarization ratios for plane and unpolarized
incident light, reduced masses (AMU), force constants (mDyne/A),
and normal coordinates:
1 2 3
A A A
Frequencies -- 216.4611 263.3989 368.1703
Red. masses -- 3.3756 1.0427 3.0817
Frc consts -- 0.0932 0.0426 0.2461
IR Inten -- 3.6192 21.7801 0.2120
Raman Activ -- 1.0049 0.1635 0.9226
Depolar (P) -- 0.6948 0.6536 0.7460
Depolar (U) -- 0.8199 0.7905 0.8546
Atom AN X Y Z X Y Z X Y Z
1 6 0.00 0.00 0.22 0.00 0.01 0.02 0.06 0.15 -0.01
2 7 0.00 0.00 0.00 0.00 0.00 0.00 0.10 -0.02 0.00
3 6 0.00 0.00 -0.23 0.00 -0.01 0.00 0.01 -0.07 0.00
4 6 0.00 0.00 0.00 0.00 0.00 0.00 -0.08 -0.02 0.00
5 6 0.00 0.00 0.21 0.00 0.01 -0.03 -0.06 0.15 0.00
6 6 0.00 0.00 0.11 0.00 0.01 0.00 -0.01 0.17 0.00
7 7 -0.02 0.00 -0.22 0.00 0.03 0.00 -0.01 -0.26 0.00
8 1 0.10 -0.02 -0.32 0.02 -0.30 0.66 0.34 -0.39 -0.13
9 1 0.07 -0.02 -0.39 -0.05 -0.25 -0.63 -0.37 -0.40 0.12
10 1 0.00 0.00 0.39 0.01 0.01 0.07 0.18 0.22 -0.03
11 1 0.00 0.00 -0.53 0.00 -0.01 0.01 0.02 -0.15 0.01
12 1 0.00 0.00 -0.03 -0.01 0.00 -0.02 -0.18 -0.09 0.00
13 1 0.00 0.00 0.31 0.00 0.00 -0.09 -0.18 0.22 0.03
4 5 6
A A A
Frequencies -- 411.0849 501.4206 548.5728
Red. masses -- 3.4204 2.8766 6.5195
Frc consts -- 0.3406 0.4261 1.1559
IR Inten -- 4.2311 30.8234 6.3698
Raman Activ -- 0.1512 0.8402 4.2329
Depolar (P) -- 0.7404 0.1511 0.4224
Depolar (U) -- 0.8508 0.2625 0.5939
Atom AN X Y Z X Y Z X Y Z
1 6 0.00 0.00 0.20 0.00 -0.01 0.01 0.02 -0.12 -0.01
2 7 0.00 0.00 -0.21 0.00 0.00 -0.16 0.06 -0.18 0.02
3 6 0.00 0.00 -0.03 0.01 0.00 0.15 0.32 -0.01 -0.02
4 6 0.00 0.00 0.27 0.01 0.00 -0.08 0.18 0.10 0.01
5 6 0.00 0.00 -0.23 0.00 0.00 -0.03 0.11 0.19 0.00
6 6 0.00 0.00 -0.02 0.00 0.00 0.32 -0.26 0.01 -0.04
7 7 0.00 -0.01 0.01 -0.04 0.00 -0.04 -0.39 0.02 0.04
8 1 -0.01 0.05 -0.10 0.17 0.03 -0.36 -0.36 0.06 -0.08
9 1 -0.02 0.04 0.16 0.15 -0.01 -0.35 -0.30 0.02 -0.11
10 1 0.01 0.01 0.48 0.01 0.00 -0.35 0.22 -0.01 0.03
11 1 0.00 0.00 -0.12 0.01 0.00 0.23 0.31 0.13 -0.02
12 1 0.00 0.00 0.54 0.00 0.00 -0.39 -0.02 -0.03 0.05
13 1 -0.01 0.00 -0.47 0.01 0.00 -0.45 0.34 0.06 0.04
7 8 9
A A A
Frequencies -- 629.8582 652.6212 716.4846
Red. masses -- 7.0000 1.4491 2.4272
Frc consts -- 1.6362 0.3637 0.7341
IR Inten -- 9.4587 253.3389 18.8342
Raman Activ -- 3.5151 11.7363 0.2311
Depolar (P) -- 0.7397 0.2892 0.7423
Depolar (U) -- 0.8504 0.4486 0.8521
Atom AN X Y Z X Y Z X Y Z
1 6 0.24 -0.18 -0.01 -0.02 0.03 -0.04 0.00 0.00 -0.12
2 7 0.30 0.27 0.02 -0.02 0.00 0.04 0.00 0.00 0.17
3 6 0.06 0.12 -0.02 -0.03 -0.01 -0.04 0.00 0.00 -0.15
4 6 -0.23 0.23 0.01 0.02 -0.04 0.02 0.00 0.00 0.18
5 6 -0.22 -0.20 -0.01 0.02 0.00 -0.04 0.00 0.00 -0.08
6 6 -0.04 -0.15 -0.02 0.04 0.01 -0.04 0.00 0.00 0.13
7 7 -0.13 -0.07 0.06 -0.05 0.00 0.14 0.01 0.00 -0.01
8 1 0.02 -0.03 -0.20 0.30 0.13 -0.57 0.00 -0.02 0.05
9 1 0.00 -0.12 -0.26 0.29 -0.10 -0.63 -0.01 0.02 0.05
the code I'm using is:
program gau_parser
implicit none
integer :: ierr ! Error value for read statement
integer, parameter :: iu = 20 ! input unit
integer, parameter :: ou = 30 ! output unit
character (len=*), parameter :: search_str = " Frequencies --" ! this is the property I'm looking for
! ^===============^ there are 15 characters here. First character is blank.
!
! NOTE: a typical string looks like this: " Frequencies -- 411.0849 501.4206 548.5728"
! ============== ======== ======== ========
! search_str xx(1) xx(2) xx(3)
!
! the string length is 73 but may be variable but very seldomly more than 80
!
real :: xx(3) ! this will be the three values associated to the above property
character (len=80) :: text
character (len=15) :: word
open (unit=iu,file="dummy.log",action="read") ! read the file I wish to parse
open (unit=ou,file='output.log',action="write") ! Open a file where I wish the parse results to be written to!
do ! the search is done line by line, until the end of the file
read (iu,"(a)",iostat=ierr) text ! read line into character variable
if (ierr /= 0) then
cycle ! If a reading error occurs, advance to new line
end if
read (text,*) word ! read first word of line
if (word == search_str) then ! found search string at beginning of line
read (text,*) word,xx ! read the entire line
write(30,*) word,xx ! write the entire line
end if
end do ! finish the search cycle
end program gau_parser
My questions are following:
a) The present code is compilable, but 'hangs up' upon execution. Can anyone compile their own version and see if the same is happening to them? What (user induced) error may be causing such behavior?
b) How can I make the multiple values of 'xx' be written in a single array in sequence? That is, they should be read like this from the parsed file
word xx(1) xx(2) xx(3)
...
junk
...
word xx(4) xx(5) xx(6)
...
more junk
...
word xx(7) xx(8) xx(9)
I know that I've stated in the program the array to be of dimension(3), but that is just for test sake. In reality, it must be allocated but unspecified until, upon reaching the end of the parsed file, it must INQUIRE:SIZE. My idea is to print it into a scratch file, evaluate it, and the write it back in memory, as xx(INQUIRE:SIZE) dimension array. Any thought on the matter would be most welcome!
EDIT: After trying to debug the program, I realized that it was actually looping! I've inserted a couple of write statements to see what could be going wrong
open (unit=iu,file="dummy.log",action="read") ! read the file I wish to parse
print*,'file opened'
! open (unit=ou,file='output.log',action="write") ! Open a file where I wish the parse results to be written to!
do ! the search is done line by line, until the end of the file
print*,'Do loop has started'
read (iu,"(a)",iostat=ierr) text ! read line into character variable
if (ierr /= 0) then
write(*,*)'Error!'
cycle ! If a reading error occurs, advance to new line
end if
and ... voilà! My screen started to be filled up by a flurry of
Error!
Do has started
messages! In essence, I'm stuck in a loop! Where have I failed?
There is a subtle error in the code. The statement
read (iu,"(a)",iostat=ierr) text ! read line into character variable
reads a line of text from the file into the variable text, and it uses the edit descriptor "(a)" which means that text is what you expect it to be. On the other hand the statement
read (text,*) word
uses list directed input (that's what the * means) and it does not get, for example, the string Frequencies from the line. Helpfully the compiler strips off the leading blank characters and word gets the string Frequencies (no leading space). This will never match the searched-for string.
An aside: especially when developing codes do not let loops run
indefinitely, put in a reasonable maximum loop iteration, eg do ix = 1,200 for your test case, this will stop you wasting time staring at
a computation which ain't ever going to finish.
The reason that the code runs forever is that there is no end condition. Instead, the block of code
if (ierr /= 0) then
cycle ! If a reading error occurs, advance to new line
end if
sends execution back to the do statement - ad infinitum. I would use a stopping condition like this:
IF (IS_IOSTAT_END(ierr)) EXIT
The function IS_IOSTAT_END frees you from having to figure out what error code end-of-file causes on your compiler, the values of those codes are not standardised. IS_IOSTAT_EOR is useful to check for end-of-record.
The next error you will find is that the statement
read (text,*) word
won't make word match Frequencies -- either. Again, using list-directed input means that the compiler will treat blank spaces in the input file as separators, and the line of code will only get Frequencies into word. But that leads to another problem,
read (text,*) word,xx ! read the entire line
will try to read the string -- into the real variable xx, with unhappy results.
One, perhaps the, solution to this series of problems, is to use an explicit edit descriptor in the read statements, like this. First change
read (text,*) word
to
read (text,'(a15)') word
Next, you have to change the line to read xx to something like
read (text,'(a15,3(f18.4))') word,xx ! read the entire line
You will find that, as it stands, this line does not read all 3 values into xx correctly. That's because the edit descriptor 3(f18.4) does not quite properly describe the layout of the line, in fact it may need f(18.4),2(fNN.4), where of course you replace NN by the proper field width for your file. And it's time you did some of the work.

How is the precision and recall calculated in the classification report?

Confusion Matrix :
[[4 2]
[1 3]]
Accuracy Score : 0.7
Report :
precision recall f1-score support
0 0.80 0.67 0.73 6
1 0.60 0.75 0.67 4
avg / total 0.72 0.70 0.70 10
from the formular precision = true positive/(true positive + false positive)
4/(4+2) = 0.667
But this is under recall .
The formula to calculate recall is true positive/(true positive + false negative)
4/(4+1) = 0.80
I don't seem to get the difference .
Hard to say for sure without seeing code but my guess is that you are using Sklearn and did not pass labels into your confusion matrix. Without labels, it makes decisions about the ordering leading to false positives and false negatives being swapped by interpretting the confusion matrix.

When doing classification, why do I get different precision for the same testing data?

I am testing a dataset with two labels 'A' and 'B' on a decision tree classifier. I accidentally found out that the model get different precision result on the same testing data. I want to know why.
Here is what I do, I train the model, and test it on
1. the testing set,
2. the data only labelled 'A' in the testing set,
3. and the data only labelled 'B'.
Here is what I got:
for testing dataset
precision recall f1-score support
A 0.94 0.95 0.95 25258
B 0.27 0.22 0.24 1963
for data only labelled 'A' in testing dataset
precision recall f1-score support
A 1.00 0.95 0.98 25258
B 0.00 0.00 0.00 0
for data only labelled 'B' in testing dataset
precision recall f1-score support
A 0.00 0.00 0.00 0
B 1.00 0.22 0.36 1963
The training dataset and model are the same, the data in 2 and 3rd test are also same with those in 1. Why the precision for 'A' and 'B' differ so much? What is the real precision for this model? Thank you very much.
You sound confused, and it is not at all clear why you are interested in metrics where you have completely remove one of the two labels from your evaluation set.
Let's explore the issue with some reproducible dummy data:
from sklearn.metrics import classification_report
import numpy as np
y_true = np.array([0, 1, 0, 1, 1, 0, 0])
y_pred = np.array([0, 0, 1, 1, 0, 0, 1])
target_names = ['A', 'B']
print(classification_report(y_true, y_pred, target_names=target_names))
Result:
precision recall f1-score support
A 0.50 0.50 0.50 4
B 0.33 0.33 0.33 3
avg / total 0.43 0.43 0.43 7
Now, let's keep only class A in our y_true:
indA = np.where(y_true==0)
print(indA)
print(y_true[indA])
print(y_pred[indA])
Result:
(array([0, 2, 5, 6], dtype=int64),)
[0 0 0 0]
[0 1 0 1]
Now, here is the definition of precision from the scikit-learn documentation:
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
For class A, a true positive (tp) would be a case where the true class is A (0 in our case), and we have indeed predict A (0); from above, it is apparent that tp=2.
The tricky part is the false positives (fp): they are the cases where we have predicted A (0), where the true label is B (1). But it is apparent here that we cannot have any such cases, since we have (intentionally) removed all the B's from our y_true (why we would want to do such a thing? I don't know, it does not make any sense at all); hence, fp=0 in this (weird) setting. Hence, our precision for class A will be tp / (tp+0) = tp/tp = 1.
Which is the exact same result given by the classification report:
print(classification_report(y_true[indA], y_pred[indA], target_names=target_names))
# result:
precision recall f1-score support
A 1.00 0.50 0.67 4
B 0.00 0.00 0.00 0
avg / total 1.00 0.50 0.67 4
and obviously the case for B is identical.
why the precision is not 1 in case #1 (for both A and B)? The data are the same
No, they are very obviously not the same - the ground truth is altered!
Bottom line: removing classes from your y_true before computing precision etc. does not make any sense at all (i.e. your reported results in case #2 and case #3 are of no practical use whatsoever); but, since for whatever reasons you decide to do so, your reported results are exactly as expected.

Haar cascade resulting file is too small

I am trying to train a cascade to detect an area with specifically structured text (MRZ).
I've gathered 200 positive samples and 572 negative samples.
Trainig went as the following:
opencv_traincascade.exe -data cascades -vec vector/vector.vec -bg bg.txt -numPos 200 -numNeg 572 -numStages 3 -precalcValBufSize 2048 -precalcIdxBufSize 2048 -featureType LBP -mode ALL -w 400 -h 45 -maxFalseAlarmRate 0.8 -minHitRate 0.9988
PARAMETERS:
cascadeDirName: cascades
vecFileName: vector/vector.vec
bgFileName: bg.txt
numPos: 199
numNeg: 572 numStages: 3 precalcValBufSize[Mb] : 2048 precalcIdxBufSize[Mb] : 2048 acceptanceRatioBreakValue : -1 stageType: BOOST featureType: LBP sampleWidth: 400 sampleHeight: 45 boostType: GAB minHitRate: 0.9988 maxFalseAlarmRate: 0.8 weightTrimRate: 0.95 maxDepth: 1 maxWeakCount: 100 Number of unique features given windowSize [400,45] : 8778000
===== TRAINING 0-stage ===== <BEGIN POS count : consumed 199 : 199 NEG count : acceptanceRatio 572 : 1 Precalculation time: 26.994
+----+---------+---------+ | N | HR | FA |
+----+---------+---------+ | 1| 1| 1|
+----+---------+---------+ | 2| 1|0.0244755|
+----+---------+---------+ END>
Training until now has taken 0 days 0 hours 36 minutes 35 seconds.
===== TRAINING 1-stage ===== <BEGIN POS count : consumed 199 : 199 NEG count : acceptanceRatio
0 : 0 Required leaf false alarm rate achieved.
Branch training terminated.
The process was running ~35 minutes and produces a 2 kB file with only 45 lines that seems too small for a good cascade.
Needless to say, it doesn't detect the needed area.
I tried to tune the arguments but to no avail.
I know that it is better to use a larger set of samples, but I think that the result with this samples number should also produce a somewhat reasonable result, not so accurate though.
Is a haar cascade a good approach for detecting areas with specific text (MRZ)?
If so how better accuracy can be achieved?
Thanks in advance.
you want to produce 3 stages with maximum false alarm rate 0.8 per stage, this means after 3 stages the classifier will have a maximum of 0.8^3 false alarm rate = 0.512 but after your first stage, the classifier already reaches false alarm rate of 0.0244755 which is much better than your final aim (0.512) so the classifier is already good enough and does not need any more stages.
If that's not fine for you, increase numStages or decrease maxFalseAlarmRate to some amount that you don't reach the "final quality" within your first stage.
You will probably have to collect more samples and samples that represent the environment better, reaching such low false alarm rates is typically a sign for bad training data (too simple or too similar?).
I can't tell you, whether haar cascades are appropriate for solving your task.

OpenCV train cascade giving error "“Train dataset for temp stage can not be filled.”

so I've searched this online and this is a pretty common error but I've tried the given solutions to no avail. My cmd log is:
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata>opencv_traincascade -data my_trained -vec positives.vec -bg negativedata.txt -numPos 30 -numNeg 76 -numStages 15 -minHitRate 0.995 -w 197 -h 197 -featureType LBP -precalcValBufSize 1024 -precalcIdxBufSize 1024
PARAMETERS:
cascadeDirName: my_trained
vecFileName: positives.vec
bgFileName: negativedata.txt
numPos: 30
numNeg: 76
numStages: 15
precalcValBufSize[Mb] : 1024
precalcIdxBufSize[Mb] : 1024
acceptanceRatioBreakValue : -1
stageType: BOOST
featureType: LBP
sampleWidth: 197
sampleHeight: 197
boostType: GAB
minHitRate: 0.995
maxFalseAlarmRate: 0.5
weightTrimRate: 0.95
maxDepth: 1
maxWeakCount: 100
Number of unique features given windowSize [197,197] : 41409225
===== TRAINING 0-stage =====
<BEGIN
POS count : consumed 30 : 30
Train dataset for temp stage can not be filled. Branch training terminated.
Cascade classifier can't be trained. Check the used training parameters.
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata>
and my negativedata.txt file has 76 lines of info in the form:
negatives/1411814567410.jpg 1 2 2 199 199
negatives/20131225_192702.jpg 1 2 2 199 199
negatives/20131225_193214.jpg 1 2 2 199 199
negatives/20131225_193325.jpg 1 2 2 199 199
negatives/20131225_193327.jpg 1 2 2 199 199
negatives/20131225_193328.jpg 1 2 2 199 199
Please can someone help me pinpoint the issue because I'm still not sure why I'm getting this error. I'm doing this on a windows system. Thank you.
Found out the issue, apparently the bg file shouldn't contain constraints so now my file is in the form
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata\negatives/ff.JPG
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata\negatives/fifa.JPG
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata\negatives/fred.JPG
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata\negatives/IMG-20140718-WA0008-1.jpg
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata\negatives/IMG-20150102-WA0013.jpg
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata\negatives/IMG-20150120-WA0005.jpg
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata\negatives/IMG_20140109_012313.jpg
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata\negatives/IMG_20140405_205621.jpg
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata\negatives/IMG_20140405_214225.jpg
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata\negatives/IMG_20140405_214225_transparent.png
C:\Users\kosyn_000\Dropbox\OpenCVtrainingdata\negatives/IMG_20140405_214225_transparent_small.png
and it outputted my xml file fine; albeit taking a bit of time. Lol I can't believe it was something so simple holding me back.

Resources