SVM Lite training file format - machine-learning

I am a student pursuing Artificial Intelligence course and I need to use SVM Lite for binary classification. I formatted my training.dat file with following values.
1 1:317.5 2:489.923718552 3:15.3 4:13.5207248326 5:51.6 6:0 7:0.0118 8: 0.0003
-1 1:114.4 2:127.135783258 3:19.9 4:15.2130764246 5:101.5 6:0 7:0.0123 8:0.123456790123
1 1:107.0 2:0.0 3:6.0 4:0.0 5:52.0 6:0 7:0.0 8:0.0
-1 1:158.9 2:200.81 3:27.9 4:7.81 5:58.9 6:1 7:0.0 8:0.054
-1 1:46.0 2:0.0 3:15.0 4:0.0 5:16.0 6:1 7:0.0 8: 0.021
..
.....
...
when i try to give this file for training it says as follows
Scanning examples done ..
Reading examples into memory -- done
Parsing Error in Line 0 !
Please guide me what to do?

Related

Error using svm predict function in openCV when loading a saved file with svm load

I'm trying to load a .xml file using the SVM load function in OpenCV, and then use the predict function to classify a traffic sign. When reaching the execution of the predict function an error is thrown:
Unhandled exception at 0x00007FFE88E54008 in LicentaFunctii.exe: Microsoft C++ exception: cv::Exception at memory location 0x00000025658FD0C0.
And in the console the following message is logged:
OpenCV Error: Assertion failed (samples.cols == var_count && samples.type()== 5) in cv::ml::SVMImpl::predict, file C:\build\master_winpack-build-win64-
vc14\opencv\modules\ml\src\svm.cpp, line 2005
This are the first 24 lines in the xml lines:
<?xml version="1.0"?>
<opencv_storage>
<opencv_ml_svm>
<format>3</format>
<svmType>C_SVC</svmType>
<kernel>
<type>LINEAR</type></kernel>
<C>15.</C>
<term_criteria><epsilon>1.0000000000000000e-02</epsilon>
<iterations>1000</iterations></term_criteria>
<var_count>3600</var_count>
<class_count>7</class_count>
<class_labels type_id="opencv-matrix">
<rows>7</rows>
<cols>1</cols>
<dt>i</dt>
<data>
0 1 2 3 4 5 6</data></class_labels>
<sv_total>21</sv_total>
<support_vectors>
<_>
1.06024239e-02 4.48197760e-02 -4.58896300e-03 -2.43553445e-02
-7.37148002e-02 -1.85971316e-02 -1.32155744e-02 -1.38255786e-02
-3.20396386e-02 8.21578354e-02 7.99100101e-02 -1.21739525e-02
The following code is used to load the trained data from the xml file:
Ptr<SVM> svm = SVM::create();
svm->load("Images/trainedImages.xml");
Note: I'm using OpenCV 3.4.0 version.
Can anyone advise on this issue?
EDIT 1:
It seems that loading the trained file like this will work:
Ptr<SVM> svm = SVM::create();
svm = SVM::load("Images/trainedImages.xml");
It seems that loading the trained file like this will work:
Ptr<SVM> svm = SVM::create();
svm = SVM::load("Images/trainedImages.xml");

Reason of failing on first iteration of CMUSphinx Baum Welch training

I am trying to build a new acoustic model, I used a speech synthesizer to train it, the total estimate hours of training files is:
0.0389416666666667
, But I keep getting an error after Baum Welch training, this the error I'm getting:
Sphinxtrain path: /usr/local/lib/sphinxtrain
Sphinxtrain binaries path: /usr/local/libexec/sphinxtrain
Running the training
MODULE: 000 Computing feature from audio files
Extracting features from segments starting at (part 1 of 1)
Extracting features from segments starting at (part 1 of 1)
Feature extraction is done
MODULE: 00 verify training files
Phase 1: Checking to see if the dict and filler dict agrees with the
phonelist file.
Found 81 words using 49 phones
Phase 2: Checking to make sure there are not duplicate entries in
the dictionary
Phase 3: Check general format for the fileids file; utterance length
(must be positive); files exist
Phase 4: Checking number of lines in the transcript file should
match lines in fileids file
Phase 5: Determine amount of training data, see if n_tied_states
seems reasonable.
Estimated Total Hours Training: 0.0389416666666667
This is a small amount of data, no comment at this time
Phase 6: Checking that all the words in the transcript are in the
dictionary
Words in dictionary: 78
Words in filler dictionary: 3
Phase 7: Checking that all the phones in the transcript are in the
phonelist, and all phones in the phonelist appear at least once
MODULE: 0000 train grapheme-to-phoneme model
Skipped (set $CFG_G2P_MODEL = 'yes' to enable)
MODULE: 01 Train LDA transformation
Skipped for multistream setup, see CFG_NUM_STREAMS configuration
LDA/MLLT only has sense for single stream features
Skipping LDA training
MODULE: 02 Train MLLT transformation
Skipped for multistream setup, see CFG_NUM_STREAMS configuration
LDA/MLLT only has sense for single stream features
Skipping MLLT training
MODULE: 05 Vector Quantization
ERROR: This step had 2 ERROR messages and 0 WARNING messages. Please
check the log file for details.
MODULE: 10 Training Context Independent models for forced alignment and
VTLN
Skipped: $ST::CFG_FORCEDALIGN set to 'no' in sphinx_train.cfg
Skipped: $ST::CFG_VTLN set to 'no' in sphinx_train.cfg
MODULE: 11 Force-aligning transcripts
Skipped: $ST::CFG_FORCEDALIGN set to 'no' in sphinx_train.cfg
MODULE: 12 Force-aligning data for VTLN
Skipped: $ST::CFG_VTLN set to 'no' in sphinx_train.cfg
MODULE: 20 Training Context Independent models
Phase 1: Cleaning up directories:
accumulator...logs...qmanager...models...
Phase 2: Flat initialize
Phase 3: Forward-Backward
Baum welch starting for 256 Gaussian(s), iteration: 1 (1 of 1)
0% 20% 30% 60% 90% 100%
ERROR: This step had 86 ERROR messages and 0 WARNING messages. Please
check the log file for details.
ERROR: Training failed in iteration 1
I have also set the CFG_CD_TRAIN to 'no' since i have a small training data.
Edit:
I checked the log file, here is a pastebin of the log:
http://pastebin.com/YBSqfxYW

What is the measure used for "importance" in the h2o random Forest

Here is my code:
set.seed(1)
#Boruta on the HouseVotes84 data from mlbench
library(mlbench) #has HouseVotes84 data
library(h2o) #has rf
#spin up h2o
myh20 <- h2o.init(nthreads = -1)
#read in data, throw some away
data(HouseVotes84)
hvo <- na.omit(HouseVotes84)
#move from R to h2o
mydata <- as.h2o(x=hvo,
destination_frame= "mydata")
#RF columns (input vs. output)
idxy <- 1
idxx <- 2:ncol(hvo)
#split data
splits <- h2o.splitFrame(mydata,
c(0.8,0.1))
train <- h2o.assign(splits[[1]], key="train")
valid <- h2o.assign(splits[[2]], key="valid")
# make random forest
my_imp.rf<- h2o.randomForest(y=idxy,x=idxx,
training_frame = train,
validation_frame = valid,
model_id = "my_imp.rf",
ntrees=200)
# find importance
my_varimp <- h2o.varimp(my_imp.rf)
my_varimp
The output that I am getting is "variable importance".
The classic measures are "mean decrease in accuracy" and "mean decrease in gini coefficient".
My results are:
> my_varimp
Variable Importances:
variable relative_importance scaled_importance percentage
1 V4 3255.193604 1.000000 0.410574
2 V5 1131.646484 0.347643 0.142733
3 V3 921.106567 0.282965 0.116178
4 V12 759.443176 0.233302 0.095788
5 V14 492.264954 0.151224 0.062089
6 V8 342.811554 0.105312 0.043238
7 V11 205.392654 0.063097 0.025906
8 V9 191.110046 0.058709 0.024105
9 V7 169.117676 0.051953 0.021331
10 V15 135.097076 0.041502 0.017040
11 V13 114.906586 0.035299 0.014493
12 V2 51.939777 0.015956 0.006551
13 V10 46.716656 0.014351 0.005892
14 V6 44.336708 0.013620 0.005592
15 V16 34.779987 0.010684 0.004387
16 V1 32.528778 0.009993 0.004103
From this my relative importance of "Vote #4" aka V4, is ~3255.2.
Questions:
What units is that in?
How is that derived?
I tried looking in documentation, but am not finding the answer. I tried the help documentation. I tried using Flow to look at parameters to see if anything in there indicated it. In none of them do I find "gini" or "decrease accuracy". Where should I look?
The answer is in the docs.
[ In the left pane, click on "Algorithms", then "Supervised", then "DRF". The FAQ section answers this question. ]
For convenience, the answer is also copied and pasted here:
"How is variable importance calculated for DRF? Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected during splitting in the tree building process and how much the squared error (over all trees) improved as a result."

Learning Weka - Precision and Recall - Wiki example to .Arff file

I'm new to WEKA and advanced statistics, starting from scratch to understand the WEKA measures. I've done all the #rushdi-shams examples, which are great resources.
On Wikipedia the http://en.wikipedia.org/wiki/Precision_and_recall examples explains with an simple example about a video software recognition of 7 dogs detection in a group of 9 real dogs and some cats.
I perfectly understand the example, and the recall calculation.
So my first step, let see in Weka how to reproduce with this data.
How do I create such a .ARFF file?
With this file I have a wrong Confusion Matrix, and the wrong Accuracy By Class
Recall is not 1, it should be 4/9 (0.4444)
#relation 'dogs and cat detection'
#attribute 'realanimal' {dog,cat}
#attribute 'detected' {dog,cat}
#attribute 'class' {correct,wrong}
#data
dog,dog,correct
dog,dog,correct
dog,dog,correct
dog,dog,correct
cat,dog,wrong
cat,dog,wrong
cat,dog,wrong
dog,?,?
dog,?,?
dog,?,?
dog,?,?
dog,?,?
cat,?,?
cat,?,?
Output Weka (without filters)
=== Run information ===
Scheme:weka.classifiers.rules.ZeroR
Relation: dogs and cat detection
Instances: 14
Attributes: 3
realanimal
detected
class
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
ZeroR predicts class value: correct
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 4 57.1429 %
Incorrectly Classified Instances 3 42.8571 %
Kappa statistic 0
Mean absolute error 0.5
Root mean squared error 0.5044
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 7
Ignored Class Unknown Instances 7
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 1 0.571 1 0.727 0.65 correct
0 0 0 0 0 0.136 wrong
Weighted Avg. 0.571 0.571 0.327 0.571 0.416 0.43
=== Confusion Matrix ===
a b <-- classified as
4 0 | a = correct
3 0 | b = wrong
There must be something wrong with the False Negative dogs,
or is my ARFF approach totally wrong and do I need another kind of attributes?
Thanks
Lets start with the basic definition of Precision and Recall.
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
Where TP is True Positive, FP is False Positive, and FN is False Negative.
In the above dog.arff file, Weka took into account only the first 7 tuples, it ignored the remaining 7. It can be seen from the above output that it has classified all the 7 tuples as correct(4 correct tuples + 3 wrong tuples).
Lets calculate the precision for correct and wrong class.
First for the correct class:
Prec = 4/(4+3) = 0.571428571
Recall = 4/(4+0) = 1.
For wrong class:
Prec = 0/(0+0)= 0
recall =0/(0+3) = 0

How to do proper testing in Weka and how to get desired results ?

I am currently working over a application of ANN, SVM and Linear Regression methods for prediction of fruit yield of a region based on meteorological factors (13 factors )
Total data set is: 36
While Implementing those methods on WEKA I am getting BAD results:
Like in the case of MultilayerPreceptron my results are :
(i divided the dataset with 28 for training and 8 for test )
=== Run information ===
Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a -G -R
Relation: apr6_data
Instances: 28
Attributes: 15
Time taken to build model: 3.69 seconds
=== Predictions on test set ===
inst# actual predicted error
1 2.551 2.36 -0.191
2 2.126 3.079 0.953
3 2.6 1.319 -1.281
4 1.901 3.539 1.638
5 2.146 3.635 1.489
6 2.533 2.917 0.384
7 2.54 2.744 0.204
8 2.82 3.473 0.653
=== Evaluation on test set ===
=== Summary ===
Correlation coefficient -0.4415
Mean absolute error 0.8493
Root mean squared error 1.0065
Relative absolute error 144.2248 %
Root relative squared error 153.5097 %
Total Number of Instances 8
In case of SVM for regression :
inst# actual predicted error
1 2.551 2.538 -0.013
2 2.126 2.568 0.442
3 2.6 2.335 -0.265
4 1.901 2.556 0.655
5 2.146 2.632 0.486
6 2.533 2.24 -0.293
7 2.54 2.766 0.226
8 2.82 3.175 0.355
=== Evaluation on test set ===
=== Summary ===
Correlation coefficient 0.2888
Mean absolute error 0.3417
Root mean squared error 0.3862
Relative absolute error 58.0331 %
Root relative squared error 58.9028 %
Total Number of Instances 8
What can be the possible error in my application ? Please let me know !
Thanks
Do I need to normalize the data ? I guess it is being done by WEKA classifiers.
If you want to normalize data, you have to do it. Preprocess tab - > Filters (choose) -> then find normalize and then click apply.
If you want to discretize your data, you have to follow the same process.
You might have better luck with discretising the prediction e.g. into low/medium/high yield.
You need to normalize or discretize- this cannot be said based on your data or on your single run. For instance, discretization brings in better result for naive baye's classifiers. For SVM- not sure.
I did not see your Precision, Recall or F-score from your data. But as you are saying you have bad results on test set, then it is very possible that your classifier is experiencing overfitting. Try to increase training instances (36 is too less I guess). Keep us posting what is happening when you increase training instances.

Resources