Reason of failing on first iteration of CMUSphinx Baum Welch training - machine-learning

I am trying to build a new acoustic model, I used a speech synthesizer to train it, the total estimate hours of training files is:
0.0389416666666667
, But I keep getting an error after Baum Welch training, this the error I'm getting:
Sphinxtrain path: /usr/local/lib/sphinxtrain
Sphinxtrain binaries path: /usr/local/libexec/sphinxtrain
Running the training
MODULE: 000 Computing feature from audio files
Extracting features from segments starting at (part 1 of 1)
Extracting features from segments starting at (part 1 of 1)
Feature extraction is done
MODULE: 00 verify training files
Phase 1: Checking to see if the dict and filler dict agrees with the
phonelist file.
Found 81 words using 49 phones
Phase 2: Checking to make sure there are not duplicate entries in
the dictionary
Phase 3: Check general format for the fileids file; utterance length
(must be positive); files exist
Phase 4: Checking number of lines in the transcript file should
match lines in fileids file
Phase 5: Determine amount of training data, see if n_tied_states
seems reasonable.
Estimated Total Hours Training: 0.0389416666666667
This is a small amount of data, no comment at this time
Phase 6: Checking that all the words in the transcript are in the
dictionary
Words in dictionary: 78
Words in filler dictionary: 3
Phase 7: Checking that all the phones in the transcript are in the
phonelist, and all phones in the phonelist appear at least once
MODULE: 0000 train grapheme-to-phoneme model
Skipped (set $CFG_G2P_MODEL = 'yes' to enable)
MODULE: 01 Train LDA transformation
Skipped for multistream setup, see CFG_NUM_STREAMS configuration
LDA/MLLT only has sense for single stream features
Skipping LDA training
MODULE: 02 Train MLLT transformation
Skipped for multistream setup, see CFG_NUM_STREAMS configuration
LDA/MLLT only has sense for single stream features
Skipping MLLT training
MODULE: 05 Vector Quantization
ERROR: This step had 2 ERROR messages and 0 WARNING messages. Please
check the log file for details.
MODULE: 10 Training Context Independent models for forced alignment and
VTLN
Skipped: $ST::CFG_FORCEDALIGN set to 'no' in sphinx_train.cfg
Skipped: $ST::CFG_VTLN set to 'no' in sphinx_train.cfg
MODULE: 11 Force-aligning transcripts
Skipped: $ST::CFG_FORCEDALIGN set to 'no' in sphinx_train.cfg
MODULE: 12 Force-aligning data for VTLN
Skipped: $ST::CFG_VTLN set to 'no' in sphinx_train.cfg
MODULE: 20 Training Context Independent models
Phase 1: Cleaning up directories:
accumulator...logs...qmanager...models...
Phase 2: Flat initialize
Phase 3: Forward-Backward
Baum welch starting for 256 Gaussian(s), iteration: 1 (1 of 1)
0% 20% 30% 60% 90% 100%
ERROR: This step had 86 ERROR messages and 0 WARNING messages. Please
check the log file for details.
ERROR: Training failed in iteration 1
I have also set the CFG_CD_TRAIN to 'no' since i have a small training data.
Edit:
I checked the log file, here is a pastebin of the log:
http://pastebin.com/YBSqfxYW

Related

How to interpret output of EM on weka

I tried to run the EM algorithm on data with the default parameters in WEKA and but I am not able to understand how to interpret it?
=== Run information ===
Scheme: weka.clusterers.EM -I 100 -N -1 -X 10 -max -1 -ll-cv 1.0E-6 -ll-iter 1.0E-6 -M 1.0E-6 -K 10 -num-slots 1 -S 100
Relation: Chronic_Kidney_Disease-weka.filters.unsupervised.attribute.Remove-R12-weka.filters.unsupervised.attribute.Remove-R3-weka.filters.unsupervised.attribute.Remove-R3-4-weka.filters.unsupervised.attribute.Remove-R5-10,12-20
Instances: 800
Attributes: 6
age
bp
rbc
pc
hemo
class
Test mode: evaluate on training data
=== Clustering model (full training set) ===
EM
==
Number of clusters selected by cross validation: 6
Number of iterations performed: 100
Cluster
Attribute 0 1 2 3 4 5
(0.29) (0.22) (0.38) (0.02) (0.04) (0.05)
===================================================================
age
mean 53.5869 65.0962 46.44 51.3652 56.1297 10.939
std. dev. 12.4505 7.9718 15.546 3.7759 10.2604 6.7004
bp
mean 77.3114 79.7 71.4394 115.138 92.1235 66.5196
std. dev. 11.7858 12.1008 8.4722 31.4278 5.8351 10.0583
rbc
normal 185.8341 165.6585 306.8285 14.0588 7.3129 32.3071
abnormal 45.4643 13.3988 1.0652 3.3197 29.7885 6.9635
[total] 231.2984 179.0574 307.8937 17.3785 37.1015 39.2706
pc
normal 152.713 147.8797 306.8886 13.0467 1.9999 31.4721
abnormal 78.5854 31.1776 1.005 4.3319 35.1016 7.7985
[total] 231.2984 179.0574 307.8937 17.3785 37.1015 39.2706
hemo
mean 10.6591 11.7665 15.0745 9.5796 8.1499 12.0494
std. dev. 2.1313 1.1677 1.3496 2.5159 2.1512 1.5108
class
ckd 230.1835 177.972 7.2109 16.3651 36.1014 38.167
notckd 1.1149 1.0853 300.6828 1.0134 1 1.1036
[total] 231.2984 179.0574 307.8937 17.3785 37.1015 39.2706
Time taken to build model (full training data) : 13.21 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 218 ( 27%)
1 196 ( 25%)
2 302 ( 38%)
3 12 ( 2%)
4 34 ( 4%)
5 38 ( 5%)
Log likelihood: -11.18988
Please help in understanding the output.
Thanks in advance
It's given you six clusters, with 27%, 25%, 38%, 2%, 4% and 5% of the data in respectively. (Which is >100%, so is rounded).
It's arrived on 6 after cross-validation (training on some, testing on the others for several runs).
The mean and standard deviation of each attribute for the items in each cluster are given.
The log likelihood is a measure of how good the clusters are - the training tried to minimise this. It is uses to compare which of the possible clusters is better and doesn't mean much by itself.

What is the measure used for "importance" in the h2o random Forest

Here is my code:
set.seed(1)
#Boruta on the HouseVotes84 data from mlbench
library(mlbench) #has HouseVotes84 data
library(h2o) #has rf
#spin up h2o
myh20 <- h2o.init(nthreads = -1)
#read in data, throw some away
data(HouseVotes84)
hvo <- na.omit(HouseVotes84)
#move from R to h2o
mydata <- as.h2o(x=hvo,
destination_frame= "mydata")
#RF columns (input vs. output)
idxy <- 1
idxx <- 2:ncol(hvo)
#split data
splits <- h2o.splitFrame(mydata,
c(0.8,0.1))
train <- h2o.assign(splits[[1]], key="train")
valid <- h2o.assign(splits[[2]], key="valid")
# make random forest
my_imp.rf<- h2o.randomForest(y=idxy,x=idxx,
training_frame = train,
validation_frame = valid,
model_id = "my_imp.rf",
ntrees=200)
# find importance
my_varimp <- h2o.varimp(my_imp.rf)
my_varimp
The output that I am getting is "variable importance".
The classic measures are "mean decrease in accuracy" and "mean decrease in gini coefficient".
My results are:
> my_varimp
Variable Importances:
variable relative_importance scaled_importance percentage
1 V4 3255.193604 1.000000 0.410574
2 V5 1131.646484 0.347643 0.142733
3 V3 921.106567 0.282965 0.116178
4 V12 759.443176 0.233302 0.095788
5 V14 492.264954 0.151224 0.062089
6 V8 342.811554 0.105312 0.043238
7 V11 205.392654 0.063097 0.025906
8 V9 191.110046 0.058709 0.024105
9 V7 169.117676 0.051953 0.021331
10 V15 135.097076 0.041502 0.017040
11 V13 114.906586 0.035299 0.014493
12 V2 51.939777 0.015956 0.006551
13 V10 46.716656 0.014351 0.005892
14 V6 44.336708 0.013620 0.005592
15 V16 34.779987 0.010684 0.004387
16 V1 32.528778 0.009993 0.004103
From this my relative importance of "Vote #4" aka V4, is ~3255.2.
Questions:
What units is that in?
How is that derived?
I tried looking in documentation, but am not finding the answer. I tried the help documentation. I tried using Flow to look at parameters to see if anything in there indicated it. In none of them do I find "gini" or "decrease accuracy". Where should I look?
The answer is in the docs.
[ In the left pane, click on "Algorithms", then "Supervised", then "DRF". The FAQ section answers this question. ]
For convenience, the answer is also copied and pasted here:
"How is variable importance calculated for DRF? Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected during splitting in the tree building process and how much the squared error (over all trees) improved as a result."

What is the prediction file in SVMlight?

I am new to SVMlight. I downloaded the source code and compiled SVMlight.
I created training and testing data sets. And ran
[command]
creating a model file. Using this model file, I ran svm_classify creating a prediction file. The prediction file contains some values.
What do these numbers represent? I would like to classify my data into -1 and +1, but I see no such values in prediction file.
model file :
SVM-light Version V6.02
0 # kernel type
3 # kernel parameter -d
1 # kernel parameter -g
1 # kernel parameter -s
1 # kernel parameter -r
empty# kernel parameter -u
9947 # highest feature index
2000 # number of training documents
879 # number of support vectors plus 1
-0.13217617 # threshold b, each following line is a SV (starting with alpha*y)
-1.0000000005381390888459236521157 6:0.013155501 9:0.10063701 27:0.038305663 41:0.12115256 63:0.056871183 142:0.020468477 206:0.12547429 286:0.073713586 406:0.12335037 578:0.40131235 720:0.13097784 960:0.30321017 1607:0.17021149 2205:0.5118736 3177:0.54580438 4507:0.27290219 #
-0.61395623101405172317157621364458 6:0.019937159 27:0.019350741 31:0.025329925 37:0.031444062 42:0.11928168 83:0.03443896 127:0.066094264 142:0.0086166598 162:0.035993244 190:0.056980081 202:0.16503957 286:0.074475288 323:0.056850906 386:0.052928429 408:0.039132856 411:0.049789339 480:0.048880257 500:0.068775021 506:0.037179198 555:0.076585822 594:0.063632675 663:0.062197074 673:0.067195281 782:0.075720288 834:0.066969693 923:0.44677126 1146:0.076086208 1191:0.5542227 1225:0.059279677 1302:0.094811738 1305:0.060443446 1379:0.070145406 1544:0.087077379 1936:0.089480147 2451:0.31556693 2796:0.1145037 2833:0.20080972 6242:0.1545693 6574:0.28386003 7639:0.29435158 #
etc...
prediction file:
1.0142989
1.3699419
1.4742762
0.52224801
0.41167112
1.3597693
0.91790572
1.1846312
1.5038173
-1.7641716
-1.4615855
-0.75832723
etc...
In your training file, did you provide known classes (+1, -1)? i.e.
-1 1:0.43 3:0.12 9284:0.2 # abcdef
Can you provide an excerpt of this file as well as the commands you ran?
The prediction file holds the values for each data point according to the model you trained. You may consider that values below 0 classify the datapoint into the -1 category and above 0 into the +1 category.
When you run the classification on the training set, you will see where the model works and where it fails.

How to do proper testing in Weka and how to get desired results ?

I am currently working over a application of ANN, SVM and Linear Regression methods for prediction of fruit yield of a region based on meteorological factors (13 factors )
Total data set is: 36
While Implementing those methods on WEKA I am getting BAD results:
Like in the case of MultilayerPreceptron my results are :
(i divided the dataset with 28 for training and 8 for test )
=== Run information ===
Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a -G -R
Relation: apr6_data
Instances: 28
Attributes: 15
Time taken to build model: 3.69 seconds
=== Predictions on test set ===
inst# actual predicted error
1 2.551 2.36 -0.191
2 2.126 3.079 0.953
3 2.6 1.319 -1.281
4 1.901 3.539 1.638
5 2.146 3.635 1.489
6 2.533 2.917 0.384
7 2.54 2.744 0.204
8 2.82 3.473 0.653
=== Evaluation on test set ===
=== Summary ===
Correlation coefficient -0.4415
Mean absolute error 0.8493
Root mean squared error 1.0065
Relative absolute error 144.2248 %
Root relative squared error 153.5097 %
Total Number of Instances 8
In case of SVM for regression :
inst# actual predicted error
1 2.551 2.538 -0.013
2 2.126 2.568 0.442
3 2.6 2.335 -0.265
4 1.901 2.556 0.655
5 2.146 2.632 0.486
6 2.533 2.24 -0.293
7 2.54 2.766 0.226
8 2.82 3.175 0.355
=== Evaluation on test set ===
=== Summary ===
Correlation coefficient 0.2888
Mean absolute error 0.3417
Root mean squared error 0.3862
Relative absolute error 58.0331 %
Root relative squared error 58.9028 %
Total Number of Instances 8
What can be the possible error in my application ? Please let me know !
Thanks
Do I need to normalize the data ? I guess it is being done by WEKA classifiers.
If you want to normalize data, you have to do it. Preprocess tab - > Filters (choose) -> then find normalize and then click apply.
If you want to discretize your data, you have to follow the same process.
You might have better luck with discretising the prediction e.g. into low/medium/high yield.
You need to normalize or discretize- this cannot be said based on your data or on your single run. For instance, discretization brings in better result for naive baye's classifiers. For SVM- not sure.
I did not see your Precision, Recall or F-score from your data. But as you are saying you have bad results on test set, then it is very possible that your classifier is experiencing overfitting. Try to increase training instances (36 is too less I guess). Keep us posting what is happening when you increase training instances.

SVM Lite training file format

I am a student pursuing Artificial Intelligence course and I need to use SVM Lite for binary classification. I formatted my training.dat file with following values.
1 1:317.5 2:489.923718552 3:15.3 4:13.5207248326 5:51.6 6:0 7:0.0118 8: 0.0003
-1 1:114.4 2:127.135783258 3:19.9 4:15.2130764246 5:101.5 6:0 7:0.0123 8:0.123456790123
1 1:107.0 2:0.0 3:6.0 4:0.0 5:52.0 6:0 7:0.0 8:0.0
-1 1:158.9 2:200.81 3:27.9 4:7.81 5:58.9 6:1 7:0.0 8:0.054
-1 1:46.0 2:0.0 3:15.0 4:0.0 5:16.0 6:1 7:0.0 8: 0.021
..
.....
...
when i try to give this file for training it says as follows
Scanning examples done ..
Reading examples into memory -- done
Parsing Error in Line 0 !
Please guide me what to do?

Resources