HTK: Understanding scores in resulting .mlf file - machine-learning

I am trying to understand the file result recout.mlf, so I have the following lines in that file:
Which of 'as' was well prononced: the one with -524.427185 or -1054.774536

The acoustic scores obtained during decoding are usually very tiny. To prevent underflow, log likelihoods are used instead of likelihoods: 1.5 Recognition and Viterbi Decoding.
Smaller argument values correspond to larger negative values of logarithms:
Thus, the first 'as' obtained a higher (-524.427185) acoustic score. Logarithm is a monotonic function (the larger is argument - the larger is the value), so you can compare the log-likelihoods directly: -524 > -1054.
BTW, it does not necessarily mean the first 'as' was better pronounced. The acoustic score depends on many factors, including model topology and the data the model was trained on.

Related

Difference between One hot encoding and Label Encoding of target/output label

I have a problem where there are 20 classes. I have designed a neural network and using the loss as categorical_crossentropy.
When dealing with categorical cross entropy the output label must be one hot encoded.
So, when I one hot encoded the output label, the label in every row was one hot encoded in a matrix, while in label encoder I got the same encoding in an array.
oht = OneHotEncoder()
y_train_oht = oht.fit_transform(np.array(y_train).reshape(-1,1))
below is the snippet of label encoding
le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_train_le_cat = to_categorical(y_train_le)
one hot encoding sample output one hot encoding
label encoding sample output label encoding
I find the one hot encoding gives a matrix while label encoding gives an array. Can I please know when one hot encoding does the same job why do we have a label encoder. What kind of optimization does the label encoder bring in?
If using the label encoder happens to be more optimal then why do we not use the label encoder to encode categorical input data instead of one hot encoding?
Label encoding imposes artificial order: if you label-encode your pet target as 'Dog':0, 'Cat':1, 'Turtle':2, 'Golden Fish':3, then you get the awkward situation where 'Dog' < 'Cat' and 'Turtle is the average of 'Cat' + 'Golden Fish'.
In the case of predictor features (not the target), this is a problem since your Random Forest can be learning something like "if it less than 'Turtle', then...".
Also, you may have categories in the testing set (or even worse, new data during deployment) that were not present in the training, and the transformer doesn't know what to do, so it throws an error. This may be the case or not depending on the particular problem and particular feature you are encoding, obviously not for the target variable.
When hot encoding, if a category absent in the training is present in a prediction, it just get encoded as 0 in each of the encoded features (new columns representing each category), so you don't get an error. Your model still has the other features to make a reasonable guess.
As a general rule, you want to use label encoding for target variables and OHE for predictor features. Note that in general you don't care about artificial order in the target, since the prediction is usually categorical also (A forest will choose a number, not a range of numbers; a network will have one activation unit per category...)
I don't think optimization should be part of the discussion here since they are used for different scenarios demanding different outputs: surely it's more efficient to use the OHE transformer than trying to hack it by performing label encoding and then some pandas trickery to create the same result as with one hot encoding.
Here there are useful comments about the different scenarios (type of model, type of data) and some issues related to efficiency.
Here there's an example on why label encoding is a bad practice for input features.
And let's not forget that the goal of the model is to make predictions, so at the end what's important is not just the output of <transformer>.fit_transform, but also the fitted transformer itself that's going to be applied to the new observations. OHE will deal with new cases differently than label-encoder (e.g. when the value of the feature in the observation was not present in the training set). That's in my opinion enough reason to have different methods, even when they act in a way similar enough so, for some inputs, you may be able to force them to give similar outputs.

Using sklearn DictVectorizer in real-time systems

Any binary one-hot encoding is aware of only values seen in training, so features not encountered during fitting will be silently ignored. For real time, where you have millions of records in a second, and features have very high cardinality, you need to keep your hasher/mapper updated with the data.
How can we do an incremental update to the hasher (rather calculating the entire fit() every time we incounter a new feature-value pair)? What is the suggested approach here the tackle this?
It depends on the learning algorithm that you are using. If you are using a method that has been designated for sparse data sets (FTRL, FFM, linear SVM) one possible approach is the following (note that it will introduce collisions in the features and a lot of constant columns).
First allocate for each element of your sample a (as large as possible) vector V, of length D.
For each categorical variable, evaluate hash(var_name + "_" + var_value) % D. This gives you an integer i, and you can store V[i] = 1.
Therefore, V never grows larger as new features appear. However, as soon as the number of features is large enough, some features will collide (i.e. be written at the same place) and this may result in an increased error rate...
Edit. You can write your own vectorizer to avoid collisions. First call L the current number of features. Prepare the same vector V of length 2L (this 2 will allow you to avoid collisions as new features arrive - at least for some time, depending of the arrival rate of new features).
Starting with an emty dictionary<input_type,int>, associate to each feature an integer. If have already seen the feature, return the int corresponding to the feature. If not, create a new entry with an integer corresponding to the new index. I think (but I am not sure) this is what LabelEncoder does for you.

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

Recommended values for OpenCV RTrees parameters

Any idea on the recommended parameters for OpenCV RTrees? I have read the documentation and I'm trying to apply it to MNIST dataset, i.e. 60000 training images, with 10000 testing images. I'm trying to optimize MaxDepth, MinSampleCount, setMaxCategories, and setPriors? e.g.
Ptr<RTrees> model = RTrees::create();
/* Depth of the tree.
A low value will likely underfit and conversely
a high value will likely overfit.
The optimal value can be obtained using cross validation
or other suitable methods.
*/
model->setMaxDepth(?); // letter_recog.cpp uses 10
/* minimum samples required at a leaf node for it to be split.
A reasonable value is a small percentage of the total data e.g. 1%.
MNIST 70000 * 0.01 = 700
*/
model->setMinSampleCount(700?); letter_recog.cpp uses 10
/* regression_accuracy – Termination criteria for regression trees.
If all absolute differences between an estimated value in a node and
values of train samples in this node are less than this parameter
then the node will not be split. */
model->setRegressionAccuracy(0); // I think this is already correct
/*
use_surrogates – If true then surrogate splits will be built.
These splits allow to work with missing data and compute variable importance correctly.'
To compute variable importance correctly, the surrogate splits must be enabled in
the training parameters, even if there is no missing data.
*/
model->setUseSurrogates(true); // I think this is already correct
/*
Cluster possible values of a categorical variable into K \leq max_categories clusters
to find a suboptimal split. If a discrete variable, on which the training procedure
tries to make a split, takes more than max_categories values, the precise best subset
estimation may take a very long time because the algorithm is exponential.
Instead, many decision trees engines (including ML) try to find sub-optimal split
in this case by clustering all the samples into max_categories clusters that is
some categories are merged together. The clustering is applied only in n>2-class
classification problems for categorical variables with N > max_categories possible values.
In case of regression and 2-class classification the optimal split can be found
efficiently without employing clustering, thus the parameter is not used in these cases.
*/
model->setMaxCategories(?); letter_recog.cpp uses 15
/*
priors – The array of a priori class probabilities, sorted by the class label value.
The parameter can be used to tune the decision tree preferences toward a certain class.
For example, if you want to detect some rare anomaly occurrence, the training base will
likely contain much more normal cases than anomalies, so a very good classification
performance will be achieved just by considering every case as normal.
To avoid this, the priors can be specified, where the anomaly probability is
artificially increased (up to 0.5 or even greater), so the weight of the misclassified
anomalies becomes much bigger, and the tree is adjusted properly. You can also think about
this parameter as weights of prediction categories which determine relative weights that
you give to misclassification. That is, if the weight of the first category is 1 and
the weight of the second category is 10, then each mistake in predicting the
second category is equivalent to making 10 mistakes in predicting the first category.
*/
model->setPriors(Mat()); // ?
/* If true then variable importance will be calculated and
then it can be retrieved by CvRTrees::get_var_importance().
*/
model->setCalculateVarImportance(true); // I think this is already correct
/*
The size of the randomly selected subset of features at each tree node and
that are used to find the best split(s). If you set it to 0 then the size
will be set to the square root of the total number of features.
*/
model->setActiveVarCount(0); // I think this is already correct
/*
CV_TERMCRIT_ITER Terminate learning by the max_num_of_trees_in_the_forest;
CV_TERMCRIT_EPS Terminate learning by the forest_accuracy;
CV_TERMCRIT_ITER | CV_TERMCRIT_EPS Use both termination criteria.
*/
model->setTermCriteria(TC(100,0.01f)); // I think this is already correct

Vowpal Wabbit Readable Model

I was using Vowpal Wabbit and was generating the classifier trained as a readable model.
My dataset had 22 features and the readable model gave as output:
Version 7.2.1
Min label:-50.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
101143:0.035237
101144:0.033885
101145:0.013357
101146:-0.007537
101147:-0.039093
101148:-0.013357
101149:0.001748
116060:0.499471
157941:-0.037318
157942:0.008038
157943:-0.011337
196772:0.138384
196773:0.109454
196774:0.118985
196775:-0.022981
196776:-0.301487
196777:-0.118985
197006:-0.000514
197007:-0.000373
197008:-0.000288
197009:-0.004444
197010:-0.006072
197011:0.000270
Can somebody please explain to me how to interpret the last portion of the file (after options: )? I was using logistic regression and I need to check how iterating over the training updates my classifier so that I can understand when I reach a convergence...
Thanks in advance :)
The values you see are the hash-values and weights of all your 22 features and one additional "Constant" feature (its hash value is 116060) in the resulting trained model.
The format is:
hash_value:weight
In order to see your original feature names instead of the hash value, you may use one of two methods:
Use the utl/vw-varinfo (in the source tree) utility on your training set with the same options you used for training. Try utl/vw-varinfo for a help/usage message
Use the relatively new --invert_hash readable.model option
BTW: inverting the hash values back to the original feature names is not the default due to the large performance penalty. By default, vw applies the one way hash on each feature string it sees. It doesn't maintain a hash-map between feature-names and their hash-values at all.
Edit:
Another little tidbit that may be of interest is the first entry after options: which reads:
:0
It essentially means that any "other" feature (all those which are not in the training-set, and thus, not hashed into the weight vector) defaults to a weight of 0. This means that it is redundant in vowpal-wabbit to train on features with values of zero, which is the default anyway. Explicit :0 value features simply won't contribute anything to the model. When you leave-out a weight in your training-set, as in: feature_name without a trailing :<value> vowpal wabbit implicitly assumes that it is a binary feature, with a TRUE value. IOW: it defaults all value-less features, to a value of one (:1) rather than a value of zero (:0). HTH.
Vowpal Wabbit also now has an --invert_hash option, which will give you a readable model with the actual variables, as well as just the hashes.
It consumes a LOT more memory, but since your model seems to be pretty small it will probably work.

Resources