Vowpal Wabbit Readable Model - machine-learning

I was using Vowpal Wabbit and was generating the classifier trained as a readable model.
My dataset had 22 features and the readable model gave as output:
Version 7.2.1
Min label:-50.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
101143:0.035237
101144:0.033885
101145:0.013357
101146:-0.007537
101147:-0.039093
101148:-0.013357
101149:0.001748
116060:0.499471
157941:-0.037318
157942:0.008038
157943:-0.011337
196772:0.138384
196773:0.109454
196774:0.118985
196775:-0.022981
196776:-0.301487
196777:-0.118985
197006:-0.000514
197007:-0.000373
197008:-0.000288
197009:-0.004444
197010:-0.006072
197011:0.000270
Can somebody please explain to me how to interpret the last portion of the file (after options: )? I was using logistic regression and I need to check how iterating over the training updates my classifier so that I can understand when I reach a convergence...
Thanks in advance :)

The values you see are the hash-values and weights of all your 22 features and one additional "Constant" feature (its hash value is 116060) in the resulting trained model.
The format is:
hash_value:weight
In order to see your original feature names instead of the hash value, you may use one of two methods:
Use the utl/vw-varinfo (in the source tree) utility on your training set with the same options you used for training. Try utl/vw-varinfo for a help/usage message
Use the relatively new --invert_hash readable.model option
BTW: inverting the hash values back to the original feature names is not the default due to the large performance penalty. By default, vw applies the one way hash on each feature string it sees. It doesn't maintain a hash-map between feature-names and their hash-values at all.
Edit:
Another little tidbit that may be of interest is the first entry after options: which reads:
:0
It essentially means that any "other" feature (all those which are not in the training-set, and thus, not hashed into the weight vector) defaults to a weight of 0. This means that it is redundant in vowpal-wabbit to train on features with values of zero, which is the default anyway. Explicit :0 value features simply won't contribute anything to the model. When you leave-out a weight in your training-set, as in: feature_name without a trailing :<value> vowpal wabbit implicitly assumes that it is a binary feature, with a TRUE value. IOW: it defaults all value-less features, to a value of one (:1) rather than a value of zero (:0). HTH.

Vowpal Wabbit also now has an --invert_hash option, which will give you a readable model with the actual variables, as well as just the hashes.
It consumes a LOT more memory, but since your model seems to be pretty small it will probably work.

Related

Difference between One hot encoding and Label Encoding of target/output label

I have a problem where there are 20 classes. I have designed a neural network and using the loss as categorical_crossentropy.
When dealing with categorical cross entropy the output label must be one hot encoded.
So, when I one hot encoded the output label, the label in every row was one hot encoded in a matrix, while in label encoder I got the same encoding in an array.
oht = OneHotEncoder()
y_train_oht = oht.fit_transform(np.array(y_train).reshape(-1,1))
below is the snippet of label encoding
le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_train_le_cat = to_categorical(y_train_le)
one hot encoding sample output one hot encoding
label encoding sample output label encoding
I find the one hot encoding gives a matrix while label encoding gives an array. Can I please know when one hot encoding does the same job why do we have a label encoder. What kind of optimization does the label encoder bring in?
If using the label encoder happens to be more optimal then why do we not use the label encoder to encode categorical input data instead of one hot encoding?
Label encoding imposes artificial order: if you label-encode your pet target as 'Dog':0, 'Cat':1, 'Turtle':2, 'Golden Fish':3, then you get the awkward situation where 'Dog' < 'Cat' and 'Turtle is the average of 'Cat' + 'Golden Fish'.
In the case of predictor features (not the target), this is a problem since your Random Forest can be learning something like "if it less than 'Turtle', then...".
Also, you may have categories in the testing set (or even worse, new data during deployment) that were not present in the training, and the transformer doesn't know what to do, so it throws an error. This may be the case or not depending on the particular problem and particular feature you are encoding, obviously not for the target variable.
When hot encoding, if a category absent in the training is present in a prediction, it just get encoded as 0 in each of the encoded features (new columns representing each category), so you don't get an error. Your model still has the other features to make a reasonable guess.
As a general rule, you want to use label encoding for target variables and OHE for predictor features. Note that in general you don't care about artificial order in the target, since the prediction is usually categorical also (A forest will choose a number, not a range of numbers; a network will have one activation unit per category...)
I don't think optimization should be part of the discussion here since they are used for different scenarios demanding different outputs: surely it's more efficient to use the OHE transformer than trying to hack it by performing label encoding and then some pandas trickery to create the same result as with one hot encoding.
Here there are useful comments about the different scenarios (type of model, type of data) and some issues related to efficiency.
Here there's an example on why label encoding is a bad practice for input features.
And let's not forget that the goal of the model is to make predictions, so at the end what's important is not just the output of <transformer>.fit_transform, but also the fitted transformer itself that's going to be applied to the new observations. OHE will deal with new cases differently than label-encoder (e.g. when the value of the feature in the observation was not present in the training set). That's in my opinion enough reason to have different methods, even when they act in a way similar enough so, for some inputs, you may be able to force them to give similar outputs.

HTK: Understanding scores in resulting .mlf file

I am trying to understand the file result recout.mlf, so I have the following lines in that file:
Which of 'as' was well prononced: the one with -524.427185 or -1054.774536
The acoustic scores obtained during decoding are usually very tiny. To prevent underflow, log likelihoods are used instead of likelihoods: 1.5 Recognition and Viterbi Decoding.
Smaller argument values correspond to larger negative values of logarithms:
Thus, the first 'as' obtained a higher (-524.427185) acoustic score. Logarithm is a monotonic function (the larger is argument - the larger is the value), so you can compare the log-likelihoods directly: -524 > -1054.
BTW, it does not necessarily mean the first 'as' was better pronounced. The acoustic score depends on many factors, including model topology and the data the model was trained on.

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

confidence level with crfsuite predictions

I am using the CRFSuite package here
http://www.chokkan.org/software/crfsuite/tutorial.html
and I have successfully used it to build a classifier and tag text. However, I'm wondering if I can get a confidence value for each prediction it makes?
It doesn't seem so. What I would really like is to get the probability of a word being each type of tag ('PER', 'LOC', 'MISC', etc), rather than just the prediction itself.
The API provides extracting conditional probabilities. I guess you mean the crfsuite binary does not have that as option. You could edit the source and add the option yourself
I hope this serves as an answer. Sklearn crfsuite provides probability for each label.
predict_marginals(X)
Make a prediction.
Parameters: X (list of lists of dicts) – feature dicts in python-crfsuite format
Returns: y – predicted probabilities for each label at each position
Return type: list of lists of dicts
Source: https://sklearn-crfsuite.readthedocs.io/en/latest/_modules/sklearn_crfsuite/estimator.html#CRF.predict_marginals

Numerically representing Nominal Data whilst retaining data semantics

I have a dataset of nominal and numerical features. I want to be able to represent this dataset entirely numerically if possible.
Ideally I would be able to do this for an n-ary nominal feature. I realize that in the binary case, one could represent the two nominal values with integers. However, when a nominal feature can have many permutations, how would this be possible, if at all?
There are a number of techniques to "embed" categorical attributes as numbers.
For example, given a categorical variable that can take the values red, green and blue, we can trivially encode this as three attributes isRed={0,1}, isGreen={0,1} and isBlue={0,1}.
While this is popular, and will obviously "work", many people fall for the fallacy of assuming that afterwards numerical processing techniques will produce sensible results.
If you run e.g. k-means on a dataset encoded this way, the result will likely not be too meaningful afterwards. In particular, if you get a mean such as isRed=.3 isGreen=.2 isBlue=.5 - you cannot reasonably map this back to the original data. Worse, with some algorithms you may even get isRed=0 isGreen=0 isBlue=0.
I suggest that you try to work on your actual data, and avoid encoding as much as possible. If you have a good tool, it will allow you to use mixed data types. Don't try to make everything a numerical vector. This mathematical view of data is quite limited and the data will not give you all the mathematical assumptions that you need to benefit from this view (e.g. metric spaces).
Don't do this: I'm trying to encode certain nominal attributes as integers.
Except if there is only two permutations for a nominal feature. It is ok to use any different integers (for example 1 and 3) for each.
But if there is more than two permutations, integers can not be used. Lets say we assigned 1, 2 and 3 to three permutations. As we can see, there is higher relation between 1-2 and 2-3 than 1-3 because of differences.
Rather, use a separate binary feature for each value of each nominal attribute. Thus, the answer of your question: It is not possible/wisely.
If you use pandas, you can use a function called .get_dummies() on your nominal value column. This will turn the column of N unique values into N (or if you want N-1, called drop_first) new columns indicating with either a 1 or a 0 if a value is present.
Example:
s = pd.Series(list('abca'))
get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0

Resources