Using match scores to determine right features (Machine Learning) - machine-learning

I am familiar with determining the extent of match of a given set of documents in our knowledge base against a search query document (based on cosine distance) once the features are available. We would map both on the vector space based on the features.
How do I handle the reverse- I have been given a set of documents and the match score against multiple query documents and have to determine the features(or decision criteria to determine match). This would be the training data, and the model would be used to identify matches against our knowledge database for new
search queries
Our current approach is to think up a set of features and see which combinations get the best match scores in the training set... but we will end up trying multiple combinations. Is there a better way to do this?

Here is a simple and straightforward way (Linear Model) that should work.
If you are working with documents and queries, probably the features you are using are those tokens(or words) or n-grams or topics. Let's assume the features are words for simplicity.
Suppose you have a query document:
apple iphone6
and you have a set of documents and their corresponding match scores against the above query:
(Assume the documents are the contents of urls)
www.apple.com (Apple - iPhone 6) score: 0.8
www.cnet.com/products/apple-iphone-6 (Apple iPhone 6 review), score: 0.75
www.stuff.tv/apple/apple-iphone-6/review (Apple iPhone 6 review), score: 0.7
....
Per-query model
First you need to extract word features from the matching urls. suppose we get word and their L1-normalized TF/IDF scores:
www.apple.com
apple 0.5
iphone 0.4
ios8 0.1
www.cnet.com/products/apple-iphone-6
apple 0.4
iphone 0.2
review 0.2
cnet 0.2
www.stuff.tv/apple/apple-iphone-6/review
apple 0.4
iphone 0.4
review 0.1
cnet 0.05
stuff 0.05
Second you can combine feature scores and match scores and aggregate on a per-feature basis:
w(apple) = 0.5 * 0.8 + 0.4 * 0.75 + 0.1 * 0.7 = 0.77
w(iphone) = 0.4 * 0.8 + 0.2 * 0.75 + 0.4 * 0.7 = 0.75
w(ios8) = 0.1 * 0.8 = 0.08
w(review) = 0.2 * 0.75 + 0.1 * 0.7 = 0.22
w(cnet) = 0.2 * 0.75 = 0.15
w(stuff) = 0.05 * 0.7 = 0.035
You might want to do a normalization step to divide each w by the number of documents. Now you get below features ordered by their relevancy decendingly:
w(apple)=0.77 / 3
w(iphone)=0.75 / 3
w(review)=0.22 / 3
w(cnet)=0.15 / 3
w(ios8)=0.08 / 3
w(stuff)=0.035 / 3
You even get a linear classifier by using those weights:
score = w(apple) * tf-idf(apple) + w(iphone) * tf-idf(iphone) + ... + w(stuff) * tf-idf(stuff)
Suppose now you have a new url with those features detected:
ios8: 0.5
cnet: 0.3
iphone:0.2
You can then calculate its match score regarding query "apple iphone6":
score = w(ios8)*0.5 + w(cnet)*0.3 + w(iphone)*0.2
= (.08*.5 + .15*0.3 + .75*.2 ) / 3
The match score can then be used to rank documents regarding their relevancy to the same query.
Any-query model
You perform the same thing to construct a linear model for each query. Suppose you have k such queries and matching documents in your training data, you will end up with
k such models; each model is constructed based on one query.
model(apple iphone6) = (0.77*apple + 0.75iphone + 0.22review + ...) / 3
model(android apps) = (0.77google + 0.5android + ...) / 5
model(samsung phone) = (0.5samsung + 0.2galaxy + ...) / 10
Note in above example models, 3, 5, 10 are the normalizer (the total number of documents matched to each query).
Now a new query comes, suppose it is :
samsung android release
Our tasks left are to:
find relevant queries q1, q2, ..., qm
use query models to score new documents and aggregate.
You first need to extract features from this query and also suppose you have already cached the features for each query you have learned. Based on any nearest neighbor approach (e.g., Locality sensitive hashing), you can find the top-k similar queries to "samsung android release", probably they should be:
similarity(samsung phone, samsung android release) = 0.2
similarity(android apps, samsung android release) = 0.2
Overall Ranker
Thus we get our final ranker as:
0.2*model(samsung phone) + 0.2*model(android apps) =
0.2* (0.77*apple + 0.75iphone + 0.22review + ...) / 3 +
0.2* (0.77google + 0.5android + ...) / 5
Usually in those information retrieval apps, you have constructed inverted index from features (words) to documents. Therefore the final rank should be able to evaluated very efficiently across top documents.
Reference
For details, please refer to the IND algorithm in Omid Madani et al. Learning When Concepts Abound.

Related

How to draw a Huffman tree properly

I have the following symbols and probabilities and I would like to draw a Huffman tree for them:
s = 0.04 || i = 0.1 || n = 0.2 || b = 0.04 || a = 0.3 || d = 0.26 || ~ = 0.06
Based on Huffman algorithm, I generated the following tree:
This was done by:
Join s + i
Join the result of 1 and n
Join ~ + d
Join b + a
Join the result of 3 and 4
Join the result of 5 and 2
My questions:
is what I have done right or not? If so, is it acceptable that the final probability (result of 6) greater than 1?
Thanks
No, what you have done is not right, and no, the only thing that is acceptable is that the final number must equal the sum of the starting numbers.
The sums do match in your case, since 0.34 + 0.66 = 1, so I don't know why you're asking that. By the way, the numbers do not have to be probabilities, so the sum does not have to be 1. Often the numbers are frequencies, i.e. the count of the number of times that symbol appeared.
As for your tree, you must always join the two lowest numbers, be they a leaf or the top of a sub-tree. At the start, that's s = 0.04 and b = 0.04. You didn't do that, so your tree does not represent the application of Huffman's algorithm. Then to that 0.08 you add ~ = 0.06. And so on.

vowpalwabbit strange features count

I have found that during training my model vw shows very big (much more than my features count ) feature number count in it's log.
I have tried to reproduce it using some small example:
simple.test:
-1 | 1 2 3
1 | 3 4 5
then "vw simple.test" command says that it have used 8 features. +one feature is constant but what are the other ? And in my real exmaple difference between my features and features used in wv is abot x10 more.
....
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = t
num sources = 1
average since example example current current current
loss last counter weight label predict features
finished run
number of examples = 2
weighted example sum = 2
weighted label sum = 3
average loss = 1.9179
best constant = 1.5
total feature number = 8 !!!!
total feature number displays a sum of feature counts from all observed examples. So it's 2*(3+1 constant)=8 in your case. The number of features in current example is displayed in current features column. Note that only 2^Nth example is printed on screen by default. In general observations can have unequal number of features.

How do I determine the weight to assign to each bucket?

Someone will answer a series of questions and will mark each important (I), very important (V), or extremely important (E). I'll then match their answers with answers given by everyone else, compute the percent of the answers in each bucket that are the same, then combine the percentages to get a final score.
For example, I answer 10 questions, marking 3 as extremely important, 5 as very important, and 2 as important. I then match my answers with someone else's, and they answer the same to 2/3 extremely important questions, 4/5 very important questions, and 2/2 important questions. This results in percentages of 66.66 (extremely important), 80.00 (very important), and 100.00 (important). I then combine these 3 percentages to get a final score, but I first weigh each percentage to reflect the importance of each bucket. So the result would be something like: score = E * 66.66 + V * 80.00 + I * 100.00. The values of E, V, and I (the weights) are what I'm trying to figure out how to calculate.
The following are the constraints present:
1 + X + X^2 = X^3
E >= X * V >= X^2 * I > 0
E + V + I = 1
E + 0.9 * V >= 0.9
0.9 > 0.9 * E + 0.75 * V >= 0.75
E + I < 0.75
When combining the percentages, I could give important a weight of 0.0749, very important a weight of .2501, and extremely important a weight of 0.675, but this seems arbitrary, so I'm wondering how to go about calculating the optimal value for each weight. Also, how do I calculate the optimal weights if I ignore all constraints?
As far as what I mean by optimal: while adhering to the last 4 constraints, I want the weight of each bucket to be the maximum possible value, while having the weights be as far apart as possible (extremely important questions weighted maximally more than very important questions, and very important questions weighted maximally more than important questions).

Finding standard deviation using only mean, min, max?

I want to find the standard deviation:
Minimum = 5
Mean = 24
Maximum = 84
Overall score = 90
I just want to find out my grade by using the standard deviation
Thanks,
A standard deviation cannot in general be computed from just the min, max, and mean. This can be demonstrated with two sets of scores that have the same min, and max, and mean but different standard deviations:
1 2 4 5 : min=1 max=5 mean=3 stdev≈1.5811
1 3 3 5 : min=1 max=5 mean=3 stdev≈0.7071
Also, what does an 'overall score' of 90 mean if the maximum is 84?
I actually did a quick-and-dirty calculation of the type M Rad mentions. It involves assuming that the distribution is Gaussian or "normal." This does not apply to your situation but might help others asking the same question. (You can tell your distribution is not normal because the distance from mean to max and mean to min is not close). Even if it were normal, you would need something you don't mention: the number of samples (number of tests taken in your case).
Those readers who DO have a normal population can use the table below to give a rough estimate by dividing the difference of your measured minimum and your calculated mean by the expected value for your sample size. On average, it will be off by the given number of standard deviations. (I have no idea whether it is biased - change the code below and calculate the error without the abs to get a guess.)
Num Samples Expected distance Expected error
10 1.55 0.25
20 1.88 0.20
30 2.05 0.18
40 2.16 0.17
50 2.26 0.15
60 2.33 0.15
70 2.38 0.14
80 2.43 0.14
90 2.47 0.13
100 2.52 0.13
This experiment shows that the "rule of thumb" of dividing the range by 4 to get the standard deviation is in general incorrect -- even for normal populations. In my experiment it only holds for sample sizes between 20 and 40 (and then loosely). This rule may have been what the OP was thinking about.
You can modify the following python code to generate the table for different values (change max_sample_size) or more accuracy (change num_simulations) or get rid of the limitation to multiples of 10 (change the parameters to xrange in the for loop for idx)
#!/usr/bin/python
import random
# Return the distance of the minimum of samples from its mean
#
# Samples must have at least one entry
def min_dist_from_estd_mean(samples):
total = 0
sample_min = samples[0]
for sample in samples:
total += sample
sample_min = min(sample, sample_min)
estd_mean = total / len(samples)
return estd_mean - sample_min # Pos bec min cannot be greater than mean
num_simulations = 4095
max_sample_size = 100
# Calculate expected distances
sum_of_dists=[0]*(max_sample_size+1) # +1 so can index by sample size
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
sum_of_dists[len(samples)] += min_dist_from_estd_mean(samples)
samples.append(random.normalvariate(0,1))
expected_dist = [total/num_simulations for total in sum_of_dists]
# Calculate average error using that distance
sum_of_errors=[0]*len(sum_of_dists)
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
ave_dist = expected_dist[len(samples)]
if ave_dist > 0:
sum_of_errors[len(samples)] += \
abs(1 - (min_dist_from_estd_mean(samples)/ave_dist))
samples.append(random.normalvariate(0,1))
expected_error = [total/num_simulations for total in sum_of_errors]
cols=" {0:>15}{1:>20}{2:>20}"
print(cols.format("Num Samples","Expected distance","Expected error"))
cols=" {0:>15}{1:>20.2f}{2:>20.2f}"
for idx in xrange(10,len(expected_dist),10):
print(cols.format(idx, expected_dist[idx], expected_error[idx]))
Yo can obtain an estimate of the geometric mean, sometimes called the geometric mean of the extremes or GME, using the Min and the Max by calculating the GME= $\sqrt{ Min*Max }$. The SD can be then calculated using your arithmetic mean (AM) and the GME as:
SD= $$\frac{AM}{GME} * \sqrt{(AM)^2-(GME)^2 }$$
This approach works well for log-normal distributions or as long as the GME, GM or Median is smaller than the AM.
In principle you can make an estimate of standard deviation from the mean/min/max and the number of elements in the sample. The min and max of a sample are, if you assume normality, random variables whose statistics follow from mean/stddev/number of samples. So given the latter, one can compute (after slogging through the math or running a bunch of monte carlo scripts) a confidence interval for the former (like it is 80% probable that the stddev is between 20 and 40 or something like that).
That said, it probably isn't worth doing except in extreme situations.

Issue in training hidden markov model and usage for classification

I am having a tough time in figuring out how to use Kevin Murphy's
HMM toolbox Toolbox. It would be a great help if anyone who has an experience with it could clarify some conceptual questions. I have somehow understood the theory behind HMM but it's confusing how to actually implement it and mention all the parameter setting.
There are 2 classes so we need 2 HMMs.
Let say the training vectors are :class1 O1={ 4 3 5 1 2} and class O_2={ 1 4 3 2 4}.
Now,the system has to classify an unknown sequence O3={1 3 2 4 4} as either class1 or class2.
What is going to go in obsmat0 and obsmat1?
How to specify/syntax for the transition probability transmat0 and transmat1?
what is the variable data going to be in this case?
Would number of states Q=5 since there are five unique numbers/symbols used?
Number of output symbols=5 ?
How do I mention the transition probabilities transmat0 and transmat1?
Instead of answering each individual question, let me illustrate how to use the HMM toolbox with an example -- the weather example which is usually used when introducing hidden markov models.
Basically the states of the model are the three possible types of weather: sunny, rainy and foggy. At any given day, we assume the weather can be only one of these values. Thus the set of HMM states are:
S = {sunny, rainy, foggy}
However in this example, we can't observe the weather directly (apparently we are locked in the basement!). Instead the only evidence we have is whether the person who checks on you every day is carrying an umbrella or not. In HMM terminology, these are the discrete observations:
x = {umbrella, no umbrella}
The HMM model is characterized by three things:
The prior probabilities: vector of probabilities of being in the first state of a sequence.
The transition prob: matrix describing the probabilities of going from one state of weather to another.
The emission prob: matrix describing the probabilities of observing an output (umbrella or not) given a state (weather).
Next we are either given the these probabilities, or we have to learn them from a training set. Once that's done, we can do reasoning like computing likelihood of an observation sequence with respect to an HMM model (or a bunch of models, and pick the most likely one)...
1) known model parameters
Here is a sample code that shows how to fill existing probabilities to build the model:
Q = 3; %# number of states (sun,rain,fog)
O = 2; %# number of discrete observations (umbrella, no umbrella)
%# prior probabilities
prior = [1 0 0];
%# state transition matrix (1: sun, 2: rain, 3:fog)
A = [0.8 0.05 0.15; 0.2 0.6 0.2; 0.2 0.3 0.5];
%# observation emission matrix (1: umbrella, 2: no umbrella)
B = [0.1 0.9; 0.8 0.2; 0.3 0.7];
Then we can sample a bunch of sequences from this model:
num = 20; %# 20 sequences
T = 10; %# each of length 10 (days)
[seqs,states] = dhmm_sample(prior, A, B, num, T);
for example, the 5th example was:
>> seqs(5,:) %# observation sequence
ans =
2 2 1 2 1 1 1 2 2 2
>> states(5,:) %# hidden states sequence
ans =
1 1 1 3 2 2 2 1 1 1
we can evaluate the log-likelihood of the sequence:
dhmm_logprob(seqs(5,:), prior, A, B)
dhmm_logprob_path(prior, A, B, states(5,:))
or compute the Viterbi path (most probable state sequence):
vPath = viterbi_path(prior, A, multinomial_prob(seqs(5,:),B))
2) unknown model parameters
Training is performed using the EM algorithm, and is best done with a set of observation sequences.
Continuing on the same example, we can use the generated data above to train a new model and compare it to the original:
%# we start with a randomly initialized model
prior_hat = normalise(rand(Q,1));
A_hat = mk_stochastic(rand(Q,Q));
B_hat = mk_stochastic(rand(Q,O));
%# learn from data by performing many iterations of EM
[LL,prior_hat,A_hat,B_hat] = dhmm_em(seqs, prior_hat,A_hat,B_hat, 'max_iter',50);
%# plot learning curve
plot(LL), xlabel('iterations'), ylabel('log likelihood'), grid on
Keep in mind that the states order don't have to match. That's why we need to permute the states before comparing the two models. In this example, the trained model looks close to the original one:
>> p = [2 3 1]; %# states permutation
>> prior, prior_hat(p)
prior =
1 0 0
ans =
0.97401
7.5499e-005
0.02591
>> A, A_hat(p,p)
A =
0.8 0.05 0.15
0.2 0.6 0.2
0.2 0.3 0.5
ans =
0.75967 0.05898 0.18135
0.037482 0.77118 0.19134
0.22003 0.53381 0.24616
>> B, B_hat(p,[1 2])
B =
0.1 0.9
0.8 0.2
0.3 0.7
ans =
0.11237 0.88763
0.72839 0.27161
0.25889 0.74111
There are more things you can do with hidden markov models such as classification or pattern recognition. You would have different sets of obervation sequences belonging to different classes. You start by training a model for each set. Then given a new observation sequence, you could classify it by computing its likelihood with respect to each model, and predict the model with the highest log-likelihood.
argmax[ log P(X|model_i) ] over all model_i
I do not use the toolbox that you mention, but I do use HTK. There is a book that describes the function of HTK very clearly, available for free
http://htk.eng.cam.ac.uk/docs/docs.shtml
The introductory chapters might help you understanding.
I can have a quick attempt at answering #4 on your list. . .
The number of emitting states is linked to the length and complexity of your feature vectors. However, it certainly does not have to equal the length of the array of feature vectors, as each emitting state can have a transition probability of going back into itself or even back to a previous state depending on the architecture. I'm also not sure if the value that you give includes the non-emitting states at the start and the end of the hmm, but these need to be considered also. Choosing the number of states often comes down to trial and error.
Good luck!

Resources