How do I form a feature vector for a classifier targeted at Named Entity Recognition? - machine-learning

I have a set of tags (different from the conventional Name, Place, Object etc.). In my case, they are domain-specific and I call them: Entity, Action, Incident. I want to use these as a seed for extracting more named-entities.
I came across this paper: "Efficient Support Vector Classifiers for Named Entity Recognition" by Isozaki et al. While I like the idea of using Support Vector Machines for doing named-entity recognition, I am stuck on how to encode the feature vector. For their paper, this is what they say:
For instance, the words in “President George Herbert Bush said Clinton
is . . . ” are classified as follows: “President” = OTHER, “George” =
PERSON-BEGIN, “Herbert” = PERSON-MIDDLE, “Bush” = PERSON-END, “said” =
OTHER, “Clinton” = PERSON-SINGLE, “is”
= OTHER. In this way, the first word of a person’s name is labeled as PERSON-BEGIN. The last word is labeled as PERSON-END. Other words in
the name are PERSON-MIDDLE. If a person’s name is expressed by a
single word, it is labeled as PERSON-SINGLE. If a word does not
belong to any named entities, it is labeled as OTHER. Since IREX de-
fines eight NE classes, words are classified into 33 categories.
Each sample is represented by 15 features because each word has three
features (part-of-speech tag, character type, and the word itself),
and two preceding words and two succeeding words are also used for
context dependence. Although infrequent features are usually removed
to prevent overfitting, we use all features because SVMs are robust.
Each sample is represented by a long binary vector, i.e., a sequence
of 0 (false) and 1 (true). For instance, “Bush” in the above example
is represented by a vector x = x[1] ... x[D] described below. Only
15 elements are 1.
x[1] = 0 // Current word is not ‘Alice’
x[2] = 1 // Current word is ‘Bush’
x[3] = 0 // Current word is not ‘Charlie’
x[15029] = 1 // Current POS is a proper noun
x[15030] = 0 // Current POS is not a verb
x[39181] = 0 // Previous word is not ‘Henry’
x[39182] = 1 // Previous word is ‘Herbert
I don't really understand how the binary vector here is being constructed. I know I am missing a subtle point but can someone help me understand this?

There is a bag of words lexicon building step that they omit.
Basically you have build a map from (non-rare) words in the training set to indicies. Let's say you have 20k unique words in your training set. You'll have mapping from every word in the training set to [0, 20000].
Then the feature vector is basically a concatenation of a few very sparse vectors that have a 1 corresponding to a particular word, and 19,999 0s, and then 1 for a particular POS, and 50 other 0s for non-active POS. This is generally called a one hot encoding. http://en.wikipedia.org/wiki/One-hot
def encode_word_feature(word, POStag, char_type, word_index_mapping, POS_index_mapping, char_type_index_mapping)):
# it makes a lot of sense to use a sparsely encoded vector rather than dense list, but it's clearer this way
ret = empty_vec(len(word_index_mapping) + len(POS_index_mapping) + len(char_type_index_mapping))
so_far = 0
ret[word_index_mapping[word] + so_far] = 1
so_far += len(word_index_mapping)
ret[POS_index_mapping[POStag] + so_far] = 1
so_far += len(POS_index_mapping)
ret[char_type_index_mapping[char_type] + so_far] = 1
return ret
def encode_context(context):
return encode_word_feature(context.two_words_ago, context.two_pos_ago, context.two_char_types_ago,
word_index_mapping, context_index_mapping, char_type_index_mapping) +
encode_word_feature(context.one_word_ago, context.one_pos_ago, context.one_char_types_ago,
word_index_mapping, context_index_mapping, char_type_index_mapping) +
# ... pattern is obvious
So your feature vector is about size 100k with a little extra for POS and char tags, and is almost entirely 0s, except for 15 1s in positions picked according to your feature to index mappings.

Related

How to get vocabulary size of word2vec?

I have a pretrained word2vec model in pyspark and I would like to know how big is its vocabulary (and perhaps get a list of words in the vocabulary).
Is this possible? I would guess it has to be stored somewhere since it can predict for new data, but I couldn't find a clear answer in the documentation.
I tried w2v_model.getVectors().count() but the result (970) seem too small for my use case. In case it may be relevant, I'm using short-text data and my dataset has tens of millions of messages each having from 10 to 30/40 words. I am using min_count=50.
Not quite sure why you doubt the result of .getVectors().count(), which gives the desired result indeed, as shown in the documentation link you have provided yourself.
Here is the example posted there, with a vocabulary of just three (3) tokens - a, b, and c:
from pyspark.ml.feature import Word2Vec
sent = ("a b " * 100 + "a c " * 10).split(" ") # 3-token vocabulary
doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
model = word2Vec.fit(doc)
So, unsurprisingly, it is
model.getVectors().count()
# 3
and asking for the vectors themselves
model.getVectors().show()
gives
+----+--------------------+
|word| vector|
+----+--------------------+
| a|[0.09511678665876...|
| b|[-1.2028766870498...|
| c|[0.30153277516365...|
+----+--------------------+
In your case, with min_count=50, every word that appears less than 50 times in your corpus will not be represented; reducing this number will result in more vectors.

arbitrarily weighted moving average (low- and high-pass filters)

Given input signal x (e.g. a voltage, sampled thousand times per second couple of minutes long), I'd like to calculate e.g.
/ this is not q
y[3] = -3*x[0] - x[1] + x[2] + 3*x[3]
y[4] = -3*x[1] - x[2] + x[3] + 3*x[4]
. . .
I'm aiming for variable window length and weight coefficients. How can I do it in q? I'm aware of mavg and signal processing in q and moving sum qidiom
In the DSP world it's called applying filter kernel by doing convolution. Weight coefficients define the kernel, which makes a high- or low-pass filter. The example above calculates the slope from last four points, placing the straight line via least squares method.
Something like this would work for parameterisable coefficients:
q)x:10+sums -1+1000?2f
q)f:{sum x*til[count x]xprev\:y}
q)f[3 1 -1 -3] x
0n 0n 0n -2.385585 1.423811 2.771659 2.065391 -0.951051 -1.323334 -0.8614857 ..
Specific cases can be made a bit faster (running 0 xprev is not the best thing)
q)g:{prev[deltas x]+3*x-3 xprev x}
q)g[x]~f[3 1 -1 -3]x
1b
q)\t:100000 f[3 1 1 -3] x
4612
q)\t:100000 g x
1791
There's a kx white paper of signal processing in q if this area interests you: https://code.kx.com/q/wp/signal-processing/
This may be a bit old but I thought I'd weigh in. There is a paper I wrote last year on signal processing that may be of some value. Working purely within KDB, dependent on the signal sizes you are using, you will see much better performance with a FFT based convolution between the kernel/window and the signal.
However, I've only written up a simple radix-2 FFT, although in my github repo I do have the untested work for a more flexible Bluestein algorithm which will allow for more variable signal length. https://github.com/callumjbiggs/q-signals/blob/master/signal.q
If you wish to go down the path of performing a full manual convolution by a moving sum, then the best method would be to break it up into blocks equal to the kernel/window size (which was based on some work Arthur W did many years ago)
q)vec:10000?100.0
q)weights:30?1.0
q)wsize:count weights
q)(weights$(((wsize-1)#0.0),vec)til[wsize]+) each til count v
32.5931 75.54583 100.4159 124.0514 105.3138 117.532 179.2236 200.5387 232.168.
If your input list not big then you could use the technique mentioned here:
https://code.kx.com/q/cookbook/programming-idioms/#how-do-i-apply-a-function-to-a-sequence-sliding-window
That uses 'scan' adverb. As that process creates multiple lists which might be inefficient for big lists.
Other solution using scan is:
q)f:{sum y*next\[z;x]} / x-input list, y-weights, z-window size-1
q)f[x;-3 -1 1 3;3]
This function also creates multiple lists so again might not be very efficient for big lists.
Other option is to use indices to fetch target items from the input list and perform the calculation. This will operate only on input list.
q) f:{[l;w;i]sum w*l i+til 4} / w- weight, l- input list, i-current index
q) f[x;-3 -1 1 3]#'til count x
This is a very basic function. You can add more variables to it as per your requirements.

How to apply different cost functions to different output channels of a convolutional network?

I have a convolutional neural network whose output is a 4-channel 2D image. I want to apply sigmoid activation function to the first two channels and then use BCECriterion to computer the loss of the produced images with the ground truth ones. I want to apply squared loss function to the last two channels and finally computer the gradients and do backprop. I would also like to multiply the cost of the squared loss for each of the two last channels by a desired scalar.
So the cost has the following form:
cost = crossEntropyCh[{1, 2}] + l1 * squaredLossCh_3 + l2 * squaredLossCh_4
The way I'm thinking about doing this is as follow:
criterion1 = nn.BCECriterion()
criterion2 = nn.MSECriterion()
error = criterion1:forward(model.output[{{}, {1, 2}}], groundTruth1) + l1 * criterion2:forward(model.output[{{}, {3}}], groundTruth2) + l2 * criterion2:forward(model.output[{{}, {4}}], groundTruth3)
However, I don't think this is the correct way of doing it since I will have to do 3 separate backprop steps, one for each of the cost terms. So I wonder, can anyone give me a better solution to do this in Torch?
SplitTable and ParallelCriterion might be helpful for your problem.
Your current output layer is followed by nn.SplitTable that splits your output channels and converts your output tensor into a table. You can also combine different functions by using ParallelCriterion so that each criterion is applied on the corresponding entry of output table.
For details, I suggest you read documentation of Torch about tables.
After comments, I added the following code segment solving the original question.
M = 100
C = 4
H = 64
W = 64
dataIn = torch.rand(M, C, H, W)
layerOfTables = nn.Sequential()
-- Because SplitTable discards the dimension it is applied on, we insert
-- an additional dimension.
layerOfTables:add(nn.Reshape(M,C,1,H,W))
-- We want to split over the second dimension (i.e. channels).
layerOfTables:add(nn.SplitTable(2, 5))
-- We use ConcatTable in order to create paths accessing to the data for
-- numereous number of criterions. Each branch from the ConcatTable will
-- have access to the data (i.e. the output table).
criterionPath = nn.ConcatTable()
-- Starting from offset 1, NarrowTable will select 2 elements. Since you
-- want to use this portion as a 2 dimensional channel, we need to combine
-- then by using JoinTable. Without JoinTable, the output will be again a
-- table with 2 elements.
criterionPath:add(nn.Sequential():add(nn.NarrowTable(1, 2)):add(nn.JoinTable(2)))
-- SelectTable is simplified version of NarrowTable, and it fetches the desired element.
criterionPath:add(nn.SelectTable(3))
criterionPath:add(nn.SelectTable(4))
layerOfTables:add(criterionPath)
-- Here goes the criterion container. You can use this as if it is a regular
-- criterion function (Please see the examples on documentation page).
criterionContainer = nn.ParallelCriterion()
criterionContainer:add(nn.BCECriterion())
criterionContainer:add(nn.MSECriterion())
criterionContainer:add(nn.MSECriterion())
Since I used almost every possible table operation, it looks a little bit nasty. However, this is the only way I could solve this problem. I hope that it helps you and others suffering from the same problem. This is how the result looks like:
dataOut = layerOfTables:forward(dataIn)
print(dataOut)
{
1 : DoubleTensor - size: 100x2x64x64
2 : DoubleTensor - size: 100x1x64x64
3 : DoubleTensor - size: 100x1x64x64
}

How to computer Document Length and Average Document Length in BM25

Please tell me anyone as how to compute document(dl) length and average document length(avdl) in BM25. For example we have the following 4 documents:
new york times east // Doc1
los angeles times west //Doc2
washington post district columbia //Doc3
wall street journal north //Doc4
The first step is to remove stop-words and perform stemming so that we can consider a document d as a set of constituent terms with corresponding term frequencies {tf(t,d) : t \in d}.
Now, the notion of document length is slightly different in vector space and probabilistic models, e.g. BM25, language model etc. While in the former, document length refers to the norm of a vector, in the latter it typically refers to total number of terms in a document.
Nonetheless, the vector norm notion of documents can, in principle, be also applied to probabilistic models as well because the term frequency values still remain normalized between 0 and 1. However, the normalized term frequency values would no longer sum to 1.
To illustrate with your example: In the case of vector space model, the length is defined as the norm of a vector, which is the case of doc1, is norm(doc1) = square root of the sum of squares of the term frequency values for each unique term in doc1 = sqrt(1^2 + 1^2 + 1^2 + 1^2) = sqrt(4) = 2.
For the probabilistic models, length would be defined as summation of term frequencies of the component terms = 1 + 1 + 1 + 1 = 4. The normalized term frequency values of a term t would be P(t,d) = tf(t,d)/dl(d) so that \sum{P(t,d) t \in d} = 1, e.g. 1/4+1/4+1/4+1/4=1.
The BM25Similarity implementation of Lucene uses vector norms as document lengths whereas the Terrier uses sum of tfs of constituent terms as document lengths.

Issue in training hidden markov model and usage for classification

I am having a tough time in figuring out how to use Kevin Murphy's
HMM toolbox Toolbox. It would be a great help if anyone who has an experience with it could clarify some conceptual questions. I have somehow understood the theory behind HMM but it's confusing how to actually implement it and mention all the parameter setting.
There are 2 classes so we need 2 HMMs.
Let say the training vectors are :class1 O1={ 4 3 5 1 2} and class O_2={ 1 4 3 2 4}.
Now,the system has to classify an unknown sequence O3={1 3 2 4 4} as either class1 or class2.
What is going to go in obsmat0 and obsmat1?
How to specify/syntax for the transition probability transmat0 and transmat1?
what is the variable data going to be in this case?
Would number of states Q=5 since there are five unique numbers/symbols used?
Number of output symbols=5 ?
How do I mention the transition probabilities transmat0 and transmat1?
Instead of answering each individual question, let me illustrate how to use the HMM toolbox with an example -- the weather example which is usually used when introducing hidden markov models.
Basically the states of the model are the three possible types of weather: sunny, rainy and foggy. At any given day, we assume the weather can be only one of these values. Thus the set of HMM states are:
S = {sunny, rainy, foggy}
However in this example, we can't observe the weather directly (apparently we are locked in the basement!). Instead the only evidence we have is whether the person who checks on you every day is carrying an umbrella or not. In HMM terminology, these are the discrete observations:
x = {umbrella, no umbrella}
The HMM model is characterized by three things:
The prior probabilities: vector of probabilities of being in the first state of a sequence.
The transition prob: matrix describing the probabilities of going from one state of weather to another.
The emission prob: matrix describing the probabilities of observing an output (umbrella or not) given a state (weather).
Next we are either given the these probabilities, or we have to learn them from a training set. Once that's done, we can do reasoning like computing likelihood of an observation sequence with respect to an HMM model (or a bunch of models, and pick the most likely one)...
1) known model parameters
Here is a sample code that shows how to fill existing probabilities to build the model:
Q = 3; %# number of states (sun,rain,fog)
O = 2; %# number of discrete observations (umbrella, no umbrella)
%# prior probabilities
prior = [1 0 0];
%# state transition matrix (1: sun, 2: rain, 3:fog)
A = [0.8 0.05 0.15; 0.2 0.6 0.2; 0.2 0.3 0.5];
%# observation emission matrix (1: umbrella, 2: no umbrella)
B = [0.1 0.9; 0.8 0.2; 0.3 0.7];
Then we can sample a bunch of sequences from this model:
num = 20; %# 20 sequences
T = 10; %# each of length 10 (days)
[seqs,states] = dhmm_sample(prior, A, B, num, T);
for example, the 5th example was:
>> seqs(5,:) %# observation sequence
ans =
2 2 1 2 1 1 1 2 2 2
>> states(5,:) %# hidden states sequence
ans =
1 1 1 3 2 2 2 1 1 1
we can evaluate the log-likelihood of the sequence:
dhmm_logprob(seqs(5,:), prior, A, B)
dhmm_logprob_path(prior, A, B, states(5,:))
or compute the Viterbi path (most probable state sequence):
vPath = viterbi_path(prior, A, multinomial_prob(seqs(5,:),B))
2) unknown model parameters
Training is performed using the EM algorithm, and is best done with a set of observation sequences.
Continuing on the same example, we can use the generated data above to train a new model and compare it to the original:
%# we start with a randomly initialized model
prior_hat = normalise(rand(Q,1));
A_hat = mk_stochastic(rand(Q,Q));
B_hat = mk_stochastic(rand(Q,O));
%# learn from data by performing many iterations of EM
[LL,prior_hat,A_hat,B_hat] = dhmm_em(seqs, prior_hat,A_hat,B_hat, 'max_iter',50);
%# plot learning curve
plot(LL), xlabel('iterations'), ylabel('log likelihood'), grid on
Keep in mind that the states order don't have to match. That's why we need to permute the states before comparing the two models. In this example, the trained model looks close to the original one:
>> p = [2 3 1]; %# states permutation
>> prior, prior_hat(p)
prior =
1 0 0
ans =
0.97401
7.5499e-005
0.02591
>> A, A_hat(p,p)
A =
0.8 0.05 0.15
0.2 0.6 0.2
0.2 0.3 0.5
ans =
0.75967 0.05898 0.18135
0.037482 0.77118 0.19134
0.22003 0.53381 0.24616
>> B, B_hat(p,[1 2])
B =
0.1 0.9
0.8 0.2
0.3 0.7
ans =
0.11237 0.88763
0.72839 0.27161
0.25889 0.74111
There are more things you can do with hidden markov models such as classification or pattern recognition. You would have different sets of obervation sequences belonging to different classes. You start by training a model for each set. Then given a new observation sequence, you could classify it by computing its likelihood with respect to each model, and predict the model with the highest log-likelihood.
argmax[ log P(X|model_i) ] over all model_i
I do not use the toolbox that you mention, but I do use HTK. There is a book that describes the function of HTK very clearly, available for free
http://htk.eng.cam.ac.uk/docs/docs.shtml
The introductory chapters might help you understanding.
I can have a quick attempt at answering #4 on your list. . .
The number of emitting states is linked to the length and complexity of your feature vectors. However, it certainly does not have to equal the length of the array of feature vectors, as each emitting state can have a transition probability of going back into itself or even back to a previous state depending on the architecture. I'm also not sure if the value that you give includes the non-emitting states at the start and the end of the hmm, but these need to be considered also. Choosing the number of states often comes down to trial and error.
Good luck!

Resources