How to computer Document Length and Average Document Length in BM25 - search-engine

Please tell me anyone as how to compute document(dl) length and average document length(avdl) in BM25. For example we have the following 4 documents:
new york times east // Doc1
los angeles times west //Doc2
washington post district columbia //Doc3
wall street journal north //Doc4

The first step is to remove stop-words and perform stemming so that we can consider a document d as a set of constituent terms with corresponding term frequencies {tf(t,d) : t \in d}.
Now, the notion of document length is slightly different in vector space and probabilistic models, e.g. BM25, language model etc. While in the former, document length refers to the norm of a vector, in the latter it typically refers to total number of terms in a document.
Nonetheless, the vector norm notion of documents can, in principle, be also applied to probabilistic models as well because the term frequency values still remain normalized between 0 and 1. However, the normalized term frequency values would no longer sum to 1.
To illustrate with your example: In the case of vector space model, the length is defined as the norm of a vector, which is the case of doc1, is norm(doc1) = square root of the sum of squares of the term frequency values for each unique term in doc1 = sqrt(1^2 + 1^2 + 1^2 + 1^2) = sqrt(4) = 2.
For the probabilistic models, length would be defined as summation of term frequencies of the component terms = 1 + 1 + 1 + 1 = 4. The normalized term frequency values of a term t would be P(t,d) = tf(t,d)/dl(d) so that \sum{P(t,d) t \in d} = 1, e.g. 1/4+1/4+1/4+1/4=1.
The BM25Similarity implementation of Lucene uses vector norms as document lengths whereas the Terrier uses sum of tfs of constituent terms as document lengths.

Related

Misconceptions about the Shannon-Nyquist theorem

I am a student working with time-series data which we feed into a neural network for classification (my task is to build and train this NN).
We're told to use a band-pass filter of 10 Hz to 150 Hz since anything outside that is not interesting.
After applying the band-pass, I've also down-sampled the data to 300 samples per second (originally it was 768 Hz). My understanding of the Shannon Nyquist sampling theorem is that, after applying the band-pass, any information in the data will be perfectly preserved at this sample-rate.
However, I got into a discussion with my supervisor who claimed that 300 Hz might not be sufficient even if the signal was band-limited. She says that it is only the minimum sample rate, not necessarily the best sample rate.
My understanding of the sampling theorem makes me think the supervisor is obviously wrong, but I don't want to argue with my supervisor, especially in case I'm actually the one who has misunderstood.
Can anyone help to confirm my understanding or provide some clarification? And how should I take this up with my supervisor (if at all).
The Nyquist-Shannon theorem states that the sampling frequency should at-least be twice of bandwidth, i.e.,
fs > 2B
So, this is the minimal criteria. If the sampling frequency is less than 2B then there will be aliasing. There is no upper limit on sampling frequency, but more the sampling frequency, the better will be the reconstruction.
So, I think your supervisor is right in saying that it is the minimal condition and not the best one.
Actually, you and your supervisor are both wrong. The minimum sampling rate required to faithfully represent a real-valued time series whose spectrum lies between 10 Hz and 150 Hz is 140 Hz, not 300 Hz. I'll explain this, and then I'll explain some of the context that shows why you might want to "oversample", as it is referred to (spoiler alert: Bailian-Low Theorem). The supervisor is mixing folklore into the discussion, and when folklore is not properly-contexted, it tends to telephone tag into fakelore. (That's a common failing even in the peer-reviewed literature, by the way). And there's a lot of fakelore, here, that needs to be defogged.
For the following, I will use the following conventions.
There's no math layout on Stack Overflow (except what we already have with UTF-8), so ...
a^b denotes a raised to the power b.
∫_I (⋯x⋯) dx denotes an integral of (⋯x⋯) taken over all x ∈ I, with the default I = ℝ.
The support supp φ (or supp_x φ(x) to make the "x" explicit) of a function φ(x) is the smallest closed set containing all the x-es for which φ(x) ≠ 0. For regularly-behaving (e.g. continuously differentiable) functions that means a union of closed intervals and/or half-rays or the whole real line, itself. This figures centrally in the Shannon-Nyquist sampling theorem, as its main condition is that a spectrum have bounded support; i.e. a "finite bandwidth".
For the Fourier transform I will use the version that has the 2π up in the exponent, and for added convenience, I will use the convention 1^x = e^{2πix} = cos(2πx) + i sin(2πx) (which I refer to as the Ramanujan Convention, as it is the convention I frequently used in my previous life oops I mean which Ramanujan secretly used in his life to make the math a whole lot simpler).
The set ℤ = {⋯, -2, -1, 0, +1, +2, ⋯ } is the integers, and 1^{x+z} = 1^x for all z∈ℤ - making 1^x the archetype of a periodic function whose period is 1.
Thus, the Fourier transform f̂(ν) of a function f(t) and its inverse are given by:
f̂(ν) = ∫ f(t) 1^{-νt} dt, f(t) = ∫ f̂(ν) 1^{+νt} dν.
The spectrum of the time series given by the function f(t) is the function f̂(ν) of the cyclic frequency ν, which is what is measured in Hertz (Hz.); t, itself, being measured in seconds. A common convention is to use the angular frequency ω = 2πν, instead, but that muddies the picture.
The most important example, with respect to the issue at hand, is the Fourier transform χ̂_Ω of the interval function given by χ_Ω(t) = 1 if t ∈ [-½Ω,+½Ω] and χ_Ω(t) = 0 else:
χ̂_Ω(t) = ∫_[-½Ω,+½Ω] 1^ν dν
= {1^{+½Ω} - 1^{-½Ω}}/{2πi}
= {2i sin πΩ}/{2πi}
= Ω sinc πΩ
which is where the function sinc x = (sin πx)/(πx) comes into play.
The cardinal form of the sampling theorem is that a function f(t) can be sampled over an equally-spaced sampled domain T ≡ { kΔt: k ∈ ℤ }, if its spectrum is bounded by supp f̂ ⊆ [-½Ω,+½Ω] ⊆ [-1/(2Δt),+1/(2Δt)], with the sampling given as
f(t) = ∑_{t'∈T} f(t') Ω sinc(Ω(t - t')) Δt.
So, this generally applies to [over-]sampling with redundancy factors 1/(ΩΔt) ≥ 1. In the special case where the sampling is tight with ΩΔt = 1, then it reduces to the form
f(t) = ∑_{t'∈T} f(t') sinc({t - t'}/Δt).
In our case, supp f̂ = [10 Hz., 150 Hz.] so the tightest fits are with 1/Δt = Ω = 300 Hz.
This generalizes to equally-spaced sampled domains of the form T ≡ { t₀ + kΔt: k ∈ ℤ } without any modification.
But it also generalizes to frequency intervals supp f̂ = [ν₋,ν₊] of width Ω = ν₊ - ν₋ and center ν₀ = ½ (ν₋ + ν₊) to the following form:
f(t) = ∑_{t'∈T} f(t') 1^{ν₀(t - t')} Ω sinc(Ω(t - t')) Δt.
In your case, you have ν₋ = 10 Hz., ν₊ = 150 Hz., Ω = 140 Hz., ν₀ = 80 Hz. with the condition Δt ≤ 1/140 second, a sampling rate of at least 140 Hz. with
f(t) = (140 Δt) ∑_{t'∈T} f(t') 1^{80(t - t')} sinc(140(t - t')).
where t and Δt are in seconds.
There is a larger context to all of this. One of the main places where this can be used is for transforms devised from an overlapping set of windowed filters in the frequency domain - a typical case in point being transforms for the time-scale plane, like the S-transform or the continuous wavelet transform.
Since you want the filters to be smoothly-windowed functions, without sharp corners, then in order for them to provide a complete set that adds up to a finite non-zero value over all of the frequency spectrum (so that they can all be normalized, in tandem, by dividing out by this sum), then their respective supports have to overlap.
(Edit: Generalized this example to cover both equally-spaced and logarithmic-spaced intervals.)
One example of such a set would be filters that have end-point frequencies taken from the set
Π = { p₀ (α + 1)ⁿ + β {(α + 1)ⁿ - 1} / α: n ∈ {0,1,2,⋯} }
So, for interval n (counting from n = 0), you would have ν₋ = p_n and ν₊ = p_{n+1}, where the members of Π are enumerated
p_n = p₀ (α + 1)ⁿ + β {(α + 1)ⁿ - 1} / α,
Δp_n = p_{n+1} - p_n = α p_n + β = (α p₀ + β)(α + 1)ⁿ,
n ∈ {0,1,2,⋯}
The center frequency of interval n would then be ν₀ = p_n + ½ Δp₀ (α + 1)ⁿ and the width would be Ω = Δp₀ (α + 1)ⁿ, but the actual support for the filter would overlap into a good part of the neighboring intervals, so that when you add up the filters that cover a given frequency ν the sum doesn't drop down to 0 as ν approaches any of the boundary points. (In the limiting case α → 0, this produces an equally-spaced frequency domain, suitable for an equalizer, while in the case β → 0, it produces a logarithmic scale with base α + 1, where octaves are equally-spaced.)
The other main place where you may apply this is to time-frequency analysis and spectrograms. Here, the role of a function f and its Fourier transform f̂ are reversed and the role of the frequency bandwidth Ω is now played by the (reciprocal) time bandwidth 1/Ω. You want to break up a time series, given by a function f(t) into overlapping segments f̃(q,λ) = g(λ)* f(q + λ), with smooth windowing given by the functions g(λ) with bounded support supp g ⊆ [-½ 1/Ω, +½ 1/Ω], and with interval spacing Δq much larger than the time sampling Δt (the ratio Δq/Δt is called the "hop" factor). The analogous role of Δt is played, here, by the frequency interval in the spectrogram Δp = Ω, which is now constant.
Edit: (Fixed the numbers for the Audacity example)
The minimum sampling rate for both supp_λ g and supp_λ f(q,λ) is Δq = 1/Ω = 1/Δp, and the corresponding redundancy factor is 1/(ΔpΔq). Audacity, for instance, uses a redundancy factor of 2 for its spectrograms. A typical value for Δp might be 44100/2048 Hz., while the time-sampling rate is Δt = 1/(2×3×5×7)² second (corresponding to 1/Δt = 44100 Hz.). With a redundancy factor of 2, Δq would be 1024/44100 second and the hop factor would be Δq/Δt = 1024.
If you try to fit the sampling windows, in either case, to the actual support of the band-limited (or time-limited) function, then the windows won't overlap and the only way to keep their sum from dropping to 0 on the boundary points would be for the windowing functions to have sharp corners on the boundaries, which would wreak havoc on their corresponding Fourier transforms.
The Balian-Low Theorem makes the actual statement on the matter.
https://encyclopediaofmath.org/wiki/Balian-Low_theorem
And a shout-out to someone I've been talking with, recently, about DSP-related matters and his monograph, which provides an excellent introductory reference to a lot of the issues discussed here.
A Friendly Guide To Wavelets
Gerald Kaiser
Birkhauser 1994
He said it's part of a trilogy, another installment of which is forthcoming.

Accounting for time with repeated-measures in lmer when not interested in time

I am trying to conduct a repeated-measures mixed-effects test with lmer and lmerTest, but I am not sure if I am doing it appropriately.
I have 6 sites with 3 plots per site that have been sampled once per year for 24 consecutive years. I have several environmental and species variables, but for simplicity, let's say I have two environmental variables (depth and temperature) and two species (species 1 and species 2). I am not interested in the time variable, changes with time, or the interactions, as this system has strong wet/dry seasonality where the effects of the dry season outweigh carry over effects of species from the prior year. I do not necessarily have data for all variables and plots every year, with some plots not sampled at times.
The question is whether species2 (a predator) has any effect on populations of species1, relative to the environmental variables.
Is it appropriate to include year as its own random effect in the model, along with plot within site?
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
For this particular analysis, there were 435 total observations (plot/year), but I worry that it is not appropriately conducting repeated-measures.
anova(model1)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
depth 0.0221 0.0221 1 145.75 0.0908 0.7635
temperature 9.0213 9.0213 1 422.19 37.0429 2.596e-09 ***
species2 0.0597 0.0597 1 418.95 0.2450 0.6208
This does not seem right. Is the a better way to incorporate year, or should I include year at all?
If I exclude year, why does the DenDF for depth change so drastically?
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
depth 2.599 2.599 1 431.77 7.1096 0.007955 **
temperature 58.788 58.788 1 432.10 160.7955 < 2.2e-16 ***
species2 0.853 0.853 1 429.62 2.3336 0.127343
summary(M1)
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: species1 ~ depth + temperature + species2 + (1 | site/plot)
Data: data
AIC BIC logLik deviance df.resid
833.4 861.9 -409.7 819.4 428
Scaled residuals:
Min 1Q Median 3Q Max
-2.20675 -0.66119 -0.07051 0.52722 2.99942
Random effects:
Groups Name Variance Std.Dev.
plot:site (Intercept) 0.0003221 0.01795
site (Intercept) 0.2051143 0.45290
Residual 0.3656072 0.60465
Number of obs: 435, groups: plot:site, 24; site, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -0.538258 0.325072 50.071940 -1.656 0.10401
depth 0.006338 0.002377 431.768539 2.666 0.00796 **
temperature 0.391023 0.030837 432.101095 12.681 < 2e-16 ***
species2 -0.353264 0.231252 429.615226 -1.528 0.12734
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) depth temp
depth -0.316
temperature -0.467 -0.204
specie2 -0.544 0.040 0.007
I may have asked more questions than I answered, but I hope some of this is helpful.
"The question is whether species2 (a predator) has any effect on populations of species1, relative to the environmental variables."
I think when you word it this way, it is not entirely clear. Are you interested in the effect that species2 has on species1 - depending on what the environmental variables are (in other words the effect of species2 on species1 can change depending on depth or temperature? Or do you mean you would like to compare the effects of species2 on species1 to the effects of depth or temperature on species1? Or what do you mean, exactly, by "relative to the environmental variables"?
Yes, (1|year) + (1|site/plot) is a random intercept for both year and for plot within site. If you wanted a variable to be able to vary over each group (i.e. have a random slope) you would do something like (Temperature|year) + (1|site/plot) if you thought the effect of temperature on species1 might be different in different years.
Exactly how you specify the model is going to be based on your knowledge of the biological system and your knowledge of statistics. Based on the information in your question, this random effects formulation that you have suggested appears completely reasonable to me. Yes, this is allowing you to account for grouped data (grouped by each year and by each plot within site). It is possible that with only 435 observations you may have convergence issues with an overly complex model, which you may or may not have - just something to look out for.
I am not sure what you mean by "this does not seem right" - what are you expecting to see? What is missing?
I am seeing the same model twice (below), with different values as the output, is there a copy and pasting error here, or am I missing something? The values shouldn't be off with the same model structure.
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
You haven't removed year in the above line, but have below this in the summary(M1) call.
My simple answer about the year question would be yes, I would include year. Every year is so different in any biological dataset I have seen that it is worth including as a random intercept at least - exactly as you have done. If the variance of the random effect mean is estimated to be zero, then this term is as if you didn't have it there in the first place. At that point you can choose to fit that random effect as a fixed effect instead if you still would like to account for the grouped nature of the data.
Also, there are lots of resources on this. Some examples:
Bolker, Benjamin M., Mollie E. Brooks, Connie J. Clark, Shane W. Geange, John R. Poulsen, M. Henry H. Stevens, and Jada-Simone S. White. "Generalized linear mixed models: a practical guide for ecology and evolution." Trends in ecology & evolution 24, no. 3 (2009): 127-135.
Harrison, Xavier A., Lynda Donaldson, Maria Eugenia Correa-Cano, Julian Evans, David N. Fisher, Cecily ED Goodwin, Beth S. Robinson, David J. Hodgson, and Richard Inger. "A brief introduction to mixed effects modelling and multi-model inference in ecology." PeerJ 6 (2018): e4794.
https://peerj.com/articles/4794/

Can you give me a short step by step numerical example of radial basis function kernel trick? I would like to understand how to apply on perceptron

I understand well perceptron so put accent only on kernel but I am not familiar with matemathic expressions so please give me an numerical example and a guide on kernel.
For example:
My hyperplane of perceptron is x1*w1+x2*w2+x3*w3+b=0; The RBF kernel formula: k(x,z) = exp((-|x-z|^2)/2*variance^2) where takes action the radial basis function kernel here. Is x an input and what is z variable here?
Or of what I have to calculate variance if it is variance in the formula?
Somewhere I have understood so that I have to plug this formula in perceptron decision function x1*w1+x2*w2+x3*w3+b=0; but how does it look look like If I plug in?
I would like to ask a numerical example to avoid confusion.
Linear Perceptron
As you know linear perceptrons can be trained for binary classification. More precisely, if there is n features, x1, x2, ..., xn in n-dimensional space, Rn, and you want to label them in 2 categories, y1 & y2 (usually -1 and +1), you can use linear perceptron which defines a hyperplane w1*x1 + ... + wn*xn + b = 0 to do so.
w1*x1 + ... + wn*xn + b > 0 or W.X + b > 0 ==> class = y1
w1*x1 + ... + wn*xn + b < 0 or W.X + b < 0 ==> class = y2
Linear perceptron will work well, only if the problem is linearly separable in Rn. For example, in 2D space, this means that one line can separate the 2 sets of points.
Algorithm
One common algorithm to train the perceptron, i.e., find weights and bias, w's & b, based on N data points, X1, ..., XN, and their labels, Y1, ..., YN is the following:
Initialize: W = zeros(n,1); b = 0
For i=1 to N:
Calculate F(Xi) = W.Xi + b
If F(Xi)*Yi <= 0:
W <--- W + Xi*Yi
b <--- b + Yi
This will give the final value for W & b. Besides, based on the training, W will be a linear combination of training points, Xi's, more precisely, the ones that were misclassified. So W = a1*X1 + ... + ...aN*XN where a's are in {0,y1,y2}.
Now, if there is a new point, let's say Z, to label, we check the sign of F(Z) = W.Z + b = a1*(X1.Z) + ... + aN*(XN.Z) + b. It is interesting that only the inner product of new point and training points take part in it.
Kernel Perceptron
Now, if the problem is not linearly separable, one may try to go to a higher dimensional space in which a hyperplane can do the classification. As an example, consider a circle in 2D space. The points inside and outside of the circle can't be separated by a line. However, if you find a transformation that can take the points to 3D space such that the first 2 coordinates remain the same for all points, and the 3rd coordinate become +1 and -1 for the points inside and outside of the circle respectively, then a plane defined as 3rd coordinate = 0 can separate the points.
Finding such transformations can be difficult and computationally heavy, so the kernel trick is introduced. Notice that we only used the inner product of new points with the training points. Kernel trick employs this fact and defines the inner product of the transformed points without actually finding the transformation.
If the unknown transformation is P(X) then Kernel function will be:
K(Xi,Xj) = <P(Xi),P(Xj)>. So instead of finding P, kernel functions are defined which represent the scalar result of the inner product in high-dimensional space. There are also theorems about what functions can be kernel functions, i.e., correspond to inner product in another space.
After choosing a kernel function, the algorithm will be modified as follows:
Initialize: F(X) = 0
For i=1 to N:
Calculate F(Xi)
If F(Xi)*Yi <= 0:
F(.) <--- F(.) + K(.,Xi)*Yi + Yi
At the end, F(.) = a1*K(.,X1) + ... + ...aN*K(.,XN) + b where a's are in {0,y1,y2}.
RBF Kernel
Radial basis function is one type of kernel function that is actually computing the inner product in an infinite-dimensional space. It can be written as
K(Xi,Xj) = exp(- norm2(Xi-Xj)^2 / (2*sigma^2))
Sigma is some parameter that you can work with to find an optimum value for. For example, you can train the model with different values of sigma and then find the best value based on the performance. You can start with sigma = 1
After training the model to find F(.), for a new data Z, the sign of F(Z) = a1*K(Z,X1) + ... + ...aN*K(Z,XN) + b will determine the class.
Remarks:
Regarding to your question about variance, you don't need to find any variance.
About x and z in your question, in each iteration, you should find the kernel output for the current data point and all the previously added points (the points that were misclassified and hence were added to F).
I couldn't come up with a simple instructive numerical example.
References:
I borrowed some notation from
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjVu-fXo8DOAhVDxCYKHQkcDDAQFggoMAE&url=http%3A%2F%2Falex.smola.org%2Fteaching%2Fpune2007%2Fpune_3.pdf&usg=AFQjCNHlxy9TnY8xNe2-QDERipN_GycSqQ&bvm=bv.129422649,d.eWE

The number of parameters in Gaussian mixture model

I have D-dimensional data with K components.
How many parameters if I use a model with full covariance matrices?
and How many if I use diaogonal covariance matrices?
Answer by xyLe_ at CrossValidated
https://stats.stackexchange.com/a/229321/127305
Simply do the math.
For each Gaussian you have:
1. A Symmetric full DxD covariance matrix giving (D*D - D)/2 + D parameters ((D*D - D)/2 is the number of off-diagonal elements and D is the number of diagonal elements)
2. A D dimensional mean vector giving D parameters
3. A mixing weight giving another parameter
This results in Df = (D*D - D)/2 + 2D + 1 for each gaussian.
Given you have K components, you have K*Df parameters.
In the diagonal case the covariance matrix parameters reduce to D, because of the abscence of off-diagonal elements.
Thus yielding Df = 2D + 1.

How do I form a feature vector for a classifier targeted at Named Entity Recognition?

I have a set of tags (different from the conventional Name, Place, Object etc.). In my case, they are domain-specific and I call them: Entity, Action, Incident. I want to use these as a seed for extracting more named-entities.
I came across this paper: "Efficient Support Vector Classifiers for Named Entity Recognition" by Isozaki et al. While I like the idea of using Support Vector Machines for doing named-entity recognition, I am stuck on how to encode the feature vector. For their paper, this is what they say:
For instance, the words in “President George Herbert Bush said Clinton
is . . . ” are classified as follows: “President” = OTHER, “George” =
PERSON-BEGIN, “Herbert” = PERSON-MIDDLE, “Bush” = PERSON-END, “said” =
OTHER, “Clinton” = PERSON-SINGLE, “is”
= OTHER. In this way, the first word of a person’s name is labeled as PERSON-BEGIN. The last word is labeled as PERSON-END. Other words in
the name are PERSON-MIDDLE. If a person’s name is expressed by a
single word, it is labeled as PERSON-SINGLE. If a word does not
belong to any named entities, it is labeled as OTHER. Since IREX de-
fines eight NE classes, words are classified into 33 categories.
Each sample is represented by 15 features because each word has three
features (part-of-speech tag, character type, and the word itself),
and two preceding words and two succeeding words are also used for
context dependence. Although infrequent features are usually removed
to prevent overfitting, we use all features because SVMs are robust.
Each sample is represented by a long binary vector, i.e., a sequence
of 0 (false) and 1 (true). For instance, “Bush” in the above example
is represented by a vector x = x[1] ... x[D] described below. Only
15 elements are 1.
x[1] = 0 // Current word is not ‘Alice’
x[2] = 1 // Current word is ‘Bush’
x[3] = 0 // Current word is not ‘Charlie’
x[15029] = 1 // Current POS is a proper noun
x[15030] = 0 // Current POS is not a verb
x[39181] = 0 // Previous word is not ‘Henry’
x[39182] = 1 // Previous word is ‘Herbert
I don't really understand how the binary vector here is being constructed. I know I am missing a subtle point but can someone help me understand this?
There is a bag of words lexicon building step that they omit.
Basically you have build a map from (non-rare) words in the training set to indicies. Let's say you have 20k unique words in your training set. You'll have mapping from every word in the training set to [0, 20000].
Then the feature vector is basically a concatenation of a few very sparse vectors that have a 1 corresponding to a particular word, and 19,999 0s, and then 1 for a particular POS, and 50 other 0s for non-active POS. This is generally called a one hot encoding. http://en.wikipedia.org/wiki/One-hot
def encode_word_feature(word, POStag, char_type, word_index_mapping, POS_index_mapping, char_type_index_mapping)):
# it makes a lot of sense to use a sparsely encoded vector rather than dense list, but it's clearer this way
ret = empty_vec(len(word_index_mapping) + len(POS_index_mapping) + len(char_type_index_mapping))
so_far = 0
ret[word_index_mapping[word] + so_far] = 1
so_far += len(word_index_mapping)
ret[POS_index_mapping[POStag] + so_far] = 1
so_far += len(POS_index_mapping)
ret[char_type_index_mapping[char_type] + so_far] = 1
return ret
def encode_context(context):
return encode_word_feature(context.two_words_ago, context.two_pos_ago, context.two_char_types_ago,
word_index_mapping, context_index_mapping, char_type_index_mapping) +
encode_word_feature(context.one_word_ago, context.one_pos_ago, context.one_char_types_ago,
word_index_mapping, context_index_mapping, char_type_index_mapping) +
# ... pattern is obvious
So your feature vector is about size 100k with a little extra for POS and char tags, and is almost entirely 0s, except for 15 1s in positions picked according to your feature to index mappings.

Resources