what is the format of word alignments in machine translation? - machine-learning

I am reading this paper and having a difficulty understanding the way word alignments are represented. To be precise, right below section 4.1, the authors say the format of the alignment is (i,j) where i ranges within the source sentence length and j ranges within the target sentence range. This means that each alignment is a pair of two numbers, which given that sentences are typically not longer than 40-100 words, values for i, and j can be stored using short type. So, I expect to see that the amount of space required to store these alignments be 2 x sizeof(short) x number of word alignments. But if you go to the next page where, right above section 4.2, they say the space is sizeof(short) x number of word alignments. WHY? Am I confusing stuff?


Extracting properties of handwritten digits to fasten nearest neighbour algorithm

I have 1024 bit long binary representation of three handwritten digits: 0, 1, 8.
Basically, in 32x32 bitmap of a digit, rows are concatenated to form a binary vector.
There are 50 binary vectors for each digit.
When we apply Nearest neighbour to each digit, we can use hamming distance metric or some other, and then apply the algorithm to differentiate between the vectors.
Now I want to use another technique where instead of looking at each bit of a vector, I would like to analyse on less number of bits while comparing the vectors.
For example, I know that when one compares bitmap(size:1024 bits) of digits '8' and '0', We must have 1s in middle of the vector of digit '8' as there digit 8 visually appears as the combination of two zeros placed in column.
So our algorithm would look for the intersection of two zeros(which would be the middle of digit.
Thats the way I want to work. I want to convert the low level representation(looking at 1024 bitmap vector) to the high level representation(that consist of two properties extracted from bitmap).
Any suggestion? I hope, the question is somewhat clear to the audience.
Idea 1: Flood fill
This idea does not use the 50 patterns you have per digit: it is based on the idea that usually a "1" has all 0-bits connected around that "1" shape, while a "0" separates the 0-bits inside it from those outside it, and an "8" has two such enclosed areas. So counting connected areas of 0-bits would identify which of the three it is.
So you could use a flood fill algorithm, starting at any 0 bit in the vector, and set all those connected 0-bits to 1. In a 1 dimensional array you need to take care to correctly identify connected bits (either horizontally: 1 position apart, but not crossing a 32 boundary, or vertically... 32 positions apart). Of course, this flood-filling will destroy the image - so make sure to use a copy. If after one such flood-fill there are still 0 bits (which were therefore not connected to those you turned into 1), then choose one of those and start a second flood-fill there. Repeat if necessary.
When all bits have been set to 1 in that way, use the number of flood-fills you had to perform, as follows:
One flood-fill? It's a "1", because all 0-bits are connected.
Two flood-fills? It's a "0", because the shape of a zero separates two areas (inside/outside)
Three flood-fills? It's an "8", because this shape separates three areas of connected 0-bits.
Of course, this process assumes that these handwritten digits are well-formed. For example, if an 8-shape would have a small gap, like here:
..then it will not be identified as an "8", but a "0". This particular problem could be resolved by identifying "loose ends" of 1-bits (a "line" that stops). When you have two of those at a short distance, then increase the number you got from flood-fill counting with 1 (as if those two ends were connected).
Similarly, if a "0" accidentally has a small second loop, like here:
...it will be identified as an "8" instead of a "0". You could prevent this particular problem by requiring that each flood-fill finds a minimum number of 0-bits (like at least 10 0-bits) to count as one.
Idea 2: probability vector
For each digit, add up the 50 example vectors you have, so that for each position you have a count somewhere between 0 to 50. You would have one such "probability" vector per digit, so prob0, prob1 and prob8. If prob8[501] = 45, it means that it is highly probable (45/50) that an "8" vector will have a 1-bit at index 501.
Now transform these 3 probability vectors as follows: instead of storing a count per position, store the positions in order of decreasing count (probability). So if prob8[513] has the highest value (like 49), then that new array should start like [513, ...]. Let's call these new vectors A0, A8 and A1 (for the corresponding digit).
Finally, when you need to match a given input vector, simultaneously go through A0, A1 and A8 (always looking at the same index in the three vectors) and keep 3 scores. When the input vector has a 1 at the position specified in A0[i], then add 1 to score0. If it also has a 1 at the position specified in A1[i] (same i), then add 1 to score1. Same thing for score8. Increment i, and repeat. Stop this iteration as soon as you have a clear winner, i.e. when the highest score among score0, score1 and score8 has crossed a threshold difference with the second highest score among them. At that point you know which digit is being represented.

Feature Scaling with Octave

I want to do feature scaling datasets by using means and standard deviations, and my code is below; but apparently it is not a univerisal code, since it seems only work with one dataset. Thus I am wondering what is wrong with my code, any help will be appreciated! Thanks!
X is the dataset I am currently using.
mu = mean(X);
sigma = std(X);
m = size(X, 1);
mu_matrix = ones(m, 1) * mu;
sigma_matrix = ones(m, 1) * sigma;
featureNormalize = (X-mu_matrix)/sigma;
Thank you for clarifying what you think the code should be doing in the comments.
My answer will effectively answer why what you think is happening is not what is happening.
First let's talk about the mean and std functions. When their input is a vector (whether this is vertically or horizontally aligned), then this will return a single number which is the mean or standard deviation of that vector respectively, as you might expect.
However, when the input is a matrix, then you need to know what it does differently. Unless you specify the direction (dimension) in which you should be calculating means / std, then it will calculate means along the rows, i.e. returning a single number for each column. Therefore, the end-result of this operation will be a horizontal vector.
Therefore, both mu and sigma will be horizontal vectors in your code.
Now let's move on to the 'matrix multiplication' operator (i.e. *).
When using the matrix multiplication operator, if you multiply a horizontal vector with a vertical vector (i.e. the usual matrix multiplication operation), your output is a single number (i.e. a scalar). However, if you reverse the orientations, as in, you multiply a vertical vector by a horizontal one, you will in fact be calculating a 'Kronecker product' instead. Since the output of the * operation is completely defined by the rows of the first input, and the columns of the second input, whether you're getting a matrix multiplication or a kronecker product is implicit and entirely dependent on the orientation of your inputs.
Therefore, in your case, the line mu_matrix = ones(m, 1) * mu; is not in fact appending a vector of ones, like you say. It is in fact performing the kronecker product between a vertical vector of ones, and the horizontal vector that is your mu, effectively creating an m-by-n matrix with mu repeated vertically for m rows.
Therefore, at the end of this operation, as the variable naming would suggest, mu_matrix is in fact a matrix (same with sigma_matrix), having the same size as X.
Your final step is X- mu_sigma, which gives you at each element, the difference between x and mu at that element. Then you "divide" with the sigma matrix.
Here is why I asked if you were sure you should be using ./ instead of /.
/ is the matrix division operator. With / You are effectively performing matrix multiplication by an inverse matrix, since D / S is mathematically equivalent to D * inv(S). It seems to me you should be using ./ instead, to simply divide each element by the standard deviation of that column (which is why you had to repeat the horizontal vector over m rows in sigma_matrix, so that you could use it for 'elementwise division'), since what you are trying to do is to normalise each row (i.e. observation) of a particular column, by the standard deviation that is specific to that column (i.e. feature).

Funny (rounding?) errors when adding

One column has numbers (always with 2 decimals, some are computed but all multiplications and divisions rounded to 2 decimals), the other is cumulative. The cumulative column has formula =<above cell>+<left cell>.
In the cumulative column the result is 58.78, the next number in the first column is -58.78. Because of different formatting for zero than for positive or negative numbers, I spotted something was wrong. Changing the format to several decimals, the numbers appear as:
-£58.780000000000000000000000000000 £0.000000000000007105427357601000
The non-zero zero is about 2^(-47). Another time the numbers in the same situation are:
-£50.520000000000000000000000000000 -£0.000000000000007105427357601000
How can that happen?
Also, if I change the cell in cumulative column into the actual number 58.78, the result suddenly becomes zero.
Google Sheets uses double precision floating point arithmetics, which creates such artifacts. The relative precision of this format is 2^(-53), so for a number of size around 2^6 = 64 we expect 2^(-47) truncation error.
Some spreadsheet users would be worried if they found out that "58.78" is actually not 58.78, because this number does not admit an exact representation in this floating point format. So the spreadsheet is hiding the truth, rounding the number for display and printing fake zeros when asked for more digits. Those zeros after 58.78 are fake.
The truth comes to light when you subtract two numbers that appear to be identical but are not — because they were computed in different ways, e.g. one obtained as a sum while the other by direct input. Rounding the result of subtraction to zero would be too much of a lie: this is no longer a matter of a small relative error, the difference between 2^(-47) and 0 may be an important one. Hence the unexpected reveal of the mechanics behind the scenes.
See also: Why does Google Spreadsheets says Zero is not equals Zero?

Hidden Markov Model Notation

I have some background in machine learning and I also just completed a face-identification excersize using support vector machine. I am in the process of trying to convert this exercise to HMM, but I am having problems understanding the notation and how to use it (I am using Kevin Murphy’s HMM package).
I am given about a 50 gray scale images of 6 different people (numbered 1-6). Each image is a 10 pixels by 10 pixels and each pixel can have values between 0-255 (8 bit gray scale). The goal is that I will be able to classify a new image to one of the 6 faces.
My approach is to take each image and make it a long vector of length 100 elements each is a pixel value . Now, I am getting to the confusing part. The notations I am using is as follows:
N : Number of observation symbols - I understand that the hidden state is the person’s face (i.e 1-6), therefore, there are 6 hidden states so N=6.
T : Length of observation sequence – is this equal to a 50 ? I am not sure what this represents
M: Number of observation symbols – is this equal to a 100 ? Does the term of “observation symbol” refer to the number of elements in the vector representing the observation?
O : Number of observations – what does this represent? In every example they use a single binary observed value and they make this to be 2 (i.e on or off). What would this be in my case ?
I greatly appreciate the help

Optimal pairs of farthest points

I have a even set of points in 2D. I need an algorithm that can make pairs of those points such that total sum of distance between pairs is maximum.
Dynamic Programming, greedy approach won't work, I think.
Can I use Linear Programming or Hungarian algo? or any other?
You certainly can use integer linear programming. Here is an example formulation:
Introduce a binary variable x[ij] for each unordered couple of distrinct points i and j (i.e. such as i<j), where x[ij]=1 iff the points i and j are grouped together.
Compute all the distances d[ij] (for i<j).
The objective is to maximize sum_[i<j] d[ij]*x[ij], subject to the constraints that each point is in exactly one pair, i.e. forall j, sum_[i<j] x[ij] = 1.
Note that this work also for 3d points: you only need the distance between two pairs of points.
