How to express the cosine similarity ( http://en.wikipedia.org/wiki/Cosine_similarity )
when one of the vectors is all zeros?
v1 = [1, 1, 1, 1, 1]
v2 = [0, 0, 0, 0, 0]
When we calculate according to the classic formula we get division by zero:
Let d1 = 0 0 0 0 0 0
Let d2 = 1 1 1 1 1 1
Cosine Similarity (d1, d2) = dot(d1, d2) / ||d1|| ||d2||dot(d1, d2) = (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) = 0
||d1|| = sqrt((0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2) = 0
||d2|| = sqrt((1)^2 + (1)^2 + (1)^2 + (1)^2 + (1)^2 + (1)^2) = 2.44948974278
Cosine Similarity (d1, d2) = 0 / (0) * (2.44948974278)
= 0 / 0
I want to use this similarity measure in a clustering application.
And I often will need to compare such vectors.
Also [0, 0, 0, 0, 0] vs. [0, 0, 0, 0, 0]
Do you have any experience?
Since this is a similarity (not a distance) measure should I use special case for
d( [1, 1, 1, 1, 1]; [0, 0, 0, 0, 0] ) = 0
d([0, 0, 0, 0, 0]; [0, 0, 0, 0, 0] ) = 1
what about
d([1, 1, 1, 0, 0]; [0, 0, 0, 0, 0] ) = ? etc.
If you have 0 vectors, cosine is the wrong similarity function for your application.
Cosine distance is essentially equivalent to squared Euclidean distance on L_2 normalized data. I.e. you normalize every vector to unit length 1, then compute squared Euclidean distance.
The other benefit of Cosine is performance - computing it on very sparse, high-dimensional data is faster than Euclidean distance. It benefits from sparsity to the square, not just linear.
While you obviously can try to hack the similarity to be 0 when exactly one is zero, and maximal when they are identical, it won't really solve the underlying problems.
Don't choose the distance by what you can easily compute.
Instead, choose the distance such that the result has a meaning on your data. If the value is undefined, you don't have a meaning...
Sometimes, it may work to discard constant-0 data as meaningless data anyway (e.g. analyzing Twitter noise, and seeing a Tweet that is all numbers, no words). Sometimes it doesn't.
It is undefined.
Think you have a vector C that is not zero in place your zero vector. Multiply it by epsilon > 0 and let run epsilon to zero. The result will depend on C, so the function is not continuous when one of the vectors is zero.
Related
I am trying to extract rotation and translation from the fundamental matrix. I used the intrinsic matrices to get the essential matrix, but the SVD is not giving expected results. So I composed the essential matrix and trying my SVD code to get the Rotation and translation matrices back and found that the SVD code is wrong.
I created the essential matrix using Rotation and translation matrices
R = [[ 0.99965657, 0.02563432, -0.00544263],
[-0.02596087, 0.99704732, -0.07226806],
[ 0.00357402, 0.07238453, 0.9973704 ]]
T = [-0.1679611706725666, 0.1475313058767286, -0.9746915198833979]
tx = np.array([[0, -T[2], T[1]], [T[2], 0, -T[0]], [-T[1], T[0], 0]])
E = R.dot(tx)
// E Output: [[-0.02418259, 0.97527093, 0.15178621],
[-0.96115177, -0.01316561, 0.16363519],
[-0.21769595, -0.16403593, 0.01268507]]
Now while trying to get it back using SVD.
U,S,V = np.linalg.svd(E)
diag_110 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 0]])
newE = U.dot(diag_110).dot(V.T)
U,S,V = np.linalg.svd(newE)
W = np.array([[0, -1, 0], [1, 0, 0], [0, 0, 1]])
Z = np.array([[0, 1, 0],[-1, 0, 0],[0, 0, 0]])
R1 = U.dot(W).dot(V.T)
R2 = U.dot(W.T).dot(V.T)
T = U.dot(Z).dot(U.T);
T = [T[1,0], -T[2, 0], T[2, 1]]
'''
Output
R1 : [[-0.99965657, -0.00593909, 0.02552386],
[ 0.02596087, -0.35727319, 0.93363906],
[-0.00357402, -0.93398105, -0.35730468]]
R2 : [[-0.90837444, -0.20840016, -0.3625262 ],
[ 0.26284261, 0.38971602, -0.8826297 ],
[-0.32522244, 0.89704559, 0.29923163]]
T : [-0.1679611706725666, 0.1475313058767286, -0.9746915198833979],
'''
What is wrong with the SVD code? I referred the code here and here
Your R1 output is a left-handed and axis-permuted version of your initial (ground-truth) rotation matrix: notice that the first column is opposite to the ground-truth, and the second and third are swapped, and that the determinant of R1 is ~= -1 (i.e. it's a left-handed frame).
The reason this happens is that the SVD decomposition returns unitary matrices U and V with no guaranteed parity. In addition, you multiplied by an axis-permutation matrix W. It is up to you to flip or permute the axes so that the rotation has the correct handedness. You do so by enforcing constraints from the images and the scene, and known order of the cameras (i.e. knowing which camera is the left one).
The docs for an Embedding Layer in Keras say:
Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
I believe this could also be achieved by encoding the inputs as one-hot vectors of length vocabulary_size, and feeding them into a Dense Layer.
Is an Embedding Layer merely a convenience for this two-step process, or is something fancier going on under the hood?
An embedding layer is faster, because it is essentially the equivalent of a dense layer that makes simplifying assumptions.
Imagine a word-to-embedding layer with these weights:
w = [[0.1, 0.2, 0.3, 0.4],
[0.5, 0.6, 0.7, 0.8],
[0.9, 0.0, 0.1, 0.2]]
A Dense layer will treat these like actual weights with which to perform matrix multiplication. An embedding layer will simply treat these weights as a list of vectors, each vector representing one word; the 0th word in the vocabulary is w[0], 1st is w[1], etc.
For an example, use the weights above and this sentence:
[0, 2, 1, 2]
A naive Dense-based net needs to convert that sentence to a 1-hot encoding
[[1, 0, 0],
[0, 0, 1],
[0, 1, 0],
[0, 0, 1]]
then do a matrix multiplication
[[1 * 0.1 + 0 * 0.5 + 0 * 0.9, 1 * 0.2 + 0 * 0.6 + 0 * 0.0, 1 * 0.3 + 0 * 0.7 + 0 * 0.1, 1 * 0.4 + 0 * 0.8 + 0 * 0.2],
[0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2],
[0 * 0.1 + 1 * 0.5 + 0 * 0.9, 0 * 0.2 + 1 * 0.6 + 0 * 0.0, 0 * 0.3 + 1 * 0.7 + 0 * 0.1, 0 * 0.4 + 1 * 0.8 + 0 * 0.2],
[0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2]]
=
[[0.1, 0.2, 0.3, 0.4],
[0.9, 0.0, 0.1, 0.2],
[0.5, 0.6, 0.7, 0.8],
[0.9, 0.0, 0.1, 0.2]]
However, an Embedding layer simply looks at [0, 2, 1, 2] and takes the weights of the layer at indices zero, two, one, and two to immediately get
[w[0],
w[2],
w[1],
w[2]]
=
[[0.1, 0.2, 0.3, 0.4],
[0.9, 0.0, 0.1, 0.2],
[0.5, 0.6, 0.7, 0.8],
[0.9, 0.0, 0.1, 0.2]]
So it's the same result, just obtained in a hopefully faster way.
The Embedding layer does have limitations:
The input needs to be integers in [0, vocab_length).
No bias.
No activation.
However, none of those limitations should matter if you just want to convert an integer-encoded word into an embedding.
Mathematically, the difference is this:
An embedding layer performs select operation. In keras, this layer is equivalent to:
K.gather(self.embeddings, inputs) # just one matrix
A dense layer performs dot-product operation, plus an optional activation:
outputs = matmul(inputs, self.kernel) # a kernel matrix
outputs = bias_add(outputs, self.bias) # a bias vector
return self.activation(outputs) # an activation function
You can emulate an embedding layer with fully-connected layer via one-hot encoding, but the whole point of dense embedding is to avoid one-hot representation. In NLP, the word vocabulary size can be of the order 100k (sometimes even a million). On top of that, it's often needed to process the sequences of words in a batch. Processing the batch of sequences of word indices would be much more efficient than the batch of sequences of one-hot vectors. In addition, gather operation itself is faster than matrix dot-product, both in forward and backward pass.
Here I want to improve the voted answer by providing more details:
When we use embedding layer, it is generally to reduce one-hot input vectors (sparse) to denser representations.
Embedding layer is much like a table lookup. When the table is small, it is fast.
When the table is large, table lookup is much slower. In practice, we would use dense layer as a dimension reducer to reduce the one-hot input instead of embedding layer in this case.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm using Genetic Algorithms (GA) on an image processing problem (an image segmentation to be more precise). In this case, an individual represent a block of pixels (i.e. a set of pixel coordinates). I need to encourage individuals with contiguous pixels.
To encourage contiguous blocks of pixels:
The "contiguousness" of an individual need to be considered in the fitness function to encourage individuals having adjacent pixels (best-fit). Hence during the evolution, the contiguousness of a set of coordinates (i.e. an individual) will influence the fitness of this individual.
The problem I'm facing is how to measure this feature (how much contiguous) on a set of pixel coordinates (x, y)?
As can be shown on the image below, the individual (set of pixels in black) on the right is clearly more "contiguous" (and therefore fitter) than the individual on the left:
I think I understand what you are asking, and my suggestion would be to count the number of shared "walls" between your pixels:
I would argue that from left to right the individuals are decreasing in continuity.
Counting the number of walls is not difficult to code, but might be slow the way I've implemented it here.
import random
width = 5
height = 5
image = [[0 for x in range(width)] for y in range(height)]
num_pts_in_individual = 4
#I realize this may give replicate points
individual = [[int(random.uniform(0,height)),int(random.uniform(0,width))] for x in range(num_pts_in_individual)]
#Fill up the image
for point in individual:
image[point[0]][point[1]] = 1
#Print out the image
for row in image:
print row
def count_shared_walls(image):
num_shared = 0
height = len(image)
width = len(image[0])
for h in range(height):
for w in range(width):
if image[h][w] == 1:
if h > 0 and image[h-1][w] == 1:
num_shared += 1
if w > 0 and image[h][w-1] == 1:
num_shared += 1
if h < height-1 and image[h+1][w] == 1:
num_shared += 1
if w < width-1 and image[h][w+1] == 1:
num_shared += 1
return num_shared
shared_walls = count_shared_walls(image)
print shared_walls
Different images and counts of shared walls:
[0, 0, 0, 0, 0]
[0, 1, 0, 0, 0]
[0, 0, 0, 0, 0]
[1, 0, 0, 1, 1]
[0, 0, 0, 0, 0]
2
[1, 0, 0, 0, 0]
[0, 0, 0, 1, 0]
[0, 0, 0, 0, 0]
[0, 0, 0, 0, 0]
[1, 0, 1, 0, 0]
0
[0, 0, 0, 1, 1]
[0, 0, 0, 1, 0]
[0, 0, 0, 0, 0]
[0, 0, 0, 0, 0]
[1, 0, 0, 0, 0]
4
One major problem with this, is that if a change in pixel locations occurs that does not change the number of shared walls, it will not affect the score. Maybe a combination of the distance method you described and the shared walls approach would be best.
I have several questions:
In the example given in openCV document:
/* generate measurement */
cvMatMulAdd( kalman->measurement_matrix, state, measurement, measurement );
Is this correct?
In the tutorial: An Introduction to the Kalman Filter by Welch and Bishop
in Equation 1.2 it says measurement = H*state + measurement noise
Doesn't seems both are same.
I was trying to implement bouncing ball tracking for a single ball.
I tried the following: (Please point out if I am doing it incorrectly.)
For the measurement I am measuring two things: a) x b) y of the centroid of the ball.
I am just mentioning lines which are different from the example given in opencv documentation.
CvKalman* kalman = cvCreateKalman( 5, 2, 0 );
const float A[] = { 1, 0, 1, 0, 0,
0, 1, 0, 1, 0,
0, 0, 1, 0, 0,
0, 0, 0, 1, 1,
0, 0, 0, 0, 1};
CvMat* state = cvCreateMat( 5, 1, CV_32FC1 );
CvMat* measurement = cvCreateMat( 2, 1, CV_32FC1 );
//initialize the state of kalman filter
state->data.fl[0] = mean_c;
state->data.fl[1] = mean_r;
state->data.fl[2] = mean_c - prev_mean_c;
state->data.fl[3] = mean_r - prev_mean_r;
state->data.fl[4] = 9.81;
after initialization, this is what gives crash
cvMatMulAdd( kalman->transition_matrix, state,
kalman->process_noise_cov, state );
In this line they just use variable measurement to store noise. See previous line:
cvRandArr( &rng, measurement, CV_RAND_NORMAL, cvRealScalar(0),cvRealScalar(sqrt(kalman->measurement_noise_cov->data.fl[0])) );
You should change dimension of H matrix as well. It must be 5 by 2 to make it possible to calculate H*state + measurement noise. You get an error probably in line
memcpy( cvkalman->measurement_matrix->data.fl, H, sizeof(H));
because in initial example cvkalman->measurement_matrix and H are allocated as 4 by 4 matrices and you decreased dimension of cvkalman->measurement_matrix only to 5 by 2 (4*4 is more than 5*2)
The Oreilly book "Learning OpenCV" states at page 356 :
Quote
Before we get totally lost, let’s consider a particular realistic situation of taking measurements
on a car driving in a parking lot. We might imagine that the state of the car could
be summarized by two position variables, x and y, and two velocities, vx and vy. These
four variables would be the elements of the state vector xk. Th is suggests that the correct form for F is:
x = [ x;
y;
vx;
vy; ]k
F = [ 1, 0, dt, 0;
0, 1, 0, dt;
0, 0, 1, 0;
0, 0, 0, 1; ]
It seems natural to put 'dt' just there in the F matrix but I just don't get why. What if I have a n states system, how would I spray some "dt" in the F matrix?
The dts are coefficients of the velocities with the corresponding positions. If you write your state update after time dt has elapsed:
x(t+dt) = x(t) + dt * vx(t)
y(t+dt) = y(t) + dt * vy(t)
vx(t+dt) = vx(t)
vy(t+dt) = vy(t)
You can read F off of these equations pretty easily.