The docs for an Embedding Layer in Keras say:
Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
I believe this could also be achieved by encoding the inputs as one-hot vectors of length vocabulary_size, and feeding them into a Dense Layer.
Is an Embedding Layer merely a convenience for this two-step process, or is something fancier going on under the hood?
An embedding layer is faster, because it is essentially the equivalent of a dense layer that makes simplifying assumptions.
Imagine a word-to-embedding layer with these weights:
w = [[0.1, 0.2, 0.3, 0.4],
[0.5, 0.6, 0.7, 0.8],
[0.9, 0.0, 0.1, 0.2]]
A Dense layer will treat these like actual weights with which to perform matrix multiplication. An embedding layer will simply treat these weights as a list of vectors, each vector representing one word; the 0th word in the vocabulary is w[0], 1st is w[1], etc.
For an example, use the weights above and this sentence:
[0, 2, 1, 2]
A naive Dense-based net needs to convert that sentence to a 1-hot encoding
[[1, 0, 0],
[0, 0, 1],
[0, 1, 0],
[0, 0, 1]]
then do a matrix multiplication
[[1 * 0.1 + 0 * 0.5 + 0 * 0.9, 1 * 0.2 + 0 * 0.6 + 0 * 0.0, 1 * 0.3 + 0 * 0.7 + 0 * 0.1, 1 * 0.4 + 0 * 0.8 + 0 * 0.2],
[0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2],
[0 * 0.1 + 1 * 0.5 + 0 * 0.9, 0 * 0.2 + 1 * 0.6 + 0 * 0.0, 0 * 0.3 + 1 * 0.7 + 0 * 0.1, 0 * 0.4 + 1 * 0.8 + 0 * 0.2],
[0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2]]
=
[[0.1, 0.2, 0.3, 0.4],
[0.9, 0.0, 0.1, 0.2],
[0.5, 0.6, 0.7, 0.8],
[0.9, 0.0, 0.1, 0.2]]
However, an Embedding layer simply looks at [0, 2, 1, 2] and takes the weights of the layer at indices zero, two, one, and two to immediately get
[w[0],
w[2],
w[1],
w[2]]
=
[[0.1, 0.2, 0.3, 0.4],
[0.9, 0.0, 0.1, 0.2],
[0.5, 0.6, 0.7, 0.8],
[0.9, 0.0, 0.1, 0.2]]
So it's the same result, just obtained in a hopefully faster way.
The Embedding layer does have limitations:
The input needs to be integers in [0, vocab_length).
No bias.
No activation.
However, none of those limitations should matter if you just want to convert an integer-encoded word into an embedding.
Mathematically, the difference is this:
An embedding layer performs select operation. In keras, this layer is equivalent to:
K.gather(self.embeddings, inputs) # just one matrix
A dense layer performs dot-product operation, plus an optional activation:
outputs = matmul(inputs, self.kernel) # a kernel matrix
outputs = bias_add(outputs, self.bias) # a bias vector
return self.activation(outputs) # an activation function
You can emulate an embedding layer with fully-connected layer via one-hot encoding, but the whole point of dense embedding is to avoid one-hot representation. In NLP, the word vocabulary size can be of the order 100k (sometimes even a million). On top of that, it's often needed to process the sequences of words in a batch. Processing the batch of sequences of word indices would be much more efficient than the batch of sequences of one-hot vectors. In addition, gather operation itself is faster than matrix dot-product, both in forward and backward pass.
Here I want to improve the voted answer by providing more details:
When we use embedding layer, it is generally to reduce one-hot input vectors (sparse) to denser representations.
Embedding layer is much like a table lookup. When the table is small, it is fast.
When the table is large, table lookup is much slower. In practice, we would use dense layer as a dimension reducer to reduce the one-hot input instead of embedding layer in this case.
Related
I have the simulation data which contains the velocity values (x, y and z components), as well as density and energy values for each voxel. Is it possible to obtain the optical flow from the velocities, which will be the ground truth?
I tried the following:
data = np.fromfile('Velocity/ns_1000_v.dat', dtype=np.float32)
data = np.reshape(data,(128, 128, 128, 3)) # 3D volumes
slice_num = 5 # let's pick up a slice
vx = data[:,:,slice_num,0]
vy = data[:,:,slice_num,1]
hsv = np.zeros((128,128,3), dtype=np.uint8)
hsv[..., 1] = 255
mag, ang = cv2.cartToPolar(vx, vy)
hsv[..., 0] = ang * 180 / np.pi / 2
hsv[..., 2] = cv2.normalize(mag, None, 0, 255, cv2.NORM_MINMAX)
bgr = cv2.cvtColor(hsv, cv2.COLOR_HSV2BGR)
This doesn't show the (2D) flow, which I expected to see.
I already calculated the optical flow from the density values (in an unsupervised setup) and now wanted to compare with the ground truth. An example of the computed flow looks like this:
Computed unsupervised flow
Thank you!
After I downsample a vector with a constant decimating factor, I want to upsample the vector back to the original sample rate (after performing some analyses). However, I am struggling with the upsampling.
For the downsampling I apply vDSP_desamp from the Accelerate framework, and for the upsampling I tried to apply vDSP_vlint:
// Create some test data for input vector
float inputData[10] = {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9};
int inputLength = 10;
int decimationFactor = 2; // Downsample by factor 2
int downSampledLength = inputLength/decimationFactor;
// Allocate downsampled output vector
float* downSampledData = malloc(downSampledLength*sizeof(float));
// Create filter (average samples)
float* filter = malloc(decimationFactor*sizeof(float));
for (int i = 0; i < decimationFactor; ++i){
filter[i] = 1.0/decimationFactor;
}
// Downsample and average
vDSP_desamp(inputData,
(vDSP_Stride) decimationFactor,
filter,
downSampledData,
(vDSP_Length) downSampledLength, // Downsample to 5 samples
(vDSP_Length) decimationFactor );
free(filter);
The output of downSampledData using this code is:
0.05, 0.25, 0.45, 0.65, 0.85
To upsample the (processed) data vector back to the original sample rate I use the following code:
// For this example downSampledData is just copied to processedData ...
float* processedData = malloc(downSampledLength*sizeof(float));
processedData = downSampledData;
// Create vector used by vDSP_vlint to indicate interpolation constants.
float* b = malloc(downSampledLength*sizeof(float));
for (int i = 0; i < downSampledLength; i++) {
b[i] = i + 0.5;
}
// Allocate data vector for upsampled data
float* upSampledData = malloc(inputLength*sizeof(float));
// Upsample and interpolate
vDSP_vlint (processedData,
b,
1,
upSampledData,
1,
(vDSP_Length) inputLength, // Resample back to 10 samples
(vDSP_Length) downSampledLength);
However, the output of upSampledData is
0.15, 0.35, 0.55, 0.75, 0.43, 0.05, 0.05, 0.05, 0.08, 0.12
which is not correct, apparently. How should I apply vDSP_vlint? Or should I use other functions for upsampling the data?
let's take this example:
we use X and Y data corresponding to the known polynomial f (x) = 0.25 - x + x2. Using POLY_FIT to compute a second degree polynomial fit returns the exact coefficients (to within machine accuracy).
; Define an 11-element vector of independent variable data:
X = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
; Define an 11-element vector of dependent variable data:
Y = [0.25, 0.16, 0.09, 0.04, 0.01, 0.00, 0.01, 0.04, 0.09, $
0.16, 0.25]
; Define a vector of measurement errors:
measure_errors = REPLICATE(0.01, 11)
; Compute the second degree polynomial fit to the data:
result = POLY_FIT(X, Y, 2, MEASURE_ERRORS=measure_errors, $
SIGMA=sigma)
; Print the coefficients:
PRINT, 'Coefficients: ', result
PRINT, 'Standard errors: ', sigma
this example prints,
Coefficients: 0.250000 -1.00000 1.00000
Standard errors: 0.00761853 0.0354459 0.0341395
that works as expected, but let say I already know the coefficient 0.25, how can I pass to POLY_FIT that coefficient ? or maybe I have to use some other FIT function ?
source: http://www.exelisvis.com/docs/POLY_FIT.html
Have you looked at MPFIT? It has keywords to pass information about the parameters. Docs.
How to express the cosine similarity ( http://en.wikipedia.org/wiki/Cosine_similarity )
when one of the vectors is all zeros?
v1 = [1, 1, 1, 1, 1]
v2 = [0, 0, 0, 0, 0]
When we calculate according to the classic formula we get division by zero:
Let d1 = 0 0 0 0 0 0
Let d2 = 1 1 1 1 1 1
Cosine Similarity (d1, d2) = dot(d1, d2) / ||d1|| ||d2||dot(d1, d2) = (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) = 0
||d1|| = sqrt((0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2) = 0
||d2|| = sqrt((1)^2 + (1)^2 + (1)^2 + (1)^2 + (1)^2 + (1)^2) = 2.44948974278
Cosine Similarity (d1, d2) = 0 / (0) * (2.44948974278)
= 0 / 0
I want to use this similarity measure in a clustering application.
And I often will need to compare such vectors.
Also [0, 0, 0, 0, 0] vs. [0, 0, 0, 0, 0]
Do you have any experience?
Since this is a similarity (not a distance) measure should I use special case for
d( [1, 1, 1, 1, 1]; [0, 0, 0, 0, 0] ) = 0
d([0, 0, 0, 0, 0]; [0, 0, 0, 0, 0] ) = 1
what about
d([1, 1, 1, 0, 0]; [0, 0, 0, 0, 0] ) = ? etc.
If you have 0 vectors, cosine is the wrong similarity function for your application.
Cosine distance is essentially equivalent to squared Euclidean distance on L_2 normalized data. I.e. you normalize every vector to unit length 1, then compute squared Euclidean distance.
The other benefit of Cosine is performance - computing it on very sparse, high-dimensional data is faster than Euclidean distance. It benefits from sparsity to the square, not just linear.
While you obviously can try to hack the similarity to be 0 when exactly one is zero, and maximal when they are identical, it won't really solve the underlying problems.
Don't choose the distance by what you can easily compute.
Instead, choose the distance such that the result has a meaning on your data. If the value is undefined, you don't have a meaning...
Sometimes, it may work to discard constant-0 data as meaningless data anyway (e.g. analyzing Twitter noise, and seeing a Tweet that is all numbers, no words). Sometimes it doesn't.
It is undefined.
Think you have a vector C that is not zero in place your zero vector. Multiply it by epsilon > 0 and let run epsilon to zero. The result will depend on C, so the function is not continuous when one of the vectors is zero.
Suppose:
The signal S is a random variable which takes the values {0, 1, 2, 3} with probabilities {0.3, 0.2, 0.1, 0.4}.
The noise N is a random variable which takes the values {-2, -1, 0, 1, 2} with probabilities {0.1, 0.1, 0.6, 0.1, 0.1}.
The SNR is given by (Power of S)/(Power of N) = ((Amplitude of S)/(Amplitude of N))^2.
My question is: How is the amplitude of a random variable computed ? Is it:
The Root-Mean_Square (RMS) amplitude ?
The Variance ?
It's E[S^2]/E[N^2] where E is the Expectation operator.