I am reading about k nearest neighbour, and the distance measure given in the example is as below.
It says Ri is the range of the i-th component. I am confused about which distance measure is used here? I understand Euclidean Distance but this doesn't seem to be it. Could you help to explain what "range of the i-th component" is and which distance measure is this? Many thanks. Please let me know if more information is needed.
Range is difference between max and min of that feature (column) in the training dataset.
You can think about this as L1 norm since we are taking just the absolute distance between the max and min. This is commonly done to normalize the distance calculation across features so that some feature should not dominate the distance calculation.
The formula given is just for the Euclidean Distance, except that the normalization of data is done in place when calculating the distance.
Normalization of data is necessary for KNN because if not done then the features with higher values will be dominant in deciding the output.
The above formula for KNN omits the explicit step of normalization and does it in place while calculating the distance.
NOTE:- Here, i denotes the ith column and not row.
Here, is the actual explanation of the formula,
Ri = ximax - ximin
While normalization we transform each row using the following transformation,
xi = xi / (ximax - ximin)
So, when computing for the distance the formula is effective,
d2 = ((a1 - xmin)-(b1 - xmin))2 / R12 + ((a2 - xmin)-(b2 - xmin))2 / R22 + ... + ((an - xmin)-(bn - xmin))2 / Rn2
which is effectively,
d2 = (a1 - b1)2 / R12 + (a2 - b2)2 / R22 + ... + (an- bn)2 / Rn2
, which is shown in the above image.
Related
The triplet loss is defined as follows:
L(A, P, N) = max(‖f(A) - f(P)‖² - ‖f(A) - f(N)‖² + margin, 0)
where A=anchor, P=positive, and N=negative are the data samples in the loss, and margin is the minimum distance between the anchor and positive/negative samples.
I read somewhere that (1 - cosine_similarity) may be used instead of the L2 distance.
Note that I am using Tensorflow - and the cosine similarity loss is defined that When it is a negative number between -1 and 0, 0 indicates orthogonality and values closer to -1 indicate greater similarity. The values closer to 1 indicate greater dissimilarity. So, it is the opposite of cosine similarity metric.
Any suggestions on how to write my triplet loss with cosine similarity?
Edit
All good stuff in the answers (comments and answers). Based on all the hints - this is working ok for me:
self.margin = 1
self.loss = tf.keras.losses.CosineSimilarity(axis=1)
ap_distance = self.loss(anchor, positive)
an_distance = self.loss(anchor, negative)
loss = tf.maximum(ap_distance - an_distance + self.margin, 0.0)
I would like to eventually use the tensorflow addon loss as #pygeek pointed out but I haven't figured out how to pass the data yet.
Note
To use it standalone - one must do something like this:
cosine_similarity = tf.keras.metrics.CosineSimilarity()
cosine_similarity.reset_state()
cosine_similarity.update_state(anch_prediction, other_prediction)
similarity = cosine_similarity.result().numpy()
Resources
pytorch cosine embedding layer
tensorflow cosine similarity implmentation
tensorflow triplet loss hard/soft margin
First of all, Cosine_distance = 1 - cosine_similarity. The distance and similarity are different. This is not correctly mentioned in some of the answers!
Secondly, you should look at the TensorFlow code on how the cosine similarity loss is implemented https://github.com/keras-team/keras/blob/v2.9.0/keras/losses.py#L2202-L2272, which is different from PyTorch!!
Finally, I suggest you use existing loss: You should replace the || ... ||^2 with tf.losses.cosineDistance(...).
I am guessing that what you red about replacing L2 with cosine origins from the definition of cosine between two vectors:
cos(f(A), f(P)) = f(A) * f(P)/(‖f(A)‖*‖f(P)‖)
where dot product along the feature dimension is implied in the above. Next, note that
[1 - cos(f(A), f(P))]*‖f(A)‖*‖f(P)‖ = ‖f(A) - f(P)‖² - (‖f(A)‖ - ‖f(P)‖)²
which gives a hint on where the notion comes from when ‖f(A)‖ = ‖f(P)‖. So your formula can be naturally changed to
L(A, P, N) = max(cos(f(A), f(N)) - cos(f(A), f(P)) + margin, 0)
Your margin parameter should be adjusted accordingly. Here is some Tensorflow code to compute the cosines for vectors
def cos(A, B):
return tf.reduce_sum(A*B, axis=-1)/tf.norm(A, axis=-1)/tf.norm(B, axis=-1)
Whenever this loss would benefit your particular problem depends on the problem, so good luck with your experiments.
Given a classification problem in Machine Learning the hypothesis is described as below.
hθ(x)=g(θ'x)
z = θ'x
g(z) = 1 / (1+e^−z)
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z)≥0.5
whenz≥0
Remember.
z=0,e0=1⇒g(z)=1/2
z→∞,e−∞→0⇒g(z)=1
z→−∞,e∞→∞⇒g(z)=0
So if our input to g is θTX, then that means:
hθ(x)=g(θTx)≥0.5
whenθTx≥0
From these statements we can now say:
θ'x≥0⇒y=1
θ'x<0⇒y=0
If The decision boundary is the line that separates the area where y = 0 and where y = 1 and is created by our hypothesis function:
What part of this relates to the Decision Boundary? Or where does the Decision Boundary algorithm come from?
This is basic logistic regression with a threshold. So your theta' * x is just the vector notation of your weight vector multiplied by your input. If you put that into the logistic function which outputs a value between 0 and 1 exclusively, you'll threshold that value at 0.5. So if it's equal and above this, you'll treat it as a positive sample and as a negative one otherwise.
The classification algorithm is just that simple. The training is a bit more complicated and the goal of it is the find a weight vector theta which satisfies the condition to correctly classify all your labeled data...or at least as much as possible. The way to do this is to minimize a cost function which measures the difference between the output of your function and the expected label. You can do this using gradient descent. I guess, Andrew Ng is teaching this.
Edit: Your classification algorithm is g(theta'x)>=0.5 and g(theta'x)<0.5, so a basic step function.
Courtesy of other posters on a different tech forum.
Solving for theta'*x >= 0 and theta'*x<0 gives the decision boundary. The RHS of the inequality ( i.e. 0) comes from the sigmoid function.
Theta gives you the hypothesis that best fits the training set.
From theta, you can compute the decision boundary - it is the locus of points where (X * theta) = 0, or equivalently where g(X * theta) = 0.5.
I am doing image classification project and i have made the corpus of features.
I want to normalize my features for the input of PyBrain between -1 to 1 I am using the following formula to normalize the features
Normalized value = (Value - Mean ) / Standard Deviation
but it is giving me the normalized some values between -3 to 3 which is very inaccurate.
I have 100 inputs in pybrain and 1 output of pybrain.
The equation you used is that of standardization. It does not guarantee your values are in -1;1 but it rescales your data to have a mean of 0, and a standard deviation of 1 afterwards. But points can be more than 1x the standard deviation from the mean.
There are multiple options to bound your data.
Use a nonlinear function such as tanh (very popular in neural networks)
center, then rescale with 1/max(abs(dev))
preserve 0, then rescale with 1/max(abs(dev))
2*(x-min)/(max-min) - 1
standardize (as you did) but truncate values to -1;+1
... many more
In case of you have positive dataset, you can normalize your values using this formula
Normalized value = (Value / (0.5*Max_Value) )-1;
This is going to give you values with the range [-1,+1]
In case you have positive and negative:
Normalized value = ((Normalized - Min_Value)/(Max_Value-Min_Value)-0.5)*2
Maybe you can do this:
Mid_value = ( Max_value + Min_Value )/2
Max_difference = ( Max_value - Min_Value )/2;
Normalized_value = ( Value - Mid_value )/Max_difference;
The Normalized_value shall be within [-1,+1].
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec
On the spark implementation of word2vec, when the number of iterations or data partitions are greater than one, for some reason, the cosine similarity is greater than 1.
In my knowledge, cosine similarity should always be about -1 < cos < 1. Does anyone know why?
In findSynonyms method of word2vec, it does not calculate cosine similarity v1・vi / |v1| |vi|, instead it calculates v1・vi / |vi|, where v1 is the vector of the query word and vi is the vector of the candidate words.
That's why the value sometimes exceeds 1.
Just to find closer words, it is not necessary to divide by |v1| because it is constant.
Does anyone has an idea how to calculate the Inverse of a 2-D filter?
Let's say I have a 3x3 filter:
0 1 0
1 1 1
0 1 0
I want to find it's inverse.
It's easy to do using DFT.
But let's say I want to do it by convolution.
Now, that's the problem, Matlab symbolic isn't my specialty.
Assuming there's a 3X3 Inverse Filter it means convolution of the two will result in:
0 0 0
0 1 0
0 0 0
The problem is to create the right set of equations for that and solving it.
Doing it with symbols is easy to think yet I couldn't do it.
Any ideas?
Thanks.
P.S.
I'm not sure there an Inverse Filter for this one as it has zeros in its DTFT.
Moreover, someone should allow Latex in this forum the way MathOverflow has.
Let h[n] be the finite impulse response of a 1D filter. What does that imply about its inverse filter? The inverse cannot possibly be FIR.
Proof: Let H(omega) G(omega) = 1 for all omega, where H is the DTFT of h[n], and G is the DTFT of g[n]. If h[n] is FIR, then g[n] must be IIR.
Of course, there are ways to approximate the inverse IIR filter with an FIR filter. A basic method is adaptive filtering, e.g., Least Mean Squares (LMS) algorithm. Or just truncate the IIR filter. You still need to worry about stability, though.
For practical purposes, there probably is no desirable solution to your specific question. Especially if, for example, this is in image processing and you are trying to inverse an FIR blur filter with FIR sharpening filter. The final image will not look that good, period, unless your sharpening filter is really, really large.
EDIT: Let y[n] = b0 x[n-0] + b1 x[n-1] + ... + bN x[n-N]. Let this equation characterize the forward system, where y is the output and x is the input. By definition, the impulse response is the output when the input is an impulse: h[n] = b0 d[n-0] + b1 d[n-1] + ... + bN d[n-N]. This impulse response has finite length N+1.
Now, consider the inverse system where x is the output and y is the input. Then the impulse response is described by the recurrence equation d[n] = b0 h[n] + b1 h[n-1] + ... + bN h[n-N]. Equivalently, b0 h[n] = d[n] - b1 h[n-1] - ... - bN h[n-N].
Without loss of generality, assume that b0 and bN are both nonzero. For any m, if h[m] is nonzero, then h[m+N] is also nonzero. Because this system has feedback, its impulse response is infinitely long. QED.
Causality does not matter. The inverse of a delay is an advance, and vice versa. Neither a delay or an advance alter the finiteness of an impulse response. Shift an infinite impulse response left or right; it is still infinite.
EDIT 2: For clarification, this proof is not related to my original "proof". One was in frequency domain, the other in time domain.
In practice, one useful solution is Wiener deconvolution http://en.wikipedia.org/wiki/Wiener_deconvolution which basically, in Fourier space, divides by the spectrum of the given filter. Zeros are handled by adding a fudge term: instead of 1/H(w), use H(w)/( abs(H(w))^2 + c) where H(w) is the discrete Fourier transform of the filter h(x) (add a 2nd dimension as you like) and "w" is supposed to be omega. The constant is chosen based on the noise level of the signal or image.
For a separable filter (i.e., one in which you can pull out a horizontal and vertical filter which can be applied in any order and give the same result as the composite 2-D filter), you could attempt to calculate the inverse of each on individually.
The inverse of a filter H(w) is G(w)=1/H(w), so one way to do it is to take the impulse response (the h[n] time-domain coeffs) and inverse-DFT them. It's not always easy to get an analytic expression for such a filter, so you could either compute it numerically (approximating to the desired precision) or do adaptive inverse filtering, as Steve suggested. See Widrow and Stearn's Adaptive Signal Processing for more info on this latter method.
One way is to write a matrix representation of convolution with your filter and then try to find some (regularized) inverse of this matrix.
For example the 1D [1,-1] filter is maybe the simplest discrete differential approximation that exists.
Its matrix will have two diagonals 1 on main diagonal and -1 on first off diagonal.
It's inverse will be integral filter. An IIR filter with infinite trail of 1s. Since our matrix has limited size this will in practice mean filling up the matrix with 1s everywhere on one side of the diagonal.
Deriving the Inverse Kernel of a Given 2D Convolution Kernel
This is basically a generalization of the question - Deriving the Inverse Filter of Image Convolution Kernel.
Problem Formulation
Given a Convolution Kernel $ f \in \mathbb{R}^{m \times n} $ find its inverse kernel, $ g \in \mathbb{R}^{p \times q} $ such that $ f \ast g = h = \delta $.
Solution I
One could build the Matrix Form of the Convolution Operator.
The Matrix form can replicate various mode of Image Filtering (Exclusive):
Boundary Conditions (Applied as padding the image and apply Convolution in Valid mode).
Zero Padding
Padding the image with zeros.
Circular
Using Circular / Periodic continuation of the image.
Match Frequency Domain convolution.
Replicate
Replicating the edge values (Nearest Neighbor).
Usually creates the least artifacts in real world.
Symmetric
Mirroring the image along the edges.
Convolution Shape
Full
The output size is to full extent - $ \left( m + p - 1 \right) \times \left( n + q - 1 \right) $.
Same
Output size is the same as the input size ("Image").
Valid
Output size is the size of full overlap of the image and kernel.
When using Image Filtering the output size matches the input size hence the Matrix Form is square and the inverse is defined.
Using Convolution Matrix the matrix might not be square (Unless "Same" is chosen) hence the Pseudo Inverse should be derived.
One should notice that while the input matrix should be sparse the inverse matrix isn't.
Moreover while the Convolution Matrix will have special form (Toeplitz neglecting the Boundary Conditions) the inverse won't be.
Hence this solution is more accurate than the next one yet it also use higher degree of freedom (The solution isn't necessarily in Toeplitz form).
Solution II
In this solution the inverse is derived using minimization of the following cost function:
$$ \arg \min_{g} \frac{1}{2} {\left| f \ast g - h \right|}_{2}^{2} $$
The derivative is given by:
$$ \frac{\partial \frac{1}{2} {\left| f \ast g - h \right|}_{2}^{2} }{\partial g} = f \star \left( f \ast g - h \right) $$
Where $ \star $ is the Correlation operation.
In practice the convolution in the Objective Function is done in full mode (MATLAB idiom).
In practice it is done:
hObjFun = #(mG) 0.5 * sum((conv2(mF, mG, 'full') - mH) .^ 2, 'all');
mObjFunGrad = conv2(conv2(mF, mG, 'full') - mH, mF(end:-1:1, end:-1:1), 'valid');
Where the code flips mF for Correaltion and use valid (In matrix form it is the Adjoint / Transpose -> the output is smaller).
The optimization problem is Strictly Convex and can be solved easily using Gradient Descent:
for ii = 1:numIteraions
mObjFunGrad = conv2(conv2(mF, mG, 'full') - mH, mF(end:-1:1, end:-1:1), 'valid');
mG = mG - (stepSize * mObjFunGrad);
end
Full code is available in my StackExchnage Code StackOverflow Q2080835 GitHub Repository (Have a look at StackOverflow\Q2080835 folder).