Is it possible to get negative information gain if Laplace smoothing is used too?
We know:
IG = H(Y) - H(Y|X)
Here, H is the entropy function and IG is the information gain.
H(Y) = -ΣyP(Y=y).log2(P(Y=y))
H(Y|X) = ΣxP(X=x).H(Y|X=x)
H(Y|X=x) = -ΣyP(Y=y|X=x).log2(P(Y=y|X=x))
For example, suppose P(Y=y|X=x) = ny|x/nx. But it is possible that nx = 0 and ny|x = 0. So, I do laplace smoothing and define P(Y=y|X=x) = (ny|x+1)/(nx+|X|). Here, |X| denote the number of possible values that X can take(number of splits possible if X is chosen as the attribute). Is it possible that due to laplace smoothing, I get negative information gain?
The triplet loss is defined as follows:
L(A, P, N) = max(‖f(A) - f(P)‖² - ‖f(A) - f(N)‖² + margin, 0)
where A=anchor, P=positive, and N=negative are the data samples in the loss, and margin is the minimum distance between the anchor and positive/negative samples.
I read somewhere that (1 - cosine_similarity) may be used instead of the L2 distance.
Note that I am using Tensorflow - and the cosine similarity loss is defined that When it is a negative number between -1 and 0, 0 indicates orthogonality and values closer to -1 indicate greater similarity. The values closer to 1 indicate greater dissimilarity. So, it is the opposite of cosine similarity metric.
Any suggestions on how to write my triplet loss with cosine similarity?
All good stuff in the answers (comments and answers). Based on all the hints - this is working ok for me:
self.margin = 1
self.loss = tf.keras.losses.CosineSimilarity(axis=1)
ap_distance = self.loss(anchor, positive)
an_distance = self.loss(anchor, negative)
loss = tf.maximum(ap_distance - an_distance + self.margin, 0.0)
I would like to eventually use the tensorflow addon loss as #pygeek pointed out but I haven't figured out how to pass the data yet.
To use it standalone - one must do something like this:
cosine_similarity = tf.keras.metrics.CosineSimilarity()
cosine_similarity.update_state(anch_prediction, other_prediction)
similarity = cosine_similarity.result().numpy()
pytorch cosine embedding layer
tensorflow cosine similarity implmentation
tensorflow triplet loss hard/soft margin
First of all, Cosine_distance = 1 - cosine_similarity. The distance and similarity are different. This is not correctly mentioned in some of the answers!
Secondly, you should look at the TensorFlow code on how the cosine similarity loss is implemented, which is different from PyTorch!!
Finally, I suggest you use existing loss: You should replace the || ... ||^2 with tf.losses.cosineDistance(...).
I am guessing that what you red about replacing L2 with cosine origins from the definition of cosine between two vectors:
cos(f(A), f(P)) = f(A) * f(P)/(‖f(A)‖*‖f(P)‖)
where dot product along the feature dimension is implied in the above. Next, note that
[1 - cos(f(A), f(P))]*‖f(A)‖*‖f(P)‖ = ‖f(A) - f(P)‖² - (‖f(A)‖ - ‖f(P)‖)²
which gives a hint on where the notion comes from when ‖f(A)‖ = ‖f(P)‖. So your formula can be naturally changed to
L(A, P, N) = max(cos(f(A), f(N)) - cos(f(A), f(P)) + margin, 0)
Your margin parameter should be adjusted accordingly. Here is some Tensorflow code to compute the cosines for vectors
def cos(A, B):
return tf.reduce_sum(A*B, axis=-1)/tf.norm(A, axis=-1)/tf.norm(B, axis=-1)
Whenever this loss would benefit your particular problem depends on the problem, so good luck with your experiments.
I tried to classification problem for fun with the scikit-learn library. I got 10000x10 dimension data, and I found very weird phenomenon (for me).
pca = PCA(n_components = 2)
ss = StandardScaler()
X = pca.fit_transform(X) # explained_variance_ratio_ = 0.8
X = ss.fit_transform(X)
in this case, i got a wonderfull explained_variance_ratio_ almost 99%. but when I apply scaling first, suddely PCA's performence is dropped drastically and explained_variance_ratio decreased to 20%.
pca = PCA(n_components = 2)
ss = StandardScaler()
X = ss.fit_transform(X)
X = pca.fit_transform(X) # explained_variance_ratio_ = 0.2
What makes this difference? Standard Scaler is just rescaling process, so I suppose no information loss. Can I apply the PCA before for visualizing conveniency? Or I must select Standardization for mathematical insurance?
Suppose, you have two features A and B that measure distance and both are in metres. Feature A has a greater range of numbers in it (suppose, 1 - 1000) as compared to a Feature B , which has a range( suppose, 1-10).
Then, the feature A will capture greater variance in the data as compared to B, and hence it is not a good idea to scale the features in this case .
But if , the features are having two different units,(say, kg and metre), then it will be wise to scale the features.
P.S: PCA preserves those components along which there is max. variance.
I want to learn optimal weights and exponents for a custom model I've created:
weights = tf.Variable(tf.zeros([t.num_features, 1], dtype=tf.float64))
exponents = tf.Variable(tf.ones([t.num_features, 1], dtype=tf.float64))
# works fine:
pred = tf.matmul(x, weights)
# doesn't work:
x_to_exponent = tf.mul(tf.sign(x), tf.pow(tf.abs(x), tf.transpose(exponents)))
pred = tf.matmul(x_to_exponent, weights)
cost_function = tf.reduce_mean(tf.abs(pred-y_))
optimizer = tf.train.GradientDescentOptimizer(t.LEARNING_RATE).minimize(cost_function)
The problem is that whenever there is a negative value zero in x the optimizer returns the weight as NaN. If I simply add 0.0001 when x = 0 then everything works as expected. But should I really have to do this? Shouldn't the TensorFlow optimizer have a way to handle this?
I've noticed Wikipedia shows no activation functions where x is taken to an exponent. Why isn't there an activation function that looks as below Image?
For the above image I'd like my program to learn that the correct exponent is 0.5.
This is correct behavior on TensorFlow's part, since the gradient is infinity there (and many computations that should mathematically be infinity end up NaN due to indeterminate limits).
If you want to work around the problem, a slightly generalized version of gradient clipping may work. You can get the gradients via Optimizer.compute_gradients, manually clip them via something like
safe_grad = tf.clip_by_value(, 0, grad), -lim, lim)
and then pass the clipped gradients to Optimizer.apply_gradients. The clipping will be necessary to not explode for values near the singularity, where the gradient may be arbitrarily large.
Warning: There is no guarantee that this will work, especially for deeper networks where the nans may pollute large swaths of the network.
Can anyone explain me the similarities and differences, of the Correlation and Convolution ? Please explain the intuition behind that, not the mathematical equation(i.e, flipping the kernel/impulse).. Application examples in the image processing domain for each category would be appreciated too
You will likely get a much better answer on dsp stack exchange but... for starters I have found a number of similar terms and they can be tricky to pin down definitions.
Cross correlation
Correlation coefficient
Sliding dot product
Pearson correlation
1, 2, 3, and 5 are very similar
4,6 are similar
Note that all of these terms have dot products rearing their heads
You asked about Correlation and Convolution - these are conceptually the same except that the output is flipped in convolution. I suspect that you may have been asking about the difference between correlation coefficient (such as Pearson) and convolution/correlation.
I am assuming that you know how to compute the dot-product. Given two equal sized vectors v and w each with three elements, the algebraic dot product is v[0]*w[0]+v[1]*w[1]+v[2]*w[2]
There is a lot of theory behind the dot product in terms of what it represents etc....
Notice the dot product is a single number (scalar) representing the mapping between these two vectors/points v,w In geometry frequently one computes the cosine of the angle between two vectors which uses the dot product. The cosine of the angle between two vectors is between -1 and 1 and can be thought of as a measure of similarity.
Correlation coefficient (Pearson)
Correlation coefficient between equal length v,w is simply the dot product of two zero mean signals (subtract mean v from v to get zmv and mean w from w to get zmw - here zm is shorthand for zero mean) divided by the magnitudes of zmv and zmw.
to produce a number between -1 and 1. Close to zero means little correlation, close to +/- 1 is high correlation. it measures the similarity between these two vectors.
See for a better definition.
Convolution and Correlation
When we want to correlate/convolve v1 and v2 we basically are computing a series of dot-products and putting them into an output vector. Let's say that v1 is three elements and v2 is 10 elements. The dot products we compute are as follows:
output[0] = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]
output[1] = v1[0]*v2[1]+v1[1]*v2[2]+v1[2]*v2[3]
output[2] = v1[0]*v2[2]+v1[1]*v2[3]+v1[2]*v2[4]
output[3] = v1[0]*v2[3]+v1[1]*v2[4]+v1[2]*v2[5]
output[4] = v1[0]*v2[4]+v1[1]*v2[5]+v1[2]*v2[6]
output[5] = v1[0]*v2[7]+v1[1]*v2[8]+v1[2]*v2[9]
output[6] = v1[0]*v2[8]+v1[1]*v2[9]+v1[2]*v2[10] #note this is
#mathematically valid but might give you a run time error in a computer implementation
The output can be flipped if a true convolution is needed.
output[5] = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]
output[4] = v1[0]*v2[1]+v1[1]*v2[2]+v1[2]*v2[3]
output[3] = v1[0]*v2[2]+v1[1]*v2[3]+v1[2]*v2[4]
output[2] = v1[0]*v2[3]+v1[1]*v2[4]+v1[2]*v2[5]
output[1] = v1[0]*v2[4]+v1[1]*v2[5]+v1[2]*v2[6]
output[0] = v1[0]*v2[7]+v1[1]*v2[8]+v1[2]*v2[9]
Notice that we have less than 10 elements in the output as for simplicity I am computing the convolution only where both v1 and v2 are defined
Notice also that the convolution is simply a number of dot products. There has been considerable work over the years to be able to speed up convolutions. The sweeping dot products are slow and can be sped up by first transforming the vectors into the fourier basis space and then computing a single vector multiplication then inverting the result, though I won't go into that here...
You might want to look at these resources as well as googling: Calculating Pearson correlation and significance in Python
The best answer I got were from this document:
I'm just going to copy the excerpt from the doc:
"The key difference between the two is that convolution is associative. That is, if F and G are filters, then F*(GI) = (FG)*I. If you don’t believe this, try a simple example, using F=G=(-1 0 1), for example. It is very convenient to have convolution be associative. Suppose, for example, we want to smooth an image and then take its derivative. We could do this by convolving the image with a Gaussian filter, and then convolving it with a derivative filter. But we could alternatively convolve the derivative filter with the Gaussian to produce a filter called a Difference of Gaussian (DOG), and then convolve this with our image. The nice thing about this is that the DOG filter can be precomputed, and we only have to convolve one filter with our image.
In general, people use convolution for image processing operations such as smoothing, and they use correlation to match a template to an image. Then, we don’t mind that correlation isn’t associative, because it doesn’t really make sense to combine two templates into one with correlation, whereas we might often want to combine two filter together for convolution."
Convolution is just like correlation, except that we flip over the filter before correlating
Can anybody please show me how to use RANSAC algorithm to select common feature points in two images which have a certain portion of overlap? The problem came out from feature based image stitching.
I implemented a image stitcher a couple of years back. The article on RANSAC on Wikipedia describes the general algortihm well.
When using RANSAC for feature based image matching, what you want is to find the transform that best transforms the first image to the second image. This would be the model described in the wikipedia article.
If you have already got your features for both images and have found which features in the first image best matches which features in the second image, RANSAC would be used something like this.
The input to the algorithm is:
n - the number of random points to pick every iteration in order to create the transform. I chose n = 3 in my implementation.
k - the number of iterations to run
t - the threshold for the square distance for a point to be considered as a match
d - the number of points that need to be matched for the transform to be valid
image1_points and image2_points - two arrays of the same size with points. Assumes that image1_points[x] is best mapped to image2_points[x] accodring to the computed features.
best_model = null
best_error = Inf
for i = 0:k
rand_indices = n random integers from 0:num_points
base_points = image1_points[rand_indices]
input_points = image2_points[rand_indices]
maybe_model = find best transform from input_points -> base_points
consensus_set = 0
total_error = 0
for i = 0:num_points
error = square distance of the difference between image2_points[i] transformed by maybe_model and image1_points[i]
if error < t
consensus_set += 1
total_error += error
if consensus_set > d && total_error < best_error
best_model = maybe_model
best_error = total_error
The end result is the transform that best tranforms the points in image2 to image1, which is exacly what you want when stitching.