I am doing image classification project and i have made the corpus of features.
I want to normalize my features for the input of PyBrain between -1 to 1 I am using the following formula to normalize the features
Normalized value = (Value - Mean ) / Standard Deviation
but it is giving me the normalized some values between -3 to 3 which is very inaccurate.
I have 100 inputs in pybrain and 1 output of pybrain.
The equation you used is that of standardization. It does not guarantee your values are in -1;1 but it rescales your data to have a mean of 0, and a standard deviation of 1 afterwards. But points can be more than 1x the standard deviation from the mean.
There are multiple options to bound your data.
Use a nonlinear function such as tanh (very popular in neural networks)
center, then rescale with 1/max(abs(dev))
preserve 0, then rescale with 1/max(abs(dev))
2*(x-min)/(max-min) - 1
standardize (as you did) but truncate values to -1;+1
... many more
In case of you have positive dataset, you can normalize your values using this formula
Normalized value = (Value / (0.5*Max_Value) )-1;
This is going to give you values with the range [-1,+1]
In case you have positive and negative:
Normalized value = ((Normalized - Min_Value)/(Max_Value-Min_Value)-0.5)*2
Maybe you can do this:
Mid_value = ( Max_value + Min_Value )/2
Max_difference = ( Max_value - Min_Value )/2;
Normalized_value = ( Value - Mid_value )/Max_difference;
The Normalized_value shall be within [-1,+1].
Related
The triplet loss is defined as follows:
L(A, P, N) = max(‖f(A) - f(P)‖² - ‖f(A) - f(N)‖² + margin, 0)
where A=anchor, P=positive, and N=negative are the data samples in the loss, and margin is the minimum distance between the anchor and positive/negative samples.
I read somewhere that (1 - cosine_similarity) may be used instead of the L2 distance.
Note that I am using Tensorflow - and the cosine similarity loss is defined that When it is a negative number between -1 and 0, 0 indicates orthogonality and values closer to -1 indicate greater similarity. The values closer to 1 indicate greater dissimilarity. So, it is the opposite of cosine similarity metric.
Any suggestions on how to write my triplet loss with cosine similarity?
Edit
All good stuff in the answers (comments and answers). Based on all the hints - this is working ok for me:
self.margin = 1
self.loss = tf.keras.losses.CosineSimilarity(axis=1)
ap_distance = self.loss(anchor, positive)
an_distance = self.loss(anchor, negative)
loss = tf.maximum(ap_distance - an_distance + self.margin, 0.0)
I would like to eventually use the tensorflow addon loss as #pygeek pointed out but I haven't figured out how to pass the data yet.
Note
To use it standalone - one must do something like this:
cosine_similarity = tf.keras.metrics.CosineSimilarity()
cosine_similarity.reset_state()
cosine_similarity.update_state(anch_prediction, other_prediction)
similarity = cosine_similarity.result().numpy()
Resources
pytorch cosine embedding layer
tensorflow cosine similarity implmentation
tensorflow triplet loss hard/soft margin
First of all, Cosine_distance = 1 - cosine_similarity. The distance and similarity are different. This is not correctly mentioned in some of the answers!
Secondly, you should look at the TensorFlow code on how the cosine similarity loss is implemented https://github.com/keras-team/keras/blob/v2.9.0/keras/losses.py#L2202-L2272, which is different from PyTorch!!
Finally, I suggest you use existing loss: You should replace the || ... ||^2 with tf.losses.cosineDistance(...).
I am guessing that what you red about replacing L2 with cosine origins from the definition of cosine between two vectors:
cos(f(A), f(P)) = f(A) * f(P)/(‖f(A)‖*‖f(P)‖)
where dot product along the feature dimension is implied in the above. Next, note that
[1 - cos(f(A), f(P))]*‖f(A)‖*‖f(P)‖ = ‖f(A) - f(P)‖² - (‖f(A)‖ - ‖f(P)‖)²
which gives a hint on where the notion comes from when ‖f(A)‖ = ‖f(P)‖. So your formula can be naturally changed to
L(A, P, N) = max(cos(f(A), f(N)) - cos(f(A), f(P)) + margin, 0)
Your margin parameter should be adjusted accordingly. Here is some Tensorflow code to compute the cosines for vectors
def cos(A, B):
return tf.reduce_sum(A*B, axis=-1)/tf.norm(A, axis=-1)/tf.norm(B, axis=-1)
Whenever this loss would benefit your particular problem depends on the problem, so good luck with your experiments.
I have a dataset with some outliers, which are 10 or 100 times greater than the normal values. I cannot throw out these rows, and I want to normalize this data in an interval [0, 1]
First of all, here's what I thought to do:
Simply rank my dataset's rows and use the ranked positions as variable to normalize. Since we have a uniform distribution here, it is easy. The problem is that the value's differences are not measured, so values with a large difference could have similar normalized values if there aren't intermediate value examples in this dataset
Use sklearn.preprocessing.RobustScaler method. But I got normalized values between -0.4 and 300. It is still not good to normalize something in this scale
Distribute normalized values between 0 and 0.8 in a linear way for all values where quantile <= 0.8, and distribute the values between 0.8 and 1.0 among the remaining values in a similar way to the ranking strategy I mentioned above
Make a 1D Kmeans algorithm to locate all near values and get a cluster with non-outlier values. For these values, I just distribute normalized values between 0 and the quantile value it represents, simply by doing (value - mean) / (max - min), and for the remaining outlier values, I distribute the range between values greater than the quantile and 1 with the ranking strategy
Create a filter function, like a sigmoid, and multiply values by it. Smaller values remain unchanged, but the outlier's values are approximated to non-outlier values. Then, I normalize it. But how can I design this sigmoid's parameters?
First of all, I would like to get some feedbacks about these strategies, what do you think about them?
And also, how is this problem normally solved? Is there any references to recommend?
Thank you =)
How can I calculate the false positive rate for an object detection algorithm, where I can have multiple objects per image?
In my data, a given image may have many objects. I am counting a predicted box as a true positive if its IOU with a truth box is above a certain threshold, and as a false positive otherwise. For example:
I have 2 prediction bounding boxes and 2 ground-truth bounding boxes:
I computed IoU for each pair of prediction and ground-truth bounding boxes:
IoU = 0.00, 0.60, 0.10, 0.05
threshold = 0.50
In this case do I have TP example or not? Could You explain it?
Summary, specific: Yes, you have a TP; you also have a FP and a FN.
Summary, detailed: Your prediction model correctly identified one GT (ground truth) box. It missed the other. It incorrectly identified a third box.
Classification logic:
At the very least, your IoU figures should be a matrix, not a linear sequence. For M predictions and N GT boxes, you will have a NxM matrix. Your looks like this:
0.00 0.60
0.10 0.05
Now, find the largest value in the matrix, 0.60. This is above the threshold, so you declare the match and eliminate both that prediction and that GT box from the matrix. This leaves you with a rather boring matrix:
0.10
Since this value is below the threshold, you are out of matches. You have one prediction and one GT remaining. With the one "hit", you have three objects in your classification set: two expected objects, and a third created by the predictor. You code your gt and pred lists like this:
gt = [1, 1, 0] // The first two objects are valid; the third is a phantom.
pred = [1, 0, 1] // Identified one actual box and the phantom.
Is that clear enough?
You can use an algorithm (e.g. Hungarian algorithm aka Kuhn–Munkres algorithm aka Munkres algorithm) to assign detections to ground truths. You might incorporate the ability to not assign a detection to ground truth & vice versa (e.g. allow for false alarms and missed detections).
After assigning the detections to ground truths, just use the definition of TPR Wikipedia page for Sensitivity (aka TPR) & Specificity (aka TNR)
I provide this answer since I think #Prune provided an answer which uses a Greedy algorithm to perform assignment of detections to ground truths (i.e. "Now, find the largest value in the matrix, 0.60. This is above the threshold, so you declare the match and eliminate both that prediction and that GT box from the matrix."). This Greedy assignment method will not work well in all scenarios. For example imagine a matrix of IoU values between detections and ground truth bounding boxes
det1 det2
pred1 0.4 0.0
pred2 0.6 0.4
The Greedy algorithm would assign pred2 to det1 and pred1 to det2 (or pred1 to nothing if accounting for possibility of false alarms). However, the Hungarian algorithm would assign pred1 to det1 and pred2 to det2, which might be better in some cases.
Given a classification problem in Machine Learning the hypothesis is described as below.
hθ(x)=g(θ'x)
z = θ'x
g(z) = 1 / (1+e^−z)
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z)≥0.5
whenz≥0
Remember.
z=0,e0=1⇒g(z)=1/2
z→∞,e−∞→0⇒g(z)=1
z→−∞,e∞→∞⇒g(z)=0
So if our input to g is θTX, then that means:
hθ(x)=g(θTx)≥0.5
whenθTx≥0
From these statements we can now say:
θ'x≥0⇒y=1
θ'x<0⇒y=0
If The decision boundary is the line that separates the area where y = 0 and where y = 1 and is created by our hypothesis function:
What part of this relates to the Decision Boundary? Or where does the Decision Boundary algorithm come from?
This is basic logistic regression with a threshold. So your theta' * x is just the vector notation of your weight vector multiplied by your input. If you put that into the logistic function which outputs a value between 0 and 1 exclusively, you'll threshold that value at 0.5. So if it's equal and above this, you'll treat it as a positive sample and as a negative one otherwise.
The classification algorithm is just that simple. The training is a bit more complicated and the goal of it is the find a weight vector theta which satisfies the condition to correctly classify all your labeled data...or at least as much as possible. The way to do this is to minimize a cost function which measures the difference between the output of your function and the expected label. You can do this using gradient descent. I guess, Andrew Ng is teaching this.
Edit: Your classification algorithm is g(theta'x)>=0.5 and g(theta'x)<0.5, so a basic step function.
Courtesy of other posters on a different tech forum.
Solving for theta'*x >= 0 and theta'*x<0 gives the decision boundary. The RHS of the inequality ( i.e. 0) comes from the sigmoid function.
Theta gives you the hypothesis that best fits the training set.
From theta, you can compute the decision boundary - it is the locus of points where (X * theta) = 0, or equivalently where g(X * theta) = 0.5.
I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1)
Let's suppose I have 3 features with values in ranges of:
3 - 5.
0.02 - 0.05
10-15.
How do I convert all of those values into range of [0,1]?
What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1...
How do I normalize values during training to account for the possibility of "values in the wild" exceeding the highest(or lowest) values the model "seen" during training? How will the model react to that and how I make it work properly when that happens?
Besides scaling to unit length method provided by Tim, standardization is most often used in machine learning field. Please note that when your test data comes, it makes more sense to use the mean value and standard deviation from your training samples to do this scaling. If you have a very large amount of training data, it is safe to assume they obey the normal distribution, so the possibility that new test data is out-of-range won't be that high. Refer to this post for more details.
You normalise a vector by converting it to a unit vector. This trains the SVM on the relative values of the features, not the magnitudes. The normalisation algorithm will work on vectors with any values.
To convert to a unit vector, divide each value by the length of the vector. For example, a vector of [4 0.02 12] has a length of 12.6491. The normalised vector is then [4/12.6491 0.02/12.6491 12/12.6491] = [0.316 0.0016 0.949].
If "in the wild" we encounter a vector of [400 2 1200] it will normalise to the same unit vector as above. The magnitudes of the features is "cancelled out" by the normalisation and we are left with relative values between 0 and 1.