Comparing two distributions with relation among buckets

Comparing two distributions with relation among buckets - machine-learning

I want to compare the following distributions with key-percentage.
dist1 = 200 - 0.1, 201-0.1, 500-0.8
dist2 = 200 - 0.15, 201 - 0.05, 500 - 0.8
dist3 = 200 - 0.1, 201-0.05, 500 - 0.85
dist1 is my original distribution. I want to compare it with dist2 , dist3. When I use something like KL divergence, I get KL(dist2,dist1)> KL(dist3,dist1) but in my current use case its the opposite, I want a metric which say dist2 is closer to dist1 than dist3 because there is only change between closer buckets i.e, 200,201 in dist2 compared to dist1 whereas in dist3 there is a movement from 201 bucket to 500 bucket.
Something like mean would work in this case but I want a more rigorous method of comparing these distributions which can capture all the variations.
Thanks

You may want to look into Earth mover’s distance. This measures the differences between two distributions by thinking of probability mass as a pile of dirt, then thinking of how much the dirt needs to move to transform one distribution into another. Moving dirt further takes more work than moving dirt less distance, whereas KL-divergence is insensitive to how far the probability mass travels.

Related

How to convert TS-SS result to similarity measure between 0 - 1?

I'm currently developing a question plugin for some LMS that auto grade the answer based on the similarity between the answer and answer key with cosine similarity. But lately, I found that there is a better algorithm that promised to be more accurate called TS-SS. But, the result of the calculation 0 - infinity. Being not a machine learning guy, I was assuming that the result maybe a distance, just like Euclidean Distance, but I'm not sure. It can be a geometry or something, because the algorithm is calculating the triangle and sector, so I'm assuming that it is a geometric similarity or something, I'm not sure though.
So I have some example in my note, and then I tried to convert it with what people suggest, S = 1 / (1 + D), but the result was not what I'm looking for. With cosine similarity I got 0.77, but with TS-SS plus equation before, I got 0.4. And then I found this SO answer that uses S = 1 / (1.1 ** D). When I tried the equation, sure enough it gave me "relevant" result, 0.81. That is not far from cosine similarity, and in my opinion the result is better suited for auto grading than 0.77 one based on the answer key.
But unfortunately, I don't know where that equation come from, and I tried to google it but no luck, so that is why I'm asking this question.
How to convert the TS-SS result to similarity measure the right way? Is the S = 1 / (1.1 ** D) enough or...?
Edit:
When calculating TS-SS, it is actually using cosine similarity calculation as well. So, if the cosine similarity is 1, then the TS-SS will be 0. But, if the cosine similarity is 0, the TS-SS is not infinty. So, I think it is reasonable to compare the result between the two to know what conversion formula will be used
TS-SS Cosine Similarity
38.19 0
7.065 0.45
3.001 0.66
1.455 0.77
0.857 0.81
0.006 0.80
0 1
another random comparison from multiple answer key
36.89 0
9.818 0.42
7.581 0.45
3.910 0.63
2.278 0.77
2.935 0.75
1.329 0.81
0.494 0.84
0.053 0.75
0.011 0.80
0.003 0.98
0 1
comparison from the same answer key
38.11 0.71
4.293 0.33
1.448 0
1.203 0.17
0.527 0.62
Thank you in advance

With these new figures, the answer is simply that we can't give you an answer. The two functions give you a distance measure based on metrics that appear to be different enough that we can't simply transform between TS-SS and CS. In fact, if the two functions are continuous (which they're supposed to be for comfortable use), then the transformation between them isn't a bijection (two-way function).
For a smooth translation between the two, we need at least for the functions to be continuous and differentiable for the entire interval of application. a small change in the document results in a small change in the metric. We also need them to be monotonic over the interval, such that a rise in TS-SS would always result in a drop in CS.
Your data tables show that we can't even craft such transformation functions for a single document, let alone the metrics in general.
The cited question was a much simpler problem: there, the OP already has a transformation with all of desired properties; they needed only to alter the slopes of change and ensure the boundary properties.

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.

As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

Get a blobs Skewness by inspecting 3rd Order Moment?

I've been reading up on Image Moments and it looks like they are very useful for efficiently describing a blob. Apparently, the 3rd order moment represents/can tell me about a blob's skewness (is this correct?).
How can I get the 3rd order moment in OpenCV? Do you have a calculation/formula you can point me to?
Moments m = moments(contour, false);
// Are any of these the 3rd order moment?
m.m03;
m.mu03;
m.nu03;

As stated in the OpenCV docs for moments():
Calculates all of the moments up to the third order of a polygon or rasterized shape.
So yes, moments() does return what you're after.
The three quantities you mention, m03, mu03, nu03 are all different types of the third or moment.
m03 is the third order moment
mu03 is the third order central moment, i.e. the same as m03 but if the blob was centered around (0, 0)
nu03 is mean-shifted and normalized, i.e. the same as mu03 but divided by an area.
Let's say you wanted to describe a shape but be agnostic to the size or the location of it in your image. Then you would use the mean-shifted and normalized descriptors, nu03. But if you wanted to keep the size as part of the description, then you'd use mu03. If you wanted to keep the location information as well, you'd use m03.
You can think about it in the same way as you think about distributions in general. Saying that I have a sample of x = 500 in a normal distribution with mean 450 and standard deviation 25 is basically the same thing as saying I have a sample of x = 2 in a normal distribution with mean 0 and std dev 1. Sometimes you might want to talk about the distribution in terms of it's actual parameters (mean 450, std dev 25), sometimes you might want to talk about the distribution as if it's mean centered (mean 0, std dev 25), and sometimes talk about it as if it's the standard Gaussian (mean 0, std dev 1).
This excellent answer goes over how to manually calculate the moments, which in tandem with the 'basic' descriptions I gave above, should make the formulas in the OpenCV docs for moments() a little more comfortable.

Does FFT neccessary to find peaks and pits on audio files

I'm able to read a wav files and its values. I need to find peaks and pits positions and their values. First time, i tried to smooth it by (i-1 + i + i +1) / 3 formula then searching on array as array[i-1] > array[i] & direction == 'up' --> pits style solution but because of noise and other reasons of future calculations of project, I'm tring to find better working area. Since couple days, I'm researching FFT. As my understanding, fft translates the audio files to series of sines and cosines. After fft operation the given values is a0's and a1's for a0 + ak * cos(k*x) + bk * sin(k*x) which k++ and x++ as this picture
http://zone.ni.com/images/reference/en-XX/help/371361E-01/loc_eps_sigadd3freqcomp.gif
My question is, does fft helps to me find peaks and pits on audio? Does anybody has a experience for this kind of problems?

It depends on exactly what you are trying to do, which you haven't really made clear. "finding the peaks and pits" is one thing, but since there might be various reasons for doing this there might be various methods. You already tried the straightforward thing of actually looking for the local maximum and minima, it sounds like. Here are some tips:
you do not need the FFT.
audio data usually swings above and below zero (there are exceptions, including 8-bit wavs, which are unsigned, but these are exceptions), so you must be aware of positive and negative values. Generally, large positive and large negative values carry large amounts of energy, though, so you want to count those as the same.
due to #2, if you want to average, you might want to take the average of the absolute value, or more commonly, the average of the square. Once you find the average of the squares, take the square root of that value and this gives the RMS, which is related to the power of the signal, so you might do something like this is you are trying to indicate signal loudness, intensity or approximate an analog meter. The average of absolutes may be more robust against extreme values, but is less commonly used.
another approach is to simply look for the peak of the absolute value over some number of samples, this is commonly done when drawing waveforms, and for digital "peak" meters. It makes less sense to look at the minimum absolute.
Once you've done something like the above, yes you may want to compute the log of the value you've found in order to display the signal in dB, but make sure you use the right formula. 10 * log_10( amplitude ) is not it. Rule of thumb: usually when computing logs from amplitude you will see a 20, not a 10. If you want to compute dBFS (the amount of "headroom" before clipping, which is the standard measurement for digital meters), the formula is -20 * log_10( |amplitude| ), where amplitude is normalize to +/- 1. Watch out for amplitude = 0, which gives an infinite headroom in dB.

If I understand you correctly, you just want to estimate the relative loudness/quietness of an audio digital sample at a given point.
For this estimation, you don't need to use FFT. However your method of averaging the signal does not produce the appropiate picture neither.
The digital signal is the value of the audio wave at a given moment. You need to find the overall amplitude of the signal at that given moment. You can somewhat see it as the local maximum value for a given interval around the moment you want to calculate. You may have a moving max for the signal and get your amplitude estimation.
At a 16 bit sound sample, the sound signal value can go from 0 up to 32767. At a 44.1 kHz sample rate, you can find peaks and pits of around 0.01 secs by finding the max value of 441 samples around a given t moment.
max=1;
for (i=0; i<441; i++) if (array[t*44100+i]>max) max=array[t*44100+i];
then for representing it on a 0 to 1 scale you (not really 0, because we used a minimum of 1)
amplitude = max / 32767;
or you might represent it in relative dB logarithmic scale (here you see why we used 1 for the minimum value)
dB = 20 * log10(amplitude);

all you need to do is take dy/dx, which can getapproximately by just scanning through the wave and and subtracting the previous value from the current one and look at where it goes to zero or changes from positive to negative
in this code I made it really brief and unintelligent for sake of brevity, of course you could handle cases of dy being zero better, find the 'centre' of a long section of a flat peak, that kind of thing. But if all you need is basic peaks and troughs, this will find them.
lastY=0;
bool goingup=true;
for( i=0; i < wave.length; i++ ) {
y = wave[i];
dy = y - lastY;
bool stillgoingup = (dy>0);
if( goingup != direction ) {
// changed direction - note value of i(place) and 'y'(height)
stillgoingup = goingup;
}
}

Correcting a known bias in collected data

Ok, so here is a problem analogous to my problem (I'll elaborate on the real problem below, but I think this analogy will be easier to understand).
I have a strange two-sided coin that only comes up heads (randomly) 1 in every 1,001 tosses (the remainder being tails). In other words, for every 1,000 tails I see, there will be 1 heads.
I have a peculiar disease where I only notice 1 in every 1,000 tails I see, but I notice every heads, and so it appears to me that the rate of noticing a heads or tails is 0.5. Of course, I'm aware of this disease and its effect so I can compensate for it.
Someone now gives me a new coin, and I noticed that the rate of noticing heads is now 0.6. Given that my disease hasn't changed (I still only notice 1 in every 1,000 tails), how do I calculate the actual ratio of heads to tails that this new coin produces?
Ok, so what is the real problem? Well, I have a bunch of data consisting of input, and outputs which are 1s and 0s. I want to teach a supervised machine learning algorithm to predict the expected output (a float between 0 and 1) given an input. The problem is that the 1s are very rare, and this screws up the internal math because it becomes very susceptible to rounding errors - even with high-precision floating point math.
So, I normalize the data by randomly omitting most of the 0 training samples so that it appears that there is a roughly equal ratio of 1s and 0s. Of course, this means that now the machine learning algorithm's output is no-longer predicting a probability, ie. instead of predicting 0.001 as it should, it would now predict 0.5.
I need a way to convert the output of the machine learning algorithm back to a probability within the original training set.
Author's Note (2015-10-07): I later discovered that this technique is commonly known as "downsampling"

You are calculating the following
calculatedRatio = heads / (heads + tails / 1000)
and you need
realRatio = heads / (heads + tails)
Solving both equations for tails yields the following equations.
tails = 1000 / calculatedRatio - 1000
tails = 1 / realRatio - 1
Combining both yields the following.
1000 / calculateRatio - 1000 = 1 / realRatio - 1
And finally solving for realRatio.
realRatio = 1 / (1000 / calculatedRatio - 999)
Seems to be correct. calculatedRatio 0.5 yields realRatio 1/1001, 0.6 yields 3 / 2003.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart