I have two normally distributed samples. I want to know how close or similar it is. I tried few methods to find the similarity, like z-score and bhattacharyya distance.
Bhattacharyya distance didn't work for me. It gives the same distance if the standard deviation of two samples is same. It doesn't change with change in mean.
I want to know whether any method is available that take the samples or its mean and standard deviation to find the similarity or similarity rank something like this.
I am not from mathematics background, so please ignore the terminology mistakes and let me know if any clarification is required.
I assume you're not looking for a relationship between the two samples, where a correlation coefficient would be appropriate?
I've been investigating a similar question for my current data and am looking at the Mahalanobis distance and the Earthmovers distance.
I found this post from a different forum which gave me a few ideas
Related
Couldn't find a precise and concise answer. I'm not particularly interested in different machine learning evaluation methods, I just want to know why it's important to have more than one?
Each metrics gives a different insight and evaluates your model differently.
Let's take an example for binary classification:
Accuracy tells you what percentage of your predictions are correct. But what if you also want to know exactly how many 1's you got wrong [i.e. you predicted 0's where they should be 1]. for this, you will calculate the recall score.
So you get the idea maybe you want good accuracy but also good recall [real world example : maybe spam detection], so you look at both metric and choose wisely
I've been reading up on how NEAT (Neuro Evolution of Augmenting Topologies) works and i've got the main idea of it, but one thing that's been bothering me is how you split the different networks into species. I've gone through the algorithm but it doesn't make a lot of sense to me and the paper i read doesn't explain it very well either so if someone could give a explanation of what each component is and what it's doing then that would be great thanks.
The 2 equations are:
The original paper
Speciation in NEAT is similar to fitness sharing used by other evolutionary algorithms. The idea is to penalize similar solutions, creating a pressure toward a more diverse population.
The delta term is a measure of distance between two solutions. The measure of distance used here is specialized for the variable-length genomes used by NEAT. Small delta values indicate more similar solutions.
The sharing function implemented in NEAT results in a value of 0 or 1 if the distance between two solutions is greater or less than a given threshold, respectively. Each solution is compared to each other solution in the candidate population, and the fitness is modified by the sum of resulting sharing function values. If a solution is similar to several other solutions in the population it's modified fitness will be significantly reduced.
I am looking for an Algorithm that is able to solve this problem.
The problem:
I have the following set points:
I want to group the points that represents a line (with some epsilon) in one group.
So, the optimal output will be something like:
Some notes:
The point belong to one and only line.
If the point can be belong to two lines, it should belong to the strongest.
A line is considered stronger that another when it has more belonging points.
The algorithm should not cover all points because they may be outliers.
The space contains many outliers it may hit 50% of the the total space.
Performance is critical, Real-Time is a must.
The solutions I found till now:
1) Dealing with it as clustering problem:
The main drawback of this method is that there is no direct distance metric between points. The distance metric is on the cluster itself (how much it is linear). So, I can not use traditional clustering methods and I have to (as far as I thought) use some kind of, for example, clustering us genetic algorithm where the evaluation occurs on the while cluster not between two points. I also do not want to use something like Genetic Algorithm While I am aiming real-time solution.
2) accumulative pairs and then do clustering:
While It is hard to make clustering on points directly, I thought of extracting pairs of points and then try to cluster them with others. So, I have a distance between two pairs that can represents the linearity (two pairs are in real 4 points).
The draw-back of this method is how to choose these pairs? If I depend on the Ecledian-Distance between them, it may not be accurate because two points may be so near to each other but they are so far from making a line with others.
I appreciate any solution, suggest, clue or note. Please you may ask about any clarification.
P.S. You may use any ready OpenCV function in thinking of any solution.
As Micka advised, I used Sequential-RANSAC to solve my problem. Results were fantastic and exactly as I want.
The idea is simple:
Apply RANSAC with fit-line model on the points.
Delete all points that are in-liers of the output of RANSAC.
While there are 2 or more points go to 1.
I have implemented my own fit-line RANSAC but unfortnantly I can not share code because it belongs to the company I work for. However, there is an excellent fit-line RANSAC here on SO that was implemented by Srinath Sridhar. The link of the post is : RANSAC-like implementation for arbitrary 2D sets.
It is easy to make a Sequential-RANSAC depending on the 3 simple steps I mentioned above.
Here are some results:
this question troubles me for two days. Now i am comparing the similarity of two time series data. The approach i know so far is to calculate the distance between them. Here, i choose the Dynamic Time Warping(DTW) to compute their distance. As a result, there is a warping path together with their DTW distance. Now my question is, how can i judge whether these two are similar based on this distance? Is there any threshold defined for this problem?
My intuition tells me that, if they are identical, then the distance between them would be 0.
Can anyone help me deal with this question?
Why not just use some simple statistical methods like finding the correlation between the two sets of data? You could do this in Excel quite easily - see this tutorial http://www.excel-easy.com/examples/correlation.html
Using distance measure on Time Series is always risky and yes, you need to define some threshold. The value will depend on your data. (It is all hit and trial approach).
Further,You can also refer to the paper "A review on time series data mining".
link:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.454.773&rep=rep1&type=pdf
In the paper, you can find various approach to find the similarity in Time Series
I'm pretty new in the field of machine learning (even if I find it extremely interesting), and I wanted to start a small project where I'd be able to apply some stuff.
Let's say I have a dataset of persons, where each person has N different attributes (only discrete values, each attribute can be pretty much anything).
I want to find clusters of people who exhibit the same behavior, i.e. who have a similar pattern in their attributes ("look-alikes").
How would you go about this? Any thoughts to get me started?
I was thinking about using PCA since we can have an arbitrary number of dimensions, that could be useful to reduce it. K-Means? I'm not sure in this case. Any ideas on what would be most adapted to this situation?
I do know how to code all those algorithms, but I'm truly missing some real world experience to know what to apply in which case.
K-means using the n-dimensional attribute vectors is a reasonable way to get started. You may want to play with your distance metric to see how it affects the results.
The first step to pretty much any clustering algorithm is to find a suitable distance function. Many algorithms such as DBSCAN can be parameterized with this distance function then (at least in a decent implementation. Some of course only support Euclidean distance ...).
So start with considering how to measure object similarity!
In my opinion you should also try expectation-maximization algorithm (also called EM). On the other hand, you must be careful while using PCA because this algorithm may reduce the dimensions relevant to clustering.