Calculate total distance between multiple pairwise distributions/histograms - histogram

I am not sure about the terminology I should use for my problem, so I will give an example.
I have 2 sets of measurements (6 empirical distributions per set = D1-6) that describe 2 different states of the same system (BLUE & RED). These distributions can be multimodal, skewed, undersampled, and strange in some other unpredictable ways.
BLUE is my reference and I want to make RED distributed as closely as possible to BLUE, for all pairwise distributions. For this, I will play with parameters of my RED system and monitor the RED set of measurements D1-6 trying to make it overlap BLUE perfectly.
I know that I can use Jensen-Shannon or Bhattacharyya distances to evaluate the distance between 2 distributions (i.e. RED-D1 and BLUE-D1, for example). However, I do not know if there exist other metrics that could be applied here to get a global distance between all distributions (i.e. quantify the global mismatch between 2 sets of pairwise distributions). Is that the case ?
I am thinking about building an empirical scoring function that would use all the pairwise Jensen-Shannon distances, but I have no better ideas yet. I believe I can NOT just sum all the JS distances because I would get similar scores in these 2 hypothetical, different cases:
D1-6 are distributed as in my image
RED-D1-5 are a much better fit to BLUE-D1-5, BUT RED-D6 is shifted compared to BLUE-D6
And that would be wrong because I would have missed one important feature of my system. Given these 2 cases, it is better to have D1-6 distributed as in my image (solution 1).
The pairwise match between each distribution is equally important and should be equally weighted (i.e. the match between BLUE-D1 and RED-D1 is as important as the match between BLUE-D2 and RED-D2, etc).
D1-3 has a given range DOM1 of [0, 5] and D4-6 has another range DOM2 of [50, 800]. Diamonds represent the weighted means of BLUE and RED distributions.
Thank you very much for your help!

I ended up using a sum of all pairwise Earth Mover's Distance (EMD, https://en.wikipedia.org/wiki/Earth_mover%27s_distance, also known as Wasserstein metric) as a global metric of distance between all pairwise distributions. This describes the difference or similarity between 2 states of my system in an appropriate way.
EMD is implemented in python in package 'pyemd' or using scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html.

Related

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

Time series distance metric

In order to clusterize a set of time series I'm looking for a smart distance metric.
I've tried some well known metric but no one fits to my case.
ex: Let's assume that my cluster algorithm extracts this three centroids [s1, s2, s3]:
I want to put this new example [sx] in the most similar cluster:
The most similar centroids is the second one, so I need to find a distance function d that gives me d(sx, s2) < d(sx, s1) and d(sx, s2) < d(sx, s3)
edit
Here the results with metrics [cosine, euclidean, minkowski, dynamic type warping]
]3
edit 2
User Pietro P suggested to apply the distances on the cumulated version of the time series
The solution works, here the plots and the metrics:
nice question! using any standard distance of R^n (euclidean, manhattan or generically minkowski) over those time series cannot achieve the result you want, since those metrics are independent of the permutations of the coordinate of R^n (while time is strictly ordered and it is the phenomenon you want to capture).
A simple trick, that can do what you ask is using the cumulated version of the time series (sum values over time as time increases) and then apply a standard metric. Using the Manhattan metric, you would get as a distance between two time series the area between their cumulated versions.
Another approach would be by utilizing DTW which is an algorithm to compute the similarity between two temporal sequences. Full disclosure; I coded a Python package for this purpose called trendypy, you can download via pip (pip install trendypy). Here is a demo on how to utilize the package. You're just just basically computing the total min distance for different combinations to set the cluster centers.
what about using standard Pearson correlation coefficient? then you can assign the new point to the cluster with the highest coefficient.
correlation = scipy.stats.pearsonr(<new time series>, <centroid>)
Pietro P's answer is just a special case of applying a convolution to your time series.
If I gave the kernel:
[1,1,...,1,1,1,0,0,0,0,...0,0]
I would get a cumulative series .
Adding a convolution works because you're giving each data point information about it's neighbours - it's now order dependent.
It might be interesting to try with a guassian convolution or other kernels.

Non-linear interaction terms in Stata

I have a continuous dependent variable polity_diff and a continuous primary independent variable nb_eq. I have hypothesized that the effect of nb_eq will vary with different levels of the continuous variable gini_round in a non-linear manner: The effect of nb_eq will be greatest for mid-range values of gini_round and close to 0 for both low and high levels of gini_round (functional shape as a second-order polynomial).
My question is: How this is modelled in Stata?
To this point I've tried with a categorized version of gini_round which allows me to compare the different groups, but obviously this doesn't use data to its fullest. I can't get my head around the inclusion of a single interaction term which allows me to test my hypothesis. My best bet so far is something along the lines of the following (which is simplified by excluding some if-arguments etc.):
xtreg polity_diff c.nb_eq##c.gini_round_squared, fe vce(cluster countryno),
but I have close to 0 confidence that this is even nearly right.
Here's how I might do it:
sysuse auto, clear
reg price c.weight#(c.mpg##c.mpg) i.foreign
margins, dydx(weight) at(mpg = (10(10)40))
marginsplot
margins, dydx(weight) at(mpg=(10(10)40)) contrast(atcontrast(ar(2(1)4)._at) wald)
We interact weight with a second degree polynomial of mpg. The first margins calculates the average marginal effect of weight at different values of mpg. The graph looks like what you describe. The second margins compares the slopes at adjacent values of mpg and does a joint test that they are all equal.
I would probably give weight its own effect as well (two octothorpes rather than one), but the graph does not come out like your example:
reg price c.weight##(c.mpg##c.mpg) i.foreign

Which of the parameters in LibSVM is the slack variable?

I am a bit confused about the namings in the SVM. I am using this library LibSVM. There are so many parameters that can be set. Does anyone know which of these is the slack variable?
thx
The "slack variable" is C in c-svm and nu in nu-SVM. These both serve the same function in their respective formulations - controlling the tradeoff between a wide margin and classifier error. In the case of C, one generally test it in orders of magnitude, say 10^-4, 10^-3, 10^-2,... to 1, 5 or so. nu is a number between 0 and 1, generally from .1 to .8, which controls the ratio of support vectors to data points. When nu is .1, the margin is small, the number of support vectors will be a small percentage of the number of data points. When nu is .8, the margin is very large and most of the points will fall in the margin.
The other things to consider are your choice of kernel (linear, RBF, sigmoid, polynomial) and the parameters for the chosen kernel. Generally one has to do a lot of experimenting to find the best combination of parameters. However, be careful of over-fitting to your dataset.
Burges wrote a great tutorial: A Tutorial on Support Vector Machines for Pattern
Recognition
But if you mostly just want to know how to USE it and less about how it works, read "A Practical Guide to Support Vector Classication" by Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin (authors of libsvm)
First decide which type of SVM are u intending to use: C-SVC, nu-SVC , epsilon-SVR or nu-SVR. In my opinion u need to vary C and gamma most of the time... the rest are usually fixed..

Recommended anomaly detection technique for simple, one-dimensional scenario?

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.
For example, with the following example data:
a = 10
b = 14
c = 25
d = 467
e = 12
d is clearly an anomaly, and I would want to perform a specific action based on this.
I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.
Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.
Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.
The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.
The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.
So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.
Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.
What is a recommended anomaly detection technique for simple, one-dimensional data?
Check out the three-sigma rule:
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
An alternative method is the IQR outlier test:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN x is an extreme outlier
this test is usually employed by Box plots (indicated by the whiskers):
EDIT:
For your case (simple 1D univariate data), I think my first answer is well suited.
That however isn't applicable to multivariate data.
#smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.
A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.
DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.
If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.
If the number of neighbors is less than minPoints, the point is marked as noise.
If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.
Finally the set of all points marked as noise are considered outliers.
There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.
After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by #Amro as a good starting point.
For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.
This is an old topic but still it lacks some information.
Evidently, this can be seen as a case of univariate outlier detection. The approaches presented above have several pros and cons. Here are some weak spots:
Detection of outliers with the mean and sigma has the obvious disadvantage of dependence of mean and sigma on the outliers themselves.
The case of the small sample limit (see question for example) is not adequately covered by, 3 sigma, K-Means, IQR etc.
And I could go on... However the statistical literature offers a simple metric: the median absolute deviation. (Medians are insensitive to outliers)
Details can be found here: https://www.sciencedirect.com/book/9780128047330/introduction-to-robust-estimation-and-hypothesis-testing
I think this problem can be solved in a few lines of python code like this:
import numpy as np
import scipy.stats as sts
x = np.array([10, 14, 25, 467, 12]) # your values
np.abs(x - np.median(x))/(sts.median_abs_deviation(x)/0.6745) #MAD criterion
Subsequently you reject values above a certain threshold (97.5 percentile of the distribution of data), in case of an assumed normal distribution the threshold is 2.24. Here it translates to:
array([ 0.6745 , 0. , 1.854875, 76.387125, 0.33725 ])
or the 467 entry being rejected.
Of course, one could argue, that the MAD (as presented) also assumes a normal dist. Therefore, why is it that argument 2 above (small sample) does not apply here? The answer is that MAD has a very high breakdown point. It is easy to choose different threshold points from different distributions and come to the same conclusion: 467 is the outlier.
Both three-sigma rule and IQR test are often used, and there are a couple of simple algorithms to detect anomalies.
The three-sigma rule is correct
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
The IQR test should be:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
If x > Q75 + 1.5 * IQR or x < Q25 - 1.5 * IQR THEN x is a mild outlier
If x > Q75 + 3.0 * IQR or x < Q25 – 3.0 * IQR THEN x is a extreme outlier

Resources