inequality measures for comparing two distributions - inequality

I am looking for an inequality measure to compare the inequality between two distributions. The random variable is categorical and I am thinking of using Gini index and skewness. Here is my two questions:
I read in a paper that skewness is a shape measure and Gini is not. Does Gini tell us anything about the shape of the distribution? What is the benefit of each of these measures in that sense?
I am looking to compare the two distributions in terms of inequality. To give an example, if we have the distribution of two societies, can we use Gini and skewness in order to compare them although we know that the number of classes of two societies (number of categories) are not close at all e.g. one has 10 classes and one has 40? Is skewness bounded? Can it be used for comparison?

Related

How to select features for clustering?

I had time-series data, which I have aggregated into 3 weeks and transposed to features.
Now I have features: A_week1, B_week1, C_week1, A_week2, B_week2, C_week2, and so on.
Some of features are discreet, some - continuous.
I am thinking of applying K-Means or DBSCAN.
How should I approach the feature selection in such situation?
Should I normalise the features? Should I introduce some new ones, that would somehow link periods together?
Since K-means and DBSCAN are unsupervised learning algorithms, selection of features over them are tied to grid search. You may want to test them to evaluate such algorithms based on internal measures such as Davies–Bouldin index, Silhouette coefficient among others. If you're using python you can use Exhaustive Grid Search to do the search. Here is the link to the scikit library.
Formalize your problem, don't just hack some code.
K-means minimizes the sum of squares. If the features have different scales they get different influence on the optimization. Therefore, you carefully need to choose weights (scaling factors) of each variable to balance their importance the way you want (and note that a 2x scaling factor does not make the variable twice as important).
For DBSCAN, the distance is only a binary decision: close enough, or not. If you use the GDBSCAN version, this is easier to understand than with distances. But with mixed variables, I would suggest to use the maximum norm. Two objects are then close if they differ in each variable by at most "eps". You can set eps=1, and scale your variables such that 1 is a "too big" difference. For example in discrete variables, you may want to tolerate one or two discrete steps, but not three.
Logically, it's easy to see that the maximum distance threshold decomposes into a disjunction of one-variablea clauses:
maxdistance(x,y) <= eps
<=>
forall_i |x_i-y_i| <= eps

Feature extraction for multiple sub-features

I would like to conduct some feature extraction(or clustering) for dataset containing sub-features.
For example, dataset is like below. The goal is to classify the type of robot using the data.
Samples : 100 robot samples [Robot 1, Robot 2, ..., Robot 100]
Classes : 2 types [Type A, Type B]
Variables : 6 parts, and 3 sub-features for each parts (total 18 variables)
[Part1_weight, Part1_size, Part1_strength, ..., Part6_size, Part6_strength, Part6_weight]
I want to conduct feature extraction with [weight, size, strength], and use extracted feature as a representative value for the part.
In short, my aim is to reduce the feature to 6 - [Part1_total, Part2_total, ..., Part6_total] - and then, classify the type of robot with those 6 features. So, make combined feature with 'weight', 'size', and 'strength' is the problem to solve.
First I thought of applying PCA (Principal Component Analysis), because it is one of the most popular feature extraction algorithm. But it considers all 18 features separately, so 'Part1_weight' can be considered as more important than 'Part2_weight'. But what I have to know is the importance of 'weights', 'sizes', and 'strengths' among samples, so PCA seems to be not applicable.
Is there any supposed way to solve this problem?
If you want to have exactly one feature per part I see no other way than performing the feature reduction part-wise. However, there might be better choices than simple PCA. For example, if the parts are mostly solid, their weight is likely to correlate with the third power of the size, so you could take the cubic root of the weight or the cube of the size before performing the PCA. Alternatively, you can take a logarithm of both values, which again results in a linear dependency.
Of course, there are many more fancy transformations you could use. In statistics, the Box-Cox Transformation is used to achieve a normal-looking distribution of the data.
You should also consider normalising the transformed data before performing the PCA, i.e. subtracting the mean and dividing by the standard deviations of each variable. It will remove the influence of units of measurement. I.e. it won't matter whether you measure weight in kg, atomic units, or Sun masses.
If the Part's number makes them different from one another (e.g Part1 is different from Part2, doesn't matter if their size, weight, strength parameters are identical), you can do PCA once for each Part. Using only the current Part's size, weight and strength as parameters in the current PCA.
Alternatively, if the Parts array order doesn't matter, you can do only one PCA using all (size, weight, strength) parameter triples, not differing them by their part number.

Set Similarity measure with known item similarities and abundances

I'm looking for a similarity measure (like the Jaccard Index) but I want to use known similarities between objects within the set, and weigh the connections by the item abundances. These known similarities are scores between 0 and 1, 1 indicating an exact match.
For example, consider two sets:
SET1 {A,B,C} and SET2 {A',B',C'}
I know that
{A,A'}, {B,B'}, {C,C'} each have an item similarity of 0.9. Hence, I would expect the similarity of SET1 and SET2 to be relatively high.
Another example would be: consider two sets SET1 {A,B,C} and SET2 {A,B',C',D,E,F,.....,Z}. Although the matches between the first three items are higher than in the first example, this score should likely be lower because of the size difference (as in Jaccard).
One more issue here is how to use abundances as weights, but I've got no idea as how to solve this.
In general, I need a normalized set similarity measure that takes into account this item similarity and abundancy.
Correct me if I'm wrong but I guess you need clustering error as similarity measure. It is the proportion of points which are clustered differently in A' and A after an optimal matching of clusters. In other words, it is the
scaled sum of the non-diagonal elements of the confusion matrix, minimized
over all possible permutations of rows and columns. It uses the Hungarian algorithm to avoid high computational cost and it penalizes different number of elements in sets.

Heterogeneous Value Difference Metric (HVDM)

I would like to ask if someone know some examples of the Heterogeneous Value Difference Metric (HVDM) distance ? also, i would like to ask if there is an implementation of such metric in R?
I will be grateful if someone can give some useful ressource in such way i could compute this distance manually
This is a very involved subject, which is no doubt why you can't find examples. What worries me about your question is that it is very general, and often a given implementation or use case of this sort of machine learning / data mining may need considerable algorithm tuning to make it effective, because the nature of the data will to some extent dictate how your HVDM is best calculated.
Single dimensional euclidean distance can obviously be calculated by D = a - b. 2D distance is Pythagoras, so D = SQRT((a1-b1)^2+(a2-b2)^2), and when you have N dimensional data D = SQRT((a1-b1)^2+(a2-b2)^2+....+(aN-bN)^2).
So, if you are comparing 2 data sets, a and b, with N numerical values, you can now calculate a distance between them...
Note that the square root is probably usually optional for practical purposes since it affects magnitude, but this is a tuning/performance/optimisation issue... and I'm not sure, but maybe some use cases might be better with it and some without.
Since you say your dataset has nominal values in, this makes it more interesting, as euclidean distance is meaningless for nominal values... How you reconcile that depends on the data, if you can assign numerical data to the nominals, that's good, because you can then calculate a euclidean distance again (e.g. banana = {2,4,6}, apple={4,2,2}, pear={3,3,5}, these numbers being characteristics such as shape, colour, squishiness, for example).
Next problem is that because you have nominal and numerical data which is fundamentally different, you almost certainly need to normalise the nominal and numerical so that one doesn't have an unreasonable weight because of the nature of that data. Also it's possible you might split each numerical data set and calculate 2 distances for each data set comparison... again it's a data dependant decision, or a decision you will make when tuning to get good or even sane performance. Sum the normalised results, or calculate a euclidean distance of them.
Normalising, at its simplest, means dividing by the over all range of the data, so 2 bits of data, both normalised will both be reduced to a value between 0 and 1, thus eliminating irrelevant facts like the magnitude of one bit of data is 10,000 times that of the other. Alternative normalising techniques might be appropriate for your data if it can or does have outliers.
In R, You can find UBL Package that use HVDM as option of Distance, at ENNClassif function.
library(datasets)
data(iris)
summary(iris)
#install.packages("UBL")
library(UBL)
# generate an small imbalanced data set
ir<- iris[-c(95:130), ]
# use HDVM as Distance for numeric and nominal features.
irHVDM <- ENNClassif(Species~., ir, k = 3, dist = "HVDM")

Distance measure for categorical attributes for k-Nearest Neighbor

For my class project, I am working on the Kaggle competition - Don't get kicked
The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices:
Since the data is highly skewed, out of 73,000 instances, 64,000 instances are bad buy and only 9,000 instances are good buy. Since building a decision tree would overfit the data, I chose to use kNN - K nearest neighbors.
After trying out kNN, I plan to try out Perceptron and SVM techniques, if kNN doesn't yield good results. Is my understanding about overfitting correct?
Since some features are numeric, I can directly use the Euclid distance as a measure, but there are other attributes which are categorical. To aptly use these features, I need to come up with my own distance measure. I read about Hamming distance, but I am still unclear on how to merge 2 distance measures so that each feature gets equal weight.
Is there a way to find a good approximate for value of k? I understand that this depends a lot on the use-case and varies per problem. But, if I am taking a simple vote from each neighbor, how much should I set the value of k? I'm currently trying out various values, such as 2,3,10 etc.
I researched around and found these links, but these are not specifically helpful -
a) Metric for nearest neighbor, which says that finding out your own distance measure is equivalent to 'kernelizing', but couldn't make much sense from it.
b) Distance independent approximation of kNN talks about R-trees, M-trees etc. which I believe don't apply to my case.
c) Finding nearest neighbors using Jaccard coeff
Please let me know if you need more information.
Since the data is unbalanced, you should either sample an equal number of good/bad (losing lots of "bad" records), or use an algorithm that can account for this. I think there's an SVM implementation in RapidMiner that does this.
You should use Cross-Validation to avoid overfitting. You might be using the term overfitting incorrectly here though.
You should normalize distances so that they have the same weight. By normalize I mean force to be between 0 and 1. To normalize something, subtract the minimum and divide by the range.
The way to find the optimal value of K is to try all possible values of K (while cross-validating) and chose the value of K with the highest accuracy. If a "good" value of K is fine, then you can use a genetic algorithm or similar to find it. Or you could try K in steps of say 5 or 10, see which K leads to good accuracy (say it's 55), then try steps of 1 near that "good value" (ie 50,51,52...) but this may not be optimal.
I'm looking at the exact same problem.
Regarding the choice of k, it's recommended be an odd value to avoid getting "tie votes".
I hope to expand this answer in the future.

Resources