Set Similarity measure with known item similarities and abundances - machine-learning

I'm looking for a similarity measure (like the Jaccard Index) but I want to use known similarities between objects within the set, and weigh the connections by the item abundances. These known similarities are scores between 0 and 1, 1 indicating an exact match.
For example, consider two sets:
SET1 {A,B,C} and SET2 {A',B',C'}
I know that
{A,A'}, {B,B'}, {C,C'} each have an item similarity of 0.9. Hence, I would expect the similarity of SET1 and SET2 to be relatively high.
Another example would be: consider two sets SET1 {A,B,C} and SET2 {A,B',C',D,E,F,.....,Z}. Although the matches between the first three items are higher than in the first example, this score should likely be lower because of the size difference (as in Jaccard).
One more issue here is how to use abundances as weights, but I've got no idea as how to solve this.
In general, I need a normalized set similarity measure that takes into account this item similarity and abundancy.

Correct me if I'm wrong but I guess you need clustering error as similarity measure. It is the proportion of points which are clustered differently in A' and A after an optimal matching of clusters. In other words, it is the
scaled sum of the non-diagonal elements of the confusion matrix, minimized
over all possible permutations of rows and columns. It uses the Hungarian algorithm to avoid high computational cost and it penalizes different number of elements in sets.

Related

machine learning, nominal data normalization

i am working on kmeans clustering .
i have 3d dataset as no.days,frequency,food
->day is normalized by means & std deviation(SD) or better to say Standardization. which gives me range of [-2 to 14]
->for frequency and food which are NOMINAL data in my data sets are normalized by DIVIDE BY MAX ( x/max(x) ) which gives me range [0 to 1]
the problem is that the kmeans only considers the day-axis for grouping since there is obvious gap b/w points in this axis and almost ignores the other two of frequency and food (i think because of negligible gaps in frequency and food dims ).
if i apply the kmeans only on day-axis alone (1D) i get the exact similar result as i applied on 3D(days,frequency,food).
"before, i did x/max(x) as well for days but not acceptable"
so i want to know is there any way to normalize the other two nominal data of frequency and food and we get fair scaling based on DAY-axis.
food => 1,2,3
frequency => 1-36
The point of normalization is not just to get the values small.
The purpose is to have comparable value ranges - something which is really hard for attributes of different units, and may well be impossible for nominal data.
For your kind of data, k-means is probably the worst choice, because k-means relies on continuous values to work. If you have nominal values, it usually gets stuck easily. So my main recommendation is to not use k-means.
For k-means to wprk on your data, a difference of 1 must be the same in every attribute. So 1 day difference = difference between food q and food 2. And because k-means is based on squared errors the difference of food 1 to food 3 is 4x as much as food to food 2.
Unless you have above property, don't use k-means.
You can try to use the Value Difference Metric, VDM (or any variant) to convert pretty much every nominal attribute you encounter to a valid numeric representation. An after that you can just apply standardisation to the whole dataset as usual.
The original definition is here:
http://axon.cs.byu.edu/~randy/jair/wilson1.html
Although it should be easy to find implementations for every common language elsewhere.
N.B. for ordered nominal attributes such as your 'frequency' most of the time it is enough to just represent them as integers.

Heterogeneous Value Difference Metric (HVDM)

I would like to ask if someone know some examples of the Heterogeneous Value Difference Metric (HVDM) distance ? also, i would like to ask if there is an implementation of such metric in R?
I will be grateful if someone can give some useful ressource in such way i could compute this distance manually
This is a very involved subject, which is no doubt why you can't find examples. What worries me about your question is that it is very general, and often a given implementation or use case of this sort of machine learning / data mining may need considerable algorithm tuning to make it effective, because the nature of the data will to some extent dictate how your HVDM is best calculated.
Single dimensional euclidean distance can obviously be calculated by D = a - b. 2D distance is Pythagoras, so D = SQRT((a1-b1)^2+(a2-b2)^2), and when you have N dimensional data D = SQRT((a1-b1)^2+(a2-b2)^2+....+(aN-bN)^2).
So, if you are comparing 2 data sets, a and b, with N numerical values, you can now calculate a distance between them...
Note that the square root is probably usually optional for practical purposes since it affects magnitude, but this is a tuning/performance/optimisation issue... and I'm not sure, but maybe some use cases might be better with it and some without.
Since you say your dataset has nominal values in, this makes it more interesting, as euclidean distance is meaningless for nominal values... How you reconcile that depends on the data, if you can assign numerical data to the nominals, that's good, because you can then calculate a euclidean distance again (e.g. banana = {2,4,6}, apple={4,2,2}, pear={3,3,5}, these numbers being characteristics such as shape, colour, squishiness, for example).
Next problem is that because you have nominal and numerical data which is fundamentally different, you almost certainly need to normalise the nominal and numerical so that one doesn't have an unreasonable weight because of the nature of that data. Also it's possible you might split each numerical data set and calculate 2 distances for each data set comparison... again it's a data dependant decision, or a decision you will make when tuning to get good or even sane performance. Sum the normalised results, or calculate a euclidean distance of them.
Normalising, at its simplest, means dividing by the over all range of the data, so 2 bits of data, both normalised will both be reduced to a value between 0 and 1, thus eliminating irrelevant facts like the magnitude of one bit of data is 10,000 times that of the other. Alternative normalising techniques might be appropriate for your data if it can or does have outliers.
In R, You can find UBL Package that use HVDM as option of Distance, at ENNClassif function.
library(datasets)
data(iris)
summary(iris)
#install.packages("UBL")
library(UBL)
# generate an small imbalanced data set
ir<- iris[-c(95:130), ]
# use HDVM as Distance for numeric and nominal features.
irHVDM <- ENNClassif(Species~., ir, k = 3, dist = "HVDM")

finding maximum depth of random forest given the number of features

How do we find maximum depth of Random Forest if we know the number of features ?
This is needed for regularizing random forest classifier.
I have not thought about this before. In general the trees are non-deterministic. Instead of asking what is the maximum depth? You may want to know what would be the average depth, or what is the chance of a tree has depth 20... Anyways it is possible to calculate some bounds of the maximum depth. So either a node runs out of (a)inbag samples or (b)possible splits.
(a) If inbag samples(N) is the limiting part, one could imagine a classification tree, where all samples except one are forwarded left for each split. Then the maximum depth is N-1. This outcome is highly unlikely, but possible. The minimal depth tree, where all child nodes are equally big, then the minimal depth would be ~log2(N), e.g. 16,8,4,2,1. In practice the tree depth will be somewhere in between maximal in minimal. Settings controlling minimal node size, would reduce the depth.
(b) To check if features are limiting tree depth and you on before hand know the training set, then count how many training samples are unique. Unique samples (U) cannot be split. Do to boostrapping only ~0.63 of samples will be selected for every tree. N ~ U * 0.63. Use the rules from section (a). All unique samples could be selected during bootstrapping, but that is unlikely too.
If you do not know your training set, try to estimate how many levels (L[i]) possible could be found in each feature (i) out of d features. For categorical features the answer may given. For numeric features drawn from a real distribution, there would be as many levels as there are samples. Possible unique samples would be U = L[1] * L[2] * L[3] ... * L[d].

inequality measures for comparing two distributions

I am looking for an inequality measure to compare the inequality between two distributions. The random variable is categorical and I am thinking of using Gini index and skewness. Here is my two questions:
I read in a paper that skewness is a shape measure and Gini is not. Does Gini tell us anything about the shape of the distribution? What is the benefit of each of these measures in that sense?
I am looking to compare the two distributions in terms of inequality. To give an example, if we have the distribution of two societies, can we use Gini and skewness in order to compare them although we know that the number of classes of two societies (number of categories) are not close at all e.g. one has 10 classes and one has 40? Is skewness bounded? Can it be used for comparison?

Most effective similarity measure for list-ranked items

We're trying to find similarity between items (and later users) where the items are ranked in various lists by users (think Rob, Barry and Dick in Hi Fidelity). A lower index in a given list implies a higher rating.
I suppose a standard approach would be to use the Pearson correlation and then invert the indexes in some way.
However, as I understand it, the aim of the Pearson correlation is to compensate for differences between users who typically rate things higher or lower but have a similar relative ratings.
It seems to me that if the lists are continuous (although of arbitrary length) it's not an issue that the ratings implied from the position will be skewed in this way.
I suppose in this case a Euclidean based similarity would suffice. Is this the case? Would using the Pearson correlation have a negative effect and find correlation that isn't appropriate? What similarity measure might best suit this data?
Additionally while we want position in the list to have effect we don't want to penalise rankings that are too far apart. Two users both featuring an item in a list with very differing ranking should still be considered similar.
Jaccard Similarity looks better in your case. To include the rank you mentioned, you can take a bag-of-items approach.
Using your example of (Rob, Barry, Dick) with their rating being (3,2,1) respectively, you insert Rob 3 times into this user a's bag.
Rob, Rob, Rob.
Then for Barry, you do it twice. The current bag looks like below,
Rob, Rob, Rob, Barry, Barry.
You put Dick into the bag finally.
Rob, Rob, Rob, Barry, Barry, Dick
Suppose another user b has a bag of [Dick, Dick, Barry], you calculate the Jaccard Similarity as below:
The intersection between a and b = [Dick, Barry]
The union of a and b = [Rob, Rob, Rob, Barry, Barry, Dick, Dick]
The Jaccard Similarity = 2/7,
that is, the number of items in the intersection divided by the number of items in the union.
This similarity measure does NOT penalize rankings that are far apart. You can see that:
Two users both featuring an item in a list with very differing ranking should still be considered similar.
The most well-known similarity metric based only on ranking is Spearman's correlation. It just assigns "1" to the first item, "2" to the second and so on and computes a (Pearson) correlation coefficient. (You can make the values descending too, which is more intuitive -- won't matter to Pearson's correlation.)
Spearman's correlation is implemented in the project, but, that said I do not think it is very useful.
Tau rank is a more principled measure of how much ranked lists match, but it's not implemented. It would not be hard.

Resources