How to rank the features based on differences in their distributions where the features are measured for two groups having unequal sample sizes? - machine-learning

Let's say, I have 3 parameters, X,Y and Z. And they are measured for three groups of people, A, B and C. The samples sizes of A, B and C are different. I need to rank the parameters X,Y and Z based on their differences in distributions between the A and B groups, A and C groups, B and C groups. Can I define a metric that quantifies the difference in each case?
I was thinking on symmetrised KL divergence (𝐷𝐾𝐿(𝑃,𝑄)+𝐷𝐾𝐿
(𝑄,𝑃), as KL divergence is not symmetric) but it works good for equal size groups.
I was also thinking of two sample t-test but then how do I compare/rank the features based on p-values when i have unequal sample sizes for A, B and C.

Related

Compute the similarity of two graphs of different sizes

I have two graphs G and G' (of different sizes) and I want to check how similar they are. I have read that the Wasserstein distance is used in this case.
How can I use it?
In scipy there is the function:
scipy.stats.wasserstein_distance(u_values, v_values, u_weights=None, v_weights=None)
How can I pass G and G' as u_values and v_values?
EDIT:
I got the idea from this paper: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0228728&type=printable
Where they write:
Inspired by the rich connections between graph theory and geometry, one can define a notion of distance between any two graphs by extending the notion of distance between metric spaces [58]. The construction proceeds as follows: each graph is represented as a metric space, wherein the metric is simply the shortest distance on the graph. Two graphs are equivalent if there exists an isomorphism between the graph represented as metric spaces. Finally, one can define a distance between two graphs G1 and G2 (or rather between the two classes of graph isometric to G1 and G2 respectively) by considering standard notions of distances between isometry classes of metric spaces [59]. Examples of such distances include the Gromov-Hausdorff distance [59], the Kantorovich-Rubinstein distance and the Wasserstein distance [60], which both require that the metric spaces be equipped with probability measures.
It is not clear to me though how to do this.

Why different stocks can be mergerd together to build a single prediction models?

Given n samples with d features of stock A, we can build a (d+1) dimensional linear model to predict the profit. However, in some books, I found that if we have m different stocks with n samples and d features for each, then they merge these data to get m*n samples with d features to build a single (d+1) dimensional linear model to predict the profit.
My confusion is that, different stocks usually have little connection with each other, and their profit are influenced by different factors and environment, so why they can be merged to build a single model?
If you are using R as tool of choice, you might like the time series embedding howto and its appendix -- the mathematics behind that is Taken's theorem:
[Takens's theorem gives] conditions under which a chaotic dynamical system can be reconstructed from a sequence of observations of the state of a dynamical system.
It looks to me as the statement's you quote seem to relate to exactly this theorem: For d features (we are lucky, if we know that number - we usually don't), we need d+1 dimensions.
If more time series should be predicted, we can use the same embedding space if the features are the same. The dimensions d are usually simple variables (like e.g. temperature for different energy commodity stocks) - this example helped me to intuitively grasp the idea.
Further reading
Forecasting with Embeddings

Building regression using Categorical features

I am trying to use house price prediction as a practical example to learn machine learning. Currently I ran into the problem regarding to neighborhood.
With most machine learning examples, I saw features such as number of bedrooms, floor spaces, land area are used. Intuitively, these features has strong correlations to house prices. However, this is not the case for neighborhood. Let's say I randomly assign a neighborhood_id to each neighborhood. I won't be able to tell neighborhood with id 100 has higher or lower house price than neighborhood with id 53.
I am wondering if I need to do some data pre-processing, such as find the average price for each neighborhood then use the processed data, or there are existing machine learning algorithm that figure out the relation from seemingly unrelated feature?
I'm assuming that you're trying to interpret the relationship between neighborhood and housing price in a regression model with continuous and categorical data. From what I remember, R handles categorical variables automatically using one-hot encoding.
There are ways to approach this problem by creating data abstractions from categorical variables:
1) One-Hot Encoding
Let's say you're trying to predict housing prices from floor space and neighborhood. Assume that floor space is continuous and neighborhood is categorical with 3 possible neighborhoods, being A, B and C. One possibility is to encode neighborhood as a one-hot vector and treat each categorical variables as a new binary variable:
neighborhood A B C
A 1 0 0
B 0 1 0
B 0 1 0
C 0 0 1
The regression model would be something like:
y = c0*bias + c1*floor_space + c2*A + c3*B + c4*C
Note that this neighborhood variable is similar to bias in regression models. The coefficient for each neighborhood can be interpreted as the "bias" of the neighborhood.
2) From categorical to continuous
Let's call Dx and Dy the horizontal and vertical distances from all neighborhoods to a fixed point on the map. By doing this, you create a data abstraction that transforms neighborhood, a categorical variable, into two continuous variables. By doing this, you can correlate housing prices to horizontal and vertical distance from your fixed point.
Note that this is only appropriate when the transformation from categorical to continuous makes sense.

How to derive a marginal likelihood function?

I'm a little confused on the integral over ''theta'' of marginal likelihood function (http://en.wikipedia.org/wiki/Marginal_likelihood,Section: "Applications"-"Bayesian model comparison", the third equation on this page):
Why does the probability of x given M equal the integral and how to derive the equation?
This integral is nothing more than than the law of total probability in continuous form. Thus it can be derived directly from the probability axioms. Given the second formula in the link (Wikipedia), the only thing you have to do to arrive at the formula you are looking for is to replace the sum over discrete states by an integral.
So, what does it mean intuitively? You assume a model for your data X, which depends on a variable theta. For a given theta, the probability of a dataset X is thus p(X|theta). As you are not sure on the exact value of theta, you choose it to follow a distribution p(theta|alpha) specified by a (constant) parameter alpha. Now, the distribution of X is directly determined by alpha (this should be clear ... just ask yourself whether there is something other it might depend on ... and find nothing). Therefore, you can calculate its exact influence by integrating out the variable theta. This is what the law of total probability states.
If you don't get it by this explanation, I suggest you to play a bit around with conditional probabilities for discrete states, which in fact often leads to obvious results. The extension to the continuous case is then straightforward.
EDIT: The third equation shows the same which I tried to explain above. You have a model M. This model has parameters theta distributed by p(theta|M) -- you could also write this p_M(theta), for example.
These parameters determine the distribution of the data X via p(X|theta, M) ... i.e. each theta gives a different distribution of X (for a chosen model M). This form, however, is not convenient to work with. What you want is a summarized statement on the model M, not on its various possible choices for theta. So, in a way, you now want to know the average of X given a model M (note that in the model M also a chosen distribution of its parameters is included. For example, M does not simply mean "Neural Network", but rather something like "Neural Network with weights uniformly distributed in [-1,1]").
Obtaining this "average" requires only basic statistics: Just take the model, p(X|theta, M), multiply it by the density p(theta| M), and integrate over theta. This is essentially what you do for any average in statistics. All together, you arrive at the marginalization p(x|M).

K-Means clustering on multidimensional heterogeneous space

The data set I am trying to cluster is made of multiple heterogeneous dimensions.
For example
<A, B, C, D>
where A, B is lat, long.
C is a number.
D is a binary value.
What is the best way to approach a clustering problem in this case?
Should I normalise the data to make it homogeneous, or I should run a separate clustering problem for each homogeneous set of dimensions?
k-means is not a good choice, as it will not handle the 180Β° wrap-around, and distances anywhere but the equator will be distorted. IIRC in northern USA and most parts of Europe, the distortion is over 20% already.
Similar, it does not make sense to use k-means on binary data - the mean does not make sense, to be precise.
Use an algorithm that can work with arbitrary distances, and construct a combined distance function that is designed for solving your problem, on your particular data set.
Then use e.g. PAM or DBSCAN or hierarchical linkage clustering any other algorithm that works with arbitrary distance functions.
The mean of a binary feature can be seen as the frequency of that feature. There are cases in which one can standardise a binary feature v by v-\bar{v}.
However, in your case it seems to me that you have three features in three different feature spaces. I'd approach this problem by creating three distances d_v, one appropriate for each feature v \in V. The distance between two entities, say x and y would be given by d(x,y) \sum_{v \in V} w_v d_v(x_{v}, y_{v}). You could play with w_v, but I'd probably constraint it to \sum_{v \in V} w_v =1 and {v}_{v \in V} \geq 0.
The above are just some quick thoughts on it, good luck!
PS: Sorry for the text, I'm new here and I don't know how to put latex text here

Resources