Compute the similarity of two graphs of different sizes - machine-learning

I have two graphs G and G' (of different sizes) and I want to check how similar they are. I have read that the Wasserstein distance is used in this case.
How can I use it?
In scipy there is the function:
scipy.stats.wasserstein_distance(u_values, v_values, u_weights=None, v_weights=None)
How can I pass G and G' as u_values and v_values?
EDIT:
I got the idea from this paper: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0228728&type=printable
Where they write:
Inspired by the rich connections between graph theory and geometry, one can define a notion of distance between any two graphs by extending the notion of distance between metric spaces [58]. The construction proceeds as follows: each graph is represented as a metric space, wherein the metric is simply the shortest distance on the graph. Two graphs are equivalent if there exists an isomorphism between the graph represented as metric spaces. Finally, one can define a distance between two graphs G1 and G2 (or rather between the two classes of graph isometric to G1 and G2 respectively) by considering standard notions of distances between isometry classes of metric spaces [59]. Examples of such distances include the Gromov-Hausdorff distance [59], the Kantorovich-Rubinstein distance and the Wasserstein distance [60], which both require that the metric spaces be equipped with probability measures.
It is not clear to me though how to do this.

Related

Constructing a 3-D weighted & undirected similarity graph

I am a newbie in using python and I am in need of some help.
I am trying to built a weighted and undirected k-nearest-neighbors graph for a given 13-dimensional dataset containing 200 data points.
For a start, I created an 3-dimensional embedding via PCA (preserving up to 98% of the initial data structure). I also created the embedding scatter plot using matplotlib and a similarity matrix containing each data point's distance to it's 10 nearest neighbors using sklearn.neighbors.kneighbors_graph. The resulting matrix is not a symmetric one and would lead me to a directed graph.
What I want to do is to create an undirected graph, using the distances as edge weights and each data point as a vertex. Focusing on the "undirected" part of the process, this means that:
(A) two vertices (let's say v-i and v-j) would be connected with an undirected edge if v-i is among the k-nearest-neighbors of v-j or if v-j is among the k-nearest-neighbors of v-i.
(B) two vertices would be connected with an undirected edge if v-i is and among the k-nearest_neighbors of v-j and v-j is among the k-nearest-neighbors of v-i.
The resulting similarity matrix (using either (A) or (B) would be a symmetric one).
Unfortunately, I have no idea how to do this or how to plot it. Does anyone have a clue?
Thanks in advance!!!
I tried using Networkx, but I'm afraid it doesn't work.

Determining the number of clusters for kdd99 dataset using k-means

What is the general convention for number of k, while performing k-means on KDD99 dataset? Three different papers I read have three completely different k (25,20 and 5). I would like to know the general opinion on this, like what should be the range of k e.t.c?
Thanks
The K-means clustering algorithm is used to find groups which have not been explicitly labeled in the data.
I general there is no method for determining the exact value for K, but an estimated approach can be used to determine it.
To find K, take the mean distance between data points and their cluster centroid.
The elbow method and kernel method works more precisely, but the number of clusters can depend upon your problem. (Recommended)
And one of the quick approaches is:-Take the square root of the number of data points divided by two and set that as number of cluster.

A feature that has different meanings in different ranges

In machine learning, how to deal with a feature like salary. For example, if I'm applying k-nearest neighbors by measuring the distance between data points based on features. Let's say we have two points with salaries 2000 and 6000. The difference between them is 4000. Let's view another two points with salaries 102000 and 106000. The difference here is still 4000$ but we humans consider the last two points closer or more similar than the first two points.
How do I incorporate such an intuition in machine learning?
You can do one of the following things (and many more):
transform the feature using log function (thus 2000 and 6000 would be much further than 102000 and 106000)
binarize feature into multiple buckets (you would create a feature for each range of salary and you are the one creating the buckets)
change similarity function in k-nn to look at relative instead of absolute difference

How to find maximal eulerian subgraph?

How to find maximal eulerian subgraph of a given graph? By "maximal" I mean subgraph with maximal number of edges, vertices, or both. My idea is to find basis of cycle space and combine basis cycles in a proper way, but I don't know how to do it (and is it a good idea or not).
UPD. Source graph is connected.
Some thoughts. Graph is eulerian iff it is connected (with possible isolated vertices) and all vertices have even degree.
It is 'easy' to satisfy second criteria by removing (shortest) paths between pairs of odd degree vertices.
Connectivity is problematic since removing edges can produce unconnected graph.
An example which shows that 'simple' (greedy) solution is not easy to produce. Modify complete graph K5 by splitting each edge in two edges (or more). Take two these modified K5 graph and from each one take two vertices (A, B from first and C, D from second). Connect A-C and B-D. Greedy approach would remove these added edges since they are the shortest paths. With that graph becomes unconnected. Solution would be to remove paths A-B and C-D.
It seems to me that algorithm should take a care about subgraph connectivity while removing edges. For sure algorithm should preserve that each subset of odd degree vertices, of which no pair are used to remove path between them, should have connectivity larger than cardinality of subset.
I would try (for a test) with recursive brute force solution with optimization. O is list of odd degree vertices.
def remove_edges(O, G):
if O is empty:
return solution
for f in O:
for t in O\{f}":
G2 = G without path edges between (f,t)
if G2 is unconnected:
continue
return remove_edges(O\{f,t}, G2)
Optimization can be to order sets O and O{f} by vertices that have shortest paths. That can be done by finding shortest lengths between all pairs of vertices from O before removing edges. That can be done by BFS from each O vertex.
It is proved in 1979 that determining if a given graph contains a spanning Eulerian subgraph is NP-complete.
Ref: W. R. Pulleyblank, A note on graphs spanned by
Eulerian graphs, J. Graph Theory 3, 1979, pp.
309–310,
Please refer to this
Finding the maximum size (number of edges) of spanning Eulerian subgraph of a graph (if it exists) is an active research area.
Consider the following standard definitions. Given a graph G = (V, E)
A circuit is a sequence of adjacent vertices starting and ending at
the same vertex. Circuits do not allow repeated edges but they do allow
repeated vertices.
A cycle is a special case of a circuit in which vertices also do not
repeat.
Note that circuits and Eulerian subgraphs are the same thing. This means that finding the longest circuit in G is equivalent to finding a maximum Eulerian subgraph of G. As noted above, this problem is NP-hard. So, unless P=NP, an efficient (i.e. polynomial time) algorithm for finding a maximal Eulerian subgraph in an arbitrary graph is impossible.
For undirected graphs, one way of randomly producing an Eulerian subgraph is to identify a cycle basis for G. A cycle basis is a set of cycles that, when combined using symmetric differences, can be used to form every Eulerian subgraph of the original graph G. Hence, we only need to take a random selection of cycles from this set and combine them to get our arbitrary Eulerian subgraph.
Given that an Eulerian subgraph is basically a collection of overlapping cycles, here is a greedy, polynomial-time algorithm that I'd like to suggest for finding large (but not necessarily maximum) Eulerian subgraphs. This works for both directed and undirected graphs and produces a set of edges (or arcs) E’ that define an Eulerian subgraph containing a user-defined source vertex s. The following steps are for directed graphs but can be easily modified for the undirected case.
Let U = {s} and E' = {}
while U is not empty
Let u be a random element in U
Form a cycle C from u in G
if no such cycle C exists
Remove u from U
else
Add the arcs of C to E'
Remove the arcs of C from G
Add the vertices of C to U
Here’s a few points to note about this algorithm.
Here, the set U holds the vertices that are yet to be fully considered by the algorithm.
To apply this method to undirected graphs, just replace the word
"arcs" with "edges"
This method can be seen as a generalisation of
Hierholzer's algorithm. Hence, if the input graph G is already
an Eulerian graph, then the returned set E’ will contain all of the
edges from G.
Various methods can be used to generate a cycle C from
vertex u. For directed graphs, a simple method is to create an
additional dummy vertex u' and temporarily redirect all of the incoming arcs
from u to u'. Various algorithms can then be used to determine a
u-u'-path (which represents a cycle), such as BFS, DFS, or
Wilson's algorithm.
This algorithm can be said to produce a maximal Eulerian subgraph with respect to G and s. This is because, on termination, no further cycles can be added to the solution contained in E'. Note that we should not confuse the terms maximal and maximum here: finding a maximal Eulerian subgraph is easy (using the above method); finding a maximum Eulerian subgraph is NP-hard. Similar terminology is used with matchings.

Latent semantic analysis (LSA) single value decomposition (SVD) understanding

Bear with me through my modest understanding of LSI (Mechanical Engineering background):
After performing SVD in LSI, you have 3 matrices:
U, S, and V transpose.
U compares words with topics and S is a sort of measure of strength of each feature. Vt compares topics with documents.
U dot S dot Vt
returns the original matrix before SVD. Without doing too much (none) in-depth algebra it seems that:
U dot S dot **Ut**
returns a term by term matrix, which provides a comparison between the terms. i.e. how related one term is to other terms, a DSM (design structure matrix) of sorts that compares words instead of components. I could be completely wrong, but I tried it on a sample data set, and the results seemed to make sense. It could just be bias though (I wanted it to work, so I saw what I wanted). I can't post the results as the documents are protected.
My question though is: Does this make any sense? Logically? Mathematically?
Thanks for any time/responses.
If you want to know how related one term is to another you can just compute
(U dot S)
The terms are represented by the row vectors. You can then compute the distance matrix by applying a distance function such as euclidean distance. Once you make the distance matrix by computing the distance between all the vectors the resulted matrix should be hollow symmetric with all distances >0. if the distance A[i,j] is small then they are related otherwise they are not.

Resources