How node2vec works - machine-learning

I have been reading about the node2vec embedding algorithm and I am a little confused how it works.
For reference, node2vec is parametrised by p and q and works by simulating a bunch of random walks from nodes and just running word2vec embeddings on these walks as "sentences". By setting p and q in different ways, you can get more BFS or more DFS type random walks in the simulataion phase, capturing different network structure in the embedding.
Setting q > 1 gives us more BFS behaviour in that the samples of walks comprise of nodes within a small locality. The thing I am confused about is that the paper says this is equivalent to embedding nodes with similar structural properties close to each other.
I don't quite understand how that works. If I have two separate say star/hub structured nodes in my network that are far apart, why would embedding based on the random walks from those two nodes put those two nodes close together in the embedding?

This question has occupied my mind also after reading the article, and more so after empirically seeing that it indeed does that.
I assume you refer to the part in the paper showing the following diagram, states that u and s6 resulting embeddings will be quite similar in the space:
To understand why this indeed happens, first we must understand how the skip-gram model embeds information, which is the mechanism that consumes the random walks.
The skip-gram model eventually generates similar embeddings for tokens that can appear in similar context - but what does that really mean from the skip-gram model perspective?
If we would like to embed the structural equivalence we would favor a DFS-like walk (and additionally we would have to use an adequate window size for the skip-gram model).
So random walks would look like
1. s1 > u > s4 > s5 > s6 > s8
2. s8 > s6 > s5 > s4 > u > s1
3. s1 > s3 > u > s2 > s5 > s6
4. s7 > s6 > s5 > s2 > u > s3
.
.
n. .....
What will happen is that there would be many walks, where u and s6 appear in walks where their surroundings are the same. Since their surroundings will be similar it means that their context is similar and as stated similar context == similar embeddings.
One might further ask what about order? Well order doesn't really matter, since the skip-gram model uses the window size to generate pairs out of every sentence, in the link I provided you can further understand this concept.
So bottom line, if you can create walks that will create similar context for two nodes, their embeddings will be similar.

My understanding of the two sampling strategies goes like this:
DFS: for each node (a) the walk explores a wide context, containing not just the immediate neighbors (b), but also nodes further away (c). When optimizing the embedding and trying to get nodes closer which have similar context, the optimizer has to consider not just the relation of (a)-(b), but also (b)-(c), and so on. This is the same as trying to place nodes so that their distance in the network is conserved (each node trying to find its place based on a wide context).
BFS: for each node (a) the walk only explores the local context, but it does that extensively, so probably all neighbors (b1, b2, ...) will be included (and maybe some 2nd neighbors). Imagine trying to find a nodes place in the embedding space, while only having information on their neighbors. Nodes, that have similarly embedded neighbors should be close, e.g. dangling nodes with only 1 neighbor (and thus respective walk containing the source node many times), or nodes with two neighbors which have high degrees (i.e. a bridges connecting two hubs). So by only knowing the local information the embedding will not optimize for global distances, thus the result is not based on the actual graph structure, but rather on local patterns (called structural equivalence in the paper, just to make it confusing)
BUT!!! I tried reproducing the results for the network of Les Miserables with the parameters used in the original paper (p=1 q=0.5 and p=1 q=2), and couldn't get node2vec to do this 2nd type structural embedding thing. There is something fishy going on, as others also struggle with getting node2vec to embed structurally, here is a paper on it. If someone was able to reproduce their results please tell me how :)

Related

Computing classes of maximal path-equivalent nodes in a rooted DAG

I have a rooted directed acyclic graph with a single root node r.
I'm interested in computing the following equivalence:
"Nodes v and w are maximal path-equivalent iff every maximal path from r contains either both of v and w or none of them"
In particular, I want to find all equivalence classes w.r.t. the above condition, possibly in O(n+m) time (n nodes, m edges).
I feel like this problem is not unknown but I don't know what terms to search for.
If anyone knows what this problem is called or has any ideas on how to solve it, I would appreciate it.

How can we implement efficiently a maximum set coverage arc of fixed cardinality?

I am working on solving the following problem and implement the solution in C++.
Let us assume that we have an oriented weighted graph G = (V, A, w) and P a set of persons.
We receive a number of queries such that every query gives a person p and two vertices s and d and asks to compute the minimum weighted path between s and d for the person p. One person can have multiple paths.
After the end of all queries I have a number k <= |A| and I should give k arcs such that the number of persons using at least one of the k arcs is maximal (this is a maximum coverage problem).
To solve the first part I implemented the Djikistra algorithm using priority_queue and I compute the minimal weight between s and d. (Is this a good way to do ?)
To solve the second part I store for every arc the set of persons that use this arc and I use a greedy algorithm to compute the set of arcs (at each stage, I choose an arc used by the largest number of uncovered persons). (Is this a good way to do it ?)
Finally, if my algorithms are goods how can I implement them efficiently in C++?

Could you explain this question? i am new to ML, and i faced this problem, but its solution is not clear to me

The problem is in the picture
Question's image:
Question 2
Many substances that can burn (such as gasoline and alcohol) have a chemical structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand how the number of carbon atoms in a molecule affects how much energy is released when that molecule combusts (meaning that it is burned). The chemists obtains the dataset below. In the column on the right, kj/mole is the unit measuring the amount of energy released. examples.
You would like to use linear regression (h a(x)=a0+a1 x) to estimate the amount of energy released (y) as a function of the number of carbon atoms (x). Which of the following do you think will be the values you obtain for a0 and a1? You should be able to select the right answer without actually implementing linear regression.
A) a0=−1780.0, a1=−530.9 B) a0=−569.6, a1=−530.9
C) a0=−1780.0, a1=530.9 D) a0=−569.6, a1=530.9
Since all a0s are negative but two a1s are positive lets figure out the latter first.
As you can see by increasing the number of carbon atoms the energy is become more and more negative, so the relation cannot be positively correlated which rules out options c and d.
Then for the intercept the value that produces the least error is the correct one. For the 1 and 10 (easier to calculate) the outputs are about -2300 and -7000 for a, -1100 and -5900 for b, so one would prefer b over a.
PS: You might be thinking there should be obvious values for a0 and a1 from the data, it's not. The intention of the question is to give you a general understanding of the best fit. Also this way of solving is kinda machine learning as well

a new edge is insert to a Minimum spanning tree

I trying to find an algorithm to the following question with one different :
the edge are not distinct.
Give an efficient algorithm to test if T remains the minimum-cost spanning tree with the new edge added to G.
in this link- there is a solution but it is not for the different I wrote up:
the edges are not nessecerliy distinct.
Updating a Minimum spanning tree when a new edge is inserted
someone has an idea?
Well, the naive approach of just using Prim or Kruskal to find the min cost spanning tree of the new graph and then see which one has a lower total cost isn't too bad at O(|E|log|E|).
But we don't need to look at the whole graph.
Suppose your new edge connects vertices A and B. Let C be the parent of A. If B is not a descendent of A, then if A-B is lower cost than A-C, then T is no longer the MST and B should be the new parent of the subtree rooted at A.
If B is a descendant of A, then if A-B is shorter than any of the branches in T along the path from A to B, then T is no longer the MST, and the highest cost edge along that path should be removed, B is the root of the newly disconnected component, and should be added as a child of A.
I believe you may need to check these things a second time, reversing which vertices are A and B. The complexity of this is log|V| where the base of the log is the average number of children per node of T. In the case of T being a straight line, it's O(|V|), but otherwise, I think you could say it is O(log|V|).
First find an MST using one of the existing efficient algorithms.
Now adding an edge (v,w) creates a cycle in the MST. If the newly added edge has the maximum cost among the edges on the cycle then the MST remains as it is. If some other edge on the cycle has the maximum cost, then that's the edge to be removed to get a tree with lower cost.
So we need an efficient way to find the edge with the maximum value on the cycle. You can climb from v and w until you reach LCA(v, w) (the least common ancestor of v and w) to get the edge with the max cost. This takes linear time in the worst case.
If you are going to answer multiple such queries then pre-processing the MST is probably better. You can pre-process the MST to get a sparse table data structure in O(N lg N) time and then use this data structure to answer max queries in O(lg N) time in the worst case.

How to design an O(m) time algorithm to compute the shortest cycle of G(undirected unweighted graph) that contains s?

How to design an O(m) time algorithm to compute the shortest cycle of G(undirected unweighted graph) that contains s(s ∈ V) ?
You can run a BFS from your node s as starting point, this will give you a BFS-tree. Afterwards you can built a lowest-common-ancestor (LCA) data structure on this BFS-tree. This can be done for example with Tarjan's lowest-common-ancestor algorithm. I will not got into details here. Given two nodes v and w, LCA lets you find the lowest node in a tree (the BFS-tree in our case) that has v and w as descendents. The idea is when you are considering two nodes that are connected in our BFS-tree you want to check if their paths to the root (s is this case) + the edge that connects them forms a cycle (with s). This is the case if their LCA is s.
Assuming you have built the LCA, you run a second BFS. When expanding the neighbours of a node v, you also take into consideration the nodes already marked as explored. Suppose x is a neighbour of v such that x has already been explored. If the LCA of v and x is s then the path from x to s and form v to s in the BFS-tree plus the edge xv forms a cycle. The first x and v that you encounter in your second BFS gives you the desired result. If no such x exist then s is not contained in any cycle.
The cycle is also the shortest containing s.
The two BFS run in O(m) and the LCA construction can also be done in linear time, hence the whole procedure can be implemented in O(m).
This might a bit overkill. There surely is a much simpler solution.

Resources