GraphX or GraphFrame - community detection in undirected weighted graph - graph-algorithm

I'm trying to identify strongly connected communities within large group (undirected weighted graph). Alternatively, identifying vertices causing connection of sub-groups (communities) that would be otherwise unrelated.
The problem is part of broader Databricks solution thus Spark GraphX and GraphFrames are the first choice for resolving it.
As you can see from attached picture, I need to find vertex "X" as a point where can be split big continuous group identified by connected componect algorithms (val result = g.connectedComponents.run())
Strongly connected components method (for directed graph only), Triangle counting, or LPA community detection algorithms are not suitable, even if all weights are same, e.g. 1.
Picture with point, where should be cut big group ST0
Similar logic is nice described in question "Cut in a Weighted Undirected Connected Graph", but as a mathematical expression only.
Thanks for any hint.
// Vertex DataFrame
val v = sqlContext.createDataFrame(List(
(1L, "A-1", 1), // "St-1"
(2L, "B-1", 1),
(3L, "C-1", 1),
(4L, "D-1", 1),
(5L, "G-2", 1), // "St-2"
(6L, "H-2", 1),
(7L, "I-2", 1),
(8L, "J-2", 1),
(9L, "K-2", 1),
(10L, "E-3", 1), // St-3
(11L, "F-3", 1),
(12L, "Z-3", 1),
(13L, "X-0", 1) // split point
)).toDF("id", "name", "myGrp")
// Edge DataFrame
val e = sqlContext.createDataFrame(List(
(1L, 2L, 1),
(1L, 3L, 1),
(1L, 4L, 1),
(1L, 13L, 5), // critical edge
(2L, 4L, 1),
(5L, 6L, 1),
(5L, 7L, 1),
(5L, 13L, 7), // critical edge
(6L, 9L, 1),
(6L, 8L, 1),
(7L, 8L, 1),
(12L, 10L, 1),
(12L, 11L, 1),
(12L, 13L, 9), // critical edge
(10L, 11L, 1)
)).toDF("src", "dst", "relationship")
val g = GraphFrame(v, e)

Betweenness centrality seems to be one of the algorithms fitting this problem. This method counts how many shortest paths are going thru each vertex from all shortest paths connecting any pair of other vertices.
As far as I know, GraphFrame does not have Betweenness centrality and its Shortest Path just provides number of hoops between vertices w/o listing the actual path. Using bfs (Breadth First Search) method can give us reasonable approximation (note: bfs doesn't reflect distance/edge length neither; it also treats each graph as directed):
Ensure each vertex is defined in both directions to make bfs treating graph as undirected
Declare mutable structure (e.g. ArrayBuffer) pathMembers with following fields [fromId, toId, pathId, vertexId]
For each vertex o in your graph g.vertices (outer loop)
For each vertex i in your graph g.vertices.filter($"id" < lit(o.id)) (inner loop - looks only into i.id smaller than o.id, because shortestPath(o.id, i.id) is exaclty same as shortestPath(i.id, o.id) in undirected graph)
apply val paths = g.bfs.fromExpr("id = " + o.id).toExpr("id = " + i.id).run()
transpose paths to store all vertices in the path for each path and store them in pathMembers
Calculate how many time was each vertexId present in each fromId, toId path (i.e. vertexId count divided by pathId count for each fromId, toId pair)
Sum-up the calculation for each vertexId to obtain betweenness centrality measure
Vertex "X" for the schema will get highest value. Value for vertices directly connected to "X" will drop. Difference will be highes if most of the groups cross-connected by "X" have comparable size.
Note: if your graph is so large the full Betweenness centrality algorithm will be prohibitively long, sub-set of pairs for shortest path calculation could be selected randomly. Sample size is compromise between acceptable processing time and probability choosing majority of pairs within single branch of the graph.

Related

Subgraph isomorphism (or even set membership) in Z3?

I'm trying to find a way to encode a sort of basic subgraph isomorphism in Z3 (preferably z3py). While I know there are papers on this in the abstract, finding any mechanism to do it has eluded me even for very trivial cases, because I'm very new to Z3 in general!
Suppose you have just about the most basic subgraph with nodes (0,1,2) and edges (0,1) with node 2 off on its own, and the supergraph has nodes (0,1,2) and edges (1,2) with node 0 off on its own. You could map the nodes of the subgraph into the supergraph with
0->1,
1->2,
2->0
...as one possible mapping that would satisfy "if these two nodes are connected in the subgraph, their mapped nodes are connected in the supergraph"
So okay :) I tried
from networkx import Graph
from networkx.linalg.graphmatrix import adjacency_matrix
subgraph = Graph()
subgraph.add_nodes_from([0,1,2])
subgraph.add_edges_from([(0,1)])
supergraph = Graph()
supergraph.add_nodes_from([0,1,2])
supergraph.add_edges_from([(1,2)])
s = Solver()
assignments = [Int(f'n{node}') for node in subgraph.nodes]
# each bit assignment in the subgraph belongs to one in the supergraph
assignment_constraint = [ And(assignments[i] >= 0, assignments[i] <= max(supergraph.nodes)) for i in subgraph.nodes ]
# subgraph bits can't be assigned to the same supergraph bits
assignment_distinct = [ Distinct([assignments[i] for i in subgraph.nodes])]
which just gets me as far as "each assignment from subgraph to supergraph should map a node in the subgraph to some node in the supergraph and no two subgraph nodes can be assigned to the same supergraph node"
...but then I get stuck because I keep thinking along the lines of
for edge in subgraph.edges:
s.add( (assignments[edge[0]], assignments[edge[1]]) in supergraph.edges )
...but of course that doesn't work because pythonically those aren't the right sort of keys so that's always false or broken.
So how does one approach that? I can add constraints like "this_var == 1" but get very confused on things like checking membership, ie
>>> assignments[0] == 1.0
n0 == 1 # so that's OK then
>>> assignments[0] in [1.0, 2.0, 3.0]
False # woops, that fails horribly
and I feel like I'm missing a very basic "frame of mind" thing here.
It is relatively straightforward to encode subgraph isomorphism in z3, pretty much along the lines of how you described. However, this encoding is unlikely to scale to large graphs. As you no doubt know, subgraph isomorphism is NP-complete in general, and this encoding will cause z3 to simply enumerate all possibilities and thus will blow up exponentially.
Having said that, here's a straightforward encoding:
from z3 import *
# Subgraph, number of nodes and edges.
# Nodes will be named implicitly from 0 to noOfNodesA - 1
noOfNodesA = 3
edgesA = [(0, 1)]
# Supergraph:
noOfNodesB = 3
edgesB = [(1, 2)]
# Mapping of subgraph nodes to supergraph nodes:
mapping = Array('Map', IntSort(), IntSort())
s = Solver()
# Check that elt is between low and high, inclusive
def InRange(elt, low, high):
return And(low <= elt, elt <= high)
# Check that (x, y) is in the list
def Contains(x, y, lst):
return Or([And(x == x1, y == y1) for x1, y1 in lst])
# Make sure mapping is into the supergraph
s.add(And([InRange(Select(mapping, n1), 0, noOfNodesB-1) for n1 in range(noOfNodesA)]))
# Make sure we map nodes to distinct nodes
s.add(Distinct([Select(mapping, n1) for n1 in range(noOfNodesA)]))
# Make sure edges are preserved:
for x, y in edgesA:
s.add(Contains(Select(mapping, x), Select(mapping, y), edgesB))
# Solve:
r = s.check()
if r == sat:
m = s.model()
for x in range(noOfNodesA):
print ("%s -> %s" % (x, m.evaluate(Select(mapping, x))))
else:
print ("Solver said: %s" % r)
I've added comments along the way, so hopefully you should be able to read the code through; feel free to ask specific questions.
When I run this, I get:
$ python a.py
0 -> 1
1 -> 2
2 -> 0
which finds exactly the mapping you alluded to in your question.
Best of luck!

Vectorizing distance to several points on Octave (Matlab)

I'm writing a k-means algorithm. At each step, I want to compute the distance of my n points to k centroids, without a for loop, and for d dimensions.
The problem is I have a hard time splitting on my number of dimensions with the Matlab functions I know. Here is my current code, with x being my n 2D-points and y my k centroids (also 2D-points of course), and with the points distributed along dimension 1, and the spatial coordinates along the dimension 2:
dist = #(a,b) (a - b).^2;
dx = bsxfun(dist, x(:,1), y(:,1)'); % x is (n,1) and y is (1,k)
dy = bsxfun(dist, x(:,2), y(:,2)'); % so the result is (n,k)
dists = dx + dy; % contains the square distance of each points to the k centroids
[_,l] = min(dists, [], 2); % we then argmin on the 2nd dimension
How to vectorize furthermore ?
First edit 3 days later, searching on my own
Since asking this question I made progress on my own towards vectorizing this piece of code.
The code above runs in approximately 0.7 ms on my example.
I first used repmat to make it easy to do broadcasting:
dists = permute(permute(repmat(x,1,1,k), [3,2,1]) - y, [3,2,1]).^2;
dists = sum(dists, 2);
[~,l] = min(dists, [], 3);
As expected it is slightly slower since we replicate the matrix, it runs at 0.85 ms.
From this example it was pretty easy to use bsxfun for the whole thing, but it turned out to be extremely slow, running in 150 ms so more than 150 times slower than the repmat version:
dist = #(a, b) (a - b).^2;
dists = permute(bsxfun(dist, permute(x, [3, 2, 1]), y), [3, 2, 1]);
dists = sum(dists, 2);
[~,l] = min(dists, [], 3);
Why is it so slow ? Isn't vectorizing always an improvement on speed, since it uses vector instructions on the CPU ? I mean of course simple for loops could be optimized to use it aswell, but how can vectorizing make the code slower ? Did I do it wrong ?
Using a for loop
For the sake of completeness, here's the for loop version of my code, surprisingly the fastest running in 0.4 ms, not sure why..
for i=1:k
dists(:,i) = sum((x - y(i,:)).^2, 2);
endfor
[~,l] = min(dists, [], 2);
Note: This answer was written when the question was also tagged MATLAB. Links to Octave documentation added after the MATLAB tag was removed.
You can use the pdist2MATLAB/Octave function to calculate pairwise distances between two sets of observations.
This way, you offload the bother of vectorization to the people who wrote MATLAB/Octave (and they have done a pretty good job of it)
X = rand(10,3);
Y = rand(5,3);
D = pdist2(X, Y);
D is now a 10x5 matrix where the i, jth element is the distance between the ith X and jth Y point.
You can pass it the kind of distance you want as the third argument -- e.g. 'euclidean', 'minkowski', etc, or you could pass a function handle to your custom function like so:
dist = #(a,b) (a - b).^2;
D = pdist2(X, Y, dist);
As saastn mentions, pdist2(..., 'smallest', k) makes things easier in k-means. This returns just the smallest k values from each column of pdist2's result. Octave doesn't have this functionality, but it's easily replicated using sort()MATLAB/Octave.
D_smallest = sort(D);
D_smallest = D_smallest(1:k, :);

Simple RNN example showing numerics

I'm trying to understand RNNs and I would like to find a simple example that actually shows the one hot vectors and the numerical operations. Preferably conceptual since actual code may make it even more confusing. Most examples I google just show boxes with loops coming out of them and its really difficult to understand what exactly is going on. In the rare case where they do show the vectors its still difficult to see how they are getting the values.
for example I don't know where the values are coming from in this picture https://i1.wp.com/karpathy.github.io/assets/rnn/charseq.jpeg
If the example could integrate LSTMs and other popular extensions that would be cool too.
In the simple RNN case, a network accepts an input sequence x and produces an output sequence y while a hidden sequence h stores the network's dynamic state, such that at timestep i: x(i) ∊ ℝM, h(i) ∊ ℝN, y(i) ∊ ℝP the real valued vectors of M/N/P dimensions corresponding to input, hidden and output values respectively. The RNN changes its state and omits output based on the state equations:
h(t) = tanh(Wxh ∗ [x(t); h(t-1)]), where Wxh a linear map: ℝM+N ↦ ℝN, * the matrix multiplication and ; the concatenation operation. Concretely, to obtain h(t) you concatenate x(t) with h(t-1), you apply matrix multiplication between Wxh (of shape (M+N, N)) and the concatenated vector (of shape M+N) , and you use a tanh non-linearity on each element of the resulting vector (of shape N).
y(t) = sigmoid(Why * h(t)), where Why a linear map: ℝN ↦ ℝP. Concretely, you apply matrix multiplication between Why (of shape (N, P)) and h(t) (of shape N) to obtain a P-dimensional output vector, on which the sigmoid function is applied.
In other words, obtaining the output at time t requires iterating through the above equations for i=0,1,...,t. Therefore, the hidden state acts as a finite memory for the system, allowing for context-dependent computation (i.e. h(t) fully depends on both the history of the computation and the current input, and so does y(t)).
In the case of gated RNNs (GRU or LSTM), the state equations get somewhat harder to follow, due to the gating mechanisms which essentially allow selection between the input and the memory, but the core concept remains the same.
Numeric Example
Let's follow your example; we have M = 4, N = 3, P = 4, so Wxh is of shape (7, 3) and Why of shape (3, 4). We of course do not know the values of either W matrix, so we cannot reproduce the same results; we can still follow the process though.
At timestep t<0, we have h(t) = [0, 0, 0].
At timestep t=0, we receive input x(0) = [1, 0, 0, 0]. Concatenating x(0) with h(0-), we get [x(t); h(t-1)] = [1, 0, 0 ..., 0] (let's call this vector u to ease notation). We apply u * Wxh (i.e. multiplying a 7-dimensional vector with a 7 by 3 matrix) and get a vector v = [v1, v2, v3], where vi = Σj uj Wji = u1 W1i + u2 W2i + ... + u7 W7i. Finally, we apply tanh on v, obtaining h(0) = [tanh(v1), tanh(v2), tanh(v3)] = [0.3, -0.1, 0.9]. From h(0) we can also get y(0) via the same process; multiply h(0) with Why (i.e. 3 dimensional vector with a 3 by 4 matrix), get a vector s = [s1, s2, s3, s4], apply sigmoid on s and get σ(s) = y(0).
At timestep t=1, we receive input x(1) = [0, 1, 0, 0]. We concatenate x(1) with h(0) to get a new u = [0, 1, 0, 0, 0.3, -0.1, 0.9]. u is again multiplied with Wxh, and tanh is again applied on the result, giving us h(1) = [1, 0.3, 1]. Similarly, h(1) is multiplied by Why, giving us a new s vector on which we apply the sigmoid to obtain σ(s) = y(1).
This process continues until the input sequence finishes, ending the computation.
Note: I have ignored bias terms in the above equations because they do not affect the core concept and they make notation impossible to follow

How to correctly implement dropout for convolution in TensorFlow

According to the original paper on Dropout said regularisation method can be applied to convolution layers often improving their performance. TensorFlow function tf.nn.dropout supports that by having a noise_shape parameter to allow the user to choose which parts of the tensors will drop out independently. However, neither the paper nor the documentation give a clear explanation of which dimensions should be kept independently, and the TensorFlow explanation of how noise_shape works is rather unclear.
only dimensions with noise_shape[i] == shape(x)[i] will make independent decisions.
I would assume that for a typical CNN layer output of the shape [batch_size, height, width, channels] we don't want individual rows or columns to drop out by themselves, but rather whole channels (which would be equivalent to a node in a fully connected NN) independently of the examples (i.e. different channels could be dropped for different examples in a batch). Am I correct in this assumption?
If so, how would one go about implementing dropout with such specificity using the noise_shape parameter? Would it be:
noise_shape=[batch_size, 1, 1, channels]
or:
noise_shape=[1, height, width, 1]
from here,
For example, if shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n], each batch and channel component will be kept independently and each row and column will be kept or not kept together.
The code may help explain this.
noise_shape = noise_shape if noise_shape is not None else array_ops.shape(x)
# uniform [keep_prob, 1.0 + keep_prob)
random_tensor = keep_prob
random_tensor += random_ops.random_uniform(noise_shape,
seed=seed,
dtype=x.dtype)
# 0. if [keep_prob, 1.0) and 1. if [1.0, 1.0 + keep_prob)
binary_tensor = math_ops.floor(random_tensor)
ret = math_ops.div(x, keep_prob) * binary_tensor
ret.set_shape(x.get_shape())
return ret
the line random_tensor += supports broadcast. When the noise_shape[i] is set to 1, that means all elements in this dimension will add the same random value ranged from 0 to 1. So when noise_shape=[k, 1, 1, n], each row and column in the feature map will be kept or not kept together. On the other hand, each example (batch) or each channel receives different random values and each of them will be kept independently.

Need a specific example of U-Matrix in Self Organizing Map

I'm trying to develop an application using SOM in analyzing data. However, after finishing training, I cannot find a way to visualize the result. I know that U-Matrix is one of the method but I cannot understand it properly. Hence, I'm asking for a specific and detail example how to construct U-Matrix.
I also read an answer at U-matrix and self organizing maps but it only refers to 1 row map, how about 3x3 map? I know that for 3x3 map:
m(1) m(2) m(3)
m(4) m(5) m(6)
m(7) m(8) m(9)
a 5x5 matrix must me created:
u(1) u(1,2) u(2) u(2,3) u(3)
u(1,4) u(1,2,4,5) u(2,5) u(2,3,5,6) u(3,6)
u(4) u(4,5) u(5) u(5,6) u(6)
u(4,7) u(4,5,7,8) u(5,8) u(5,6,8,9) u(6,9)
u(7) u(7,8) u(8) u(8,9) u(9)
but I don't know how to calculate u-weight u(1,2,4,5), u(2,3,5,6), u(4,5,7,8) and u(5,6,8,9).
Finally, after constructing U-Matrix, is there any way to visualize it using color, e.g. heat map?
Thank you very much for your time.
Cheers
I don't know if you are still interested in this but I found this link
http://www.uni-marburg.de/fb12/datenbionik/pdf/pubs/1990/UltschSiemon90
which explains very speciffically how to calculate the U-matrix.
Hope it helps.
By the way, the site were I found the link has several resources referring to SOMs I leave it here in case anyone is interested:
http://www.ifs.tuwien.ac.at/dm/somtoolbox/visualisations.html
The essential idea of a Kohonen map is that the data points are mapped to a
lattice, which is often a 2D rectangular grid.
In the simplest implementations, the lattice is initialized by creating a 3D
array with these dimensions:
width * height * number_features
This is the U-matrix.
Width and height are chosen by the user; number_features is just the number
of features (columns or fields) in your data.
Intuitively this is just creating a 2D grid of dimensions w * h
(e.g., if w = 10 and h = 10 then your lattice has 100 cells), then
into each cell, placing a random 1D array (sometimes called "reference tuples")
whose size and values are constrained by your data.
The reference tuples are also referred to as weights.
How is the U-matrix rendered?
In my example below, the data is comprised of rgb tuples, so the reference tuples
have length of three and each of the three values must lie between 0 and 255).
It's with this 3D array ("lattice") that you begin the main iterative loop
The algorithm iteratively positions each data point so that it is closest to others similar to it.
If you plot it over time (iteration number) then you can visualize cluster
formation.
The plotting tool i use for this is the brilliant Python library, Matplotlib,
which plots the lattice directly, just by passing it into the imshow function.
Below are eight snapshots of the progress of a SOM algorithm, from initialization to 700 iterations. The newly initialized (iteration_count = 0) lattice is rendered in the top left panel; the result from the final iteration, in the bottom right panel.
Alternatively, you can use a lower-level imaging library (in Python, e.g., PIL) and transfer the reference tuples onto the 2D grid, one at a time:
for y in range(h):
for x in range(w):
img.putpixel( (x, y), (
SOM.Umatrix[y, x, 0],
SOM.Umatrix[y, x, 1],
SOM.Umatrix[y, x, 2])
)
Here img is an instance of PIL's Image class. Here the image is created by iterating over the grid one pixel at a time; for each pixel, putpixel is called on img three times, the three calls of course corresponding to the three values in an rgb tuple.
From the matrix that you create:
u(1) u(1,2) u(2) u(2,3) u(3)
u(1,4) u(1,2,4,5) u(2,5) u(2,3,5,6) u(3,6)
u(4) u(4,5) u(5) u(5,6) u(6)
u(4,7) u(4,5,7,8) u(5,8) u(5,6,8,9) u(6,9)
u(7) u(7,8) u(8) u(8,9) u(9)
The elements with single numbers like u(1), u(2), ..., u(9) as just the elements with more than two numbers like u(1,2,4,5), u(2,3,5,6), ... , u(5,6,8,9) are calculated using something like the mean, median, min or max of the values in the neighborhood.
It's a nice idea calculate the elements with two numbers first, one possible code for that is:
for i in range(self.h_u_matrix):
for j in range(self.w_u_matrix):
nb = (0,0)
if not (i % 2) and (j % 2):
nb = (0,1)
elif (i % 2) and not (j % 2):
nb = (1,0)
self.u_matrix[(i,j)] = np.linalg.norm(
self.weights[i //2, j //2] - self.weights[i //2 +nb[0], j // 2 + nb[1]],
axis = 0
)
In the code above the self.h_u_matrix = self.weights.shape[0]*2 - 1 and self.w_u_matrix = self.weights.shape[1]*2 - 1 are the dimensions of the U-Matrix. With that said, for calculate the others elements it's necessary obtain a list with they neighboors and apply a mean for example. The following code implements that's idea:
for i in range(self.h_u_matrix):
for j in range(self.w_u_matrix):
if not (i % 2) and not (j % 2):
nodelist = []
if i > 0:
nodelist.append((i-1,j))
if i < 4:
nodelist.append((i+1, j))
if j > 0:
nodelist.append((i,j -1))
if j < 4:
nodelist.append((i,j+1))
meanlist = [self.u_matrix[u_node] for u_node in nodelist]
self.u_matrix[(i,j)] = np.mean(meanlist)
elif (i % 2) and (j % 2):
meanlist = [
(i - 1, j),
(i + 1, j),
(i, j - 1),
(i, j + 1)]
self.u_matrix[(i,j)] = np.mean(meanlist)

Resources