Cytoscape: Trying to Determine Which of Two Node Types Is More "Central" - cytoscape

So I have a decently large dataset (4k+ nodes, 16k+ edges), and there are two nodes types (let's call them "A" and "B," combined ~130 nodes) that should be considered the centers of many sub-networks. I'm trying to create a visualization that can illustrate whether A or B is more "central" to these sub-networks. To put it another way, is A or B the more "important" organizing type? If any of this makes any sense at all, I'd appreciate your thoughts. (As a disclaimer, I'm fairly new to the software but pretty comfortable with the fundamentals. Consider me a decently intelligent noob haha)

There is a tool included with Cytoscape called Network Analyzer (Tools->Analyze Network). What you are asking for is a measure of the "centrality" of the nodes. There are several types of centrality measures that can be used for "importance" depending on what you mean by importance. Network Analyzer will provide new columns with the main measures of centrality: degree centrality (the extent to which the node is a hub), betweenness centrality (the extent to which paths go through this node) and closeness centrality (the extent to which this node is close to other nodes). See https://cytoscape.org/cytoscape-tutorials/presentations/intro-cytoscape-2020-ucsf.html#/12 for a brief discussion of some of the common network centrality measures.
-- scooter

Related

How is Growing Neural Gas used for clustering?

I know how the algorithm works, but I'm not sure how it determines the clusters. Based on images I guess that it sees all the neurons that are connected by edges as one cluster. So that you might have two clusters of two groups of neurons each all connected. But is that really it?
I also wonder.. is GNG really a neural network? It doesn't have a propagation function or an activation function or weighted edges.. isn't it just a graph? I guess that depends on personal opinion a bit but I would like to hear them.
UPDATE:
This thesis www.booru.net/download/MasterThesisProj.pdf deals with GNG-clustering and on page 11 you can see an example of what looks like clusters of connected neurons. But then I'm also confused by the number of iterations. Let's say I have 500 data points to cluster. Once I put them all in, do I remove them and add them again to adapt die existing network? And how often do I do that?
I mean.. I have to re-add them at some point.. when adding a new neuron r, between two old neurons u and v then some data points formerly belonging to u should now belong to r because it's closer. But the algorithm does not contain changing the assignment of these data points. And even if I remove them after one iteration and add them all again, then the false assignment of the points for the rest of that first iteration changes the processing of the network doesn't it?
NG and GNG are a form of self-organizing maps (SOM), which are also referred to as "Kohonen neural networks".
These are based on older, much wider view of neutal networks when they were still inspired by nature rather than being driven by GPU capabilites of matrix operations. Back then, when you did not yet have massive-SIMD architectures yet, there was nothing bad about having neurons self-organize rather than being preorganized in strict layers.
I would not call them clustering although that term is commonly (ab-) used in related work. Because I don't see any strong propery of these "clusters".
SOMs are literally maps as in geography. A SOM is a set of nodes ("neurons") usually arranged in a 2d rectangular or hexagonal grid. (=the map). The positions in the input space are then optimized iteratively to fit the data. Because they influence their neighbors, they cannot move freely. Think of wrapping a net around a tree; the knots of the net are your neurons. NG and GNG appear to be pretty mich the same thing, but with a more flexible structure of nodes. But actually a nice property of SOMs is the 2d map that you can get.
The only approach I remember for clustering was to project the input data to the discrete 2d space of the SOM grid, then run k-means on this projection. It will probably work okayish (as in: it will perform similar to k-means), but I'm not convinced that it's theoretically well supported.

choose cluster in hierarchical clustering

How can i choose a cluster if a point is at the same distance with two different points?
Here, X1 is at the same distance to X2 and X3. Can I directly make a cluster of X1/X2/X3 or just go one by one as X1/X2 and then X1/X2/X3?
In general you should always follow the rule of merging two if you want to have all the typical properties of the hierarchical clustering (like uniform meaning of each "slice through" - if you start merging many steps into one, you will have "unbalanced" structure, thus the height of the clustering tree will have different meanings in multiple places). Furthermore, it actually only makes sense for min linkage, if you use avg linkage or other, more complex rules, then it is not even true then after merging two points, the third one will be the next now to add (it might even end up in a different cluster). However, in general, clustering of this type (greedy) is just a heuristic, with some particular properties. Thus alternating it a bit gives you yet another clustering with some properties. Saying which one is "correct" is impossible - they are both wrong to some extent, what matters is your exact usage later on.

What parameters can I play with using mcl?

I am clustering undirected graphs using mcl. To do so, I have choose a threshold under which nodes are connected, a similarity measure for each edge and the inflation parameter to tune the granularity of my graph. I have been playing around with these parameters, but so far, the clusters I have seem to be too large (I did visualizations that suggest that the largest clusters should be cut into 2 or more clusters). Therefore, I was wondering what are the other parameters I can play with to improve my clustering (I am currently working with the scheme parameter of mcl to see whether increasing the accuracy would help, but if there are other 'more specific' parameters that could help to get smaller clusters for instance, please let me know)?
There are really mainly two things to consider. The first and most important is outside mcl (http://micans.org/mcl/) itself, namely how the network is constructed. I've written about it elsewhere, but I'll repeat it here because it is important.
If you have a weighted similarity, choose an edge-weight (similarity) cutoff
such that the topology of the network becomes informative; i.e. too many edges
or too few edges yield little discriminative information in the
absence/presence structure of edges. Choose it such that no edges connect
things you consider very dissimilar, and that edges connect things you consider
somewhat similar to quite similar. In the case of mcl, the dynamic range in
edge weight between 'a bit similar' and 'very similar' should be, as a rule of
a thumb, one order of magnitude, i.e. two-fold or five-fold or ten-fold, as
opposed to varying from 0.9 to 1.0. Of course, it is possible to give simple
networks to mcl and it will just utilise the absence/presence of edges. Make sure
the network does not become very dense - a very rough rule of thumb could be to aim
for a total number of edges that is in the order of V * sqrt(V) if the number of nodes (vertcies) is V, that is, each node has, on average, in the order of sqrt(V) neighbours.
The above, network construction, is really crucial, and it is advisable
to try different approaches. Now, given a network,
there is really only one mcl parameter to vary: the inflation parameter (the -I option).
A good set of values to test with is 1.4, 2, 3, 4, 6.
In summary, if you are exploring, try different ways of network construction,
using your knowledge of the data to make the network a meaningful representation,
and combine this with trying different mcl inflation values.

Determining groups in a hierarchical cluster

I have an algorithm that can group data into a hierarchical cluster tree. The algorithm is the one described in Toby Seagram's Programming Collective Intelligence. The tree output is a binary tree with a "distance" value at each node, that tells you how far apart the two child nodes are.
I can then display this as a Dendrogram and it makes it fairly easy for a human spot which values are grouped together. However I'm having difficult coming up with an algorithm that automatically decides what the groups should be. I'd like to be able to determine automatically:
The number of group
Which points should be placed in each group
Is there a standard algorithm for this?
I think there is no default way to do this. Simple 'manual' methods would be to either:
specify the number of clusters you want/expect
set a threshold for the maximum distance between two nodes; any nodes with a larger distance belong to another cluster
There are some automatic methods to determine the number of clusters. R has the Dynamic Tree Cut package which automatically deals with this problem, also pvclust could be used. Here are two more methods described to deal with this problem, Salvador (2002) and Daniels (2006).
I have found out that the Calinski-Harabasz index (also known as Variance Ratio Criterion) works well with dendrograms produced by hierarchical clustering. You can find more information (and a comparative study) in this paper.

Reinforcement learning of a policy for multiple actors in large state spaces

I have a real-time domain where I need to assign an action to N actors involving moving one of O objects to one of L locations. At each time step, I'm given a reward R, indicating the overall success of all actors.
I have 10 actors, 50 unique objects, and 1000 locations, so for each actor I have to select from 500000 possible actions. Additionally, there are 50 environmental factors I may take into account, such as how close each object is to a wall, or how close it is to an actor. This results in 25000000 potential actions per actor.
Nearly all reinforcement learning algorithms don't seem to be suitable for this domain.
First, they nearly all involve evaluating the expected utility of each action in a given state. My state space is huge, so it would take forever to converge a policy using something as primitive as Q-learning, even if I used function approximation. Even if I could, it would take too long to find the best action out of a million actions in each time step.
Secondly, most algorithms assume a single reward per actor, whereas the reward I'm given might be polluted by the mistakes of one or more actors.
How should I approach this problem? I've found no code for domains like this, and the few academic papers I've found on multi-actor reinforcement learning algorithms don't provide nearly enough detail to reproduce the proposed algorithm.
Clarifying the problem
N=10 actors
O=50 objects
L=1K locations
S=50 features
As I understand it, you have a warehouse with N actors, O objects, L locations, and some walls. The goal is to make sure that each of the O objects ends up in any one of the L locations in the least amount of time. The action space consist of decisions on which actor should be moving which object to which location at any point in time. The state space consists of some 50 X-dimensional environmental factors that include features such as proximity of actors and objects to walls and to each other. So, at first glance, you have XS(OL)N action values, with most action dimensions discrete.
The problem as stated is not a good candidate for reinforcement learning. However, it is unclear what the environmental factors really are and how many of the restrictions are self-imposed. So, let's look at a related, but different problem.
Solving a different problem
We look at a single actor. Say, it knows it's own position in the warehouse, positions of the other 9 actors, positions of the 50 objects, and 1000 locations. It wants to achieve maximum reward, which happens when each of the 50 objects is at one of the 1000 locations.
Suppose, we have a P-dimensional representation of position in the warehouse. Each position could be occupied by the actor in focus, one of the other actors, an object, or a location. The action is to choose an object and a location. Therefore, we have a 4P-dimensional state space and a P2-dimensional action space. In other words, we have a 4PP2-dimensional value function. By futher experimenting with representation, using different-precision encoding for different parameters, and using options 2, it might be possible to bring the problem into the practical realm.
For examples of learning in complicated spatial settings, I would recommend reading the Konidaris papers 1 and 2.
1 Konidaris, G., Osentoski, S. & Thomas, P., 2008. Value function approximation in reinforcement learning using the Fourier basis. Computer Science Department Faculty Publication Series, p.101.
2 Konidaris, G. & Barto, A., 2009. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining Y. Bengio et al., eds. Advances in Neural Information Processing Systems, 18, pp.1015-1023.

Resources