Identifying nodes with many, but thin connections - neo4j

I need to identify (and label) nodes with many relationships (say, 10 on a possible scale of 1 to 60) but weak weights for the relationships (say 1 or 2 on a possible scale of 1 to 100). I could write a Cypher query to find them. I don’t need help with that. What I want to ask is, is there a GDS metric for this?

You could use a combination of degree and weighted degree.
If you want to construct such a gds graph, you could use the subgraph option that allows you to filter on mutated properties

Related

Is it ok to use graphdb on a large dense graph

We want to present our data in a graph and thought about using one of graphdbs. During our vendor investigation process, one of the experts suggested that using graphdb on dense graph won't be efficient and we'd better off with columnar-based db like cassandra.
I gave your use case some thought and given your graph is very dense (number of relationships = number of nodes squared) and that you seem to only need a few hop traversals from the particular node along different relationships. I’d actually recommend you also try out a columnar database.
Graph databases tend to work well when you have sparse graphs (num of relationships << num of nodes ^ 2) and with deep traversals - from 4-5 hops to hundreds of hops. If I understood your use-case correctly, a columnar database should generally outperform graphs there.
Our use case will probably end up with nodes connected to 10s of millions of other nodes with about 30% overlap between different nodes - so in a way, it's probably a dense graph. Overall there will be probably a few billion nodes.
Looking in Neo4j source code I found some reference of isDense flag on the nodes to differentiate the processing logic - not sure what that does. But I also wonder whether it was done as an edge case patch and won't work well if most of the nodes in the graph are dense.
Does anyone have any experience with graphdbs on dense graphs and should it be considered in such cases?
All opinions are appreciated!
When the use of graph DB comes into mind it shows multiple tables are linked with each other, which is a perfect use case for graph DB.
We are handling JansuGraph with a scale of 20B vertices and 15B edges. It's not a large dense graph with a vertex connected with 10s M vertices. But still, we observed the super node case, where a vertex is connected with more number of vertices than expectation. But with our use case while doing traversal (DFS) we always traverse with max N children nodes of a node and a limited depth say M, which is absolutely fine considering the number of joins required in non-graph DBS (columnar, relational, Athena, etc..).
The only way (i feel) to get all relations of a node is to do a full DFS or inner joins datasets until no common data found.
Excited to know more about other creative solutions.
I do not have experience with dense graphs using graph databases, but I do not think that dense graph is a problem. Since You are going to use graph algorithms, I suppose, You would benefit from using graph database (depending on the algorithms complexity - the more "hops", the more You benefit from constant edge traversing time).
A good trade-off could be to use one of not native graph databases (like Titan, its follow-up JanusGraph, Mongo Db, ..), which actually uses column based storages (Cassandra, Barkley DB, .. ) as its backend.

What is the best way to find all possible paths between two nodes in large networks scale?

I wonder that what is the best way to find all possible paths from a source to a destination in a very large network scale (in a network matrix), i.e. 5000 nodes. I have used this function that is implemented using stacks, but its limit seems about 60 nodes and it can't retrieve the paths for a 200-node network. In another approach, DFS (depth-first search) could be one of the options but this algorithm also uses stack, so I am afraid of its scalability. Thus, do we have any efficient way for finding all paths between two given nodes in such a large network?
Depth-first is the only way to make it scalable at the level you specify, at least until quantum computing gives us infinite processing power. The number of paths if you have 100% adjacency among all nodes is about the same as the number of atoms in the universe, around 2^120.

Solving a weighted network flow using Neo4J

I have a bipartite graph (guy and girl notes) where nodes are connected with weighted edges (how compatible the girl-guy pair is) and each node has a capacity of 5 (each guy/girl can get matched to 5 people of the opposite gender). I need to find the best possible matching to maximize the weights.
This can be formulated as a weighted network flow - each guy is a source of 5 units, each girl is a sink of 5 units, and each possible arc has capacity of 1 unit. The problem can be solved either using linear programming, or a graph traversal algorithm (such as Ford–Fulkerson).
I'm currently looking into possible solutions using Neo4j - does anybody have any idea how to go about it? (or should I just go with a linear programming solution...)
I think it is something like this. Find the five most COMPATIBLE relationships ordering by the weight of the relationship in descending order and then create them as a separate relationship MATCH.
match (guy:Guy)-[rel:COMPATIBLE]->(girl:Girl)
where guy.id = 'xx'
with guy, rel, girl
order by rel.weight desc
limit 5
create (guy)-[:MATCH]->(girl)

What parameters can I play with using mcl?

I am clustering undirected graphs using mcl. To do so, I have choose a threshold under which nodes are connected, a similarity measure for each edge and the inflation parameter to tune the granularity of my graph. I have been playing around with these parameters, but so far, the clusters I have seem to be too large (I did visualizations that suggest that the largest clusters should be cut into 2 or more clusters). Therefore, I was wondering what are the other parameters I can play with to improve my clustering (I am currently working with the scheme parameter of mcl to see whether increasing the accuracy would help, but if there are other 'more specific' parameters that could help to get smaller clusters for instance, please let me know)?
There are really mainly two things to consider. The first and most important is outside mcl (http://micans.org/mcl/) itself, namely how the network is constructed. I've written about it elsewhere, but I'll repeat it here because it is important.
If you have a weighted similarity, choose an edge-weight (similarity) cutoff
such that the topology of the network becomes informative; i.e. too many edges
or too few edges yield little discriminative information in the
absence/presence structure of edges. Choose it such that no edges connect
things you consider very dissimilar, and that edges connect things you consider
somewhat similar to quite similar. In the case of mcl, the dynamic range in
edge weight between 'a bit similar' and 'very similar' should be, as a rule of
a thumb, one order of magnitude, i.e. two-fold or five-fold or ten-fold, as
opposed to varying from 0.9 to 1.0. Of course, it is possible to give simple
networks to mcl and it will just utilise the absence/presence of edges. Make sure
the network does not become very dense - a very rough rule of thumb could be to aim
for a total number of edges that is in the order of V * sqrt(V) if the number of nodes (vertcies) is V, that is, each node has, on average, in the order of sqrt(V) neighbours.
The above, network construction, is really crucial, and it is advisable
to try different approaches. Now, given a network,
there is really only one mcl parameter to vary: the inflation parameter (the -I option).
A good set of values to test with is 1.4, 2, 3, 4, 6.
In summary, if you are exploring, try different ways of network construction,
using your knowledge of the data to make the network a meaningful representation,
and combine this with trying different mcl inflation values.

Determining groups in a hierarchical cluster

I have an algorithm that can group data into a hierarchical cluster tree. The algorithm is the one described in Toby Seagram's Programming Collective Intelligence. The tree output is a binary tree with a "distance" value at each node, that tells you how far apart the two child nodes are.
I can then display this as a Dendrogram and it makes it fairly easy for a human spot which values are grouped together. However I'm having difficult coming up with an algorithm that automatically decides what the groups should be. I'd like to be able to determine automatically:
The number of group
Which points should be placed in each group
Is there a standard algorithm for this?
I think there is no default way to do this. Simple 'manual' methods would be to either:
specify the number of clusters you want/expect
set a threshold for the maximum distance between two nodes; any nodes with a larger distance belong to another cluster
There are some automatic methods to determine the number of clusters. R has the Dynamic Tree Cut package which automatically deals with this problem, also pvclust could be used. Here are two more methods described to deal with this problem, Salvador (2002) and Daniels (2006).
I have found out that the Calinski-Harabasz index (also known as Variance Ratio Criterion) works well with dendrograms produced by hierarchical clustering. You can find more information (and a comparative study) in this paper.

Resources