I have a bipartite graph (guy and girl notes) where nodes are connected with weighted edges (how compatible the girl-guy pair is) and each node has a capacity of 5 (each guy/girl can get matched to 5 people of the opposite gender). I need to find the best possible matching to maximize the weights.
This can be formulated as a weighted network flow - each guy is a source of 5 units, each girl is a sink of 5 units, and each possible arc has capacity of 1 unit. The problem can be solved either using linear programming, or a graph traversal algorithm (such as Ford–Fulkerson).
I'm currently looking into possible solutions using Neo4j - does anybody have any idea how to go about it? (or should I just go with a linear programming solution...)
I think it is something like this. Find the five most COMPATIBLE relationships ordering by the weight of the relationship in descending order and then create them as a separate relationship MATCH.
match (guy:Guy)-[rel:COMPATIBLE]->(girl:Girl)
where guy.id = 'xx'
with guy, rel, girl
order by rel.weight desc
limit 5
create (guy)-[:MATCH]->(girl)
Related
I need to identify (and label) nodes with many relationships (say, 10 on a possible scale of 1 to 60) but weak weights for the relationships (say 1 or 2 on a possible scale of 1 to 100). I could write a Cypher query to find them. I don’t need help with that. What I want to ask is, is there a GDS metric for this?
You could use a combination of degree and weighted degree.
If you want to construct such a gds graph, you could use the subgraph option that allows you to filter on mutated properties
So I have a decently large dataset (4k+ nodes, 16k+ edges), and there are two nodes types (let's call them "A" and "B," combined ~130 nodes) that should be considered the centers of many sub-networks. I'm trying to create a visualization that can illustrate whether A or B is more "central" to these sub-networks. To put it another way, is A or B the more "important" organizing type? If any of this makes any sense at all, I'd appreciate your thoughts. (As a disclaimer, I'm fairly new to the software but pretty comfortable with the fundamentals. Consider me a decently intelligent noob haha)
There is a tool included with Cytoscape called Network Analyzer (Tools->Analyze Network). What you are asking for is a measure of the "centrality" of the nodes. There are several types of centrality measures that can be used for "importance" depending on what you mean by importance. Network Analyzer will provide new columns with the main measures of centrality: degree centrality (the extent to which the node is a hub), betweenness centrality (the extent to which paths go through this node) and closeness centrality (the extent to which this node is close to other nodes). See https://cytoscape.org/cytoscape-tutorials/presentations/intro-cytoscape-2020-ucsf.html#/12 for a brief discussion of some of the common network centrality measures.
-- scooter
We want to present our data in a graph and thought about using one of graphdbs. During our vendor investigation process, one of the experts suggested that using graphdb on dense graph won't be efficient and we'd better off with columnar-based db like cassandra.
I gave your use case some thought and given your graph is very dense (number of relationships = number of nodes squared) and that you seem to only need a few hop traversals from the particular node along different relationships. I’d actually recommend you also try out a columnar database.
Graph databases tend to work well when you have sparse graphs (num of relationships << num of nodes ^ 2) and with deep traversals - from 4-5 hops to hundreds of hops. If I understood your use-case correctly, a columnar database should generally outperform graphs there.
Our use case will probably end up with nodes connected to 10s of millions of other nodes with about 30% overlap between different nodes - so in a way, it's probably a dense graph. Overall there will be probably a few billion nodes.
Looking in Neo4j source code I found some reference of isDense flag on the nodes to differentiate the processing logic - not sure what that does. But I also wonder whether it was done as an edge case patch and won't work well if most of the nodes in the graph are dense.
Does anyone have any experience with graphdbs on dense graphs and should it be considered in such cases?
All opinions are appreciated!
When the use of graph DB comes into mind it shows multiple tables are linked with each other, which is a perfect use case for graph DB.
We are handling JansuGraph with a scale of 20B vertices and 15B edges. It's not a large dense graph with a vertex connected with 10s M vertices. But still, we observed the super node case, where a vertex is connected with more number of vertices than expectation. But with our use case while doing traversal (DFS) we always traverse with max N children nodes of a node and a limited depth say M, which is absolutely fine considering the number of joins required in non-graph DBS (columnar, relational, Athena, etc..).
The only way (i feel) to get all relations of a node is to do a full DFS or inner joins datasets until no common data found.
Excited to know more about other creative solutions.
I do not have experience with dense graphs using graph databases, but I do not think that dense graph is a problem. Since You are going to use graph algorithms, I suppose, You would benefit from using graph database (depending on the algorithms complexity - the more "hops", the more You benefit from constant edge traversing time).
A good trade-off could be to use one of not native graph databases (like Titan, its follow-up JanusGraph, Mongo Db, ..), which actually uses column based storages (Cassandra, Barkley DB, .. ) as its backend.
I have a question about weighted graphs in neo4j. Is a property (like ".setProperty("cost", weight)") the only way of constructing a weighted graph. The problem is that a program, which often needs this weights by "(Double) rel.getProperty("cost")" will get too slow, because the cast takes some time;
Well, you actually could encode the weight into the relationship type which is faster, something like
create a-[:`KNOWS_0.34`]->b
http://console.neo4j.org/r/2dez98 for an example.
I have an algorithm that can group data into a hierarchical cluster tree. The algorithm is the one described in Toby Seagram's Programming Collective Intelligence. The tree output is a binary tree with a "distance" value at each node, that tells you how far apart the two child nodes are.
I can then display this as a Dendrogram and it makes it fairly easy for a human spot which values are grouped together. However I'm having difficult coming up with an algorithm that automatically decides what the groups should be. I'd like to be able to determine automatically:
The number of group
Which points should be placed in each group
Is there a standard algorithm for this?
I think there is no default way to do this. Simple 'manual' methods would be to either:
specify the number of clusters you want/expect
set a threshold for the maximum distance between two nodes; any nodes with a larger distance belong to another cluster
There are some automatic methods to determine the number of clusters. R has the Dynamic Tree Cut package which automatically deals with this problem, also pvclust could be used. Here are two more methods described to deal with this problem, Salvador (2002) and Daniels (2006).
I have found out that the Calinski-Harabasz index (also known as Variance Ratio Criterion) works well with dendrograms produced by hierarchical clustering. You can find more information (and a comparative study) in this paper.