We want to present our data in a graph and thought about using one of graphdbs. During our vendor investigation process, one of the experts suggested that using graphdb on dense graph won't be efficient and we'd better off with columnar-based db like cassandra.
I gave your use case some thought and given your graph is very dense (number of relationships = number of nodes squared) and that you seem to only need a few hop traversals from the particular node along different relationships. I’d actually recommend you also try out a columnar database.
Graph databases tend to work well when you have sparse graphs (num of relationships << num of nodes ^ 2) and with deep traversals - from 4-5 hops to hundreds of hops. If I understood your use-case correctly, a columnar database should generally outperform graphs there.
Our use case will probably end up with nodes connected to 10s of millions of other nodes with about 30% overlap between different nodes - so in a way, it's probably a dense graph. Overall there will be probably a few billion nodes.
Looking in Neo4j source code I found some reference of isDense flag on the nodes to differentiate the processing logic - not sure what that does. But I also wonder whether it was done as an edge case patch and won't work well if most of the nodes in the graph are dense.
Does anyone have any experience with graphdbs on dense graphs and should it be considered in such cases?
All opinions are appreciated!
When the use of graph DB comes into mind it shows multiple tables are linked with each other, which is a perfect use case for graph DB.
We are handling JansuGraph with a scale of 20B vertices and 15B edges. It's not a large dense graph with a vertex connected with 10s M vertices. But still, we observed the super node case, where a vertex is connected with more number of vertices than expectation. But with our use case while doing traversal (DFS) we always traverse with max N children nodes of a node and a limited depth say M, which is absolutely fine considering the number of joins required in non-graph DBS (columnar, relational, Athena, etc..).
The only way (i feel) to get all relations of a node is to do a full DFS or inner joins datasets until no common data found.
Excited to know more about other creative solutions.
I do not have experience with dense graphs using graph databases, but I do not think that dense graph is a problem. Since You are going to use graph algorithms, I suppose, You would benefit from using graph database (depending on the algorithms complexity - the more "hops", the more You benefit from constant edge traversing time).
A good trade-off could be to use one of not native graph databases (like Titan, its follow-up JanusGraph, Mongo Db, ..), which actually uses column based storages (Cassandra, Barkley DB, .. ) as its backend.
Related
To represent terrain in a SceneKit game, I have around 20k SCNNodes structured in a hierarchy, similar to an octree or quadtree. This tree isn't balanced - some branches have far more great(*n)-grandchildren than others.
How much extra time is SceneKit spending to get at individual SCNNodes for physics, rendering, adding/deleting nodes etc. compared to if they were all flat at the root level? Does it have to do lots of extra work to traverse the entire height of the tree just to iterate or perform a random access, or is it just not a significant overhead? (Maybe it's clever enough to have structured the nodes itself in advance?)
I'm not asking how a graphics engine might theoretically handle this. I'm asking what SceneKit actually does.
Edit: just in case it's putting people off answering this... I don't need exact numbers of how much time SceneKit takes, obviously that's device-dependent anyway. I just want to know if it's a significant proportion of time. Has anyone had experience of trying both approaches and comparing, or switching from one to the other and noticing whether there was a significant difference?
Thanks for your help.
I am also looking into this as my scene is starting to pack serious amount of nodes, but here's my take:
When you call the node.childNode(withName: String, recursively: true), this means the implementation uses some form of balanced or non balanced tree, take your pick between a binary tree, a-b tree b+ tree (in the case one node has multiple child nodes beyond a certain threshold), you will end up mostly with a log|n| complexity depending on whether you want to insert delete or search a tree structure. the AVL tree usually keeps the most used nodes up the hierarchy so you get less computation.
Just looking at the quad tree or R tree structures, they also probably have log|n| complexity since we're talking going into iterative sub quadrants of the root quadrant.
It would be good to know what kind of actual structure you got in terms of child nodes to see what best strategy to take.
On a side note, what prompts your terrain to have 20 k ndoes? aAre you attaching a node per bezier point or vertex on your terrain to do some morphing?
I wonder that what is the best way to find all possible paths from a source to a destination in a very large network scale (in a network matrix), i.e. 5000 nodes. I have used this function that is implemented using stacks, but its limit seems about 60 nodes and it can't retrieve the paths for a 200-node network. In another approach, DFS (depth-first search) could be one of the options but this algorithm also uses stack, so I am afraid of its scalability. Thus, do we have any efficient way for finding all paths between two given nodes in such a large network?
Depth-first is the only way to make it scalable at the level you specify, at least until quantum computing gives us infinite processing power. The number of paths if you have 100% adjacency among all nodes is about the same as the number of atoms in the universe, around 2^120.
I have a bipartite graph (guy and girl notes) where nodes are connected with weighted edges (how compatible the girl-guy pair is) and each node has a capacity of 5 (each guy/girl can get matched to 5 people of the opposite gender). I need to find the best possible matching to maximize the weights.
This can be formulated as a weighted network flow - each guy is a source of 5 units, each girl is a sink of 5 units, and each possible arc has capacity of 1 unit. The problem can be solved either using linear programming, or a graph traversal algorithm (such as Ford–Fulkerson).
I'm currently looking into possible solutions using Neo4j - does anybody have any idea how to go about it? (or should I just go with a linear programming solution...)
I think it is something like this. Find the five most COMPATIBLE relationships ordering by the weight of the relationship in descending order and then create them as a separate relationship MATCH.
match (guy:Guy)-[rel:COMPATIBLE]->(girl:Girl)
where guy.id = 'xx'
with guy, rel, girl
order by rel.weight desc
limit 5
create (guy)-[:MATCH]->(girl)
I have an algorithm that can group data into a hierarchical cluster tree. The algorithm is the one described in Toby Seagram's Programming Collective Intelligence. The tree output is a binary tree with a "distance" value at each node, that tells you how far apart the two child nodes are.
I can then display this as a Dendrogram and it makes it fairly easy for a human spot which values are grouped together. However I'm having difficult coming up with an algorithm that automatically decides what the groups should be. I'd like to be able to determine automatically:
The number of group
Which points should be placed in each group
Is there a standard algorithm for this?
I think there is no default way to do this. Simple 'manual' methods would be to either:
specify the number of clusters you want/expect
set a threshold for the maximum distance between two nodes; any nodes with a larger distance belong to another cluster
There are some automatic methods to determine the number of clusters. R has the Dynamic Tree Cut package which automatically deals with this problem, also pvclust could be used. Here are two more methods described to deal with this problem, Salvador (2002) and Daniels (2006).
I have found out that the Calinski-Harabasz index (also known as Variance Ratio Criterion) works well with dendrograms produced by hierarchical clustering. You can find more information (and a comparative study) in this paper.
I was trying to cluster some documents using the KMeansClustering approach and successfully created the clusters. I saved the cluster id corresponding to a particular document for recommendations. So whenever I wanted to recommend documents similar to a particular document, I would query all the documents in a particular cluster and return n random documents from the cluster. However, returning any random document from the cluster did not seem appropriate and I read somewhere that we should be returning the documents nearest to the document in question.
So I started searching for calculating distance between documents and stumbled upon the RowSimilarity approach which returns 10 most similar documents to each document, ordered by distance. Now this approach relies on a similarity metric like LogLikelihood etc to calculate the distance between documents.
Now my question is this. How is clustering better/worse than RowSimilarity given that both the approaches use a similarity distance metric to calculate the distance between documents?
What I'm trying to achieve is that I'm trying to cluster products on the basis of their titles and other text properties to recommend similar products. Any help is appreciated.
Clustering is not just another variant of classification or recommendation. It is a different discipline.
When you are doing cluster analysis, you want to discover structure in the data. But then, you should actually be analyzing the structure you found.
Now k-means is not really meant for documents. It tries to find a near optimal partitioning of a data set into k Voronoi cells. Unless you have a good reason to believe that Voronoi cells are a good partitioning for your data, the algorithm may be pretty much useless. Just because it returns a result does not at all indicate that the result is useful.
For documents, Euclidean distance (and k-means is in fact optimizing Euclidean distances) are usually pretty much meaningless. The vectors are very sparse, and k-means cluster centers will then often resemble impossible (and thus insensible) "average documents".
And I havn't started on the need to find an appropriate value of k, on the Mahout implementation likely just being an approximation of Lloyds k-means approximation, and so on. Did you even check the cluster sizes? In situations like these, k-means will often produce degenerate results. For example, almost all clusters containing 1 or 0 elements, and a mega-cluster containing the rest. In this situation, you might in fact be returning just random documents from your database...
Just because you can use it does not mean it is helpful. Make sure to validate the individual steps of your approach, for example if the clusters are in any way useful and sensible!
Similarity is not the same thing as distance -- one is big when the other is small. Clustering is not the same as computing distances either. First you should decide whether you have a clustering problem -- it does not sound like you do based on what you say. So, don't use k-means.