I have a question about weighted graphs in neo4j. Is a property (like ".setProperty("cost", weight)") the only way of constructing a weighted graph. The problem is that a program, which often needs this weights by "(Double) rel.getProperty("cost")" will get too slow, because the cast takes some time;
Well, you actually could encode the weight into the relationship type which is faster, something like
create a-[:`KNOWS_0.34`]->b
http://console.neo4j.org/r/2dez98 for an example.
Related
We want to present our data in a graph and thought about using one of graphdbs. During our vendor investigation process, one of the experts suggested that using graphdb on dense graph won't be efficient and we'd better off with columnar-based db like cassandra.
I gave your use case some thought and given your graph is very dense (number of relationships = number of nodes squared) and that you seem to only need a few hop traversals from the particular node along different relationships. I’d actually recommend you also try out a columnar database.
Graph databases tend to work well when you have sparse graphs (num of relationships << num of nodes ^ 2) and with deep traversals - from 4-5 hops to hundreds of hops. If I understood your use-case correctly, a columnar database should generally outperform graphs there.
Our use case will probably end up with nodes connected to 10s of millions of other nodes with about 30% overlap between different nodes - so in a way, it's probably a dense graph. Overall there will be probably a few billion nodes.
Looking in Neo4j source code I found some reference of isDense flag on the nodes to differentiate the processing logic - not sure what that does. But I also wonder whether it was done as an edge case patch and won't work well if most of the nodes in the graph are dense.
Does anyone have any experience with graphdbs on dense graphs and should it be considered in such cases?
All opinions are appreciated!
When the use of graph DB comes into mind it shows multiple tables are linked with each other, which is a perfect use case for graph DB.
We are handling JansuGraph with a scale of 20B vertices and 15B edges. It's not a large dense graph with a vertex connected with 10s M vertices. But still, we observed the super node case, where a vertex is connected with more number of vertices than expectation. But with our use case while doing traversal (DFS) we always traverse with max N children nodes of a node and a limited depth say M, which is absolutely fine considering the number of joins required in non-graph DBS (columnar, relational, Athena, etc..).
The only way (i feel) to get all relations of a node is to do a full DFS or inner joins datasets until no common data found.
Excited to know more about other creative solutions.
I do not have experience with dense graphs using graph databases, but I do not think that dense graph is a problem. Since You are going to use graph algorithms, I suppose, You would benefit from using graph database (depending on the algorithms complexity - the more "hops", the more You benefit from constant edge traversing time).
A good trade-off could be to use one of not native graph databases (like Titan, its follow-up JanusGraph, Mongo Db, ..), which actually uses column based storages (Cassandra, Barkley DB, .. ) as its backend.
I have around 50K data sets whose value may range between 0 and 10. I want to apply the HAC to cluster these data. But to apply HAC I need to prepare a N*N similarity matrix.
For N = 50 K , this matrix would simply be too large to hold in memory , even if I use short.
Is there any way to do HAC in batches or any other method which could help me to apply HAC with 50K data points. I plan to implement it in java.
I am also worried about total time it would take , any pointers regarding this would be quite helpful.
If you want to apply a top-down clustering approach you could easily distribute it, related article: http://scgroup.hpclab.ceid.upatras.gr/faculty/stratis/Papers/tm07book.pdf
Long story short (quote from other article): After your first node split, each node created can be shipped to a distributed process to be split again and so on... Each distributed process needs only to be aware of the subset of the dataset it is splitting. Only the parent process is aware of the full dataset.
Bottom-up approach is much harder to distribute and I won't try to suggest anything here.
But hey, you don't need to write this in Java yourself, Mahout or MLLib libraries already have it, and they support java. And hadoop
Anyway, here is your example in Java for hadoop if you want to write it yourself:
http://sujitpal.blogspot.ru/2009/09/hierarchical-agglomerative-clustering.html
Finally, a good and big work on comparison of different distributed approaches for hierarchical clustering:
C. F. Olson. "Parallel Algorithms for Hierarchical Clustering." Parallel Computing, 21:1313-1325, 1995, doi:10.1016/0167-8191(95)00017-I.
There are various different HAC methods, but they are generally all lower bounded by O(n^2) complexity. So while 50k is still a doable number of data points, you won't be able to scale this out too far.
I dont know what code you are using, but you don't have to explicitly store the N^2 sized similarity matrix, the similarity values can be computed on the fly / as needed. Scikit learn will do it without explicitly forming the matrix.
I was trying to cluster some documents using the KMeansClustering approach and successfully created the clusters. I saved the cluster id corresponding to a particular document for recommendations. So whenever I wanted to recommend documents similar to a particular document, I would query all the documents in a particular cluster and return n random documents from the cluster. However, returning any random document from the cluster did not seem appropriate and I read somewhere that we should be returning the documents nearest to the document in question.
So I started searching for calculating distance between documents and stumbled upon the RowSimilarity approach which returns 10 most similar documents to each document, ordered by distance. Now this approach relies on a similarity metric like LogLikelihood etc to calculate the distance between documents.
Now my question is this. How is clustering better/worse than RowSimilarity given that both the approaches use a similarity distance metric to calculate the distance between documents?
What I'm trying to achieve is that I'm trying to cluster products on the basis of their titles and other text properties to recommend similar products. Any help is appreciated.
Clustering is not just another variant of classification or recommendation. It is a different discipline.
When you are doing cluster analysis, you want to discover structure in the data. But then, you should actually be analyzing the structure you found.
Now k-means is not really meant for documents. It tries to find a near optimal partitioning of a data set into k Voronoi cells. Unless you have a good reason to believe that Voronoi cells are a good partitioning for your data, the algorithm may be pretty much useless. Just because it returns a result does not at all indicate that the result is useful.
For documents, Euclidean distance (and k-means is in fact optimizing Euclidean distances) are usually pretty much meaningless. The vectors are very sparse, and k-means cluster centers will then often resemble impossible (and thus insensible) "average documents".
And I havn't started on the need to find an appropriate value of k, on the Mahout implementation likely just being an approximation of Lloyds k-means approximation, and so on. Did you even check the cluster sizes? In situations like these, k-means will often produce degenerate results. For example, almost all clusters containing 1 or 0 elements, and a mega-cluster containing the rest. In this situation, you might in fact be returning just random documents from your database...
Just because you can use it does not mean it is helpful. Make sure to validate the individual steps of your approach, for example if the clusters are in any way useful and sensible!
Similarity is not the same thing as distance -- one is big when the other is small. Clustering is not the same as computing distances either. First you should decide whether you have a clustering problem -- it does not sound like you do based on what you say. So, don't use k-means.
I'm pretty new in the field of machine learning (even if I find it extremely interesting), and I wanted to start a small project where I'd be able to apply some stuff.
Let's say I have a dataset of persons, where each person has N different attributes (only discrete values, each attribute can be pretty much anything).
I want to find clusters of people who exhibit the same behavior, i.e. who have a similar pattern in their attributes ("look-alikes").
How would you go about this? Any thoughts to get me started?
I was thinking about using PCA since we can have an arbitrary number of dimensions, that could be useful to reduce it. K-Means? I'm not sure in this case. Any ideas on what would be most adapted to this situation?
I do know how to code all those algorithms, but I'm truly missing some real world experience to know what to apply in which case.
K-means using the n-dimensional attribute vectors is a reasonable way to get started. You may want to play with your distance metric to see how it affects the results.
The first step to pretty much any clustering algorithm is to find a suitable distance function. Many algorithms such as DBSCAN can be parameterized with this distance function then (at least in a decent implementation. Some of course only support Euclidean distance ...).
So start with considering how to measure object similarity!
In my opinion you should also try expectation-maximization algorithm (also called EM). On the other hand, you must be careful while using PCA because this algorithm may reduce the dimensions relevant to clustering.
I have a large sparse matrix representing attributes for millions of entities. For example, one record, representing an entity, might have attributes "has(fur)", "has(tail)", "makesSound(meow)", and "is(cat)".
However, this data is incomplete. For example, another entity might have all the attributes of a typical "is(cat)" entity, but it might be missing the "is(cat)" attribute. In this case, I want to determine the probability that this entity should have the "is(cat)" attribute.
So the problem I'm trying to solve is determining which missing attributes each entity should contain. Given an arbitrary record, I want to find the top N most likely attributes that are missing but should be included. I'm not sure what the formal name is for this type of problem, so I'm unsure what to search for when researching current solutions. Is there a scalable solution for this type of problem?
My first is to simply calculate the conditional probability for each missing attribute (e.g. P(is(cat)|has(fur) and has(tail) and ... )), but that seems like a very slow approach. Plus, as I understand the traditional calculation of conditional probability, I imagine I'd run into problems where my entity contains a few unusual attributes that aren't common with other is(cat) entities, causing the conditional probability to be zero.
My second idea is to train a Maximum Entropy classifier for each attribute, and then evaluate it based on the entity's current attributes. I think the probability calculation would be much more flexible, but this would still have scalability problems, since I'd have to train separate classifiers for potentially millions attributes. In addition, if I wanted to find the top N most likely attributes to include, I'd still have to evaluate all the classifiers, which would likely take forever.
Are there better solutions?
This sounds like a typical recommendation problem. For each attribute use the word 'movie rating' and for each row use the word 'person'. For each person, you want to find the movies that they will probably like but haven't rated yet.
You should look at some of the more successful approaches to the Netflix Challenge. The dataset is pretty large, so efficiency is a high priority. A good place to start might be the paper 'Matrix Factorization Techniques for Recommender Systems'.
If you have a large data set and you're worried about scalability, then I would look into Apache Mahout. Mahout is a Machine Learning and Data Mining library that might help you with your project, in particular they have some of the most well known algorithms already built-in:
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Parallel Frequent Pattern mining
Complementary Naive Bayes classifier
Random forest decision tree based classifier
High performance java collections (previously colt collections)