Neo4j or GraphX / Giraph what to choose? - neo4j

Just started my excursion to graph processing methods and tools. What we basically do - count some standard metrics like pagerank, clustering coefficient, triangle count, diameter, connectivity etc. In the past was happy with Octave, but when we started to work with graphs having let's say 10^9 nodes/edges we stuck.
So the possible solutions can be distributed cloud made with Hadoop/Giraph, Spark/GraphX, Neo4j on top of them, etc.
But since I am a beginner, can someone advise what actually to choose? I did not get the difference when to use Spark/GraphX and when Neo4j? Right now I consider Spark/GraphX, since it have more Python alike syntax, while neo4j has the own Cypher. Visualization in neo4j is cool but not useful in such a large scale. I do not understand is there a reason to use additional level of software (neo4j) or just use Spark/GraphX? Since I understood neo4j will not save so much time like if we worked with pure hadoop vs Giraph or GraphX or Hive.
Thank you.

Neo4J: It is a graphical database which helps out identifying the relationships and entities data usually from the disk. It's popularity and choice is given in this link. But when it needs to process the very large data-sets and real time processing to produce the graphical results/representation it needs to scale horizontally. In this case combination of Neo4J with Apache Spark will give significant performance benefits in such a way Spark will serve as an external graph compute solution.
Mazerunner is a distributed graph processing platform which extends Neo4J. It uses message broker to process distribute graph processing jobs to Apache Spark GraphX module.
GraphX: GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. It supports multiple Graph algorithms.
Conclusion:
It is always recommended to use the Hybrid combination of Neo4j with GraphX as they both easier to integrate.
For real time processing and processing large data-sets, use neo4j with GraphX.
For simple persistence and to show the entity relationship for a simple graphical display representation use standalone neo4j.

Neo4j: I have not used it, but I think it does all of a graph computation (like pagerank) on a single machine. Would that be able to handle your data set? It may depend on whether your entire graph fits into memory, and if not, how efficiently does it process data from disk. It may hit the same problems you encountered with Octave.
Spark GraphX: GraphX partitions graph data (vertices and edges) across a cluster of machines. This gives you horizontal scalability and parallelism in computation. Some things you may want to consider: it only has a Scala API right now (no Python yet). It does PageRank, triangle count, and connected components, but you may have to implement clustering coefficent and diameter yourself, using the provided graph API (pregel for example). The programming guide has a list of supported algorithms: https://spark.apache.org/docs/latest/graphx-programming-guide.html

GraphX is more of a realtime processing framework for the data that can be (and it's is better when) represented in a graph form. With GraphX you can use various algorithms that require large amounts of processing power (both RAM and CPU), and with neo4j you can (reliably) persist and update that data. This is what I'd suggest.
I know for sure that #kennybastani has done some pretty interesting advancements in that area, you can take a look at his mazerunner solution. It's also shipped as a docker image, so you can poke at it with a stick and find out for yourself whether you like it or not.
This image deploys a container with Apache Spark and uses GraphX to
perform ETL graph analysis on subgraphs exported from Neo4j. The
results of the analysis are applied back to the data in the Neo4j
database.

Related

PageRank tool for large graphs

I need to compute the PageRank scores for a large graph which cannot be loaded into memory. I need a simple toolkit that can be easily modified, since I need to change its code in my research. Are you aware of any useful and simple toolkit that computes PageRank for large graphs (the size of graph is around 40 GB).
Thanks
Two packages you might want to evaluate are
Apache TinkerPop
http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#pagerankvertexprogram
Apache Spark - GraphX
http://spark.apache.org/docs/latest/graphx-programming-guide.html#pagerank
Both are open source with Apache license, so the source code is available for you to modify or extend.

Neo4j partition

Is the a way to physically separate between neo4j partitions?
Meaning the following query will go to node1:
Match (a:User:Facebook)
While this query will go to another node (maybe hosted on docker)
Match (b:User:Google)
this is the case:
i want to store data of several clients under neo4j, hopefully lots of them.
now, i'm not sure about whats is the best design for that but it has to fulfill few conditions:
no mixed data should be returned from a cypher query ( its really hard to make sure, that no developer will forget the ":Partition1" (for example) in a cypher query)
performance of 1 client shouldn't affect another client, for example, if 1 client has lots of data, and another client has small amount of data, or if a "heavy" query of 1 client is currently running, i dont want other "lite" queries of another client to suffer from slow slow performance
in other words, storing everything under 1 node, at some point in the future, i think, will have scalability problem, when i'll have more clients.
btw, is it common to have few clusters?
also whats the advantage of partitioning over creating different Label for each client? for example: Users_client_1 , Users_client_2 etc
Short answer: no, there isn't.
Neo4j has high availability (HA) clusters where you can make a copy of your entire graph on many machines, and then serve many requests against that copy quickly, but they don't partition a really huge graph so some of it is stored here, some other parts there, and then connected by one query mechanism.
More detailed answer: graph partitioning is a hard problem, subject to ongoing research. You can read more about it over at wikipedia, but the gist is that when you create partitions, you're splitting your graph up into multiple different locations, and then needing to deal with the complication of relationships that cross partitions. Crossing partitions is an expensive operation, so the real question when partitioning is, how do you partition such that the need to cross partitions in a query comes up as infrequently as possible?
That's a really hard question, since it depends not only on the data model but on the access patterns, which may change.
Here's how bad the situation is (quote stolen):
Typically, graph partition problems fall under the category of NP-hard
problems. Solutions to these problems are generally derived using
heuristics and approximation algorithms.[3] However, uniform graph
partitioning or a balanced graph partition problem can be shown to be
NP-complete to approximate within any finite factor.[1] Even for
special graph classes such as trees and grids, no reasonable
approximation algorithms exist,[4] unless P=NP. Grids are a
particularly interesting case since they model the graphs resulting
from Finite Element Model (FEM) simulations. When not only the number
of edges between the components is approximated, but also the sizes of
the components, it can be shown that no reasonable fully polynomial
algorithms exist for these graphs.
Not to leave you with too much doom and gloom, plenty of people have partitioned big graphs. Facebook and twitter do it every day, so you can read about FlockDB on the twitter side or avail yourself of relevant facebook research. But to summarize and cut to the chase, it depends on your data and most people who partition design a custom partitioning strategy, it's not something software does for them.
Finally, other architectures (such as Apache Giraph) can auto-partition in some senses; if you store a graph on top of hadoop, and hadoop already automagically scales across a cluster, then technically this is partitioning your graph for you, automagically. Cool, right? Well...cool until you realize that you still have to execute graph traversal operations all over the place, which may perform very poorly owing to the fact that all of those partitions have to be traversed, the performance situation you're usually trying to avoid by partitioning wisely in the first place.

sharding a neo4j graph, min-cut

I've heard of a max flow min cut method for sharding or segmenting a graph database. Does someone have a sample cypher query that can do that say against the movielens dataset? Basically I want to segment users into different shards/clusters based on what they like so maybe the min cuts can naturally find clusters of users around the genres say Horror, Drama, or maybe it will create non-intuitive clusters/segments like hipster/romantics and conservative/comedy/horror groups.
my short answer is no, sorry I don't know how you would express that.
my longer answer is even if this were possible - which it very well may be - I would advise against it.
multiple algorithms 'do' min-cut max-flow, these will all have different performance characteristics and, because clustering is computationally expensive, I'd guess you want control over the specific algorithm implementation used.
Cypher is a declarative language, you specify what you're looking for but not how to do it, and it will be difficult to specify such a complex problem in a way that the Cypher engine can figure out what you're trying to do. that will make it hard for Cypher (or any declarative language engine) to produce an efficient query plan.
my suggestion is find the specific algorithm you wish to use and implement it using the Neo4j Java API.
if you're running Neo4j in embedded mode you're done at that point. if you're running Neo4j server you'll then just have to run that code as an Unmanaged Server Extension
AFAIK you're after 'Community Detection' algorithms. There are non-overlapping (communities do not overlap) and overlapping variants, where non-overlapping is generally easier to implement and understand. Common algorithms are:
Non-overlapping: Louvain
Overlapping: Label Propagation Algorithm (LPA) (typically non-overlapping, but there are extensions to make it overlapping)
Here are a few C++ code examples for the algorithms: Louvain, Oslom (overlapping), LPA (non-overlapping), and Infomap)
And if you want bleeding edge I was recommended the SCD algorithm
Academic paper: "High Quality, Scalable and Parallel Community Detection for Large Real Graphs"
C++ implementation

Neo4J Performance Benchmarking

I have created a basic implementation of high level client over Neo4J (https://github.com/impetus-opensource/Kundera/tree/trunk/kundera-neo4j) and want to compare its performance with Native neo4j driver (and maybe SpringData too). This way I would be able to determine overhead my library is putting over native driver.
I plan to create an extension of YCSB for Neo4J.
My question is: what should be considered as a basic unit of object to be written into neo4j (should it be a single node or a couple of nodes joined by an edge).
What's current practice in Neo4J world. How people benchmarking neo4j performance are doing it.
There's already been some work for benchmarking Neo4J with Gatling: http://maxdemarzi.com/2013/02/14/neo4j-and-gatling-sitting-in-a-tree-performance-t-e-s-t-ing/
You could maybe adapt it.
See graphdb-benchmarks
The project graphdb-benchmarks is a benchmark between popular graph dataases. Currently the framework supports Titan, OrientDB, Neo4j and Sparksee. The purpose of this benchmark is to examine the performance of each graph database in terms of execution time. The benchmark is composed of four workloads, Clustering, Massive Insertion, Single Insertion and Query Workload. Every workload has been designed to simulate common operations in graph database systems.
Clustering Workload (CW): CW consists of a well-known community detection algorithm for modularity optimization, the Louvain Method. We adapt the algorithm on top of the benchmarked graph databases and employ cache techniques to take advantage of both graph database capabilities and in-memory execution speed. We measure the time the algorithm needs to converge.
Massive Insertion Workload (MIW): Create the graph database and configure it for massive loading, then we populate it with a particular dataset. We measure the time for the creation of the whole graph.
Single Insertion Workload (SIW): Create the graph database and load it with a particular dataset. Every object insertion (node or edge) is committed directly and the graph is constructed incrementally. We measure the insertion time per block, which consists of one thousand edges and the nodes that appear during the insertion of these edges.
Query Workload (QW): Execute three common queries:
FindNeighbours (FN): finds the neighbours of all nodes.
FindAdjacentNodes (FA): finds the adjacent nodes of all edges.
FindShortestPath (FS): finds the shortest path between the first node and 100 randomly picked nodes.
One way to performance-test is to use e.g. http://gatling-tool.org/. There is work underway to create benchmark frameworks at http://ldbc.eu . Otherwise, benchmarking is highly dependent on your domain dataset and the queries you are trying to do. Maybe you could start at https://github.com/neo4j/performance-benchmark and improve on it?

what is the advantage of RDF and Triple Storage to Neo4j?

Neo4j is a really fast and scalable graph database, it seems that it can be used on business projects and it is free, too!
At the same time, there are no RDF triple stores that work well with large data or deliver a high-speed access. And what is more, free RDF triple stores perform even worse.
So what is the advantage of RDF and RDF triple stores to Neo4j?
The advantage of using a triple store for RDF rather than Neo4j is that that's what they're designed for. Neo4j is pretty good for many use cases, but in my experience its performance for loading and querying RDF is well below all dedicated RDF databases.
It's a fallacy that RDF databases don't scale or are not fast. Sure, they're not yet up to the performance & scale levels that relational databases have, but they have a 50 year head start. Many triple stores scale into the billions of triples, provide 'standard' enterprise features, and provide great performance for many use cases.
If you're going to use RDF for a project, use a triple store; it's going to provide the best performance and set of features/APIs for working with RDF to build your application.
RDF and SPARQL are standards, so you have a choice of multiple implementations, and can migrate your data from one RDF store to another.
Additionally, version 1.1 of the SPARQL query language is quite sophisticated (more expressive than most SQL implementations) and can do all kinds of queries that would require a lot of code to be written in Neo4J.
If you are going for graph mining (e.g., graph traversal) upon triples, neo4j is a good choice. For the large triples, you might want to use its batchInserter which is fairly fast.
So I think it's all about your use case. Both technologies can and do overlap.
In my mind, there its mostly about the use case. Do you want a full knowledge graph including all the ecosystems from the semantic web? Then go for the triple store.
If you need a general-purpose graph (e.g. store big data as a graph) use the property graph model. My reasoning is, that the underlying philosophy is very much different and this starts with how the data is stored which has implications for your usage scenario.
let's do some out-of-mind bullet points here to compare. Take it with a grain of salt please as this is not a benchmark paper just some experience-based 5 min write down.
Property graph (neo4j):
Think of nodes/Edges as documents
Implemented on top of e.g. linked list, key-value stores (deep searches, large data e.g. via gremlin)
Support for OWL/RDF but not natively (as i see its on a meta layer)
Really great when it comes to having the data in the graph and doing ML (it stores it as linked lists that gives you nice vectors which is cool for ML out of the box)
Made for large data at scale.
Use Cases: (focus is on the data entities and not their classes)
Social Graphs and other scenarios where you need deep traversal
Large data graphs, where you have a lot of documents that need to be searched in a schema-free graph manner .
Analyzing customer funnels from click data etc. You want to move out of your relational schema because actually, you are in a graph use case...
Triple Store (E.g. rdf4j)
Think of data in maximum normal form as triples (no redundant data at all)
Triples are stored in context triples. Works a lot with index.
Broad but searches and specific knowledge extractions. Deep searches are sometimes cumbersome.
Scale is impressive and can scale to trillions of nodes with fast performance. But i would not recommend storing big data in the graph e.g. time-series or so. The reason is the special way how indexes are used and in order to scale horizontally, you may consider working with subgraphs ...
Support for all the ecosystems like SPARQL, SHACL, SWIRL etc. this is a big plus in case
Use cases:
It's really about knowledge graphs. Do you need shape testing, rule evaluation, inference, and reasoning? Go for it because you have to focus on the ontology and class structure!
Also e.g. you have IoT and want to configure relations for logistics and smart factory while the telemetry is stored somewhere else and only referenced in the graph.
I have heard rumors that it takes whole day to load 10M triples into Neo4j (it is actually the slowest one because it's not built primarily for RDF).
Sesame and 4Store are the fastest ones but Jena has powerful API.

Resources