Neo4j: Cypher Performance testing/Benchmarking - neo4j

I created a Neo4j 3 database that includes some test data and also a small application that will send http cypher requests to Neo4j. These requests are always of the same time. Acutally its a query template that just differs by some attributes. I am interested in the performance of these statements.
I know that I can use the PROFILE to get some information in the browser. But I want to execute a set of statements, e. g. 10 example queries, several times and calculate the average performance. Is there an easy way or a tool to do this or do I have to write e. g. a Python script that collects these values? It does not have to be a big application, I just want to see some general performance metrics.

I don't think there is an out-of-the-box tool for benchmarking Neo4j yet. So your best option is to implement your own solution - but you have to be careful if you want to get results that are (to some degree) representative:
Check the docs on performance.
Give the Neo4j JVM sufficient time to warmup. This means that you'll want to run a warmup phase with the queries and discard the execution times of them.
Instead of using a client-server architecture, you can also opt to use Neo4j in embedded mode, which will give you a better idea of the query performance (without the overhead of the driver and the serialization/deserialization process). However, in this case you have to implement benchmark over the JVM (in Java or possibly Jython).
Run each query multiple times. Do not use the average as it is more sensitive to outlier values (you can get high values for a number of reasons, e.g. if the OS scheduler starts some job in the background during a particular query execution).
A good paper in the topic, How not to lie with statistics: the correct way to summarize benchmark results, argues that you should use the geometric mean.
It is also common practice in performance experiments in computer science papers to use the median value. I tend to use this option - e.g. this figure shows the execution times of two simple SPARQL queries on in-memory RDF engines (Jena and Sesame), for their first executions and the median values of 5 consecutive executions.
Note however, that Neo4j employs various caching mechanisms, so if you only run the same query multiple times, it will only need to compute the results on the first execution and following executions will use the cache - unless the database is updated between the query executions.
As a good approximation, you can design the benchmark to resemble your actual workload as closely as possible - in many cases, application-specific macrobenchmarks make more sense than microbenchmarks. So if each query will only be evaluated once by the application, it is perfectly acceptable to benchmark only the first evaluation.
(Bonus.) Another good read in the topic is The Benchmark Handbook - chapter 1 discusses the most important criteria for domain-specific benchmarks (relevance, portability, scalability and simplicity). These are probably not required for your benchmark but these are nice to now.
I worked on a cross-technology benchmark considering relational, graph and semantic databases, including Neo4j. You might find some useful ideas or code snippets in the repository: https://github.com/FTSRG/trainbenchmark

Related

Complex join with google dataflow

I'm a newbie, trying to understand how we might re-write a batch ETL process into Google Dataflow. I've read some of the docs, run a few examples.
I'm proposing that the new ETL process would be driven by business events (i.e. a source PCollection). These would trigger the ETL process for that particular business entity. The ETL process would extract datasets from source systems and then pass those results (PCollections) onto the next processing stage. The processing stages would involve various types of joins (including cartesian and non-key joins, e.g. date-banded).
So a couple of questions here:
(1) Is the approach that I'm proposing valid & efficient? If not what would be better, I havent seen any presentations on real-world complex ETL processes using Google Dataflow, only simple scenarios.
Are there any "higher-level" ETL products that are a better fit? I've been keeping an eye on Spark and Flink for a while.
Our current ETL is moderately complex, though there are only about 30 core tables (classic EDW dimensions and facts), and ~1000 transformation steps. Source data is complex (roughly 150 Oracle tables).
(2) The complex non-key joins, how would these be handled?
I'm obviously attracted to Google Dataflow because of it being an API first and foremost, and the parallel processing capabilities seem a very good fit (we are being asked to move from batch overnight to incremental processing).
A good worked example of Dataflow for this use case would really push adoption forward!
Thanks,
Mike S
It sounds like Dataflow would be a good fit. We allow you to write a pipeline that takes a PCollection of business events and performs the ETL. The pipeline could either be batch (executed periodically) or streaming (executed whenever input data arrives).
The various joins are for the most part relatively expressible in Dataflow. For the cartesian product, you can look at using side inputs to make the contents of a PCollection available as an input to the processing of each element in another PCollection.
You can also look at using GroupByKey or CoGroupByKey to implement the joins. These flatten multiple inputs, and allow accessing all values with the same key in one place. You can also use Combine.perKey to compute associative and commutative combinations of all the elements associated with a key (eg., SUM, MIN, MAX, AVERAGE, etc.).
Date-banded joins sound like they would be a good fit for windowing which allows you to write a pipeline that consumes windows of data (eg., hourly windows, daily windows, 7 day windows that slide every day, etc.).
Edit: Mention GroupByKey and CoGroupByKey.

Neo4j partition

Is the a way to physically separate between neo4j partitions?
Meaning the following query will go to node1:
Match (a:User:Facebook)
While this query will go to another node (maybe hosted on docker)
Match (b:User:Google)
this is the case:
i want to store data of several clients under neo4j, hopefully lots of them.
now, i'm not sure about whats is the best design for that but it has to fulfill few conditions:
no mixed data should be returned from a cypher query ( its really hard to make sure, that no developer will forget the ":Partition1" (for example) in a cypher query)
performance of 1 client shouldn't affect another client, for example, if 1 client has lots of data, and another client has small amount of data, or if a "heavy" query of 1 client is currently running, i dont want other "lite" queries of another client to suffer from slow slow performance
in other words, storing everything under 1 node, at some point in the future, i think, will have scalability problem, when i'll have more clients.
btw, is it common to have few clusters?
also whats the advantage of partitioning over creating different Label for each client? for example: Users_client_1 , Users_client_2 etc
Short answer: no, there isn't.
Neo4j has high availability (HA) clusters where you can make a copy of your entire graph on many machines, and then serve many requests against that copy quickly, but they don't partition a really huge graph so some of it is stored here, some other parts there, and then connected by one query mechanism.
More detailed answer: graph partitioning is a hard problem, subject to ongoing research. You can read more about it over at wikipedia, but the gist is that when you create partitions, you're splitting your graph up into multiple different locations, and then needing to deal with the complication of relationships that cross partitions. Crossing partitions is an expensive operation, so the real question when partitioning is, how do you partition such that the need to cross partitions in a query comes up as infrequently as possible?
That's a really hard question, since it depends not only on the data model but on the access patterns, which may change.
Here's how bad the situation is (quote stolen):
Typically, graph partition problems fall under the category of NP-hard
problems. Solutions to these problems are generally derived using
heuristics and approximation algorithms.[3] However, uniform graph
partitioning or a balanced graph partition problem can be shown to be
NP-complete to approximate within any finite factor.[1] Even for
special graph classes such as trees and grids, no reasonable
approximation algorithms exist,[4] unless P=NP. Grids are a
particularly interesting case since they model the graphs resulting
from Finite Element Model (FEM) simulations. When not only the number
of edges between the components is approximated, but also the sizes of
the components, it can be shown that no reasonable fully polynomial
algorithms exist for these graphs.
Not to leave you with too much doom and gloom, plenty of people have partitioned big graphs. Facebook and twitter do it every day, so you can read about FlockDB on the twitter side or avail yourself of relevant facebook research. But to summarize and cut to the chase, it depends on your data and most people who partition design a custom partitioning strategy, it's not something software does for them.
Finally, other architectures (such as Apache Giraph) can auto-partition in some senses; if you store a graph on top of hadoop, and hadoop already automagically scales across a cluster, then technically this is partitioning your graph for you, automagically. Cool, right? Well...cool until you realize that you still have to execute graph traversal operations all over the place, which may perform very poorly owing to the fact that all of those partitions have to be traversed, the performance situation you're usually trying to avoid by partitioning wisely in the first place.

sharding a neo4j graph, min-cut

I've heard of a max flow min cut method for sharding or segmenting a graph database. Does someone have a sample cypher query that can do that say against the movielens dataset? Basically I want to segment users into different shards/clusters based on what they like so maybe the min cuts can naturally find clusters of users around the genres say Horror, Drama, or maybe it will create non-intuitive clusters/segments like hipster/romantics and conservative/comedy/horror groups.
my short answer is no, sorry I don't know how you would express that.
my longer answer is even if this were possible - which it very well may be - I would advise against it.
multiple algorithms 'do' min-cut max-flow, these will all have different performance characteristics and, because clustering is computationally expensive, I'd guess you want control over the specific algorithm implementation used.
Cypher is a declarative language, you specify what you're looking for but not how to do it, and it will be difficult to specify such a complex problem in a way that the Cypher engine can figure out what you're trying to do. that will make it hard for Cypher (or any declarative language engine) to produce an efficient query plan.
my suggestion is find the specific algorithm you wish to use and implement it using the Neo4j Java API.
if you're running Neo4j in embedded mode you're done at that point. if you're running Neo4j server you'll then just have to run that code as an Unmanaged Server Extension
AFAIK you're after 'Community Detection' algorithms. There are non-overlapping (communities do not overlap) and overlapping variants, where non-overlapping is generally easier to implement and understand. Common algorithms are:
Non-overlapping: Louvain
Overlapping: Label Propagation Algorithm (LPA) (typically non-overlapping, but there are extensions to make it overlapping)
Here are a few C++ code examples for the algorithms: Louvain, Oslom (overlapping), LPA (non-overlapping), and Infomap)
And if you want bleeding edge I was recommended the SCD algorithm
Academic paper: "High Quality, Scalable and Parallel Community Detection for Large Real Graphs"
C++ implementation

Neo4J Performance Benchmarking

I have created a basic implementation of high level client over Neo4J (https://github.com/impetus-opensource/Kundera/tree/trunk/kundera-neo4j) and want to compare its performance with Native neo4j driver (and maybe SpringData too). This way I would be able to determine overhead my library is putting over native driver.
I plan to create an extension of YCSB for Neo4J.
My question is: what should be considered as a basic unit of object to be written into neo4j (should it be a single node or a couple of nodes joined by an edge).
What's current practice in Neo4J world. How people benchmarking neo4j performance are doing it.
There's already been some work for benchmarking Neo4J with Gatling: http://maxdemarzi.com/2013/02/14/neo4j-and-gatling-sitting-in-a-tree-performance-t-e-s-t-ing/
You could maybe adapt it.
See graphdb-benchmarks
The project graphdb-benchmarks is a benchmark between popular graph dataases. Currently the framework supports Titan, OrientDB, Neo4j and Sparksee. The purpose of this benchmark is to examine the performance of each graph database in terms of execution time. The benchmark is composed of four workloads, Clustering, Massive Insertion, Single Insertion and Query Workload. Every workload has been designed to simulate common operations in graph database systems.
Clustering Workload (CW): CW consists of a well-known community detection algorithm for modularity optimization, the Louvain Method. We adapt the algorithm on top of the benchmarked graph databases and employ cache techniques to take advantage of both graph database capabilities and in-memory execution speed. We measure the time the algorithm needs to converge.
Massive Insertion Workload (MIW): Create the graph database and configure it for massive loading, then we populate it with a particular dataset. We measure the time for the creation of the whole graph.
Single Insertion Workload (SIW): Create the graph database and load it with a particular dataset. Every object insertion (node or edge) is committed directly and the graph is constructed incrementally. We measure the insertion time per block, which consists of one thousand edges and the nodes that appear during the insertion of these edges.
Query Workload (QW): Execute three common queries:
FindNeighbours (FN): finds the neighbours of all nodes.
FindAdjacentNodes (FA): finds the adjacent nodes of all edges.
FindShortestPath (FS): finds the shortest path between the first node and 100 randomly picked nodes.
One way to performance-test is to use e.g. http://gatling-tool.org/. There is work underway to create benchmark frameworks at http://ldbc.eu . Otherwise, benchmarking is highly dependent on your domain dataset and the queries you are trying to do. Maybe you could start at https://github.com/neo4j/performance-benchmark and improve on it?

Map Reduce Algorithms on Terabytes of Data?

This question does not have a single "right" answer.
I'm interested in running Map Reduce algorithms, on a cluster, on Terabytes of data.
I want to learn more about the running time of said algorithms.
What books should I read?
I'm not interested in setting up Map Reduce clusters, or running standard algorithms. I want rigorous theoretical treatments or running time.
EDIT: The issue is not that map reduce changes running time. The issue is -- most algorithms do not distribute well to map reduce frameworks. I'm interested in algorithms that run on the map reduce framework.
Technically, there's no real different in the runtime analysis of MapReduce in comparison to "standard" algorithms - MapReduce is still an algorithm just like any other (or specifically, a class of algorithms that occur in multiple steps, with a certain interaction between those steps).
The runtime of a MapReduce job is still going to scale how normal algorithmic analysis would predict, when you factor in division of tasks across multiple machines and then find the maximum individual machine time required for each step.
That is, if you have a task which requires M map operations, and R reduce operations, running on N machines, and you expect that the average map operation will take m time and the average reduce operation r time, then you'll have an expected runtime of ceil(M/N)*m + ceil(R/N)*r time to complete all of the tasks in question.
Prediction of the values for M,R,m, and r are all something that can be accomplished with normal analysis of whatever algorithm you're plugging into MapReduce.
There are only two books that i know of that are published, but there are more in the works:
Pro hadoop and Hadoop: The Definitive Guide
Of these, Pro Hadoop is more of a beginners book, whilst The Definitive Guide is for those that know what Hadoop actually is.
I own The Definitive Guide and think its an excellent book. It provides good technical details on how the HDFS works, as well as covering a range of related topics such as MapReduce, Pig, Hive, HBase etc. It should also be noted that this book was written by Tom White who has been involved with the development of Hadoop for a good while, and now works at cloudera.
As far as the analysis of algorithms goes on Hadoop you could take a look at the TeraByte sort benchmarks. Yahoo have done a write up of how Hadoop performs for this particular benchmark: TeraByte Sort on Apache Hadoop. This paper was written in 2008.
More details about the 2009 results can be found here.
There is a great book about Data Mining algorithms applied to the MapReduce model.
It was written by two Stanford Professors and it if available for free:
http://infolab.stanford.edu/~ullman/mmds.html

Resources