I'm posting because I'm having strange results while stressing Neo4j 2.2.7 and OrientDB 2.1.4 and I am looking for an explanation (I'm pretty sure there is no bug in the code but if anyone is interested I'd be happy to share it).
Here are the facts:
I'm continuously shooting to the DBs the following OSql and Cypher queries, which are equivalent (except for the name of the attributes)
SELECT both('Meets').email FROM Employee WHERE nt_account = '<employeeid>'
MATCH (e: Employee {Nt_Account: '<employeeid>'}) -[:MEETS]- (y: Employee) RETURN y.E_Mail
nt_account and Nt_Account are both indexed.
the execution time of the queries, averaged over 100 repetitions, is:
OrientDB: 4.4ms
Neo4j: 7.6ms
to parallelise execution I'm using AKKA actors
Despite of the previous points, when continuously shooting the above mentioned queries from just one thread, I measured that Neo4j can serve ~59k requests, while OrientDB can serve ~16k requests.
The number of requests OrientDB could serve is consistently 3 to 5 times lower than Neo4j.
As you can imagine, points 5. and 6. shocked me a bit as I was expecting the number of requests served by OrientDB to be the greatest, given that it can execute the query in almost half of the time.
Does anybody have any idea of what's going on?
Is OrientDB doing something after having returned the query result?
Am I using the API unproperly?
More detail:
Here is how I execute the query in OrientDB (I found this here):
val start = System.currentTimeMillis()
graph.command(new OCommandSQL(<the_query>)).execute()
val ellapsedTime = System.currentTimeMillis() - start
graph is an OrientGraphNoTx instance, there is one such instance per actor.
I got comparable results by using the OrientGraph and a number of requests slightly lower by using the REST API.
Here is the method I used to execute the Neo4j query (notice that I turned off JSON streaming):
def queryRest(query: String): Unit = {
val reqData = s"""{"statements" : [ { "statement" : "$query" } ] }"""
val response = Http("http://localhost:7474/db/data/transaction/commit")
.postData(reqData)
.header("content-type", "application/json")
.header("accept", "application/json;stream=false")
.asString.body.length
}
Here are the measurements (the last row of both tables does not make much sense as the effective level of parallelism I achieved, computation_time / 1minute, is only ~12).
Related
I recently upgraded my Neo4j to 3.1.3, and alongside that, got the most recent APOC plugin (3.1.3.6).
I had a bit of code that worked fine, and could create ~3 million relationships in about a minute and a half wall time. But now, it's been running for over 8 hours and shows no sign of stopping...
Because the code used to run without any problems, I'm hoping something must have changed between versions that has lead to my code having been borked.
Is it rock_n_roll that should be changed (maybe to apoc.periodic.commit with positional arguments or something)? Thanks for any insight.
Here's what I'm running .
CALL apoc.periodic.rock_n_roll(
"MATCH (c:ChessPlayer),(r:Record) WHERE c.ChessPlayer_ID = r.ChessPlayer RETURN c,r",
"CYPHER planner=rule WITH {c} AS c, {r} AS r CREATE (c)-[:HAD_RECORD]->(r)",
200000)
My understanding is that call is querying the Cartesian product of ChessPlayers and Records, and then trying to filter them out row by row, and then doing the batch update on those final results (which eats a lot of memory, I think this one opening transaction is what's killing you). So if you can break it up so that each transaction can touch as few nodes as possible, it should be able to perform massively better (especially if r.ChessPlayer is indexed, since now you don't need to load all of them)
CALL apoc.periodic.rock_n_roll(
"MATCH (c:ChessPlayer) WHERE NOT EXISTS((c)-[:HAD_RECORD]->()) RETURN c",
"MATCH (r:Record) WHERE c.ChessPlayer_ID = r.ChessPlayer WITH c,r CREATE UNIQUE (c)-[:HAD_RECORD]->(r)",
100000)
periodic.commit() would work on a similar principle. The smaller (least nodes touched) you can make each transaction, the faster the batch will become.
I have been really struggling to achieve acceptable performance for my application with Neo4J 3.0.3. Here is some background:
I am trying to replace Apache Solr with Neo4j for an application to extend its capabilities, while maintaining or improving performance.
In Solr I have documents that essentially look like this:
{
"time": "2015-08-05T00:16:00Z",
"point": "45.8300018311,-129.759994507",
"sea_water_temperature": 18.49,
"sea_water_temperature_depth": 4,
"wind_speed": 6.48144,
"eastward_wind": 5.567876,
"northward_wind": -3.3178043,
"wind_depth": -15,
"sea_water_salinity": 32.19,
"sea_water_salinity_depth": 4,
"platform": 1,
"mission": 1,
"metadata": "KTDQ_20150805v20001_0016"
}
Since Solr is a key-value data store, my initial translation to Neo4J was going to be simple so I could get a feel for working with the API.
My method was essentially to have each Solr record equate to a Neo4J node, where every key-value would become a node-property.
Obviously a few tweaks were required (changing None to 'None' (python), changing ISO times to epoch times (neo4j doesnt support indexing datetimes), changing point to lat/lon (neo4j spatial indexing), etc).
My goal was to load up Neo4J using this model, regardless of how naive it might be.
Here is an example of a rest call I make when loading in a single record (using http:localhost:7474/db/data/cypher as my endpoint):
{
"query" :
"CREATE (r:record {lat : {lat}, SST : {SST}, meta : {meta}, lon : {lon}, time : {time}}) RETURN id(r);",
"params": {
"lat": 40.1021614075,
"SST": 6.521100044250488,
"meta": "KCEJ_20140418v20001_1430",
"lon": -70.8780212402,
"time": 1397883480
}
}
Note that I have actually removed quite a few parameters for testing neo4j.
Currently I have serious performance issues. Loading a document like this into Solr for me takes about 2 seconds. For Neo4J it takes:
~20 seconds using REST API
~45 seconds using BOLT
~70 seconds using py2neo
I have ~50,000,000 records I need to load. Doing this in Solr usually takes 24 hours, so Neo4J could take almost a month!!
I recorded these times without using a uniqueness constraint on my 'meta' attribute, and without adding each node into the spatial index. The time results in this scenario was extremely awful.
Running into this issue, I tried searching for performance tweaks online. The following things have not improved my situation:
-increasing the open file limit from 1024 to 40000
-using ext4, and tweaking it as documented here
-increasing the page cache size to 16 GB (my system has 32)
So far I have only addressed load times. After I had loaded about 50,000 nodes overnight, I attempted queries on my spatial index like so:
CALL spatial.withinDistance('my_layer', lon : 34.0, lat : 20.0, 1000)
as well as on my time index like so:
MATCH (r:record) WHERE r.time > {} AND r.time < {} RETURN r;
These simple queries would take literally several minutes just return possibly a few nodes.
In Apache Solr, the spatial index is extremely fast and responds within 5 seconds (even with all 50000000 docs loaded).
At this point, I am concerned as to whether or not this performance lag is due to the nature of my data model, the configuration of my server, etc.
My goal was to extrapolate from this model, and move several measurement types to their own class of Node, and create relationships from my base record node to these.
Is it possible that I am abusing Neo4j, and need to recreate this model to use relationships and several different Node types? Should I expect to see dramatic improvements?
As a side note, I originally planned to use a triple store (specifically Parliament) to store this data, and after struggling to work with RDF, decided that Neo4J looked promising and much easier to get up and running. Would it be worth while to go back to RDF?
Any advice, tips, comments are welcome. Thank you in advance.
EDIT:
As suggested in the comments, I have changed the behavior of my loading script.
Previously I was using python in this manner:
from neo4j.v1 import GraphDatabase
driver = GraphDatabase('http://localhost:7474/db/data')
session = driver.session()
for tuple in mydata:
statement = build_statement(tuple)
session.run(statement)
session.close()
With this approach, the actual .run() statements run in virtually no time. The .close() statement was where all the run time occurs.
My modified approach:
transaction = ''
for tuple in mydata:
statement = build_statement(tuple)
transaction += ('\n' + statement)
with session.begin_transaction() as tx:
tx.run(transaction)
session.close()
I'm a bit confused because the behavior of this is pretty much the same. .close() still takes around 45 seconds, except only it doesn't commit. Since I am reusing the same identifier in each of my statements (CREATE (r:record {...}) .... CREATE (r:record {...}) ...), I get the CypherError regarding this behavior. I don't really know how to avoid this problem at the moment, and furthermore, the run time did not seem to improve at all (I would expect an error to actually make this terminate much faster).
In a general sense, is there a best practice to use when attempting to estimate how long the setting of relationships takes in Neo4j?
For example, I used the data import tool successfully, and here's what I've got in my 2.24GB database:
IMPORT DONE in 3m 8s 791ms. Imported:
7432663 nodes
0 relationships
119743432 properties
In preparation for setting relationships, I set some indices:
CREATE INDEX ON :ChessPlayer(player_id);
CREATE INDEX ON :Matches(player_id);
Then I let it rip:
MATCH (p:Player),(m:Matches)
WHERE p.player_id = m.player_id
CREATE (p)-[r:HAD_MATCH]->(m)
Then, I started to realize, that I have no idea how to even estimate how long that setting these relationships might take to set. Is there a 'back of the envelope' calculation for determining at least a ballpark figure for this kind of thing?
I understand that everyone's situation is different on all levels, including software, hardware, and desired schema. But any discussion would no doubt be useful and would deepen mine (and anyone else who reads this)'s understanding.
PS: FWIW, I'm running Ubuntu 14.04 with 16GB RAM and an Intel Core i7-3630QM CPU # 2.40GHz
The problem here is that you don't take into account transaction sizes. In your example all :HAD_MATCH relationships are created in one single large transaction. A transaction internally builds up in memory first and then gets flushed to disc. If the transaction is too large to fit in your heap you'll might see massive performance degradation due to garbage collections or even OutOfMemoryExceptions.
Typically you want to limit transaction sizes to e.g. 10k - 100k atomic operations.
The probably most easy to do transaction batching in this case is using the rock_n_roll procedure from neo4j-apoc. This uses one cypher statement to provide the data to be worked on and a second one running for each of the results from the previous one in batched mode. Note that apoc requires Neo4j 3.x:
CALL apoc.periodic.rock_n_roll(
"MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
"WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
20000)
There was a bug in 3.0.0 and 3.0.1 causing this performing rather badly. So the above is for Neo4j >= 3.0.2.
If being on 3.0.0 / 3.0.1 use this as a workaround:
CALL apoc.periodic.rock_n_roll(
"MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
"CYPHER planner=rule WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
20000)
We are evaluating Neo4J for our application, testing it against a small test database with a total of around 20K nodes, 150K properties, and 100K relationships. The branching factor is ~100 relationships/node. Server and version information is below [1]. The Cypher query is:
MATCH p = ()-[r1:RATES]-(m1:Movie)-[r2:RATES]-(u1:User)-[r3:RATES]-(m2:Movie)-[r4:RATES]-()
RETURN r1.id as i_id, m1.id, r2.id, u1.id, r3.id, m2.id, r4.id as t_id;
(The first and last empty nodes aren't important to us, but I didn't see how to start with relationships.)
I killed it after a couple of hours. Maybe I'm expecting too much by hoping Neo4J would avoid combinatorial explosion. I tried tweaking some server parameters but got no further.
My main question is whether what I'm trying to do (a nine-step path query) is reasonable for Neo4J, or, for that matter, any graph database. I realize nine steps is a very deep search, and one that touches every node in the database multiple times, but unfortunately that's what our research needs to do.
Looking forward to your thoughts.
[1] System info:
The Linux server has 32 processors and 64GB of memory.
Neo4j - Graph Database Kernel (neo4j-kernel), version: 2.1.2.
java version "1.7.0_60", Java(TM) SE Runtime Environment (build 1.7.0_60-b19), Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
To answer your main question, Neo4j has no problem doing a variable length query that does not result in a combinatorial explosion in the search space (an exponential time complexity as a result of your branching factor).
There is however an optimization that can be done to your Cypher query.
MATCH ()-[r1:RATES]->(m1:Movie),
(m1)<-[r2:RATES]-(u1:User),
(u1)-[r3:RATES]->(m2:Movie),
(m2)<-[r4:RATES]-()
RETURN r1.id as i_id, m1.id, r2.id, u1.id, r3.id, m2.id, r4.id as t_id;
That being said, Cypher has some current limitations with these kinds of queries. We call these queries "graph global operations". When you are running a query that touches the graph globally without a specific starting point, computation as well as writes and reads to disc can cause performance bottlenecks. When returning large payloads over HTTP REST, you'll encounter data transfer limitations within your network.
To test the difference between query response times due to network data transfer constraints, compare the previous query to the following:
MATCH ()-[r1:RATES]->(m1:Movie),
(m1)<-[r2:RATES]-(u1:User),
(u1)-[r3:RATES]->(m2:Movie),
(m2)<-[r4:RATES]-()
RETURN count(*)
The difference between the queries in response time should be significant.
So what are your options?
Option 1:
Write a Neo4j unmanaged extension in Java that runs on-heap embedded in the JVM using Neo4j's Java API. Your Cypher query can be translated imperatively into a traversal description that operates on your graph in-memory. Seeing that you have 64GB of memory, your Java heap should be configured so that Neo4j has access to 70-85% of your available memory.
You can learn more about the Neo4j Java API here: http://docs.neo4j.org/chunked/stable/server-unmanaged-extensions.html
Option 2:
Tune the performance configurations of Neo4j to run your graph in-memory and optimize your Cypher queries to limit the amount of data transferred over the network. Performance will still be sub-optimal for graph global operations.
I have a graph with ~89K nodes and ~1.2M relationships, and am trying to get the transitive closure of a single node via the following Cypher query:
start n=NODE(<id of a single node of interest>)
match (n)-[*1..]->(m)
where has(m.name)
return distinct m.name
Unfortunately, this query goes away and doesn't seem to come back (although to be fair I've only given it about an hour of execution time at this point).
Any suggestions on ways to optimise what I've got here, or better ways to achieve the requirement?
Notes:
Neo4J v2.0.0 (installed via Homebrew).
Mac OSX 10.8.5
Oracle Java 1.7.0_51
8GB physical RAM (neo4j JVM assigned whatever the default is)
Database is hosted on an SSD volume.
Query is submitted via the admin web UI's "Data browser".
"name" is an auto-indexed field.
CPU usage is fairly low - averaging around 20% of 8 cores.
I haven't gotten into the weeds of profiling the Neo4J server yet - my first attempt locked up VisualVM.
That's probably a combinatorial explosion of path, care to try this?
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match shortestPath((n)-[*]->(m))
return m.name
without shortest-path it would look like that, but as you are only interested in the reachable nodes from n the above should be good enough.
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match (n)-[*]->(m)
return distnct m.name
Try query - https://code.google.com/p/gueryframework/ - this is a standalone library but is has a neo4j adapter. I.e., you will have to rewrite your queries in the query format.
Better support for transitive closure was one of the main reasons for developing query, we mainly use this in software analysis tools where we need reachability / pattern analysis (e.g., the antipattern queries in http://xplrarc.massey.ac.nz/ are computed using query).
There is a brief discussion about this in the neo4j google group:
https://groups.google.com/forum/#!searchin/neo4j/jens/neo4j/n69ksEJxDtQ/29DNKyWKur4J
and an (older, not maintained) project with some benchmarking code:
https://code.google.com/p/graph-query-benchmarks/
Cheers, Jens