I have been really struggling to achieve acceptable performance for my application with Neo4J 3.0.3. Here is some background:
I am trying to replace Apache Solr with Neo4j for an application to extend its capabilities, while maintaining or improving performance.
In Solr I have documents that essentially look like this:
{
"time": "2015-08-05T00:16:00Z",
"point": "45.8300018311,-129.759994507",
"sea_water_temperature": 18.49,
"sea_water_temperature_depth": 4,
"wind_speed": 6.48144,
"eastward_wind": 5.567876,
"northward_wind": -3.3178043,
"wind_depth": -15,
"sea_water_salinity": 32.19,
"sea_water_salinity_depth": 4,
"platform": 1,
"mission": 1,
"metadata": "KTDQ_20150805v20001_0016"
}
Since Solr is a key-value data store, my initial translation to Neo4J was going to be simple so I could get a feel for working with the API.
My method was essentially to have each Solr record equate to a Neo4J node, where every key-value would become a node-property.
Obviously a few tweaks were required (changing None to 'None' (python), changing ISO times to epoch times (neo4j doesnt support indexing datetimes), changing point to lat/lon (neo4j spatial indexing), etc).
My goal was to load up Neo4J using this model, regardless of how naive it might be.
Here is an example of a rest call I make when loading in a single record (using http:localhost:7474/db/data/cypher as my endpoint):
{
"query" :
"CREATE (r:record {lat : {lat}, SST : {SST}, meta : {meta}, lon : {lon}, time : {time}}) RETURN id(r);",
"params": {
"lat": 40.1021614075,
"SST": 6.521100044250488,
"meta": "KCEJ_20140418v20001_1430",
"lon": -70.8780212402,
"time": 1397883480
}
}
Note that I have actually removed quite a few parameters for testing neo4j.
Currently I have serious performance issues. Loading a document like this into Solr for me takes about 2 seconds. For Neo4J it takes:
~20 seconds using REST API
~45 seconds using BOLT
~70 seconds using py2neo
I have ~50,000,000 records I need to load. Doing this in Solr usually takes 24 hours, so Neo4J could take almost a month!!
I recorded these times without using a uniqueness constraint on my 'meta' attribute, and without adding each node into the spatial index. The time results in this scenario was extremely awful.
Running into this issue, I tried searching for performance tweaks online. The following things have not improved my situation:
-increasing the open file limit from 1024 to 40000
-using ext4, and tweaking it as documented here
-increasing the page cache size to 16 GB (my system has 32)
So far I have only addressed load times. After I had loaded about 50,000 nodes overnight, I attempted queries on my spatial index like so:
CALL spatial.withinDistance('my_layer', lon : 34.0, lat : 20.0, 1000)
as well as on my time index like so:
MATCH (r:record) WHERE r.time > {} AND r.time < {} RETURN r;
These simple queries would take literally several minutes just return possibly a few nodes.
In Apache Solr, the spatial index is extremely fast and responds within 5 seconds (even with all 50000000 docs loaded).
At this point, I am concerned as to whether or not this performance lag is due to the nature of my data model, the configuration of my server, etc.
My goal was to extrapolate from this model, and move several measurement types to their own class of Node, and create relationships from my base record node to these.
Is it possible that I am abusing Neo4j, and need to recreate this model to use relationships and several different Node types? Should I expect to see dramatic improvements?
As a side note, I originally planned to use a triple store (specifically Parliament) to store this data, and after struggling to work with RDF, decided that Neo4J looked promising and much easier to get up and running. Would it be worth while to go back to RDF?
Any advice, tips, comments are welcome. Thank you in advance.
EDIT:
As suggested in the comments, I have changed the behavior of my loading script.
Previously I was using python in this manner:
from neo4j.v1 import GraphDatabase
driver = GraphDatabase('http://localhost:7474/db/data')
session = driver.session()
for tuple in mydata:
statement = build_statement(tuple)
session.run(statement)
session.close()
With this approach, the actual .run() statements run in virtually no time. The .close() statement was where all the run time occurs.
My modified approach:
transaction = ''
for tuple in mydata:
statement = build_statement(tuple)
transaction += ('\n' + statement)
with session.begin_transaction() as tx:
tx.run(transaction)
session.close()
I'm a bit confused because the behavior of this is pretty much the same. .close() still takes around 45 seconds, except only it doesn't commit. Since I am reusing the same identifier in each of my statements (CREATE (r:record {...}) .... CREATE (r:record {...}) ...), I get the CypherError regarding this behavior. I don't really know how to avoid this problem at the moment, and furthermore, the run time did not seem to improve at all (I would expect an error to actually make this terminate much faster).
Related
I have following question. I set up an camel -project to parse certain xml files. I have to selecting take out certain nodes from a file.
I have two files 246kb and 347kb in size. I am extracting a parent-child pair of 250 nodes in the above given example.
With the default factory here are the times. For the 246kb file respt 77secs and 106 secs. I wanted to improve the performance so switched to saxon and the times are as follows 47secs and 54secs. I was able to cut the time down by at least half.
Is it possible to cut the time further, any other factory or optimizations I can use will be appreciated.
I am using XpathBuilder to cut the xpaths out. here is an example. Is it possible to not to have to create XpathBuilder repeatedly, it seems like it has to be constructed for every xpath, I would have one instance and keep pumping the xpaths into it, maybe it will improve performance further.
return XPathBuilder.xpath(nodeXpath)
.saxon()
.namespace(Consts.XPATH_PREFIX, nameSpace)
.evaluate(exchange.getContext(), exchange.getIn().getBody(String.class), String.class);
Adding more details based on Michael's comments. So I am kind of joining them, will become clear with my example below. I am combining them into a json.
So here we go, Lets say we have following mappings for first and second path.
pData.tinf.rexd: bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:ReqdExctnDt/text()
pData.tinf.pIdentifi.instId://bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:CdtTrfTxInf[{1}]/bm:PmtId/bm:InstrId/text()
This would result in a json as below
pData:{
tinf: {
rexd: <value_from_xml>
}
pIdentifi:{
instId: <value_from_xml>
}
}
Hard to say without seeing your actual XPath expression, but given the file sizes and execution time my guess would be that you're doing a join which is being executed naively as a cartesian product, i.e. with O(n*m) performance. There is probably some way of reorganizing it to have logarithmic performance, but the devil is in the detail. Saxon-EE is quite good at optimizing join queries automatically; if not, there are often ways of doing it manually -- though XSLT gives you more options (e.g. using xsl:key or xsl:merge) than XPath does.
Actually I was able to bring the time down to 10 secs. I am using apache-camel. So I added threads there so that multiple files can be read in separate threads. Once the file was being read, it had serial operation to based on the length of the nodes that had to be traversed. I realized that it was not necessary to be serial here so introduced parrallelStream and that now gave it enough power. One thing to guard agains is not to have a proliferation of threads since that can degrade the performance. So I try to restrict the number of threads to twice or thrice the number of cores on the operating machine.
I'm posting because I'm having strange results while stressing Neo4j 2.2.7 and OrientDB 2.1.4 and I am looking for an explanation (I'm pretty sure there is no bug in the code but if anyone is interested I'd be happy to share it).
Here are the facts:
I'm continuously shooting to the DBs the following OSql and Cypher queries, which are equivalent (except for the name of the attributes)
SELECT both('Meets').email FROM Employee WHERE nt_account = '<employeeid>'
MATCH (e: Employee {Nt_Account: '<employeeid>'}) -[:MEETS]- (y: Employee) RETURN y.E_Mail
nt_account and Nt_Account are both indexed.
the execution time of the queries, averaged over 100 repetitions, is:
OrientDB: 4.4ms
Neo4j: 7.6ms
to parallelise execution I'm using AKKA actors
Despite of the previous points, when continuously shooting the above mentioned queries from just one thread, I measured that Neo4j can serve ~59k requests, while OrientDB can serve ~16k requests.
The number of requests OrientDB could serve is consistently 3 to 5 times lower than Neo4j.
As you can imagine, points 5. and 6. shocked me a bit as I was expecting the number of requests served by OrientDB to be the greatest, given that it can execute the query in almost half of the time.
Does anybody have any idea of what's going on?
Is OrientDB doing something after having returned the query result?
Am I using the API unproperly?
More detail:
Here is how I execute the query in OrientDB (I found this here):
val start = System.currentTimeMillis()
graph.command(new OCommandSQL(<the_query>)).execute()
val ellapsedTime = System.currentTimeMillis() - start
graph is an OrientGraphNoTx instance, there is one such instance per actor.
I got comparable results by using the OrientGraph and a number of requests slightly lower by using the REST API.
Here is the method I used to execute the Neo4j query (notice that I turned off JSON streaming):
def queryRest(query: String): Unit = {
val reqData = s"""{"statements" : [ { "statement" : "$query" } ] }"""
val response = Http("http://localhost:7474/db/data/transaction/commit")
.postData(reqData)
.header("content-type", "application/json")
.header("accept", "application/json;stream=false")
.asString.body.length
}
Here are the measurements (the last row of both tables does not make much sense as the effective level of parallelism I achieved, computation_time / 1minute, is only ~12).
I'm using Cypher's LOAD CSV syntax in Neo4J 2.1.2. So far it's been a huge improvement over the more manual ETL process required in previous versions. But I'm running into some behavior in a single case that's not what I'd expect and I wonder if I'm missing something.
The cypher query being used is this:
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///Users/James/Desktop/import/dependency_sets_short.csv' AS row
MATCH (s:Sense {uid: toInt(row[4])})
MERGE (ds:DependencySet {label: row[2]}) ON CREATE SET ds.optional=(row[3] = 't')
CREATE (s)-[:has]->(ds)
Here's a couple of lines of the CSV:
227303,1,TO-PURPOSE-NOMINAL,t,73830
334471,1,AT-LOCATION,t,92048
334470,1,AT-TIME,t,92048
334469,1,ON-LOCATION,t,92048
227302,1,TO-PURPOSE-INFINITIVE,t,73830
116008,1,TO-LOCATION,t,68204
116007,1,IN-LOCATION,t,68204
227301,1,TO-LOCATION,t,73830
334468,1,ON-DATE,t,92048
116006,1,AT-LOCATION,t,68204
334467,1,WITH-ASSOCIATE,t,92048
Basically, I'm matching a Sense node (previously imported) based on it's ID value which is the fifth column. Then I'm doing a merge to either get a DependencySet node if it exists, or create it. Finally, I'm creating a has edge between the Sense node and the DependencySet node. So far so good, this all works as expected. What's confusing is the performance as the size of the CSV grows.
CSV Lines Time (msec)
------------------------------
500 480
1000 717
2000 1110
5000 1521
10000 2111
50000 4794
100000 5907
200000 12302
300000 35494
400000 Java heap space error
My expectation is that growth would be more-or-less linear, particularly as I'm committing every 500 lines as recommended by the manual, but it's actually closer to polynomial:
What's worse is that somewhere between 300k and 400k rows, it runs into a Java heap space error. Based on the trend from previous imports, I'd expect the import of 400k to take a bit over a minute. Instead, it churns away for about 5-7 minutes before running into the heap space error. It seems like I could split this file into 300,000-line chunks, but isn't that what "USING PERIODIC COMMIT" is supposed to do, more or less? I suppose I could give Neo4J more memory too, but again, it's not clear why I should have to in this scenario.
Also, to be clear, the lookups on both Sense.uid and DependencySet.label are indexed, so the lookup penalty for these should be pretty small. Here's a snippet from the schema:
Indexes
ON :DependencySet(label) ONLINE (for uniqueness constraint)
ON :Sense(uid) ONLINE (for uniqueness constraint)
Any explanations or thoughts on an alternative approach would be appreciated.
EDIT: The problem definitely seems to be in the MATCH and/or CREATE part of the query. If I remove lines 3 and 5 from the Cypher query it performs fine.
I assume that you've already created all the Sense labeled nodes before running this LOAD CSV import. What I think is going on is that as you are matching nodes with the label Sense into memory and creating relationships from the DependencySet to the Sense node via CREATE (s)-[:HAS]->(ds) you are increasing utilization of the available heap.
Another possibility is that the size of your relationship store in your memory mapped settings needs to be increased. In your scenario it looks like the Sense nodes have a high degree of connectivity to other nodes in the graph. When this happens your relationship store for those nodes require more memory. Eventually when you hit 400k nodes the heap is maxed out. Up until that point it needs to do more garbage collection and reads from disk.
Michael Hunger put together an excellent blog post on memory mapped settings for fast LOAD CSV performance. See here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
That should resolve your problem. I don't see anything wrong with your query.
i believe the line
MATCH (s:Sense {uid: toInt(row[4])})
makes the time paradigm. somewhere around the 200 000 in the x line of your graph, you have no longer all the Sense nodes in the memory but some of them must be cached to disk. thus all the increase in time is simply re-loading data from cache to memory and vise-versa (otherwise it will be still linear if kept in memory).
maybe if you could post you server memory settings, we could dig deeper.
to the problem of java heap error refer to Kenny's answer
How do I improve performance when writing to neo4j. I currently have neo4j set up on a server and I am currently running it in embedded more. I believe my configurations are storing all the content of my graph database in memory based upon configurations I've found online
neostore.nodestore.db.mapped_memory=0
neostore.relationship.db.mapped_memory=0
neostore.propertystore.db.mapped_memory=0
neostore.propertystore.db.strings.mapped_memory=0
neostore.propertystore.db.arrays.mapped_memory=0
neostore.propertystore.db.index.keys.mapped_memory=0
neostore.propertystore.db.index.mapped_memory=0
node_auto_indexing=true
node_keys_indexable=type,id
cache_type=strong
use_memory_mapped_buffers=false
node_cache_size=12G
relationship_cache_size=12G
node_cache_array_fraction=10
relationship_cache_array_fraction=10
Please let me know if this is incorrect. The problem that I am encountering is that when I try to persist information to the graph database. It appears that those times are not very quick in comparison to our MYSQL times of the samething(ex. to add 250 items would take about 3sec and in MYSQL it takes 1sec) . I read online that when you have multiple indexes that that can slow down performance on persisting data so I am working on that right now to see if that is my culprit. But, I just wanted to make sure that my configurations seem to be inline when it comes to running your graph database in memory.
Second question to this topic. Okay, if my configurations are good and my database is indeed in memory, then is there a way to optimize persisting data just in case this isn't the silver bullet. If we ran one thread against our test that executes this functionality, oppose to 10 threads, its seems like the times for execution bubbles up
ex.( thread 1 finishes 1s, thread 2 finishes 2s, thread 3 finishes 3s,etc). Is there some special multithreaded configuration that I am missing to improve the performance when mulitple threads are hitting it at one time.
Neo4J version
1.9.1-enterprise
My Jvm configs are
-Xms25G -Xmx25G -XX:+UseNUMA -XX:+UseSerialGC
My Machine Specs:
File system type ext3
You cache arguments are invalid.
node_cache_size=12G
relationship_cache_size=12G
node_cache_array_fraction=10
relationship_cache_array_fraction=10
These can only be used with the GCR cache. Setting the cache isn't going to put everything in memory for you at start up, you will have to write code to do this for you. Something like this:
GlobalGraphOperations ggo = GlobalGraphOperations.at(graphDatabaseFactory);
for (Node n : ggo.getAllNodes()) {
for (String propertyKey : n.getPropertyKeys()) {
n.getProperty(propertyKey);
}
for (Relationship relationship : n.getRelationships()) {
}
}
Beware with the strong cache, if you have a lot of nodes/relationships, eventually your cache will become large and performing GC against it will cause long pauses in your system.
My recommendation would be to use the memory mapped files, as this is an OS handled and will be outside of heap space. It doesn't provide near the speed of caching, but it will provide a speed up if you have to read from the neo store.
Looking around the web for information on doing maths in Redis and don't actually find much. I'm using the Redis-RB gem in Rails, and caching lists of results:
e = [1738738.0, 2019461.0, 1488842.0, 2272588.0, 1506046.0, 2448701.0, 3554207.0, 1659395.0, ...]
$redis.lpush "analytics:math_test", e
Currently, our lists of numbers max in the thousands to tens of thousands per list per day, with number of lists likely in the thousands per day. (This is not actually that much; however, we're growing, and expect much larger sample sizes very soon.)
For each of these lists, I'd like to be able to run stats. I currently do this in-app
def basic_stats(arr)
return nil if arr.nil? or arr.empty?
min = arr.min.to_f
max = arr.max.to_f
total = arr.inject(:+)
len = arr.length
mean = total.to_f / len # to_f so we don't get an integer result
sorted = arr.sort
median = len % 2 == 1 ? sorted[len/2] : (sorted[len/2 - 1] + sorted[len/2]).to_f / 2
sum = arr.inject(0){|accum, i| accum +(i-mean)**2 }
variance = sum/(arr.length - 1).to_f
std_dev = Math.sqrt(variance).nan? ? 0 : Math.sqrt(variance)
{min: min, max: max, mean: mean, median: median, std_dev: std_dev, size: len}
end
and, while I could simply store the stats, I will often have to aggregate lists together to run stats on the aggregated list. Thus, it makes sense to store the raw numbers rather than every possible aggregated set. Because of this, I need the math to be fast, and have been exploring ways to do this. The simplest way is just doing it in-app, with 150k items in a list, this isn't actually too terrible:
$redis_analytics.llen "analytics:math_test", 0, -1
=> 156954
Benchmark.measure do
basic_stats $redis_analytics.lrange("analytics:math_test", 0, -1).map(&:to_f)
end
=> 2.650000 0.060000 2.710000 ( 2.732993)
While I'd rather not push 3 seconds for a single calculation, given that this might be outside of my current use-case by about 10x number of samples, it's not terrible. What if we were working with a sample size of one million or so?
$redis_analytics.llen("analytics:math_test")
=> 1063454
Benchmark.measure do
basic_stats $redis_analytics.lrange("analytics:math_test", 0, -1).map(&:to_f)
end
=> 21.360000 0.340000 21.700000 ( 21.847734)
Options
Use the SORT method on the list, then you can instantaneously get min/max/length in Redis. Unfortunately, it seems that you still have to go in-app for things like median, mean, std_dev. Unless we can calculate these in Redis.
Use Lua scripting to do the calculations. (I haven't learned any Lua yet, so can't say I know what this would look like. If it's likely faster, I'd like to know so I can try it.)
Some more efficient way to utilize Ruby, which seems a wee bit unlikely since utilizing what seems like a fairly decent stats gem has analogous results
Use a different database.
Example using StatsSample gem
Using a gem seems to gain me nothing. In Python, I'd probably write a C module, not sure if many ruby stats gems are in C.
require 'statsample'
def basic_stats(stats)
return nil if stats.nil? or stats.empty?
arr = stats.to_scale
{min: arr.min, max: arr.max, mean: arr.mean, median: arr.median, std_dev: arr.sd, size: stats.length}
end
Benchmark.measure do
basic_stats $redis_analytics.lrange("analytics:math_test", 0, -1).map(&:to_f)
end
=> 20.860000 0.440000 21.300000 ( 21.436437)
Coda
It's quite possible, of course, that such large stats calculations will simply take a long time and that I should offload them to a queue. However, given that much of this math is actually happening inside Ruby/Rails, rather than in the database, I thought I might have other options.
I want to keep this open in case anyone has any input that could help others doing the same thing. For me, however, I've just realized that I'm spending too much time trying to force Redis to do something that SQL does quite well. If I simply dump this into Postgres, I can do really efficient aggregation AND math directly in the database. I think I was just stuck using Redis for something that, when it started, was a good idea, but scaled out to something bad.
Lua scripting is probably the best way to solve this problem, if you can switch to Redis 2.6. Btw testing the speed should be pretty straightforward so given the small time investment needed I strongly suggest trying Lua scripting to see what is the result you get.
Another thing you could do is to use Lua to set data, and make sure it will also update a related Hash type per every list to directly retain the min/max/average stats, so you don't have to compute those stats every time, as they are incrementally updated. Not always possible btw, depends on your specific use case.
I would take a look at NArray. From their homepage:
This extension library incorporates fast calculation and easy manipulation of large numerical arrays into the Ruby language.
It looks like their array class has most all of the functions you need built in. Cmd-F "Statistics" on that page.