I am exploring the use of Arangodb as a graph engine for a project I am working on that needs shortest path analysis.
My collections look like this:
a route network of ~3.5M edges in an edge collection (_to/_from)
a vertex collection ~2.7M vertices (geo index on [lat,lng]).
a trips collection with start/end locations (not mapped to nodes).
The first task is to snap the origin and destination coordinates of the trips to vertices in on the network. I am using the following query to do that:
FOR t IN trips
let snappedFrom = (
FOR x IN nodes
SORT GEO_DISTANCE([t.Orig_Long, t.Orig_Lat], [x.lng, x.lat]) ASC
LIMIT 1
RETURN x._id
)[0]
let snappedTo = (
FOR x IN nodes
SORT GEO_DISTANCE([t.Dest_Long, t.Dest_Lat], [x.lng, x.lat]) ASC
LIMIT 1
RETURN x._id
)[0]
UPDATE t._key WITH {snappedFrom,snappedTo} IN trips
This is taking around 3.5 hours, and I want to reduce that significantly if possible.
I am running on an AWS instance with 32GB of RAM and 8 cores. I notice that when running this query, it is only using a single core which is killing me.
I am curious about setting up the arangodb for pure performance. My use case is using the DB as a calculator really. In fact is likely it will be part of a CI/CD workflow when done. I don't need any safe guards in there, there wont be any parallel user requests, and if the data is bad, I just blow it away and start again.
I am using a standard install with docker
docker run -it --name=adb --rm -p 8528:8528 -v arangodb:/data -d -v /var/run/docker.sock:/var/run/docker.sock arangodb/arangodb-starter --starter.address=<$IP> --starter.mode=single
I am going to run into the same issue when I run shortest_path on all trips too, that will take forever if single core.
Any help with the config, better query, or even better AWS setups would be truly appreciated.
add Geo-Spatial Indexes
on Orig and Dest fields, that will enable server to optimize / speed up sub-queries
for further speeding up of processing run main query in batches, processing more smaller batches is faster than running over all documents at once
Related
I'm running neo4j version 2.2.5. I love all the CYPHER language, Python integration, ease of use, and very responsive user community.
I've developed a prototype of an application and am encountering some very poor performance times. I've read a lot of links related to performance tuning. I will attempt to outline my entire database here so that someone can provide guidance to me.
My machine is a MacBook Pro, 16GB of RAM, and 500GB SSD. It's very fast for everything else I do in Spark + Python + Hadoop. It's fast for Neo4j too, BUT when I get to like 2-4M nodes then it's insanely slow.
I've used both of these commands to start up neo4j, thinking they will help, and neither is that helpful:
./neo4j-community-2.2.5/bin/neo4j start -Xms512m -Xmx3g -XX:+UseConcMarkSweepGC
./neo4j-community-2.2.5/bin/neo4j start -Xms512m -Xmx3g -XX:+UseG1GC
My neo4j.properties file is as follows:
################################################################
# Neo4j
#
# neo4j.properties - database tuning parameters
#
################################################################
# Enable this to be able to upgrade a store from an older version.
#allow_store_upgrade=true
# The amount of memory to use for mapping the store files, in bytes (or
# kilobytes with the 'k' suffix, megabytes with 'm' and gigabytes with 'g').
# If Neo4j is running on a dedicated server, then it is generally recommended
# to leave about 2-4 gigabytes for the operating system, give the JVM enough
# heap to hold all your transaction state and query context, and then leave the
# rest for the page cache.
# The default page cache memory assumes the machine is dedicated to running
# Neo4j, and is heuristically set to 75% of RAM minus the max Java heap size.
dbms.pagecache.memory=6g
# Enable this to specify a parser other than the default one.
#cypher_parser_version=2.0
# Keep logical logs, helps debugging but uses more disk space, enabled for
# legacy reasons To limit space needed to store historical logs use values such
# as: "7 days" or "100M size" instead of "true".
#keep_logical_logs=7 days
# Enable shell server so that remote clients can connect via Neo4j shell.
#remote_shell_enabled=true
# The network interface IP the shell will listen on (use 0.0.0 for all interfaces).
#remote_shell_host=127.0.0.1
# The port the shell will listen on, default is 1337.
#remote_shell_port=1337
# The type of cache to use for nodes and relationships.
#cache_type=soft
To create my database from a fresh start, I first create these indexes, they are on all of my node types, and edges that I'm using.
CREATE CONSTRAINT ON (id:KnownIDType) ASSERT id.id_type_value IS UNIQUE;
CREATE CONSTRAINT ON (p:PerspectiveKey) ASSERT p.perspective_key IS UNIQUE;
CREATE INDEX ON :KnownIDType(id_type);
CREATE INDEX ON :KnownIDType(id_value);
CREATE INDEX ON :KNOWN_BY(StartDT);
CREATE INDEX ON :KNOWN_BY(EndDT);
CREATE INDEX ON :HAS_PERSPECTIVE(Country);
I have 8,601,880 nodes.
I run this query, and it takes 9 minutes.
MATCH (l:KnownIDType { id_type:'CodeType1' })<-[e1:KNOWN_BY]-(m:KnownIDType { id_type:'CodeType2' })-[e2:KNOWN_BY]->(n:KnownIDType)<-[e3:KNOWN_BY]-(o:KnownIDType { id_type:'CodeType3' })-[e4:KNOWN_BY]->(p:KnownIDType { id_type:'CodeType4' }), (n)-[e5:HAS_PERSPECTIVE]->(q:PerspectiveKey {perspective_key:100})
WHERE 1=1
AND l.id_type IN ['CodeType1']
AND m.id_type IN ['CodeType2']
AND n.id_type IN ['CodeTypeA', 'CodeTypeB', 'CodeTypeC']
AND o.id_type IN ['CodeType3']
AND p.id_type IN ['CodeType4']
AND 20131231 >= e1.StartDT and 20131231 < e1.EndDT
AND 20131231 >= e2.StartDT and 20131231 < e2.EndDT
AND 20131231 >= e3.StartDT and 20131231 < e3.EndDT
AND 20131231 >= e4.StartDT and 20131231 < e4.EndDT
WITH o, o.id_value as KnownIDValue, e5.Country as Country, count(distinct p.id_value) as ACount
WHERE AmbiguousCount > 1
RETURN 20131231 as AsOfDate, 'CodeType' as KnownIDType, 'ACount' as MetricName, count(ACount) as MetricValue
;
I'm looking for more like 15s or less response time. Like I do with < 1M nodes.
What would you suggest? I am happy to provide more information if you tell me what you need.
Thanks a bunch in advance.
Here are a couple of ideas how to speed up your query:
Don't use IN if there is only one element. Use =
With a growing number of nodes, the index lookup will obviously take longer. Instead of having a single label with an indexed property, you could use the id_type property as label. Something like (l:KnownIDTypeCode1)<-[e1:KNOWN_BY]-(m:KnownIDTypeCode2).
Split up the query in two parts. First MATCH your KNOWN_BY path, then collect what you need using WITH and MATCH the HAS_PERSPECTIVE part.
The range queries on the StartDT and EndDT property could be slow. Try to remove them to test if this slows down the query.
Also, it looks like you could replace the >= and < with =, sind you use the same date everywhere.
If you really have to filter date ranges a lot, it might help to implement it in your graph model. One option would be to use Knownby nodes instead of KNOWN_BY relationships and connect them to Date nodes.
First upgrade to version of 2.3, because it should improve performance - http://neo4j.com/release-notes/neo4j-2-3-0/
Hint
It doesn't make sense to use IN for array with one element.
Profile your query with EXPLAIN and PROFILE
http://neo4j.com/docs/stable/how-do-i-profile-a-query.html
Martin, your second recommendation, has sped up my matching paths to single digit seconds, I am grateful for your help. Thank you. While it involved a refactoring the design of my graph, and query patterns, it's improved the performance exponentially. I decided to create CodeType1, CodeType2, CodeType[N] as nodes labels, and minimized the use of node properties, except for keeping the temporality properties on the edges. Thank you again so much! Please let me know if there is anything I can do to help.
We are running a 4 node cluster currently, (3 machines with 8GB and one with 4GB RAM) and while trying run a left outer join of a table (about 7.5 GB) on itself. Few of the fields of the table have been stored as columnstore.
The structure of the query is somewhat like:
SELECT a.x, a.y from tableX left outer join tableY on a.z=b.z group by a.x;
Any ideas?
We've seen this error reproduce when the machine is under a significant amount of load (it triggers an internal TCP timeout, hence the -1 as it's not one of ours).
Were the machines running with a lot of load while you ran the query? E.g. you were running several other queries concurrently or something else on the machines?
We are actively working on resolving this in a general way (it is relatively hard to reproduce), and we will keep this thread up-to-date with our progress.
I am running a bulk import of data into a neo4j instance (I have run against 2.2.0 community and enterprise editions as well as 2.1.7 community) running in server mode. My application creates a bunch of nodes in memory, and will peridoically stop to write a series .csv files and send cypher to the neo4j instance to upload the files. (this was done due to performance issues with running the application using the plain old REST API).
Overall, I'm looking to upload something like 150-5000 million nodes, so this is, in principle, the type of thing that neo4j claims to be able to handle relatively well.
Well, anyway, what I'm noticing when I run this against production data is that the application runs in two states -- one where the csv upload processes between 2k-8k of nodes per second, and one where it processes between 80-200 nodes per second. The two states are interwoven when you look at the upload as a time series, and as time goes on, it spends increasingly long amounts of time in the slow state.
Nodes are created through a series of
MERGE (:{NODE_TYPE} {csvLine.key = n.primaryKey}) on create set [PROPERTY LIST];
statements, and I have indexes on everything that I'm doing merges against. This doesn't feel like a degradation in the insert statements, because the slowdown is not linear, but rather bimodal, this feels like there are garbage collection in the neo4j instance. What is the best way to tune the neo4j JVM garbage collector for frequent bulk inserts?
neo4j.properties:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
#neostore.relationshipgroupstore.db.mapped_memory=10M
neostore.propertystore.db.mapped_memory=100M
#neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
neo4j-wrapper.conf:
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional=-XX:-OmitStackTraceInFastThrow
wrapper.java.additional=-XX:hashCode=5
wrapper.java.initmemory=8194
wrapper.java.maxmemory=8194
This felt like the sweet spot for both the overall heap memory and the neostore stuff. Increasing the overall heap degraded performance. That said, the neo4j garbage collection logs frequently have that GC (Allocation Failure) message.
EDIT: in response to Michael Hunger:
the machine has 64 GB of RAM, and nothing seems to be maxed out. It also seems like only a small number of cores are being used at any time. Garbage collector profiling shows that the garbage collector seems to be running quite frequently.
The exact cypher statements are, for example:
USING PERIODIC COMMIT 110000 LOAD CSV WITH HEADERS FROM 'file:///home/jschirmer/Event_2015_4_773476.csv' AS csvLine MERGE (s:Event {primaryKey: csvLine.primaryKey}) ON CREATE SET s.checkSum= csvLine.checkSum,s.epochTime= toInt(csvLine.epochTime),s.epochTimeCreated= toInt(csvLine.epochTimeCreated),s.epochTimeUpdated= toInt(csvLine.epochTimeUpdated),s.eventDescription= csvLine.eventDescription,s.fileName= csvLine.fileName,s.ip= csvLine.ip,s.lineNumber= toInt(csvLine.lineNumber),s.port= csvLine.port,s.processPid= csvLine.processPid,s.rawEventLine= csvLine.rawEventLine,s.serverId= csvLine.serverId,s.status= toInt(csvLine.status);
USING PERIODIC COMMIT 110000 LOAD CSV WITH HEADERS FROM 'file:///home/jschirmer/Event__File_2015_4_773476.csv' AS csvLine MATCH (n:SC_CSR{primaryKey: csvLine.Event_id}), (s:File{fileName: csvLine.File_id}) MERGE n-[:DATA_SOURCE]->s;
Though there are serveral such statements being made
I have tried a single concurrent transaction as well as running several (~3) such statements in parallel (which gives a roughly 2x improvement). I've tried tuning the periodic commit frequency, and the size of the file. It seems that this maximizes performance when the csv file is roughly 100k lines, which means that really, the periodic commit can be off.
I have not run profile on the staments. I will do that, but I thought that the eager merget problem was avoided by using MERGE ... on create statements.
IN general your config looks ok, what RAM does your machine have?
For the things you merge against I'd recommend constraint instead of an index.
What's your tx size? And how many concurrent tx do you run?
Instead of your generic merge statement (which wouldn't compile) can you share the concrete statements?
Did you profile the statements? Perhaps you run into the eager pipe problem:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Do you use periodic commit?
How large are you CSV files?
See: http://neo4j.com/developer/guide-import-csv/
I have imported nodes.tsv ( 350MB, 18M rows, 3 cols) and rels.tsv ( 5GB, 150M rows, 2 cols) using the batch-importer script.
These are my batch.properties file entries
• neostore.nodestore.db.mapped_memory=250M
• neostore.relationshipstore.db.mapped_memory=1000M
• neostore.relationshipgroupstore.db.mapped_memory=10M
• neostore.propertystore.db.mapped_memory=500M
• neostore.propertystore.db.strings.mapped_memory=500M
• neostore.propertystore.db.arrays.mapped_memory=215M
• dump_configuration=true
I have turned on auto update and auto indexing in ne04j.properties as follows
• allow_store_upgrade=true • node_auto_indexing=true
• node_keys_indexable=name,title • relationship_auto_indexing=true
• relationship_keys_indexable=sent_date,has_read
I'm using neo4j 2.2 version on 64 bit windows server that has 1 TB SSD and 256GB ram.
What's the configuration for batch importer and neo4j server that I should use for maximum query and data loading peformance?
This query for ex: is timing out in the browser
MATCH ()-[r:BELONGS_TO]->() RETURN r
If you have that much RAM:
Your memory mapping config is wrong for 2.2
use only this setting:
`dbms.pagecache.memory=20G``
and then provide neo4j with 24G heap in neo4j-wrapper.conf
Use Neo4j enterprise which scales much better.
Disable the auto-indexes they are not used for what you're doing.
Your query doesn't make sense for any use-case:
MATCH ()-[r:BELONGS_TO]->() RETURN r
Sub-second graph queries always start at a set of concrete start-points (retrieved with an index lookup) and then traverse out from those starting points.
Global scan queries like your's will just pull all data in memory and inefficiently work over it.
Esp. if you return so much data you can't assume sub-second performance. The data volume alone will kill it.
So figure out a label + property-values that you want to start from and then write the query that traverses out from those start points.
If you want to have sub-second on something you're doing you have to go down to the Java API and aggregate there, e.g. with a server extension:
int counter=0;
for (Relationship r : GlobalGraphOperations.at(db)) {
if (r.hasType(Types.BELONGS_TO)) counter++;
}
return counter;
With millions of nodes that query might be slow no matter what you do, though with the amount of memory you have available maybe it wouldn't be a big deal. This is a good guide for calculating memory settings:
http://neo4j.com/developer/guide-performance-tuning/
While you're playing, I would set the query timeout on the server so that your queries can't jam up the server and force you to need to restart it:
http://neo4j.com/docs/stable/server-configuration.html
You might try starting with LIMIT clauses on your queries so that you can get an idea for how the performance degrades as the LIMIT increases.
If you can possibly find a way to limit your query based on node selects that would also be helpful, especially if you can do it by a label or by a label/property combination (which you can index).
Lastly, I would try using EXPLAIN in the web console to get an idea for how your queries will be executed:
http://neo4j.com/docs/2.2.0/how-do-i-profile-a-query.html
Also you can use PROFILE, though that will run the query, so you'll need to be a bit more careful there. You can probably use the LIMIT here too to play and see how things work
I have written a variety of queries using cypher that take no less than 200ms per query. They're very straightforward, so I'm having trouble identifying where the bottleneck is.
Simple Match with Parameters, 2200ms:
Simple Distinct Match with Parameters, 200ms:
Pathing, 2500ms:
At first I thought the issue was a lack of resources, because I was running neo4j and my application on the same box. While the performance monitor indicated that CPU and memory were largely free'd up and available, I moved the neo4j server to another local box and observed similar latency. Both servers are workstations with fairly new Xeon processors, 12GB memory and SSDs for the data storage. All of the above leads me to believe that the latency isn't due to my hardware. OS is Windows 7.
The graph has less than 200 nodes and less than 200 relationships.
I've attached some queries that I send to neo4j along with the configuration for the server, database, and JVM. No plugins or extensions are loaded.
Pastebin Links:
Database Configuration
Server Configuration
JVM Configuration
[Expanding a bit on a comment I made earlier.]
#TFerrell: Your comments state that "all nodes have labels", and that you tried applying indexes. However, it is not clear if you actually specified the labels in your slow Cypher queries. I noticed from your original question statement that neither of your slower queries actually specified a node label (which presumably should have been "Project").
If your Cypher query does not specify the label for a node, then the DB engine has to test every node, and it also cannot apply an index.
So, please try specifying the correct node label(s) in your slow queries.
Is that the first run or a subsequent run of these queries?
You probably don't have a label on your nodes and no index or unique constraint.
So Neo4j has to scan the whole store for your node pulling everything into memory, loading the properties and checking.
try this:
run until count returns 0:
match (n) where not n:Entity set n:Entity return count(*);
add the constraint
create constraint on (e:Entity) assert e.Id is unique;
run your query again:
match (n:Element {Id:{Id}}) return n
etc.
It seems there is something wrong with the automatic memory mapping calculation when you are on Windows (memory mapping on heap).
I just looked at your messages.log and added up some numbers, so it seems the mmio alone is enough to fill your java heap space (old-gen) leaving no room for the database, caches etc.
Please try to amend that by fixing the mmio config in your conf/neo4j.properties to more sensible values (than the auto-calculation).
For your small store just uncommenting the values starting with #neostore. (i.e. remove the #) should work fine.
Otherwise something like this (fitting for a 3GB heap) for a larger graph (2M nodes, 10M rels, 20M props,10M long strings):
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=250M
neostore.propertystore.db.mapped_memory=250M
neostore.propertystore.db.strings.mapped_memory=250M
neostore.propertystore.db.arrays.mapped_memory=0M
Here are the added numbers:
auto mmio: 134217728 + 134217728 + 536870912 + 536870912 + 1073741824 = 2.3GB
stores sizes: 1073920 + 1073664 + 3221698 + 3221460 + 1073786 = 9MB
JVM max: 3.11 RAM : 13.98 SWAP: 27.97 GB
max heaps: Eden: 1.16, oldgen: 2.33
taken from:
neostore.propertystore.db.strings] brickCount=8 brickSize=134144b mappedMem=134217728b (storeSize=1073920b)
neostore.propertystore.db.arrays] brickCount=8 brickSize=134144b mappedMem=134217728b (storeSize=1073664b)
neostore.propertystore.db] brickCount=6 brickSize=536854b mappedMem=536870912b (storeSize=3221698b)
neostore.relationshipstore.db] brickCount=6 brickSize=536844b mappedMem=536870912b (storeSize=3221460b)
neostore.nodestore.db] brickCount=1 brickSize=1073730b mappedMem=1073741824b (storeSize=1073786b)