Neo4j - query performance degradation - neo4j

Here is my query:
MATCH (p:Publisher)-[r:PUBLISHED]->(w:Woka)-[s:AUTHORED]-(a:Author)
MATCH (l:Language)-[t:USED]->(w:Woka)
WHERE (a.author_name =~ '.*Camus.*' and a.author_name =~ '.*Albert.*')
RETURN p.publisher_name, w.woka_title, a.author_name, l.language_name;
The first time this is executing the result is returned in 3.8 seconds. For the second execution couple minutes later the result is returned in 15.1 seconds. The more I am executing the longer the response time. For the third execution the response time is increasing and several moments later I am getting the results between 30 and 90 seconds.
I am the only user of this (development) database. No data is added or deleted or changed there. No indexes are dropped or created also there.
When closing two out of three connections to the database the response time goes back to 15 seconds.
Memory is set 4GB as init and max up to 8GB. Server has 16GB total memory.
What is happening here? Why the response time differs so much?

How big is your graph? Could be that it allocates a lot of heap for caching and then there is not enough space for running the queries without garbage collection.
I presume your relationships are all 1:n, if not add a WITH distinct p,w,a in between the two matches.
Your query is also suboptimal, and will probably create a lot of intermediate results, you can use PROFILE to look at the query plan.
Try this:
PROFILE
MATCH (a:Author)
WHERE (a.author_name =~ '.*Camus.*' and a.author_name =~ '.*Albert.*')
MATCH (p:Publisher)-[:PUBLISHED]->(w:Woka)<-[:AUTHORED]-(a)
MATCH (l:Language)-[:USED]->(w:Woka)
RETURN p.publisher_name, w.woka_title, a.author_name, l.language_name;

I borrowed from other solutions and I changed the params as following in the file conf/neo4j.properties (server restarted after that):
neostore.nodestore.db.mapped_memory=512M
neostore.relationshipstore.db.mapped_memory=512M
neostore.propertystore.db.mapped_memory=512M
neostore.propertystore.db.strings.mapped_memory=512M
neostore.propertystore.db.arrays.mapped_memory=512M
The results are now returning, in average, 300% faster narrowing the time from 90 seconds to 30 seconds and I am not getting disconnected anymore from the web interface to the database.

Related

Issue in GraphAware TimeTree - bad insertions for minute resolution

EDIT: Solved problem.
TL;DR: TimeTree wants milliseconds since epoch. I was using seconds since epoch as my time values.
Versions:
Neo4j community : 3.0.3
GraphAware / TimeTree community server plugins : 3.0.3.39
I recently started using a time tree to search my graph by time ranges. I noticed some funny behavior the other day when I made a query like this:
"
WITH ({start:1350542000,end:1350543000}) as tr
CALL ga.timetree.events.range(tr) YIELD node as n
RETURN n
LIMIT 5;
"
Note that the time range here is only 1000 seconds apart. What was strange is that my return nodes (which are all of the same type) looked like this:
Node[343421]{gtype:1,bbox:[121.01454162597656,20.602155685424805,121.01454162597656,20.602155685424805],meta:"KAOU_20110613v20001_1422",time:1308026580,lat:20.602155685424805,lon:121.01454162597656}
Specifically, note that the value time:1308026580 is NOT within the bounds I provided. Now, I made this example up (because the query takes forever to run right now), but I was getting similar results the last time I ran the query.
So I investigated a little. First off, this is how I insert my data into the TimeTree:
MATCH (r:record {meta:"KAOU_20110613v20001_1422"})
WITH r
CALL ga.timetree.events.attach({node: r, time: r.time, relationshipType: "observedOn", resolution:"Minute"})
YIELD node
RETURN node.meta;
Notice the resolution: "Minute". When I first wrote this query as a function, I forgot to specify the resolution. So when I added about 4-5 records with this method, the resolution defaulted to "Day".
I didn't think this was an issue, so I just left these records in the graph at "Day" resolution and everything following would be at resolution "Minute".
So I decided to check out the graph using the Neo4J browser to see if anything weird is going on. From here, I executed the following query:
MATCH p=(:TimeTreeRoot)-[:CHILD*5]-()-[]-(:record) RETURN p LIMIT 25;
Ah ha! I noticed that all those records attached to the Minute node are consecutive records in terms of their time value. For example:
KAOU_20110613v20001_0956 has time:1307998620
KAOU_20110613v20001_0957 has time:1307998680
These consecutive records are all 1 minute apart. (i.e. time1 - time2 == 60)
So why are they being added to the same Minute node? I used an epoch time converter to verify that these time stamps are in fact 1 minute apart and represent the dates they are intended to.
I believe this problem is contributing to my performance lag since all my records are globbing up on Minute nodes.
So, either I missed something regarding the time values and how the Time Tree handles them, or something else fishy is going on.
I figured out my problem. Going back through some documentation, I found the following in the Examples section:
"time instant represented by {time} which is a long (the number of milliseconds since 1/1/1970)."
I must have misread this somewhere, and assumed the value was represented in seconds. That explains the behavior I am experiencing perfectly. All I need to do is multiply all my time values by 1000 to get milliseconds instead.

neo4j not creating index on large dataset

I am creating an index on a very large (8.2M node, 63M property) neo4j db instance.
CREATE INDEX ON :Article(lowerTitle)
It takes a negligible amount of time to issue the command, and the index (presumably) begins to process.
I have a max java heap of 100GB, and 40 cores (it's a large server). It is, however stupidly, a HDD.
Right after issuing the index command, my core usage spikes up to very efficient usage. After about 20 seconds, it drops to using almost no processor power, but about 90% of MEM.
I have left it running for 3 hours, and the index is still not created (or at least, there are no improvements for simple MATCH queries on single parameters, which average out at about 16 seconds).
MATCH (arti {lowerTitle: "quantum mechanics"}) RETURN arti
Is this reasonable? What is taking so long? Am I doing something wrong?
NOTE: I have also noticed that my total database size (38.02GB) has not increased over the 3 hours
For verifying that your index is online, issue the :schema command in the browser.
You should see your index status.
ONLINE means OK
POPULATING means it is still populating the indices
FAILED means, well, failed
Your query will never run fast, because you are not using a label, so no indices will be used, change it to :
MATCH (arti:Article {lowerTitle: "quantum mechanics"}) RETURN arti

How to explain the performance of Cypher's LOAD CSV clause?

I'm using Cypher's LOAD CSV syntax in Neo4J 2.1.2. So far it's been a huge improvement over the more manual ETL process required in previous versions. But I'm running into some behavior in a single case that's not what I'd expect and I wonder if I'm missing something.
The cypher query being used is this:
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///Users/James/Desktop/import/dependency_sets_short.csv' AS row
MATCH (s:Sense {uid: toInt(row[4])})
MERGE (ds:DependencySet {label: row[2]}) ON CREATE SET ds.optional=(row[3] = 't')
CREATE (s)-[:has]->(ds)
Here's a couple of lines of the CSV:
227303,1,TO-PURPOSE-NOMINAL,t,73830
334471,1,AT-LOCATION,t,92048
334470,1,AT-TIME,t,92048
334469,1,ON-LOCATION,t,92048
227302,1,TO-PURPOSE-INFINITIVE,t,73830
116008,1,TO-LOCATION,t,68204
116007,1,IN-LOCATION,t,68204
227301,1,TO-LOCATION,t,73830
334468,1,ON-DATE,t,92048
116006,1,AT-LOCATION,t,68204
334467,1,WITH-ASSOCIATE,t,92048
Basically, I'm matching a Sense node (previously imported) based on it's ID value which is the fifth column. Then I'm doing a merge to either get a DependencySet node if it exists, or create it. Finally, I'm creating a has edge between the Sense node and the DependencySet node. So far so good, this all works as expected. What's confusing is the performance as the size of the CSV grows.
CSV Lines Time (msec)
------------------------------
500 480
1000 717
2000 1110
5000 1521
10000 2111
50000 4794
100000 5907
200000 12302
300000 35494
400000 Java heap space error
My expectation is that growth would be more-or-less linear, particularly as I'm committing every 500 lines as recommended by the manual, but it's actually closer to polynomial:
What's worse is that somewhere between 300k and 400k rows, it runs into a Java heap space error. Based on the trend from previous imports, I'd expect the import of 400k to take a bit over a minute. Instead, it churns away for about 5-7 minutes before running into the heap space error. It seems like I could split this file into 300,000-line chunks, but isn't that what "USING PERIODIC COMMIT" is supposed to do, more or less? I suppose I could give Neo4J more memory too, but again, it's not clear why I should have to in this scenario.
Also, to be clear, the lookups on both Sense.uid and DependencySet.label are indexed, so the lookup penalty for these should be pretty small. Here's a snippet from the schema:
Indexes
ON :DependencySet(label) ONLINE (for uniqueness constraint)
ON :Sense(uid) ONLINE (for uniqueness constraint)
Any explanations or thoughts on an alternative approach would be appreciated.
EDIT: The problem definitely seems to be in the MATCH and/or CREATE part of the query. If I remove lines 3 and 5 from the Cypher query it performs fine.
I assume that you've already created all the Sense labeled nodes before running this LOAD CSV import. What I think is going on is that as you are matching nodes with the label Sense into memory and creating relationships from the DependencySet to the Sense node via CREATE (s)-[:HAS]->(ds) you are increasing utilization of the available heap.
Another possibility is that the size of your relationship store in your memory mapped settings needs to be increased. In your scenario it looks like the Sense nodes have a high degree of connectivity to other nodes in the graph. When this happens your relationship store for those nodes require more memory. Eventually when you hit 400k nodes the heap is maxed out. Up until that point it needs to do more garbage collection and reads from disk.
Michael Hunger put together an excellent blog post on memory mapped settings for fast LOAD CSV performance. See here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
That should resolve your problem. I don't see anything wrong with your query.
i believe the line
MATCH (s:Sense {uid: toInt(row[4])})
makes the time paradigm. somewhere around the 200 000 in the x line of your graph, you have no longer all the Sense nodes in the memory but some of them must be cached to disk. thus all the increase in time is simply re-loading data from cache to memory and vise-versa (otherwise it will be still linear if kept in memory).
maybe if you could post you server memory settings, we could dig deeper.
to the problem of java heap error refer to Kenny's answer

All queries are slow with neo4j

I have written a variety of queries using cypher that take no less than 200ms per query. They're very straightforward, so I'm having trouble identifying where the bottleneck is.
Simple Match with Parameters, 2200ms:
Simple Distinct Match with Parameters, 200ms:
Pathing, 2500ms:
At first I thought the issue was a lack of resources, because I was running neo4j and my application on the same box. While the performance monitor indicated that CPU and memory were largely free'd up and available, I moved the neo4j server to another local box and observed similar latency. Both servers are workstations with fairly new Xeon processors, 12GB memory and SSDs for the data storage. All of the above leads me to believe that the latency isn't due to my hardware. OS is Windows 7.
The graph has less than 200 nodes and less than 200 relationships.
I've attached some queries that I send to neo4j along with the configuration for the server, database, and JVM. No plugins or extensions are loaded.
Pastebin Links:
Database Configuration
Server Configuration
JVM Configuration
[Expanding a bit on a comment I made earlier.]
#TFerrell: Your comments state that "all nodes have labels", and that you tried applying indexes. However, it is not clear if you actually specified the labels in your slow Cypher queries. I noticed from your original question statement that neither of your slower queries actually specified a node label (which presumably should have been "Project").
If your Cypher query does not specify the label for a node, then the DB engine has to test every node, and it also cannot apply an index.
So, please try specifying the correct node label(s) in your slow queries.
Is that the first run or a subsequent run of these queries?
You probably don't have a label on your nodes and no index or unique constraint.
So Neo4j has to scan the whole store for your node pulling everything into memory, loading the properties and checking.
try this:
run until count returns 0:
match (n) where not n:Entity set n:Entity return count(*);
add the constraint
create constraint on (e:Entity) assert e.Id is unique;
run your query again:
match (n:Element {Id:{Id}}) return n
etc.
It seems there is something wrong with the automatic memory mapping calculation when you are on Windows (memory mapping on heap).
I just looked at your messages.log and added up some numbers, so it seems the mmio alone is enough to fill your java heap space (old-gen) leaving no room for the database, caches etc.
Please try to amend that by fixing the mmio config in your conf/neo4j.properties to more sensible values (than the auto-calculation).
For your small store just uncommenting the values starting with #neostore. (i.e. remove the #) should work fine.
Otherwise something like this (fitting for a 3GB heap) for a larger graph (2M nodes, 10M rels, 20M props,10M long strings):
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=250M
neostore.propertystore.db.mapped_memory=250M
neostore.propertystore.db.strings.mapped_memory=250M
neostore.propertystore.db.arrays.mapped_memory=0M
Here are the added numbers:
auto mmio: 134217728 + 134217728 + 536870912 + 536870912 + 1073741824 = 2.3GB
stores sizes: 1073920 + 1073664 + 3221698 + 3221460 + 1073786 = 9MB
JVM max: 3.11 RAM : 13.98 SWAP: 27.97 GB
max heaps: Eden: 1.16, oldgen: 2.33
taken from:
neostore.propertystore.db.strings] brickCount=8 brickSize=134144b mappedMem=134217728b (storeSize=1073920b)
neostore.propertystore.db.arrays] brickCount=8 brickSize=134144b mappedMem=134217728b (storeSize=1073664b)
neostore.propertystore.db] brickCount=6 brickSize=536854b mappedMem=536870912b (storeSize=3221698b)
neostore.relationshipstore.db] brickCount=6 brickSize=536844b mappedMem=536870912b (storeSize=3221460b)
neostore.nodestore.db] brickCount=1 brickSize=1073730b mappedMem=1073741824b (storeSize=1073786b)

Efficient creation of Neo4j relationship indexes

Can you please explain the best way to add relationship indexes to a Neo4j database created using the BatchInserter?
Our database contains about 30 million nodes and about 300 million relationships. If we build this without any indexes then it takes about 10 hours (just calls to BatchInserter.createNode and BatchInserter.createRelationship).
However if we also try to create relationship indexes using LuceneBatchInserterIndexProvider with repeated calls to index.add then the process takes 12 hours to add everything but then gets stuck on indexProvider.shutdown and doesn't complete. The longest I have left it is 3 days. Can you please explain what it is doing at this point? I expected the work to be done during the calls to index.add. What is going on during shutdown that is taking so long?
Our PC has 64GB RAM and we have allocated 40GB to the JVM. During this shutdown step, Windows reports that 99% of the memory is in use (far more than allocated to the JVM) and the computer becomes almost unusable.
The configuration settings I am using are:
neostore.nodestore.db.mapped_memory = 1G
neostore.propertystore.db.mapped_memory = 1G
neostore.propertystore.db.index.mapped_memory = 1M
neostore.propertystore.db.index.keys.mapped_memory = 1M
neostore.propertystore.db.strings.mapped_memory = 1G
neostore.propertystore.db.arrays.mapped_memory = 1M
neostore.relationshipstore.db.mapped_memory = 10G
We've tried changing some of these but it didn't appear to make any difference.
We have also tried adding the relationship indexes as a separate step after first building the database without any indexes. In this case we used GraphDatabaseFactory.newEmbeddedDatabaseBuilder and GraphDatabaseService.index().forRelationships. Doing it this way seems to work although it was estimated that it would take around 6 days to complete. We have tried invoking commit at various different intervals which makes some difference but not significant. Most of the time seems to be spent just iterating over the relationships.
The only thing I can think of that may be abnormal about our data is that the relationships have about 20 properties on them. But even creating an index on just 1 of these properties doesn't work.
The file sizes without any indexes are:
neostore.nodestore.db 400MB
neostore.propertystore.db 100GB
neostore.propertystore.db.strings 2GB
neostore.relationshipstore.db 10GB
Can you please give us some advice on how to get this working either during the BatchInserter process or as a separate step?
We are using version 2.0.1 of the Neo4j jars.
Thanks, Damon

Resources