I have a Neo4j database with roughly 417 million nodes, 780 million relationships and 2.6 billion properties.
As creating indexes takes considerable amount of time, is there any way in Neo4j to trace and monitor the progress of index creation?
In the Neo4j browser, use the command
:SCHEMA
to get information about the indexes, including if they are online or still being built.
Use
:SCHEMA await
to wait for indexes to be built.
Related
I have installed the APOC Procedures and used "CALL apoc.warmup.run."
The result is as follow:
pageSize
8192
nodesPerPage nodesTotal nodesLoaded nodesTime
546 156255221 286182 21
relsPerPage relsTotal relsLoaded relsTime
240 167012639 695886 8
totalTime
30
It looks like the neo4j server only caches part of nodes and relations.
But I want it to cache all the nodes and relationships in order to improve query performance.
First of all, for all data to be cached, you need a page cache large enough.
Then, the problem is not that Neo4j does not cache all it can, it's more of a bug in the apoc.warmup.run procedure: it retrieves the number of nodes (resp. relationships) in the database, and expects them to all have ids between 1 and that number of nodes (resp. relationships). However, it's not true if you've had some churn in the DB, like creating more nodes then deleting some of them.
I believe that could be fixed by using another query instead:
MATCH (n) RETURN count(n) AS count, max(id(n)) AS maxId
as profiling it shows about the same number of DB hits as the number of nodes, and takes about 650 ms on my machine for 1.4 million nodes.
Update: I've opened an issue on the subject.
Update 2
While the issue with the ids is real, I missed the real reason why the procedure reports reading far less nodes: it only reads one node per page (assuming they're stored sequentially), since it's the pages that are cached. With the current values, that means trying to read one node every 546 nodes. It happens that 156255221 ÷ 546 = 286181, and with node 0 that makes it 286182 nodes loaded.
Our py2neo script ingests abstracts at a rate of about 500,000 a day with Neo4J. For comparison, we ingest 20 million of these abstracts in Solr in one day. We're wondering if this is the expected rate of ingestion for Neo4J or if there is something we can do to increase performance?
We've tried combinations of py2neo version 2 and version 3 and Neo4J Enterprise version 2 and 3. With each combination, the ingestion rate remains about the same. We use batches of 1000 abstracts to increase performance. The abstracts average about 400-500 words, we create 5 additional entities with modest properties then create a relationship between each abstract and the entities. We first ingest the entities and then the relationships (create_unique()) to avoid round trips to the server (no find() or find_one()). We prefer merge() over create() to ensure only one node is created per abstract. We did try create() and the load performance only improved slightly. The bottleneck appears to be on the server side. Our script will create the 1000 transactions quickly, then there is an extended delay during the commit, suggesting any slowdown is from Neo4J server while it processes the transaction.
We require a solution that does not wipe the entire Neo4J database. We intend to ingest multiple data streams in parallel in the future so the DB must remain stable.
We prefer Python over Java and prefer py2neo's merge()/create() based transactions over direct Cypher queries.
We were hoping Bolt would give us better performance, but currently a Bolt transaction hangs indefinitely with py2neo v3 / Neo4J 3.0.0 RC1. We also had one instance of the HTTP transaction hanging as well.
Our Neo4J instances use the default configuration.
Our server is a 2 processor, 12 core, Linux host with 32GB of memory.
Any suggestions on how to increase load performance? It would be grand if we could ingest 20 million abstracts into Neo4J in just a few days.
Our ingestion script shows a transaction rate of 54 entity transactions per second. Note that's 54, not 54K:
$ python3 neo-ingestion-rate.py
Number of batches: 8
Entity transactions per batch: 6144
Merge entities: 2016-04-22 16:31:50.599126
All entities committed: 2016-04-22 16:47:08.480335
Entity transactions per second: 53.5494121750082
Relationship transactions per batch: 5120
Merge unique relationships: 2016-04-22 16:47:08.480408
All relationships committed: 2016-04-22 16:49:38.102694
Number of transactions: 40960
Relationship transactions per second: 273.75593641599323
Thanks.
How about loading via neo4j-shell? I do the majority of my work in R and simply script the import.
Here is a blog post where I outline the approach. You could mirror it in Python.
The basic idea is take your data, save it to disk, and load via neo4j-shell where you execute cypher scripts that reference those files.
I have found this approach to be helpful when loading larger sets of data. But of course, it all depends on the density of your data, the data model itself, and having the appropriate indexes established.
This blog post explains how to import data in bulk:
https://neo4j.com/blog/bulk-data-import-neo4j-3-0/
They claim being able to import ~31M nodes, ~78M relationships in ~3min
They just don't mention the machine this is running on, most likely a cluster.
Still, it shows it should be possible to get much much higher ingestion rate than what you observe.
The Python class likely import one record at a time, when you really want to do bulk inserts.
I am new to NEO4J but have been working with MySQL for many years. Now I have created a database with 700 000 user, 800 000 cookbooks and 1,6M saved recipes i NEO4J.
The structure of the nodes are like this (:User)-[:CREATED]-(:Cookbook)-[:SAVED]-(:Recipe). All the users and recipes are unique, but one user can have multipel cookbooks and every cookbook can have multipel recipes.
I use a EC2 m3.x2large, so it is quite fast. But the performance is very bad. This query:
MATCH (r:Recipe{recipe_id:2987431}) return r;
Take between 300-500 ms and mysql can execute it in around 2 ms.
Is this usual or have I configured the server all wrong?
(I have an index on :Recipe(recipe_id) )
Has your index come online yet? If you run :schema in the console it should list all of the constraints / indexes and if they've yet been fully scanned and are online and available for use.
I have a Neo4j database with 7340 nodes. Each node has a label (neoplasm) and 2 properties (conceptID and fullySpecifiedName). Autoindexing is enabled on both properties, and I have created a schema index on neoplasm:conceptID and neoplasm:fullySpecifiedName. The nodes are concepts in a terminology tree. There is a single root node and the others descend often via several paths to a depth of up to 13 levels. From a SQL Server implementation, the hierarchy structure is as follows...
Depth Relationship Count
0 1
1 37
2 360
3 1598
4 3825
5 6406
6 7967
7 7047
8 4687
9 2271
10 825
11 258
12 77
13 3
I am adding the relationships using a C# program and neo4jclient which contructs and executes cypher queries like this one...
MATCH (child:neoplasm), (parent:neoplasm)
WHERE child.conceptID = "448257000" AND parent.conceptID="372095001"
CREATE child-[:ISA]->parent
Adding the relationships up to level 3 was very fast, and level 4 itself was not bad, but at level 5 things started getting very slow, an average of over 9 seconds per relationship.
The example query above was executed through the http://localhost:7474/browser/ interface and took 12917ms, so the poor execution times are not a feature of the C# code nor the neo4jclient API.
I thought graph databases were supposed to be blindingly fast and that the performance was independent of size.
So far I have added just 9033 out of 35362 relationships. Even if the speed does not degrade further as the number of relationships increases, it will take over three days to add the remainder!
Can anyone suggest why this performance is so bad? Or is write performance of this nature normal, and it is just read performance that is so good. A sample Cypher query to return parents of a level 5 node returns a list of 23 fullySpecifiedName properties in less time than I can measure with a stop watch! (well under a second).
When using different Indexes on labels at the same time, Cypher does not (yet) choose these to make the query faster, instead, try giving hints to use them, see http://docs.neo4j.org/chunked/milestone/query-using.html#using-query-using-multiple-index-hints
PROFILE
MATCH (child:neoplasm), (parent:neoplasm)
WHERE child.conceptID = "448257000" AND parent.conceptID="372095001"
USING INDEX child:neoplasm(conceptID)
USING INDEX parent:neoplasm(conceptID)
CREATE child-[:ISA]->parent
Does that improve things? Also, please post the PROFILE output for better insight.
You said you're using autoindexing. However your query would use schema indexes and not autoindexes. Autoindexes index nodes based on properties and are not tied to labels.
Schema indexes are a new and stunning feature of Neo4j 2.0.
So get rid of the autoindexes and, as Tatham suggested, create schema indexes using:
CREATE INDEX ON :neoplasm(conceptId)
Even with schema indexes inserting relationships will become slower as your graph grows since indexes typically scale at log(n) level. However it should be much faster then the times you've observed.
I appear to have found the answer. I restarted the Neop4j database (Neop4j 2.0.0-M06) and got the usual message of Neo4j will be ready in a few seconds. Over half an hour later the status turned green. During that time I was monitoring the process and it appeared to be rebuilding the lucene indexes.
I have since tried loading more relationships and they are now being added at an acceptable rate (~100msec per relationship).
Thanks for the comments
I am new to using Neo4j and have setup a test graph db in neo4j for organizing some click stream data with a very small subset of what we actually use on a day to day basis. This graph has about 23 million nodes and 34 million relationships. The queries seem to be taking forever to run i.e. I haven't seen the response come back even after waiting for more than 30 mins.
The data is organized as Year->Month->Day->Session{1..n}->Event{1..n}
I am running the db on a Windows 7 machine with 1.5 gb of heap allocated to Neo4j server
These are the configurations in the neo4j-wrapper.conf
wrapper.java.additional.1=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional.2=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional.3=-Dlog4j.configuration=file:conf/log4j.properties
wrapper.java.additional.6=-XX:+UseParNewGC
wrapper.java.additional.7=-XX:+UseConcMarkSweepGC
wrapper.java.additional.8=-Xloggc:data/log/neo4j-gc.log
wrapper.java.initmemory=1500
wrapper.java.maxmemory=1500
This is what my query looks like
START n=node(3)
MATCH (n)-[:HAS]->(s)
WITH distinct s
MATCH (s)-[:HAS]->(e) WHERE e.page_name = 'Login'
WITH s.session_id as session, e
MATCH (e)-[:FOLLOWEDBY*0..1]->(e1)
WITH count(session) as session_cnt, e.page_name as startPage, e1.page_name as nextPage
RETURN startPage, nextPage, session_cnt
Also i have these properties set
node_auto_indexing=true
node_keys_indexable=name,page_name,geo_country
relationship_auto_indexing=true
Can anyone help me to figure out what might be wrong.
Even when I run portions of the query it takes 10-15 minutes before I can see a response.
Note: I have no other applications running on the Windows Machine
Why would you want to return all the nodes in the first place?
If you really want to do that, use the transactional http endpoint and curl to stream the response:
I tested it with a database of 100k nodes. It takes 0.9 seconds to transfer them (1.5MB) over the wire.
If you transfer all their properties by using "return n", it takes 1.4 seconds and results in 4.1MB transferred.
If you just want to know how many nodes are in your db. use something like this instead:
match (n) return count(*);