I have a Neo4j database with 7340 nodes. Each node has a label (neoplasm) and 2 properties (conceptID and fullySpecifiedName). Autoindexing is enabled on both properties, and I have created a schema index on neoplasm:conceptID and neoplasm:fullySpecifiedName. The nodes are concepts in a terminology tree. There is a single root node and the others descend often via several paths to a depth of up to 13 levels. From a SQL Server implementation, the hierarchy structure is as follows...
Depth Relationship Count
0 1
1 37
2 360
3 1598
4 3825
5 6406
6 7967
7 7047
8 4687
9 2271
10 825
11 258
12 77
13 3
I am adding the relationships using a C# program and neo4jclient which contructs and executes cypher queries like this one...
MATCH (child:neoplasm), (parent:neoplasm)
WHERE child.conceptID = "448257000" AND parent.conceptID="372095001"
CREATE child-[:ISA]->parent
Adding the relationships up to level 3 was very fast, and level 4 itself was not bad, but at level 5 things started getting very slow, an average of over 9 seconds per relationship.
The example query above was executed through the http://localhost:7474/browser/ interface and took 12917ms, so the poor execution times are not a feature of the C# code nor the neo4jclient API.
I thought graph databases were supposed to be blindingly fast and that the performance was independent of size.
So far I have added just 9033 out of 35362 relationships. Even if the speed does not degrade further as the number of relationships increases, it will take over three days to add the remainder!
Can anyone suggest why this performance is so bad? Or is write performance of this nature normal, and it is just read performance that is so good. A sample Cypher query to return parents of a level 5 node returns a list of 23 fullySpecifiedName properties in less time than I can measure with a stop watch! (well under a second).
When using different Indexes on labels at the same time, Cypher does not (yet) choose these to make the query faster, instead, try giving hints to use them, see http://docs.neo4j.org/chunked/milestone/query-using.html#using-query-using-multiple-index-hints
PROFILE
MATCH (child:neoplasm), (parent:neoplasm)
WHERE child.conceptID = "448257000" AND parent.conceptID="372095001"
USING INDEX child:neoplasm(conceptID)
USING INDEX parent:neoplasm(conceptID)
CREATE child-[:ISA]->parent
Does that improve things? Also, please post the PROFILE output for better insight.
You said you're using autoindexing. However your query would use schema indexes and not autoindexes. Autoindexes index nodes based on properties and are not tied to labels.
Schema indexes are a new and stunning feature of Neo4j 2.0.
So get rid of the autoindexes and, as Tatham suggested, create schema indexes using:
CREATE INDEX ON :neoplasm(conceptId)
Even with schema indexes inserting relationships will become slower as your graph grows since indexes typically scale at log(n) level. However it should be much faster then the times you've observed.
I appear to have found the answer. I restarted the Neop4j database (Neop4j 2.0.0-M06) and got the usual message of Neo4j will be ready in a few seconds. Over half an hour later the status turned green. During that time I was monitoring the process and it appeared to be rebuilding the lucene indexes.
I have since tried loading more relationships and they are now being added at an acceptable rate (~100msec per relationship).
Thanks for the comments
Related
I make experiments with querying Neo4j data whose size gradually increases. Some of my query expressions behave as I would expect - a larger graph means a longer execution time. But, e.g., the query
MATCH (`g`:Graph)
MATCH (`g`)<-[:`Graph`]-(person)-[:birthPlace {Value: "http://persons/data/birthPlace"}]->(place)-[:`Graph`]->(`g`)
WITH count(person.Value) AS persons, place
WHERE persons > 5
RETURN place.Value AS place, persons
ORDER BY persons
has these execution times (in ms):
|80.2 |208.1 |301.7 |399.23 |0.1 |2.07 |2.61 |2.81 |7.3 |1.5 |.
How to explain the rapid acceleration from the fifth step? The data are the same, just extended; no indexes were created.
The data on 4th experiment:
201673 nodes,
601189 relationships,
859225 properties.
The data size on the 5th experiment:
242113 nodes,
741500 relationships,
1047060 properties.
All I can think about is that maybe Cypher will start using some indexes from a certain data size, but I can't find anywhere if that's the case.
Thank you for any comments.
Neo4j cache management may explain your observations. You might explain what you are doing more precisely. What version of Neo4j are you using? What is the Graph node? You are repeatedly running the same query on graph and trying this again with a larger or smaller graph?
If you are running the same query multiple times on the same data set with more rapid execution times, then the cache may be the reason. In v 3.5 and earlier it would "warm up" with repeated execution. Putatively this does not occur in v 4.x.
You might look at
cold start
or these tips. You might also look at your transaction log; is it accumulate large files.
Why the '' around your node identifiers ('g'); just use (g:Graph) and [r:Graph]; no quotes.
I am yet trying to make use of neo4j to perform a complex query (similar to shortest path search except I have very strange conditions applied to this search like minimum path length in terms of nodes traversed count).
My dataset contains around 2.5M nodes of one single type and around 1.5 billion edges (One single type as well). Each given node has on average 1000 directional relation to a "next" node.
Yet, I have a query that allows me to retrieve this shortest path given all of my conditions but the only way I found to have decent response time (under one second) is to actually limit the number of results after each new node added to the path, filter it, order it and then pursue to the next node (This is kind of a greedy algorithm I suppose).
I'd like to limit them a lot less than I do in order to yield more path as a result, but the problem is the exponential complexity of this search that makes going from LIMIT 40 to LIMIT 60 usually a matter of x10 ~ x100 processing time.
This being said, I am yet evaluating several solutions to increase the speed of the request but I'm quite unsure of the result they will yield as I'm not sure about how neo4j really stores my data internally.
The solution I think about yet is to actually add a property to my relationships which would be an integer in between 1 and 15 because I usually will only query the relationships that have one or two max different values for this property. (like only relationships that have this property to 8 or 9 for example).
As I can guess yet, for each relationship, neo4j then have to gather the original node properties and use it to apply my further filters which takes a very long time when crossing 4 nodes long path with 1000 relationships each (I guess O(1000^4)). Am I right ?
With relationship properties, will it have direct access to it without further data fetching ? Is there any chance it will make my queries faster? How are neo4j edges properties stored ?
UPDATE
Following #logisima 's advice I've written a procedure directly with the Java traversal API of neo4j. I then switched to the raw Java procedure API of Neo4J to leverage even more power and flexibility as my use case required it.
The results are really good : the lower bound complexity is overall a little less thant it was before but the higher bound is like ten time faster and when at least some of the nodes that will be used for the traversal are in the cache of Neo4j, the performances just becomes astonishing (depth 20 in less than a second for one of my tests when I only need depth 4 usually).
But that's not all. The procedures makes it very very easily customisable while keeping the performances at their best and optimizing every single operation at its best. The results is that I can use far more powerful filters in far less computing time and can easily update my procedure to add new features. Last but not least Procedures are very easily pluggable with spring-data for neo4j (which I use to connect neo4j to my HTTP API). Where as with cypher, I would have to auto generate the queries (as being very complex, there was like 30 java classes to do the trick properly) and I should have used jdbc for neo4j while handling a separate connection pool only for this request. Cannot recommend more to use the awesome neo4j java API.
Thanks again #logisima
If you're trying to do a custom shortespath algo, then you should write a cypher procedure with the traversal API.
The principe of Cypher is to make pattern matching, and you want to traverse the graph in a specific way to find your good solution.
The response time should be really faster for your use-case !
I have a 50K node graph with 10 properties per node. Each node of the same type but with different values. Each of the properties is on an index and I have increased the heap and page cache memory sizes for the database. However using the browser console, creating the nodes takes 6 minutes!
And also a query for all the properties takes a very long time (~2 minutes) to appear in the browser console but when the results do appear the bottom of the browser says that the result of 50K node properties took only 2500 ms.
How do I improve the performance importing/querying 10's of thousands of unique instances a single node with 10 properties each and no relationships?
It takes time to update 10 different indexes for each node that you create. Do you really have use cases that require an index for every single property? If not, get rid of the indexes you do not need. Remember, indexes can speed up finding the first node(s) to initiate a query, but they do not help at all when traversing paths through a graph.
If you really need all 10 indexes, then to speed up the importing step, you can: drop all the indexes, import all 50K nodes, and then create each index one at a time (which will take some time for each index). The overall time will be about the same, but the import itself should be much faster.
It takes the neo4j browser a very long time to generate and display the visualization for a very large result (e.g., 10's of thousands of nodes). The browser is not intended for viewing that much data at one time.
1) Check that you are running a recent version of Neo4j. 3+ has optimised the way that properties are stored and indexed.
2) Check how you're running the query. Maybe your query is not optimised or is problematic in some way. Note in particular that each MATCH generates a 'row'. Multiple MATCH clauses will yield the Cartesian product of all matched sets, which could be problematic with large armounts of data.
3) Check that each of these properties needs to be attached to a node. Neo4j is optimised for searching for relationships, not for properties.
Consider turning nodes that look like this:
(:Train {
maxSpeedInKPH: 350,
fuelType: 'Diesel',
numberOfEngines: 3
})
to
(:Train)
-[:USES_FUEL_TYPE]->(:Fuel {type: 'Diesel'}),
-[:HAS_MAX_SPEED]->(:MaxSpeed {value: 350, unit: 'k/h'}),
-[:HAS_ENGINE]->(:Engine),
-[:HAS_ENGINE]->(:Engine),
-[:HAS_ENGINE]->(:Engine)
There is generally a benefit to spinning properties out into relationships, even if the uniqueness is low. For example if you have a property which has a unique value per node, generally keep that in the node. But if your 50000 nodes have less, say, 25000 unique values in that property, it would probably still be beneficial to spin them out into relationships. This is absolutely the case with integer-type properties, where you'll also be able to add additional "bucket relationships" to provide a form of indexing. In the example above, the max speed was 350. After spinning the property out into a relationship, you could also put an additional relationship of the type [:HAS_MAX_SPEED_ABOVE]-> 300. This would complicate your querying, but should make it faster.
4) If none of the above apply to you, cannot be implemented or do not help, consider switching to a more traditional relational database like SQL. SQL would be a perfect candidate for your use case, i.e. 50k different nodes (rows) with only 10 different properties (columns) and no relationships (joins).
I have installed the APOC Procedures and used "CALL apoc.warmup.run."
The result is as follow:
pageSize
8192
nodesPerPage nodesTotal nodesLoaded nodesTime
546 156255221 286182 21
relsPerPage relsTotal relsLoaded relsTime
240 167012639 695886 8
totalTime
30
It looks like the neo4j server only caches part of nodes and relations.
But I want it to cache all the nodes and relationships in order to improve query performance.
First of all, for all data to be cached, you need a page cache large enough.
Then, the problem is not that Neo4j does not cache all it can, it's more of a bug in the apoc.warmup.run procedure: it retrieves the number of nodes (resp. relationships) in the database, and expects them to all have ids between 1 and that number of nodes (resp. relationships). However, it's not true if you've had some churn in the DB, like creating more nodes then deleting some of them.
I believe that could be fixed by using another query instead:
MATCH (n) RETURN count(n) AS count, max(id(n)) AS maxId
as profiling it shows about the same number of DB hits as the number of nodes, and takes about 650 ms on my machine for 1.4 million nodes.
Update: I've opened an issue on the subject.
Update 2
While the issue with the ids is real, I missed the real reason why the procedure reports reading far less nodes: it only reads one node per page (assuming they're stored sequentially), since it's the pages that are cached. With the current values, that means trying to read one node every 546 nodes. It happens that 156255221 ÷ 546 = 286181, and with node 0 that makes it 286182 nodes loaded.
I'm toying around with Neo4J. My Data consists of users who own objects which are tagged by tags. There my schema looks like:
(:User)-[:OWNS]->(:Object)-[:TAGGED_AS]->(:Tag)
I have written a script that generates me a sample Graph. Currently I have 100 User, ~2500 Tag and ~10k Object nodes in the database. Between those I have ~700k relationships. I know want to find every Object that is not owned by a certain User but related over a Tag the User has used himself. There query looks like:
MATCH (user:User {username: 'Cristal'})
WITH user
MATCH (user)-[:OWNS]->(obj:Object)-[:TAGGED_AS]->(tag:Tag)<-[:TAGGED_AS]-(other:Object)
WHERE NOT (user)-[:OWNS]->(other)
RETURN other
LIMIT 20
However, this query runs ~1-5 minutes (depending on the user and how many objects he owns), which is a not only a bit to slow. What am I doing wrong? I consider this a rather "trivial" query against a graph of modest size. I'm using Neo4J 2.1.6 Community and already set the Java Heap to 2000 MB (and I can see that there is a Java process using this much). Am I missing an index or something like that (I'm new to Neo4J)?
I honestly expected the result to be pretty much instant especially considering that the Neo4J docs mention the I should use a heap between 1 and 4 GB for 100 million objects...and I'm only close to a 1/100 of this number.
If it is my Query (which I hope and expect) how can I improve it? What is something you have to be aware when writing queries?
Do you have an index on the username property?
CREATE INDEX ON :User(username)
Also you don't really need that WITH there, so maybe drop it to see if it helps:
MATCH (user:User {username: 'Cristal'})-[:OWNS]->(obj:Object)-[:TAGGED_AS]->(tag:Tag)<-[:TAGGED_AS]-(other:Object)
WHERE NOT (user)-[:OWNS]->(other)
RETURN other
LIMIT 20
Also, I don't think it will make a different, but you can drop the obj and tag variables since you're not using them elsewhere in the query.
Also, if you're generating sample graphs you may want to check out GraphGen:
http://graphgen.neoxygen.io/