I've been experimenting with Neo4j for a while now. I've been inserting data from the medical terminology database SNOMED in order to experiment with a large dataset.
While experimenting, I have repeatedly inserted and then deleted around 450,000 nodes. I'm finding Neo's performance to be somewhat unsatisfactory. For example, I have just removed all nodes from the database. Running the query:
match (n) return count (n)
takes 13085 ms to return 0 nodes.
I'm struggling to understand why it might take this time to count 0 things. Does Neo retain some memory of nodes that are deleted? Is it in some way hobbled by the fact that I have inserted and removed large amounts of nodes in the past? Might it perform better if I delete the data directory instead of deleting all nodes with Cypher?
Or is there some twiddling with memory allocation and so forth that might help?
I'm running it on an old-ish laptop running Linux Mint.
It's partly due to neo4j's store format. Creating new nodes or relationships assigns them ids, where ids are actual offsets into the store files. Deleting a node or relationship marks that record as not in use in the store file. Looking at all nodes is done by scanning the node store file to spot records that are in use.
Related
I make experiments with querying Neo4j data whose size gradually increases. Some of my query expressions behave as I would expect - a larger graph means a longer execution time. But, e.g., the query
MATCH (`g`:Graph)
MATCH (`g`)<-[:`Graph`]-(person)-[:birthPlace {Value: "http://persons/data/birthPlace"}]->(place)-[:`Graph`]->(`g`)
WITH count(person.Value) AS persons, place
WHERE persons > 5
RETURN place.Value AS place, persons
ORDER BY persons
has these execution times (in ms):
|80.2 |208.1 |301.7 |399.23 |0.1 |2.07 |2.61 |2.81 |7.3 |1.5 |.
How to explain the rapid acceleration from the fifth step? The data are the same, just extended; no indexes were created.
The data on 4th experiment:
201673 nodes,
601189 relationships,
859225 properties.
The data size on the 5th experiment:
242113 nodes,
741500 relationships,
1047060 properties.
All I can think about is that maybe Cypher will start using some indexes from a certain data size, but I can't find anywhere if that's the case.
Thank you for any comments.
Neo4j cache management may explain your observations. You might explain what you are doing more precisely. What version of Neo4j are you using? What is the Graph node? You are repeatedly running the same query on graph and trying this again with a larger or smaller graph?
If you are running the same query multiple times on the same data set with more rapid execution times, then the cache may be the reason. In v 3.5 and earlier it would "warm up" with repeated execution. Putatively this does not occur in v 4.x.
You might look at
cold start
or these tips. You might also look at your transaction log; is it accumulate large files.
Why the '' around your node identifiers ('g'); just use (g:Graph) and [r:Graph]; no quotes.
I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order
I have a 50K node graph with 10 properties per node. Each node of the same type but with different values. Each of the properties is on an index and I have increased the heap and page cache memory sizes for the database. However using the browser console, creating the nodes takes 6 minutes!
And also a query for all the properties takes a very long time (~2 minutes) to appear in the browser console but when the results do appear the bottom of the browser says that the result of 50K node properties took only 2500 ms.
How do I improve the performance importing/querying 10's of thousands of unique instances a single node with 10 properties each and no relationships?
It takes time to update 10 different indexes for each node that you create. Do you really have use cases that require an index for every single property? If not, get rid of the indexes you do not need. Remember, indexes can speed up finding the first node(s) to initiate a query, but they do not help at all when traversing paths through a graph.
If you really need all 10 indexes, then to speed up the importing step, you can: drop all the indexes, import all 50K nodes, and then create each index one at a time (which will take some time for each index). The overall time will be about the same, but the import itself should be much faster.
It takes the neo4j browser a very long time to generate and display the visualization for a very large result (e.g., 10's of thousands of nodes). The browser is not intended for viewing that much data at one time.
1) Check that you are running a recent version of Neo4j. 3+ has optimised the way that properties are stored and indexed.
2) Check how you're running the query. Maybe your query is not optimised or is problematic in some way. Note in particular that each MATCH generates a 'row'. Multiple MATCH clauses will yield the Cartesian product of all matched sets, which could be problematic with large armounts of data.
3) Check that each of these properties needs to be attached to a node. Neo4j is optimised for searching for relationships, not for properties.
Consider turning nodes that look like this:
(:Train {
maxSpeedInKPH: 350,
fuelType: 'Diesel',
numberOfEngines: 3
})
to
(:Train)
-[:USES_FUEL_TYPE]->(:Fuel {type: 'Diesel'}),
-[:HAS_MAX_SPEED]->(:MaxSpeed {value: 350, unit: 'k/h'}),
-[:HAS_ENGINE]->(:Engine),
-[:HAS_ENGINE]->(:Engine),
-[:HAS_ENGINE]->(:Engine)
There is generally a benefit to spinning properties out into relationships, even if the uniqueness is low. For example if you have a property which has a unique value per node, generally keep that in the node. But if your 50000 nodes have less, say, 25000 unique values in that property, it would probably still be beneficial to spin them out into relationships. This is absolutely the case with integer-type properties, where you'll also be able to add additional "bucket relationships" to provide a form of indexing. In the example above, the max speed was 350. After spinning the property out into a relationship, you could also put an additional relationship of the type [:HAS_MAX_SPEED_ABOVE]-> 300. This would complicate your querying, but should make it faster.
4) If none of the above apply to you, cannot be implemented or do not help, consider switching to a more traditional relational database like SQL. SQL would be a perfect candidate for your use case, i.e. 50k different nodes (rows) with only 10 different properties (columns) and no relationships (joins).
I have installed the APOC Procedures and used "CALL apoc.warmup.run."
The result is as follow:
pageSize
8192
nodesPerPage nodesTotal nodesLoaded nodesTime
546 156255221 286182 21
relsPerPage relsTotal relsLoaded relsTime
240 167012639 695886 8
totalTime
30
It looks like the neo4j server only caches part of nodes and relations.
But I want it to cache all the nodes and relationships in order to improve query performance.
First of all, for all data to be cached, you need a page cache large enough.
Then, the problem is not that Neo4j does not cache all it can, it's more of a bug in the apoc.warmup.run procedure: it retrieves the number of nodes (resp. relationships) in the database, and expects them to all have ids between 1 and that number of nodes (resp. relationships). However, it's not true if you've had some churn in the DB, like creating more nodes then deleting some of them.
I believe that could be fixed by using another query instead:
MATCH (n) RETURN count(n) AS count, max(id(n)) AS maxId
as profiling it shows about the same number of DB hits as the number of nodes, and takes about 650 ms on my machine for 1.4 million nodes.
Update: I've opened an issue on the subject.
Update 2
While the issue with the ids is real, I missed the real reason why the procedure reports reading far less nodes: it only reads one node per page (assuming they're stored sequentially), since it's the pages that are cached. With the current values, that means trying to read one node every 546 nodes. It happens that 156255221 ÷ 546 = 286181, and with node 0 that makes it 286182 nodes loaded.
How do nodes without relationships affect performance?
The input stream contains duplicate nodes and once I've determined that a node is not of interest I'd like a short-hand way to know that I've already seen this node and want to disregard it.
If I store one instance of the node in the db without any relationships will it impact performance? Potentially the number of relationship-less nodes is very large.
Usually these don't affect performance, they take up space on disk but will not be loaded if you don't access them. And as you don't traverse them it doesn't matter that much.
I would still skip them, you can do it both with neo4j-import there is a --skip-duplicate-nodes option as well as LOAD CSV or Cypher in general, there is the MERGE clause which only creates a new node if it is not already there.