In the O'Reilly book "Graph Databases" in chapter 6, which is about how Neo4j stores a graph database it says:
To understand why native graph processing is so much more efficient
than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships.
To traverse a network of m steps, the cost of the indexed approach, at
O(m log n), dwarfs the cost of O(m) for an implementation that uses
index-free adjacency.
It is then explained that Neo4j achieves this constant time lookup by storing all nodes and relationships as fixed size records:
With fixed sized records and pointer-like record IDs, traversals are
implemented simply by chasing pointers around a data structure, which
can be performed at very high speed. To traverse a particular
relationship from one node to another, the database performs several
cheap ID computations (these computations are much cheaper than
searching global indexes, as we’d have to do if faking a graph in a
non-graph native database)
This last sentence triggers my question: how does Titan, which uses Cassandra or HBase as a storage backend, achieve these performance gains or make up for it?
Neo4j only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Neo4j is slow because of pointer chasing on disk (they have a poor disk representation).
Titan only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Titan is faster than Neo4j cause it has a better disk representation.
Please see the following blog post that explains the above quantitatively:
http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/
Thus, its important to understand when people say O(1) what part of the memory hierarchy they are in. When you are in a single JVM (single machine), its easy to be fast as both Neo4j and Titan demonstrate with their respective caching engines. When you can't put the entire graph in memory, you have to rely on intelligent disk layouts, distributed caches, and the like.
Please see the following two blog posts for more information:
http://thinkaurelius.com/2013/11/01/a-letter-regarding-native-graph-databases/
http://thinkaurelius.com/2013/07/22/scalable-graph-computing-der-gekrummte-graph/
OrientDB uses a similar approach where relationships are managed without indexes (index-free adjacency), but rather with direct pointers (LINKS) between vertices. It's like in memory pointers but on disk. In this way OrientDB achieves O(1) on traversing in memory and on disk.
But if you have a vertex "City" with thousands of edges to the vertices "Person", and you're looking for all the people with age > 18, then OrientDB uses indexes because a query is involved, so in this case it's O(log N).
Related
In a general sense, is there a best practice to use when attempting to estimate how long the setting of relationships takes in Neo4j?
For example, I used the data import tool successfully, and here's what I've got in my 2.24GB database:
IMPORT DONE in 3m 8s 791ms. Imported:
7432663 nodes
0 relationships
119743432 properties
In preparation for setting relationships, I set some indices:
CREATE INDEX ON :ChessPlayer(player_id);
CREATE INDEX ON :Matches(player_id);
Then I let it rip:
MATCH (p:Player),(m:Matches)
WHERE p.player_id = m.player_id
CREATE (p)-[r:HAD_MATCH]->(m)
Then, I started to realize, that I have no idea how to even estimate how long that setting these relationships might take to set. Is there a 'back of the envelope' calculation for determining at least a ballpark figure for this kind of thing?
I understand that everyone's situation is different on all levels, including software, hardware, and desired schema. But any discussion would no doubt be useful and would deepen mine (and anyone else who reads this)'s understanding.
PS: FWIW, I'm running Ubuntu 14.04 with 16GB RAM and an Intel Core i7-3630QM CPU # 2.40GHz
The problem here is that you don't take into account transaction sizes. In your example all :HAD_MATCH relationships are created in one single large transaction. A transaction internally builds up in memory first and then gets flushed to disc. If the transaction is too large to fit in your heap you'll might see massive performance degradation due to garbage collections or even OutOfMemoryExceptions.
Typically you want to limit transaction sizes to e.g. 10k - 100k atomic operations.
The probably most easy to do transaction batching in this case is using the rock_n_roll procedure from neo4j-apoc. This uses one cypher statement to provide the data to be worked on and a second one running for each of the results from the previous one in batched mode. Note that apoc requires Neo4j 3.x:
CALL apoc.periodic.rock_n_roll(
"MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
"WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
20000)
There was a bug in 3.0.0 and 3.0.1 causing this performing rather badly. So the above is for Neo4j >= 3.0.2.
If being on 3.0.0 / 3.0.1 use this as a workaround:
CALL apoc.periodic.rock_n_roll(
"MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
"CYPHER planner=rule WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
20000)
I'm using Cypher's LOAD CSV syntax in Neo4J 2.1.2. So far it's been a huge improvement over the more manual ETL process required in previous versions. But I'm running into some behavior in a single case that's not what I'd expect and I wonder if I'm missing something.
The cypher query being used is this:
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///Users/James/Desktop/import/dependency_sets_short.csv' AS row
MATCH (s:Sense {uid: toInt(row[4])})
MERGE (ds:DependencySet {label: row[2]}) ON CREATE SET ds.optional=(row[3] = 't')
CREATE (s)-[:has]->(ds)
Here's a couple of lines of the CSV:
227303,1,TO-PURPOSE-NOMINAL,t,73830
334471,1,AT-LOCATION,t,92048
334470,1,AT-TIME,t,92048
334469,1,ON-LOCATION,t,92048
227302,1,TO-PURPOSE-INFINITIVE,t,73830
116008,1,TO-LOCATION,t,68204
116007,1,IN-LOCATION,t,68204
227301,1,TO-LOCATION,t,73830
334468,1,ON-DATE,t,92048
116006,1,AT-LOCATION,t,68204
334467,1,WITH-ASSOCIATE,t,92048
Basically, I'm matching a Sense node (previously imported) based on it's ID value which is the fifth column. Then I'm doing a merge to either get a DependencySet node if it exists, or create it. Finally, I'm creating a has edge between the Sense node and the DependencySet node. So far so good, this all works as expected. What's confusing is the performance as the size of the CSV grows.
CSV Lines Time (msec)
------------------------------
500 480
1000 717
2000 1110
5000 1521
10000 2111
50000 4794
100000 5907
200000 12302
300000 35494
400000 Java heap space error
My expectation is that growth would be more-or-less linear, particularly as I'm committing every 500 lines as recommended by the manual, but it's actually closer to polynomial:
What's worse is that somewhere between 300k and 400k rows, it runs into a Java heap space error. Based on the trend from previous imports, I'd expect the import of 400k to take a bit over a minute. Instead, it churns away for about 5-7 minutes before running into the heap space error. It seems like I could split this file into 300,000-line chunks, but isn't that what "USING PERIODIC COMMIT" is supposed to do, more or less? I suppose I could give Neo4J more memory too, but again, it's not clear why I should have to in this scenario.
Also, to be clear, the lookups on both Sense.uid and DependencySet.label are indexed, so the lookup penalty for these should be pretty small. Here's a snippet from the schema:
Indexes
ON :DependencySet(label) ONLINE (for uniqueness constraint)
ON :Sense(uid) ONLINE (for uniqueness constraint)
Any explanations or thoughts on an alternative approach would be appreciated.
EDIT: The problem definitely seems to be in the MATCH and/or CREATE part of the query. If I remove lines 3 and 5 from the Cypher query it performs fine.
I assume that you've already created all the Sense labeled nodes before running this LOAD CSV import. What I think is going on is that as you are matching nodes with the label Sense into memory and creating relationships from the DependencySet to the Sense node via CREATE (s)-[:HAS]->(ds) you are increasing utilization of the available heap.
Another possibility is that the size of your relationship store in your memory mapped settings needs to be increased. In your scenario it looks like the Sense nodes have a high degree of connectivity to other nodes in the graph. When this happens your relationship store for those nodes require more memory. Eventually when you hit 400k nodes the heap is maxed out. Up until that point it needs to do more garbage collection and reads from disk.
Michael Hunger put together an excellent blog post on memory mapped settings for fast LOAD CSV performance. See here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
That should resolve your problem. I don't see anything wrong with your query.
i believe the line
MATCH (s:Sense {uid: toInt(row[4])})
makes the time paradigm. somewhere around the 200 000 in the x line of your graph, you have no longer all the Sense nodes in the memory but some of them must be cached to disk. thus all the increase in time is simply re-loading data from cache to memory and vise-versa (otherwise it will be still linear if kept in memory).
maybe if you could post you server memory settings, we could dig deeper.
to the problem of java heap error refer to Kenny's answer
I wonder why neo4j has a Capacity Limit on Nodes and Relationships. The limit on Nodes and Relationships is 2^35 1 which is a "little" bit more then the "normal" 2^32 integer. Common SQL Databases for example mysql stores there primary key as int(2^32) or bigint(2^64)2. Can you explain me the advantages of this decision? In my opinion this is a key decision point when choosing a database.
It is an artificial limit. They are going to remove it in the not-too-distant future, although I haven't heard any official ETA.
Often enough, you run into hardware limits on a single machine before you actually hit this limit.
The current option is to manually shard your graphs to different machines. Not ideal for some use cases, but it works in other cases. In the future they'll have a way to shard data automatically--no ETA on that either.
Update:
I've learned a bit more about neo4j storage internals. The reason the limits are what they are exactly, are because the id numbers are stored on disk as pointers in several places (node records, relationship records, etc.). To increase it by another power of 2, they'd need to increase 1 byte per node and 1 byte per relationship--it is currently packed as far as it will go without needing to use more bytes on disk. Learn more at this great blog post:
http://digitalstain.blogspot.com/2010/10/neo4j-internals-file-storage.html
Update 2:
I've heard that in 2.1 they'll be increasing these limits to around another order of magnitude higher than they currently are.
As of neo4j 3.0, all of these constraints are removed.
Dynamic pointer compression expands Neo4j’s available address space as needed, making it possible to store graphs of any size. That’s right: no more 34 billion node limits!
For more information visit http://neo4j.com/blog/neo4j-3-0-massive-scale-developer-productivity.
I'm learning about fractal tree indices such as that found in TokuDB. I am fascinated with the strategy it uses to make writes fast by writing to CPU cache most of the time and only rarely writing out to slower RAM memory. However, a fractal tree index does eventually have to do big writes out to RAM and then giant writes out to disk and then utterly huge writes completely on disk. It is here where I get confused. Can the fractal tree index do this efficiently? More efficiently, say, than a B-tree can update the disk in a worst-case-scenario update? Also, what effect does a giant, on-disk rewrite have upon lookup-time of that data? And, vise versa, what effect does doing several look-ups on that data have on the process of the giant rewrite?
As context for answering this, you should know:
Everything I learned about fractal tree indices I learned in this slide presentation
I don't have a good mental model for how a spinning medium hard drive works.
When I say "giant rewrite", basically what happens is that you have two sorted arrays of the same length (of size 2^largeNumber) and you write them to a single array (of size 2^(largeNumber+1)) which is sorted.
I suggest you watch my video at http://www.youtube.com/watch?v=88NaRUdoWZM which may give you a better understanding of how Fractal Tree Indexes work. When the indexes do not fit in main memory, a fractal tree index is able to buffer large groups of messages which slowly push down the tree as the buffers overflow. When they eventually make it to a leaf node there is a single IO to retrieve the leaf and apply all the messages. Fractal Tree Indexes do significantly less write IO as they aggregate many operations across a single IO and writes are highly compressed. Read IO is also significantly lessened as it is reading highly compressed data.
I'm not sure if this fully answers your questions, but hopefully it helps.
I see that Erlang Efficiency User's Guide Section 5.3 recommends leaving the non-flat list as it is when being used as an iolist because the penalty of non-flattening is smaller than flattening. Is there any quantitative example of the speed difference?
When a deep list contains n elements, then performing lists:flatten on it will require Θ(n) time, and worse, Θ(n) memory allocations. How slow that is on your machine is a function of many variables; measure and ye shall know.