I wonder why neo4j has a Capacity Limit on Nodes and Relationships. The limit on Nodes and Relationships is 2^35 1 which is a "little" bit more then the "normal" 2^32 integer. Common SQL Databases for example mysql stores there primary key as int(2^32) or bigint(2^64)2. Can you explain me the advantages of this decision? In my opinion this is a key decision point when choosing a database.
It is an artificial limit. They are going to remove it in the not-too-distant future, although I haven't heard any official ETA.
Often enough, you run into hardware limits on a single machine before you actually hit this limit.
The current option is to manually shard your graphs to different machines. Not ideal for some use cases, but it works in other cases. In the future they'll have a way to shard data automatically--no ETA on that either.
Update:
I've learned a bit more about neo4j storage internals. The reason the limits are what they are exactly, are because the id numbers are stored on disk as pointers in several places (node records, relationship records, etc.). To increase it by another power of 2, they'd need to increase 1 byte per node and 1 byte per relationship--it is currently packed as far as it will go without needing to use more bytes on disk. Learn more at this great blog post:
http://digitalstain.blogspot.com/2010/10/neo4j-internals-file-storage.html
Update 2:
I've heard that in 2.1 they'll be increasing these limits to around another order of magnitude higher than they currently are.
As of neo4j 3.0, all of these constraints are removed.
Dynamic pointer compression expands Neo4j’s available address space as needed, making it possible to store graphs of any size. That’s right: no more 34 billion node limits!
For more information visit http://neo4j.com/blog/neo4j-3-0-massive-scale-developer-productivity.
Related
Assuming I have an unbounded dataset with extremely high cardinity > 1,000,000,000 unique keys, lets say I want to count by key, lets say over fixed windows
My understanding the combine function will essentially maintain an accumulator on each machine in memory for each key.
Question 1
Is the above assumption correct or can workers flush out keys and accumulators to disk when under memory pressure
Question 2 (assuming above correct)
Assuming the data is not naturally partitioned (e.g reading from pubsub) would we run out of memory on each worker since every machine may in theory see every key and have to maintain an in memory structure for each key?
Question 3 (assuming above correct)
If we store the data on kafka and split up the data into partitions based on the key we are counting on. Assuming you have 1 beam worker reading from 1 partition then each worker only see a consistent subset of the keyspace. In this scenario would the memory use of the workers be any different?
Beam is meant to be highly scalable; there are Beam pipelines that run on Dataflow with many trillions of unique keys.
When running a combining operation in Beam a table of keys and aggregated values is kept in memory, but when the table becomes full it is flushed to disk (well, technically, to shuffle) so it will not run out of memory. Another worker will read this data out of shuffle, one value at a time, to compute the final aggregate over all upstream worker outputs.
As for your other two questions, if your input is naturally partitioned by key such that each worker only sees a subset of keys it is possible that more combining could happen before the shuffle, leading to less data being shuffled, but this is by no means certain and the effects would likely be small. In particular, memory considerations won't change.
I'm playing with InfluxDB and trying to experiment it for a vehicle speed tracking usecase.
Every vehicle's speed at a given time is stored as a data point.
I'm modelling "vehicle_registration" as a tag and other values as fields. I'd want the where clause to be applied on the "vehicle_registration" and it got to be quick. Therefore I'm taking advantage of the indexing capabilities on a tag by default.
But the biggest stumbling block for me is that the tags need to have a lower cardinality.
What are the recommendations here? I want a high cardinal field to be applied in a "where" clause and the queries should be quick.
Any advice?
High cardinality means higher memory requirement. So it really depends what high cardinality means in your use case. 1k will be probably fine for 8GB memory, but 1M will be probably problem for 8GB. The best option is to try it. Simulate it and you will see real memory requirements. Then you will be able to configure proper sizing for InfluxDB based on that (and your budget of course).
Or you can try TSI https://docs.influxdata.com/influxdb/v1.8/concepts/tsi-details/
In the O'Reilly book "Graph Databases" in chapter 6, which is about how Neo4j stores a graph database it says:
To understand why native graph processing is so much more efficient
than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships.
To traverse a network of m steps, the cost of the indexed approach, at
O(m log n), dwarfs the cost of O(m) for an implementation that uses
index-free adjacency.
It is then explained that Neo4j achieves this constant time lookup by storing all nodes and relationships as fixed size records:
With fixed sized records and pointer-like record IDs, traversals are
implemented simply by chasing pointers around a data structure, which
can be performed at very high speed. To traverse a particular
relationship from one node to another, the database performs several
cheap ID computations (these computations are much cheaper than
searching global indexes, as we’d have to do if faking a graph in a
non-graph native database)
This last sentence triggers my question: how does Titan, which uses Cassandra or HBase as a storage backend, achieve these performance gains or make up for it?
Neo4j only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Neo4j is slow because of pointer chasing on disk (they have a poor disk representation).
Titan only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Titan is faster than Neo4j cause it has a better disk representation.
Please see the following blog post that explains the above quantitatively:
http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/
Thus, its important to understand when people say O(1) what part of the memory hierarchy they are in. When you are in a single JVM (single machine), its easy to be fast as both Neo4j and Titan demonstrate with their respective caching engines. When you can't put the entire graph in memory, you have to rely on intelligent disk layouts, distributed caches, and the like.
Please see the following two blog posts for more information:
http://thinkaurelius.com/2013/11/01/a-letter-regarding-native-graph-databases/
http://thinkaurelius.com/2013/07/22/scalable-graph-computing-der-gekrummte-graph/
OrientDB uses a similar approach where relationships are managed without indexes (index-free adjacency), but rather with direct pointers (LINKS) between vertices. It's like in memory pointers but on disk. In this way OrientDB achieves O(1) on traversing in memory and on disk.
But if you have a vertex "City" with thousands of edges to the vertices "Person", and you're looking for all the people with age > 18, then OrientDB uses indexes because a query is involved, so in this case it's O(log N).
I have written a variety of queries using cypher that take no less than 200ms per query. They're very straightforward, so I'm having trouble identifying where the bottleneck is.
Simple Match with Parameters, 2200ms:
Simple Distinct Match with Parameters, 200ms:
Pathing, 2500ms:
At first I thought the issue was a lack of resources, because I was running neo4j and my application on the same box. While the performance monitor indicated that CPU and memory were largely free'd up and available, I moved the neo4j server to another local box and observed similar latency. Both servers are workstations with fairly new Xeon processors, 12GB memory and SSDs for the data storage. All of the above leads me to believe that the latency isn't due to my hardware. OS is Windows 7.
The graph has less than 200 nodes and less than 200 relationships.
I've attached some queries that I send to neo4j along with the configuration for the server, database, and JVM. No plugins or extensions are loaded.
Pastebin Links:
Database Configuration
Server Configuration
JVM Configuration
[Expanding a bit on a comment I made earlier.]
#TFerrell: Your comments state that "all nodes have labels", and that you tried applying indexes. However, it is not clear if you actually specified the labels in your slow Cypher queries. I noticed from your original question statement that neither of your slower queries actually specified a node label (which presumably should have been "Project").
If your Cypher query does not specify the label for a node, then the DB engine has to test every node, and it also cannot apply an index.
So, please try specifying the correct node label(s) in your slow queries.
Is that the first run or a subsequent run of these queries?
You probably don't have a label on your nodes and no index or unique constraint.
So Neo4j has to scan the whole store for your node pulling everything into memory, loading the properties and checking.
try this:
run until count returns 0:
match (n) where not n:Entity set n:Entity return count(*);
add the constraint
create constraint on (e:Entity) assert e.Id is unique;
run your query again:
match (n:Element {Id:{Id}}) return n
etc.
It seems there is something wrong with the automatic memory mapping calculation when you are on Windows (memory mapping on heap).
I just looked at your messages.log and added up some numbers, so it seems the mmio alone is enough to fill your java heap space (old-gen) leaving no room for the database, caches etc.
Please try to amend that by fixing the mmio config in your conf/neo4j.properties to more sensible values (than the auto-calculation).
For your small store just uncommenting the values starting with #neostore. (i.e. remove the #) should work fine.
Otherwise something like this (fitting for a 3GB heap) for a larger graph (2M nodes, 10M rels, 20M props,10M long strings):
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=250M
neostore.propertystore.db.mapped_memory=250M
neostore.propertystore.db.strings.mapped_memory=250M
neostore.propertystore.db.arrays.mapped_memory=0M
Here are the added numbers:
auto mmio: 134217728 + 134217728 + 536870912 + 536870912 + 1073741824 = 2.3GB
stores sizes: 1073920 + 1073664 + 3221698 + 3221460 + 1073786 = 9MB
JVM max: 3.11 RAM : 13.98 SWAP: 27.97 GB
max heaps: Eden: 1.16, oldgen: 2.33
taken from:
neostore.propertystore.db.strings] brickCount=8 brickSize=134144b mappedMem=134217728b (storeSize=1073920b)
neostore.propertystore.db.arrays] brickCount=8 brickSize=134144b mappedMem=134217728b (storeSize=1073664b)
neostore.propertystore.db] brickCount=6 brickSize=536854b mappedMem=536870912b (storeSize=3221698b)
neostore.relationshipstore.db] brickCount=6 brickSize=536844b mappedMem=536870912b (storeSize=3221460b)
neostore.nodestore.db] brickCount=1 brickSize=1073730b mappedMem=1073741824b (storeSize=1073786b)
I'd like to get the amount of "free memory" per NUMA node.
When dealing with a whole machine, one usually parses /proc/meminfo like free does (the number wanted is MemFree + Buffers + Cached).
There also exist /sys/devices/system/node/nodex/meminfo, which seem to display numbers per NUMA node. Does anybody know how these numbers can be correlated to the content of /proc/meminfo? My trivial assumption would be to sum up some numbers for all NUMA nodes in the system, and the result is equal to some number in /proc/meminfo. But so far I failed to figure out the relationships, especially for page caches.
The code for proc is in fs/proc/meminfo.c, for the sysfs files it's in drivers/base/node.c. Comparing them might give you some hints.
Note that you'll probably never get the numbers to add up 100%, because you can't atomically read the content of all the files, so the values will change while you're reading them.
There also seems to be an inconsistency in the total RAM reported via both methods. One explanation for that is that free_init_mem doesn't appear to be NUMA aware, and increments total_ram_pages but does not do any NUMA accounting.