Neo4j representation of graph - internals - neo4j

I have a question regarding how a graph in Neo4j is loaded into memory
from disk.
Reading the link here, I think I understand how the graph is represented on
disk. And when a new Neo4j databases is created, there are
physically separate files created for Nodes, Edges and Property
stores (mainly).
When you issue a query to Neo4j, does it:
1) Load the entire graph(nodes, edges, properties) in memory using a
doubly link list structure?
OR
2) Determine the nodes, edges required for the query and populate the
list structure with random accessess to the relavant stores(nodes,
edges) on disk? If so, how does Neo4j minimize the number of disk-accesses?

As frobberOfBits mentions it's more like #2. The disc accesses are minimized by a two-layered cache architecture which is best described in the reference manual.
Even if your cache is smaller than the store files this results mostly in seek operations (since a fixed record length) with a read. This kind of operations are typically fast (even faster with appropriate hardware like SSD)

Related

Neo4j web client fails with large Cypher CREATE query. 144000 lines

I'm new to neo4j and currently attempting to migrate existing data into a neo4j database. I have written a small program to convert current data (in bespoke format) into a large CREATE cypher query for initial population of the database. My first iteration has been to somewhat retain the structuring of the existing object model, i.e Objects become nodes, node type is same as object name in current object model, and the members become properties (member name is property name). This is done for all fundamental types (and strings) and any member objects are thus decomposed in the same way as in the original object model.
This has been fine in terms of performance and 13000+ line CREATE cypher queries have been generated which can be executed throuh the web frontend/client. However the model is not ideal for a graph database, I beleive, since there can be many properties, and instead I would like to deomcompose these 'fundamental' nodes (with members which are fundamental types) into their own node, relating to a more 'abstract' node which represents the more higher level object/class. This means each member is a node with a single (at first, it may grow) property say { value:"42" }, or I could set the node type to the data type (i.e integer). If my understanding is correct this would also allow me to create relationships between the 'members' (since they are nodes and not propeties) allowing a greater freedom when expressing relationships between original members of different objects rather than just relating the parent objects to each other.
The problem is this now generates 144000+ line Cypher queries (and this isn't a large dataset in compraison to others) which the neo4j client seems to bulk at. The code highlighting appears to work in the query input box of the client (i.e it highlights correctly, which I assume implies it parsed it correctly and is valid cypher query), but when I come to run the query, I get the usual browser not responding and then a stack overflow (no punn intended) error. Whats more the neo4j client doesn't exit elegantly and always requires me to force end task and the db is in the 2.5-3GB usage from, what is effectively and small amount of data (144000 lines, approx 2/3 are relationships so at most ~48000 nodes). Yet I read I should be able to deal with millions of nodes and relationships in the milliseconds?
Have tried it with firefox and chrome. I am using the neo4j community edition on windows10. The sdk would initially be used with C# and C++. This research is in its initial stages so I haven't used the sdk yet.
Is this a valid approach, i.e to initially populate to database via a CREATE query?
Also is my approach about decomposing the data into fundamental types a good one? or are there issues which are likely to arise from this approach.
That is a very large Cypher query!!!
You would do much better to populate your database using LOAD CSV FROM... and supplying a CSV file containing the data you want to load.
For a detailed explaination, have a look at:
https://neo4j.com/developer/guide-import-csv/
(This page also discusses the batch loader for really large datasets.)
Since you are generating code for the Cypher query I wouldn't imagine you would have too much trouble generating a CSV file.
(As an indication of performance, I have been loading a 1 million record CSV today into Neo4j running on my laptop in under two minutes.)

Neo4j, do nodes without relationships affect performance?

How do nodes without relationships affect performance?
The input stream contains duplicate nodes and once I've determined that a node is not of interest I'd like a short-hand way to know that I've already seen this node and want to disregard it.
If I store one instance of the node in the db without any relationships will it impact performance? Potentially the number of relationship-less nodes is very large.
Usually these don't affect performance, they take up space on disk but will not be loaded if you don't access them. And as you don't traverse them it doesn't matter that much.
I would still skip them, you can do it both with neo4j-import there is a --skip-duplicate-nodes option as well as LOAD CSV or Cypher in general, there is the MERGE clause which only creates a new node if it is not already there.

How to apportion between BatchInserterIndex cache and MMIO?

In a batch insertion using lucene indexes, given a large set of nodes and relations such that the node and relationship store cannot fit completely in mapped memory (hence the need for lucene index caching), how should one divide memory between MMIO and lucene index caches to achieve optimal performance? Having read the documentation, I am already somewhat familiar with how to divide memory within the mapped-memory schema. I am interested in the overall allotment of memory between MMIO and the lucene caches. Since I am working on a prototype with what hardware happens to be available, and the future resources and data volume are undetermined, I would prefer the answer to be in general terms (I think this would also make the answer more useful to the rest of Neo4j community too.) So it would be good if I could pose the question like this:
Given
rwN nodes and rwR relationships that are written and must be read later in the batch insertion,
woN nodes and woR relationships that are only written,
G gigabytes of RAM (not including what is required for the operating system)
What is the optimal division of G between lucene index caches and MMIO?
If more details are needed I can supply them for my particular case.
All these considerations are only relevant for importing (multiple) billions of nodes and relationships
Usually when you do lookups it depends on the "hot dataset size" of your index lookups.
By default that's all nodes but if you know your domain better, you can probably devise some paging that results in smaller needed caches (e.g. by pre-sorting your input data for relationship creation by start and end-node lookup-property) then you have kind of a moving window over your node data during which each node is accessed frequently.
I usually even sort by min(start,end).
Usually you try to use most of the RAM for mmio mapping of the rel-store and node store. The property stores are only written to but the others have to be updated as well.
The index cache lookup is only a HashMap behind the scenes, so quite wasteful. What I found to work better is to use a different approach, e.g. a multi-pass one.
use an string-array put all your lookup properties in there, sort it and use the array index (Arrays.binarySearch) as node-id then the lookup only in that array is quite efficient
another way is using a multi-pass on the source data so you already create the node-ids that are needed for the rels as part of the source, Friso and Kris from Xebia did something like that in their hadoop based solution esp. the monotonically increasing parallel id's

Performance ramifications of filtering onNeo4j relationships

I am considering filtering by a parameter on a relationship.
For example:
If I have a graph containing
Create (n:Car)-[r:DRIVES_ON {side: 'left'}]->(m:Country {Name: 'England'}) return n,m;
I want to extract using
Match (n:Car)-[r:DRIVES_ON]-(m:Country) where r.side ='left' return r;
Is this a BAD idea because of performance reasons?
Since there are only two options I would just have two separate relationships:
(Car)-[:DRIVES_ON_LEFT]->(Country)
(Car)-[:DRIVES_ON_RIGHT]->(Country)
But it would be nice to know more about your domain.
Nicole is right,
the property data is currently stored separately from the relationships on disk while the type is stored in the relationship-record. So checking for just the type is much faster as no property has to be loaded (lazy loading).
So loading properties in a high performance traversal case can hurt (on cold caches) and uses more memory to fill the caches with. Esp. as all properties are loaded in one go, at least the ones that fit into the property records. Only larger arrays and larger strings are not loaded by default but lazily on access.

what is the complexity of accessing a node in a graph generated by neo4j?

I consider working with dbpedia and use neo4j for this purpose. I have 2 things I don't understand:
What is the complexity of accessing a node in the graph?
If I have a huge DB such as dbpedia any search for a node would
take O(|E|+|V|) ?
I mean access as random access to a node in the graph, are the nodes are hashed to be accessed in O(1) ?
Accessing by ID is O(1) accessing via an index is usually O(log(n)) scanning through the db is O(n) and accessing relationships of a node is usually O(1) too.
That said you should make sure that your hot dataset is in the mmio buffers and the caches, see:
http://docs.neo4j.org/chunked/snapshot/embedded-configuration.html
http://video.neo4j.org/4ALA/0719-hardware-sizing-with-neo4j/

Resources