How to apportion between BatchInserterIndex cache and MMIO? - memory

In a batch insertion using lucene indexes, given a large set of nodes and relations such that the node and relationship store cannot fit completely in mapped memory (hence the need for lucene index caching), how should one divide memory between MMIO and lucene index caches to achieve optimal performance? Having read the documentation, I am already somewhat familiar with how to divide memory within the mapped-memory schema. I am interested in the overall allotment of memory between MMIO and the lucene caches. Since I am working on a prototype with what hardware happens to be available, and the future resources and data volume are undetermined, I would prefer the answer to be in general terms (I think this would also make the answer more useful to the rest of Neo4j community too.) So it would be good if I could pose the question like this:
Given
rwN nodes and rwR relationships that are written and must be read later in the batch insertion,
woN nodes and woR relationships that are only written,
G gigabytes of RAM (not including what is required for the operating system)
What is the optimal division of G between lucene index caches and MMIO?
If more details are needed I can supply them for my particular case.

All these considerations are only relevant for importing (multiple) billions of nodes and relationships
Usually when you do lookups it depends on the "hot dataset size" of your index lookups.
By default that's all nodes but if you know your domain better, you can probably devise some paging that results in smaller needed caches (e.g. by pre-sorting your input data for relationship creation by start and end-node lookup-property) then you have kind of a moving window over your node data during which each node is accessed frequently.
I usually even sort by min(start,end).
Usually you try to use most of the RAM for mmio mapping of the rel-store and node store. The property stores are only written to but the others have to be updated as well.
The index cache lookup is only a HashMap behind the scenes, so quite wasteful. What I found to work better is to use a different approach, e.g. a multi-pass one.
use an string-array put all your lookup properties in there, sort it and use the array index (Arrays.binarySearch) as node-id then the lookup only in that array is quite efficient
another way is using a multi-pass on the source data so you already create the node-ids that are needed for the rels as part of the source, Friso and Kris from Xebia did something like that in their hadoop based solution esp. the monotonically increasing parallel id's

Related

Is there a definite answer for the performance of 1st level depth query search in Neo4j (or any other graph db) vs. SQL/NoSQL?

I am testing the use of Neo4j for social-like graph, but I also have many use-cases that require 1st level depth queries (e.g. get my likes / views). Consequently, I wish to decide whether I need another SQL/NoSQL to support (in terms of performance) this type of queries.
Up until now, I was only able to find benchmarks and quantitive data concerning > 1st level searches (i.e. friends of my friends ...)
Is it common knowledge that SQL/NoSQL db will have better performance for such queries? are there any research/benchmarks about this?
A depth of 1 shouldn't result in much difference in performance, I'd think. However, that does depend on two factors: the indexes you've set up, and the depth of expansion.
For both Neo4j and your relational db, you would want a supporting index on the starting node in the graph (the person or post whose likes/views you want to get). For the relational db, you would also want an index to support the join operation being used to get at the connected nodes.
For Neo4j, the expansion to the connected nodes is directly proportional to the number of nodes you are expanding to, since this is just pointer chasing between the nodes and relationships forming your graph. No indexes are used for that.
For a relational database, the relationship would likely be modeled as a table join (which should be index-backed), and that cost will be proportional to the size of the tables being joined, so as more data is added to the graph (no matter of whether it is connected to the user who you are querying for), it will be impacting your execution time.
Thankfully for your case only a single table join would be needed. You may not see a big difference between a graph db and a relational db. Neo4j tends to shine when many (possibly an unbounded number) of traversals are needed, like the friend-of-a-friend queries or those with longer patterns. If your use cases include longer patterns, especially if the types of the node expanded to are not known ahead of time, then Neo4j would be very helpful, especially as the data in your database grows, since traversal performance is proportional only to the directly connected data, not the total number of nodes of the given labels.

Query optimization that collects and orders nodes on very large graph

I have a decently large graph (1.8 billion nodes and roughly the same number of relationships) where I am performing the follow query:
MATCH (n:Article)
WHERE n.id IN $pmids
MATCH (n)-[:HAS_MENTION]->(m:Mention)
WITH n, collect(m) as mentions
RETURN n.id as pmid, mentions
ORDER BY pmid
where $pmids are a list of strings, e.g. ["1234", "4567"] where the length of this list varies from 100-500 length.
I am currently am holding the data within neo4j docker community instance with the following conf modifications: NEO4J_dbms_memory_pagecache_size=32G, NEO4J_dbms_memory_heap_max__size=32G. Index has been created for Article.id.
This query has been quite slow to run (roughly 5 seconds) and I would like to optimize to make for faster runtime. As part of work, I have access to neo4j enterprise so one approach would be to ingest this data as part of a neo4j enterprise account where I can tweak advanced configuration settings.
In general, does anyone have any tips in how I may improve performance, whether it be optimizing the cypher query itself, increase workers or other settings in neo4j.conf?
Thanks in advance.
For anyone interested - I posed this question in the neo4j forums as well and there have already been some interesting optimization suggestions (especially around the "type hint" to trigger backward-indexing, and using pattern comprehension instead of collect()
Initial thoughts
you are using a string field to store PMID, but PMIDs are numeric, it might reduce the database size, and possibly perform better if stored as int (and indexed as int, and searched as int)
if the PMID list is usually large, and the server has over half dozen cores, it might be worth looking into the apoc parallel cypher functions
do you really need every property from the Mention nodes? if not try gathering just what you need
what is the size of the database in GBs? (some context is required in terms of memory settings), and what did neo4j-admin memrec recommend?
If this is how the db is always used, all the time, a sql database might be better, and when building that sql db, collect the mentions into one field (once and done)
Note: Go PubMed!

Neo4J using properties on relationships for quicker lookup?

I am yet trying to make use of neo4j to perform a complex query (similar to shortest path search except I have very strange conditions applied to this search like minimum path length in terms of nodes traversed count).
My dataset contains around 2.5M nodes of one single type and around 1.5 billion edges (One single type as well). Each given node has on average 1000 directional relation to a "next" node.
Yet, I have a query that allows me to retrieve this shortest path given all of my conditions but the only way I found to have decent response time (under one second) is to actually limit the number of results after each new node added to the path, filter it, order it and then pursue to the next node (This is kind of a greedy algorithm I suppose).
I'd like to limit them a lot less than I do in order to yield more path as a result, but the problem is the exponential complexity of this search that makes going from LIMIT 40 to LIMIT 60 usually a matter of x10 ~ x100 processing time.
This being said, I am yet evaluating several solutions to increase the speed of the request but I'm quite unsure of the result they will yield as I'm not sure about how neo4j really stores my data internally.
The solution I think about yet is to actually add a property to my relationships which would be an integer in between 1 and 15 because I usually will only query the relationships that have one or two max different values for this property. (like only relationships that have this property to 8 or 9 for example).
As I can guess yet, for each relationship, neo4j then have to gather the original node properties and use it to apply my further filters which takes a very long time when crossing 4 nodes long path with 1000 relationships each (I guess O(1000^4)). Am I right ?
With relationship properties, will it have direct access to it without further data fetching ? Is there any chance it will make my queries faster? How are neo4j edges properties stored ?
UPDATE
Following #logisima 's advice I've written a procedure directly with the Java traversal API of neo4j. I then switched to the raw Java procedure API of Neo4J to leverage even more power and flexibility as my use case required it.
The results are really good : the lower bound complexity is overall a little less thant it was before but the higher bound is like ten time faster and when at least some of the nodes that will be used for the traversal are in the cache of Neo4j, the performances just becomes astonishing (depth 20 in less than a second for one of my tests when I only need depth 4 usually).
But that's not all. The procedures makes it very very easily customisable while keeping the performances at their best and optimizing every single operation at its best. The results is that I can use far more powerful filters in far less computing time and can easily update my procedure to add new features. Last but not least Procedures are very easily pluggable with spring-data for neo4j (which I use to connect neo4j to my HTTP API). Where as with cypher, I would have to auto generate the queries (as being very complex, there was like 30 java classes to do the trick properly) and I should have used jdbc for neo4j while handling a separate connection pool only for this request. Cannot recommend more to use the awesome neo4j java API.
Thanks again #logisima
If you're trying to do a custom shortespath algo, then you should write a cypher procedure with the traversal API.
The principe of Cypher is to make pattern matching, and you want to traverse the graph in a specific way to find your good solution.
The response time should be really faster for your use-case !

Neo4j representation of graph - internals

I have a question regarding how a graph in Neo4j is loaded into memory
from disk.
Reading the link here, I think I understand how the graph is represented on
disk. And when a new Neo4j databases is created, there are
physically separate files created for Nodes, Edges and Property
stores (mainly).
When you issue a query to Neo4j, does it:
1) Load the entire graph(nodes, edges, properties) in memory using a
doubly link list structure?
OR
2) Determine the nodes, edges required for the query and populate the
list structure with random accessess to the relavant stores(nodes,
edges) on disk? If so, how does Neo4j minimize the number of disk-accesses?
As frobberOfBits mentions it's more like #2. The disc accesses are minimized by a two-layered cache architecture which is best described in the reference manual.
Even if your cache is smaller than the store files this results mostly in seek operations (since a fixed record length) with a read. This kind of operations are typically fast (even faster with appropriate hardware like SSD)

How much does the architecture of data affect the speed of a query

I have the following nodes and relationships in Neo4j database.
The grey and the pink node are furtherly connected with more nodes. Running the following query:
MATCH (n:RealNode {gid:'$obj_id'})-[:CONTAINS*..3]-(z)
RETURN DISTINCT ID(z), z.id,n.id as InternalID"
I get a result very fast (the node n:RealNode is not one of the nodes in the image).
If I increase the depth to 4 like:
MATCH (n:RealNode {gid:'$obj_id'})-[:CONTAINS*..4]-(z)
RETURN DISTINCT ID(z), z.id,n.id as InternalID"
The response gets extremely slow. I will never get a response with depth 5 etc.
The depth 4 is actually the relationship between the blue-pink node. So my question is: can the architecture of data (in this case) affect in such a great level the speed of the query? If yes what should I do?
I have tried to run the query also using parameters but the result was the same. Also the gid of n:RealNode is an indexed value.
The architecture of your data has a huge, no...massive impact on query performance. There's a lot you can do with improving performance by reformulating your query, but you can do even more than that by changing your data model.
The model needs to be chosen in a way that's an accurate depiction of the real-world domain, but it often also has to make certain concessions to usage patterns. If you know you're going to do certain queries over and over, it makes sense to choose a data model that makes it easy for the DBMS to answer that query. In the RDBMS world, that entire line of thinking gets summarized in the word "denormalization". In graph databases, the concept is the same but the way you go about it is different.
The thing to keep in mind when adjusting your data model is that neo4j is good at traversing relationships fast, and that with all queries, the less data you have to consider, the faster the query will go.
So in your case, I don't know how many nodes branch off of each other node by a :CONTAINS relationship, but I'm guessing that at each level of the hierarchy you have many items below it. So going from level 4 to level 5 probably doesn't just add a fixed number of additional nodes, but if say each level of the hierarchy has 3x the number of nodes as the level above, the deeper you go, the more you're multiplying how much data you have to consider. If it's 10x...then ouch.
You have many different options. One is to create short-cut relationships, and "pre-materialize" the query. Imagine creating :grandfather and :greatgrandfather relationships to "hop" levels of the tree. That would make it faster. Another way would be to filter intermediate nodes, or the return nodes, so that you're not considering everything, but some subset.
In the end, really huge queries will always take longer than really small ones. You must first begin with a careful understanding of what data you want, and how often you have to run this query. I would not attempt to optimize your data model for infrequently run queries, but if you do this all the time, you should look at your options. Your query to me looks like it's going to return a whole lot of data no matter what you do.

Resources