Neo4j Huge database query performance configuration - neo4j

I am new to Neo4j and graph databases. Saying that, I have around 40000 independent graphs uploaded into a neo4j database using Batch insertion, so far everything went well. My current database folder size is 180Gb, the problem is querying, which is too slow. Just to count number of nodes, it takes forever. I am using a server with 1TB ram and 40 cores, therefore I would like to load the entire database into memory and perform queries on it.
I have looked into the configurations but not sure what changes I should make to cache the entire database into memory. So please suggest me the properties I should modify.
I also noticed that most of the time Neo4j is using only one or two cores, How can I increase it?
I am using the free version for a university research project therefore I am unable to use High-Performance Cache is there an alternative in free version?
My Solution:
I added more graphs to my database and now my database size is 400GB with more than a billion nodes. I took Stefan's comments and used java APIs to access my database and moved my database to RAM disk. It takes to 3 hours to walk through all the nodes and collect information from each node.
RAM disk and Java APIs gave a big boost in performance.

Counting nodes in a graph is a global operation that obviously needs to touch each and every node. If caches are not populated (or not configured according to your dataset) the drive of your hard disc is the most influencing factor.
To speed up things, be sure to have caches configured efficiently, see http://neo4j.com/docs/stable/configuration-caches.html.
With current versions of Neo4j, a Cypher query traverses the graph in single threaded mode. Since most graph applications out there are concurrently used by multiple users, this model saturates the available cores.
If you want to run a single query multithreaded, you need to use Java API.
In general Neo4j community edition has some limitation in scaling for more than 4 cores (due to a more performant lock manager implementation in Enterprise edition). Also the HPC (high performance cache) in Enterprise edition reduces the impact of full garbage collections significantly.

What Neo4j version are you using?
Please share your current config (conf/* and data/graph.db/messages.log) you can use the personal edition of Neo4j enterprise.
What kinds of use cases do you want to run?
Counting all nodes is probably not your main operation (there are ways in the Java API that make it faster).
For efficient multi-core usage, run multiple clients or write java-code that utilizes more cores during traversal with ThreadPools.

Related

Hold entire Neo4j graph database in RAM?

I'm researching graph databases for a work project. Since our data is highly connected, it appears that a graph database would be a good option for us.
One of the first graph DB options I've run into is neo4j, and for the most part, I like it. However, I have one question about neo4j to which I cannot find the answer: Can I get neo4j to store the entire graph in-memory? If so, how does one configure this?
The application I'm designing needs to be lightning-fast. I can't afford to wait for the db to go to disk to retrieve the data I'm searching for. I need the entire DB to be held in-memory to reduce the query time.
Is there a way to hold the entire neo4j DB in-memory?
Thanks!
Further to Bruno Peres' answer, if you want to run a regular server instance, Neo4j will load the entire graph into memory when resources are sufficient. This does indeed improve performance.
The Manual has a chapter on configuring memory.
The page cache portion holds graph data and indexes - this is configured via the dbms.memory.pagecache.size property in neo4j.conf. If it is large enough, the whole graph will be stored in memory.
The heap space portion is for query execution, state management, etc. This is set via the dbms.memory.heap.initial_size and
dbms.memory.heap.max_size properties. Generally these two properties should be set to the same value, so that the whole heap is allocated on startup.
If the sole purpose of the server is to run Neo4j, you can allocate most of the memory to the heap and page cache, leaving enough left over for operating system tasks.
Holding Very Large Graphs In Memory
At Graph Connect in San Francisco, 2016, Neo4j's CTO, Jim Webber, in his typical entertaining fashion, gave details on servers that have a very large amount of high performance memory - capable of holding an entire large graph in memory. He seemed suitably impressed by them. I forget the name of the machines, but if you're interested, the video archive should have details.
Neo4j isn't designed to hold the entire graph in main memory. This leaves you with a couple of options. You can either play around with the config parameters (as Jasper Blues already explained in more details) OR you can configure Neo4j to use RAMDisk.
The first option probably won't give you the best performance as only the cache is held in memory.
The challenge with the second approach is that everything is in-memory which means that the system isn't durable and the writes are inefficient.
You can take a look at Memgraph (DISCLAIMER: I'm the co-founder and CTO). Memgraph is a high-performance, in-memory transactional graph database and it's openCypher and Bolt compatible. The data is first stored in main memory before being written to disk. In other words, you can choose to make a tradeoff between write speed and safety.

SOLR and VNodes and Tokens

Note: I have done a little reformatting and added some additional information.
Please take a look at this: Question_Answer
I want to ask - with DSE 5.0 and the upcoming changes that were mentioned at C* Summit this year for 5.1 and 5.2, will the same advice be useful?
Our use case is:
The platform MUST be available at all times. (Cassandra)
The data must be searchable. (SOLR / Lucene)
The platform MUST provide analytics / Data Warehousing / BI etc (Graph / Spark)
All of that is possible in a single product offering thanks to DSE! Thank you DataStax!
But our amount of data stored and our transaction count are VERY modest.
Our specification is for 100 concurrent sessions within the application - which of course doesn't even translate to 100 concurrent DB requests / operations.
For the most part our application resembles an everyday enterprise CRUD application.
While not ridiculous, AWS instances aren't exactly free.
Having a separate cluster for each workload (with enough replication for continuous availability), will be a cost issue for us.
While I understand, a proof of concept can offer some help - but without a real workload / real users - passing through the services / applications - in ways that only a "production" system and rogue users : can really provide an insight for. The best you can do is "loaded" functional testing.
In short, we're a little stuck here from a platform perspective.
We're, initially, thinking of having:
2 data centres for geographic isolation
2 racks per DC
2 nodes per Rack
RF of 3
CL of local_quorum
If we find we're hitting performance issues, we can scale out - add an extra rack or extra nodes to the initial 2 racks.
As for V-nodes or number of tokens, we have no idea.
The documentation for DSE Search says V-nodes adds 30% overhead, so it sounds like you shouldn't use V-nodes, but then in a table in the documentation it also says to use 16 or 32. How can it be both?
If we can successfully run all workloads on a single node (our requirements are genuinely minimal), do we run with V-nodes (16 or 32) or do we run a single token?
Lastly, is there another alternative?
Can you have Nodes with different workloads in the same data centre? Where individual nodes are set up with RAM / CPU requirements for a specific workload?
Assuming our 4 node per data centre (as a starting place only - we have no idea whether or not you can successfully run Search on a single node / or Spark on a single node)
Node 1: Just Cassandra
Node 2 : Cassandra and Search
Node 3 : Cassandra and Graph
Node 4 : Cassandra and Spark
If Search needs 64GB RAM - so be it... but the Cassandra only node could well work with just 8 or 16.
So we can cater, in terms of CPU and memory per workload type - but still only have a single DC. (We'll have 2 for redundancy - but effectively it is a single DC installation : mirrored)
Thanks in advance for your help.
Vnodes adds an additional overhead for the scatter-gather part of the search solution. In some benchmarks that's been as high as 30%. Some customers are willing to live with that overhead and want to use vnodes due to the benefits of dynamic scaling.
If you have or are planning a small cluster - and won't need to scale it on the fly - then I would definitely recommend sticking with single tokens. The hidden benefit of that approach, is that your repairs will be slightly faster also. This helps with Search as you are reading at the equivalent of CL.ONE.
It is possible to run all the features on the same DC (Search, Analytics and now Graph) but you will find that the overheads go up. You will need larger nodes with more memory and cpu resources to cope with the processing load. I'd probably start with 128 Gb of ram and go from there. I guess if your load is really light you might get away with less. As with everything benchmarking at the scale you're intending to run is key.
As an aside I'm not totally clear on your intentions re RF. You kind of imply 2 nodes and RF=3. I'm guessing it's just phrasing, but if not - it's worth noting you want at least as many nodes as the RF for best coverage!

What's the real minimum RAM for running an instance of Neo4J

I want to run as many instances of Neo4J (using the Enterprise version) on a single VM as possible. What are the real minimum RAM requirements to fire up an instance?
Right now TaskManager is telling me that Java.exe is taking about 70,000K (70 Meg). Does that sound right?
I'm not worried about the performance, I just want to stuff as many instances as possible on a single box so people can do some low demand search of their graph.
One thing is what is recommended and second "it depends".
Neo4j is able to run on the Raspberry Pi. But you shouldn't expect great performance. Also I'm using AWS t2.micro for testing and it's enough.
The size is bound to fluctuate as the graph is loaded into memory to perform traversals and when it is paged back to the disk (When memory is running out).
If I may offer up a suggestion, you could run only one database instance and have unconnected graphs for each of your users. This would very likely be far more efficient in terms of server resources.
For example, If you have say (:Item) nodes which make up a graph for each user,
you could have them instead label them as (:Item-User1) with a unique prefix or postfix for each user.
Thus when you want to alter the query to run for each user you could just add that unique element and search the graph.
The Idea is to have a separate sub-graph for each user which is unconnected to other user's sub-graphs. Instead of having a separate database instance for each Individual user. As long as each user's sub-graph is unconnected from other user's sub-graphs there should be no security vulnerability where a user is given access to another user's data.
This way you could potentially have infinite number of users (within reason. quite possibly in the millions of users) each with their own sub graphs, with potentially no loss in performance, instead of the handful of database instances you could spin up on a single VM, which are likely to be competing for resources and choking out.
For neo4j 3.x, the documented minimum memory requirement is 2GB.

What are the minimum requirements of neo4j?

I'd like to use a neo4j database in a docker container with Odroid XU4. The database is not big, approximately 20.000 nodes will be in it. The Odroid has only 2G memory, and I'd like to have a samba server, some nodejs applications and at least one PgSQL database too, so the system is short on memory. I read in the neo4j manual that 2G memory is the minimum, but I read by docker examples that it is used with 512M, so I am a little confused about this. What is the minimum memory I can use the neo4j docker image with?
I have similar troubles with the disk space. The system is on a 32GB SD card. I'd like to save database data there and backup on an external hard drive, so I could spend max 16GB for the neo4j. The data certainly does not require that kind of space, I am not sure why neo4j needs it (according to the manual again).
First you can use http://neo4j.com/hardware-sizing-calculator/ to get rough estimate for memory and disk usage.
Second option is to do some math. You can use information on page 12 in http://graphaware.com/assets/bachman-msc-thesis.pdf
You should keep in mind it's good to have all data in the memory for the performance reasons.
From my point of view you shouldn't have problem with the memory, but you can't expect great performance.
It's better to try it by yourself before you ask here ;)

What is the recommended hardware for the following neo4j setup?

I need to build and analyze a complex network using neo4j and would like to know what is the recommended hardware for the following setup:
There are three types of nodes.
There are three types of relationships.
At the steady state, the network will contain about 1M nodes of each type and about the same amount of edges
Every day, about 500K relationships are updated and 100K nodes and edges are added. Approximately the same amount of nodes/edges are also removed.
Network update will be done in daily batches and we can tolerate update times of 1-2 hours
Once the system is up, we will quire the database for shortest paths between different nodes. Not more than 500K times per day. We can live with batch query.
Most probably, I'll use REST API
I think you should take a look at Neo4j Hardware requirements.
For the server you're talking about, I think the first thing needed will obviously be a large bandwidth. If your requests are done in a short time, it'll be needed.
Apart from that, a "normal" server should be enough :
8 or more cores
At least 24Go ram
At least 1To SSD storage (this one is important and expensive)
A good bandwidth (like 1Gbps)
By the way, it's not a programming question, so I think you should have asked this to Neo4j.
You can use Neo4j Hardware sizing calculator for rough estimation of the HW needs.

Resources