What is the recommended hardware for the following neo4j setup? - neo4j

I need to build and analyze a complex network using neo4j and would like to know what is the recommended hardware for the following setup:
There are three types of nodes.
There are three types of relationships.
At the steady state, the network will contain about 1M nodes of each type and about the same amount of edges
Every day, about 500K relationships are updated and 100K nodes and edges are added. Approximately the same amount of nodes/edges are also removed.
Network update will be done in daily batches and we can tolerate update times of 1-2 hours
Once the system is up, we will quire the database for shortest paths between different nodes. Not more than 500K times per day. We can live with batch query.
Most probably, I'll use REST API

I think you should take a look at Neo4j Hardware requirements.
For the server you're talking about, I think the first thing needed will obviously be a large bandwidth. If your requests are done in a short time, it'll be needed.
Apart from that, a "normal" server should be enough :
8 or more cores
At least 24Go ram
At least 1To SSD storage (this one is important and expensive)
A good bandwidth (like 1Gbps)
By the way, it's not a programming question, so I think you should have asked this to Neo4j.

You can use Neo4j Hardware sizing calculator for rough estimation of the HW needs.

Related

SOLR and VNodes and Tokens

Note: I have done a little reformatting and added some additional information.
Please take a look at this: Question_Answer
I want to ask - with DSE 5.0 and the upcoming changes that were mentioned at C* Summit this year for 5.1 and 5.2, will the same advice be useful?
Our use case is:
The platform MUST be available at all times. (Cassandra)
The data must be searchable. (SOLR / Lucene)
The platform MUST provide analytics / Data Warehousing / BI etc (Graph / Spark)
All of that is possible in a single product offering thanks to DSE! Thank you DataStax!
But our amount of data stored and our transaction count are VERY modest.
Our specification is for 100 concurrent sessions within the application - which of course doesn't even translate to 100 concurrent DB requests / operations.
For the most part our application resembles an everyday enterprise CRUD application.
While not ridiculous, AWS instances aren't exactly free.
Having a separate cluster for each workload (with enough replication for continuous availability), will be a cost issue for us.
While I understand, a proof of concept can offer some help - but without a real workload / real users - passing through the services / applications - in ways that only a "production" system and rogue users : can really provide an insight for. The best you can do is "loaded" functional testing.
In short, we're a little stuck here from a platform perspective.
We're, initially, thinking of having:
2 data centres for geographic isolation
2 racks per DC
2 nodes per Rack
RF of 3
CL of local_quorum
If we find we're hitting performance issues, we can scale out - add an extra rack or extra nodes to the initial 2 racks.
As for V-nodes or number of tokens, we have no idea.
The documentation for DSE Search says V-nodes adds 30% overhead, so it sounds like you shouldn't use V-nodes, but then in a table in the documentation it also says to use 16 or 32. How can it be both?
If we can successfully run all workloads on a single node (our requirements are genuinely minimal), do we run with V-nodes (16 or 32) or do we run a single token?
Lastly, is there another alternative?
Can you have Nodes with different workloads in the same data centre? Where individual nodes are set up with RAM / CPU requirements for a specific workload?
Assuming our 4 node per data centre (as a starting place only - we have no idea whether or not you can successfully run Search on a single node / or Spark on a single node)
Node 1: Just Cassandra
Node 2 : Cassandra and Search
Node 3 : Cassandra and Graph
Node 4 : Cassandra and Spark
If Search needs 64GB RAM - so be it... but the Cassandra only node could well work with just 8 or 16.
So we can cater, in terms of CPU and memory per workload type - but still only have a single DC. (We'll have 2 for redundancy - but effectively it is a single DC installation : mirrored)
Thanks in advance for your help.
Vnodes adds an additional overhead for the scatter-gather part of the search solution. In some benchmarks that's been as high as 30%. Some customers are willing to live with that overhead and want to use vnodes due to the benefits of dynamic scaling.
If you have or are planning a small cluster - and won't need to scale it on the fly - then I would definitely recommend sticking with single tokens. The hidden benefit of that approach, is that your repairs will be slightly faster also. This helps with Search as you are reading at the equivalent of CL.ONE.
It is possible to run all the features on the same DC (Search, Analytics and now Graph) but you will find that the overheads go up. You will need larger nodes with more memory and cpu resources to cope with the processing load. I'd probably start with 128 Gb of ram and go from there. I guess if your load is really light you might get away with less. As with everything benchmarking at the scale you're intending to run is key.
As an aside I'm not totally clear on your intentions re RF. You kind of imply 2 nodes and RF=3. I'm guessing it's just phrasing, but if not - it's worth noting you want at least as many nodes as the RF for best coverage!

What's the real minimum RAM for running an instance of Neo4J

I want to run as many instances of Neo4J (using the Enterprise version) on a single VM as possible. What are the real minimum RAM requirements to fire up an instance?
Right now TaskManager is telling me that Java.exe is taking about 70,000K (70 Meg). Does that sound right?
I'm not worried about the performance, I just want to stuff as many instances as possible on a single box so people can do some low demand search of their graph.
One thing is what is recommended and second "it depends".
Neo4j is able to run on the Raspberry Pi. But you shouldn't expect great performance. Also I'm using AWS t2.micro for testing and it's enough.
The size is bound to fluctuate as the graph is loaded into memory to perform traversals and when it is paged back to the disk (When memory is running out).
If I may offer up a suggestion, you could run only one database instance and have unconnected graphs for each of your users. This would very likely be far more efficient in terms of server resources.
For example, If you have say (:Item) nodes which make up a graph for each user,
you could have them instead label them as (:Item-User1) with a unique prefix or postfix for each user.
Thus when you want to alter the query to run for each user you could just add that unique element and search the graph.
The Idea is to have a separate sub-graph for each user which is unconnected to other user's sub-graphs. Instead of having a separate database instance for each Individual user. As long as each user's sub-graph is unconnected from other user's sub-graphs there should be no security vulnerability where a user is given access to another user's data.
This way you could potentially have infinite number of users (within reason. quite possibly in the millions of users) each with their own sub graphs, with potentially no loss in performance, instead of the handful of database instances you could spin up on a single VM, which are likely to be competing for resources and choking out.
For neo4j 3.x, the documented minimum memory requirement is 2GB.

distributed storage: why the redundant copy is 3 by default instead of 2?

In distributed storage, to avoid data disasters, we need multiple copies of data.
However, why the total copy quantity is preferred as 3 by default instead of 2?
Two copies will save nearly 50% storage requirements.
What's the main reason of choosing 3 copies?
When using two copies of data, and they differ which version do you choose? The third acts as a tie breaker.
As to why they would differ, if one computer were down for a bit—or even if they can't talk to each other—their data would differ unless the system stops accepting writes. With three computers, though, if one is down or separated from the others, the other two can still accept data without fear of the scenario in the first paragraph. (Unless you have correlated failures, which you should still plan for.)
Update. Generally you'll find that distributed algorithms use a Quorum-based system for ensuring writes. In most it's a simple majority, meaning that at least ceil(n/2) of the nodes must have the value before it is durably written. After that, you are guaranteed that nothing can un-write the value because you cannot get ceil(n/2) more nodes to oust the decision. In a two-node system ceil(n/2) = 2; so if one of the nodes goes down, you cannot accept a write anymore. But in a three node system, ceil(n/2) = 2 still, so one node can go down and the system can still accept writes.
Really it's a question of durability vs cost vs latency. The more nodes you throw at your system, the more likely you'll not lose data. One node is fairly ephemmeral; two nodes slightly less ephemeral. Three nodes is pretty good, and many systems stop there. But systems that need higher durability will have 5, 7, or 9 nodes required.
I work on one of the most reliable systems on the internet and we use 5 nodes in the quorum with up to 16 more nodes as hot backups. For us the cost is little compared to the required durability; we chose to use 5 nodes in the quorum for latency sake with the backups for a little boost in durability and to take some read pressure of the quorum.
Because cost increase is not that significant compared to significant improvement in redundancy.
Adding to Michael's answer in this question, three is chosen because it provides a very simple level of fault tolerance. This is called 't fault-tolerance' in the presence of Byzantine faults, where t is 1. That is at most 1 of those data copies can go stale/corrupt/wrong without bringing down the system.
t is usually chosen before hand as an SLA for the system in question, or via empirical evidence. Given a value of t one needs 2*t+1 copies to handle fault tolerance.

Neo4j Huge database query performance configuration

I am new to Neo4j and graph databases. Saying that, I have around 40000 independent graphs uploaded into a neo4j database using Batch insertion, so far everything went well. My current database folder size is 180Gb, the problem is querying, which is too slow. Just to count number of nodes, it takes forever. I am using a server with 1TB ram and 40 cores, therefore I would like to load the entire database into memory and perform queries on it.
I have looked into the configurations but not sure what changes I should make to cache the entire database into memory. So please suggest me the properties I should modify.
I also noticed that most of the time Neo4j is using only one or two cores, How can I increase it?
I am using the free version for a university research project therefore I am unable to use High-Performance Cache is there an alternative in free version?
My Solution:
I added more graphs to my database and now my database size is 400GB with more than a billion nodes. I took Stefan's comments and used java APIs to access my database and moved my database to RAM disk. It takes to 3 hours to walk through all the nodes and collect information from each node.
RAM disk and Java APIs gave a big boost in performance.
Counting nodes in a graph is a global operation that obviously needs to touch each and every node. If caches are not populated (or not configured according to your dataset) the drive of your hard disc is the most influencing factor.
To speed up things, be sure to have caches configured efficiently, see http://neo4j.com/docs/stable/configuration-caches.html.
With current versions of Neo4j, a Cypher query traverses the graph in single threaded mode. Since most graph applications out there are concurrently used by multiple users, this model saturates the available cores.
If you want to run a single query multithreaded, you need to use Java API.
In general Neo4j community edition has some limitation in scaling for more than 4 cores (due to a more performant lock manager implementation in Enterprise edition). Also the HPC (high performance cache) in Enterprise edition reduces the impact of full garbage collections significantly.
What Neo4j version are you using?
Please share your current config (conf/* and data/graph.db/messages.log) you can use the personal edition of Neo4j enterprise.
What kinds of use cases do you want to run?
Counting all nodes is probably not your main operation (there are ways in the Java API that make it faster).
For efficient multi-core usage, run multiple clients or write java-code that utilizes more cores during traversal with ThreadPools.

Is this data sharing problem an NP problem?

Here is my problem:
There are n peers in the P2P network, which request the same data block; And with some constraint.
1. Peers with its own upload bandwidth, and the average bandwidth is the size of the data block.
2. The peers have different deadline about this data block. If one peer didnt get the entire block before the deadline, it has to search for the server help.
3. A peer can transfer data (partial or entire) only if it has the entire data block.
The object is to minimize the server total upload, I cant figure it out if it has an optimal algorithm or it is an NP problem. Deadline first or largest bandwidth first may not deal with some situation
Is there some NP problem similar to this? This is like a graph flow problem or an instruction scheduling, but I found that it is difficult cause I have to deal with the deadline and the growth of the suppliers total bandwidth at the same time.
I hope that I can get some directions or resource about the solution :)
Thanks.
Considering that each peer acts individually in your case, it is not like only one automata is solving your issue, but many. Since fetching a data block when it is not available within a given delay, is typically a polynomial problem, and since the job is accomplished by individual peers, your issue is not an NP problem for each peer locally.
On the other side, if a server has to compute the minimal allocation of backup resources to transfer 'missing blocks', you would have to first find out about the probability that a peer misses a block (average + standard deviation for example). Assuming you know the statistical distribution of such events, you could compute the total bandwidth you would need to transfer those missing blocks with a chosen risk of failure/tolerance in the bandwidth. If you are using multiple servers to cover for the need, make sure your peers contact them randomly to distribute the load.
Solving this statistical problem is not an NP issue. You can collect failure info from each peer and add it on a central/server peer. Therefore, my conclusion is that your issue is not an NP problem.
PART II:
Oh, I understand your case better now: multiple 'server' peers can potentially help one peer getting a full block. In this case, the number of server peers grows exponentially in your system for a given block. In this case, this optimization problem has all the characteristic of a flooding problem for me and they are NP.
Even if your graph of peers and the potential connections between them was static (which is never the case in a real P2P network), computing the optimal solution in a reasonable amount of time for more than 50 or 100 nodes is virtually impossible, unless you can make very specific assumptions on this graph (which is almost never the case in general and not always useful).
But do you absolutely need to have the absolute optimal solution or is something near the optimal good enough?
Heuristics will tell you that if your peers have more or less the same download bandwidth capacity, then it makes sense to serve peers with the highest UPLOAD bandwidth first to maximize the avalanche effect and to reduce the risk for a peer having to ask for help, in general.
If your graph is relatively balanced (that is, most peers can connect to most peers), then I bet the minimum bandwidth of initial servers will be a logarithmic function of the number of nodes in your graph times the average speed at which peers expect to be served. This is only my gut feeling and should be validated with real measures or a strong modeling of your case.

Resources