Neo4j : Advices for hardware sizing and config [closed] - neo4j

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
I've been playing with a neo4j 2.0 db for a few months and I plan to install the db on a dedicated server. I've already tried several configs of neo4j (jvm, caches, ...) but I'm still not sure to have found the best one.
Therefore, it seems better to ask to the experts :)
Context
Db primitives:
Nodes = 224,114,478
Relationships = 417,681,104
Properties = 224,342,951
Db files:
nodestore.db = 3.064Gb
relationshipstore.db = 13.460Gb
propertystore.db = 8.982 Gb
propertystore.db.string = 5.998 Gb
propertystore.db.arrays = 1kb
OS server:
Windows server 2012 (64b)
Db usage:
Mostly graph traversals using cypher queries.
Perfs are not too bad on my dev laptop even if some queries have huge lags (I suspect that the main reason is swap caused by lack of RAM)
Graph specificities:
I suspect that some nodes may be huge hubs (till 1M relationships) but it should remain exceptional.
What would be your advices regarding:
hardware sizing,
neo4j configuration:
heap size,
use memory map buffer (is there any reason to keep the value to false with windows ?)
cache type,
recommended jvm settings for windows,
...
Thanks in advance !
laurent

The on-disk size of your graph is in total ~32 GB. Neo4j has a two layered cache architecture. The first layer is the file buffer cache. Ideally it should have the same size as the on disk graph, so ~32GB in your case.
IMPORTANT When running Neo4j on Windows, the file buffer cache is part of Java heap (due to the suckiness of Windows by itself). On Linux/Mac it's off heap. That is reason why I generally do not recommend production environments for Neo4j on Windows.
cache_type should be hpc when using enterprise edition and soft for community.
To have some reasonable amount for the second cache layer (object cache) I'd suggest to have a machine with at least 64GB RAM. Since file buffer and object cache are both on heap, make heap large and consider using G1: -Xmx60G -XX:+UseG1GC. Observe GC behaviour by uncommenting gc logging in neo4j-wrapper.conf and tweak the settings step by step.
Please note that Neo4j 2.2 might come with a different file buffer cache implementation that works off heap on Windows as well.

Please take a look at these two guides for performance tuning and hardware sizing calculations:
http://neo4j.com/developer/guide-performance-tuning/
http://neo4j.com/developer/guide-sizing-and-hardware-calculator/

Related

Does Kafka really need SSD disk? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
We are little confused about the disks types that kafka machine needs.
In our Kafka cluster in production, we have producers, 3 kafka brokers, and consumers.
When producer push data to topics and consumer read data from topics,
how to avoid the situation that consumer try to read data from topic partitions but the data not really inside the topic?
Second - since we are not use SSD disks in Kafka brokers, how to know when consumer read the data from memory cache or from the disks?
how to avoid the situation that consumer try to read data from topic
partitions but the data not really inside the topic ?
Kafka reads data sequentially so there is no random access. That's why you cannot read a specific data. (you can just specify offset to read from)
Also, because there is no random access, using SSD has no significant effect on performance.
From cloudera blog (link):
Using SSDs instead of spinning disks has not been shown to provide a
significant performance improvement for Kafka, for two main reasons:
Kafka writes to disk are asynchronous. That is, other than at startup/shutdown, no Kafka operation waits for a disk sync to
complete; disk syncs are always in the background. That’s why
replicating to at least three replicas is critical—because a single
replica will lose the data that has not been sync’d to disk, if it
crashes.
Each Kafka Partition is stored as a sequential write ahead log. Thus, disk reads and writes in Kafka are sequential, with very few
random seeks. Sequential reads and writes are heavily optimized by
modern operating systems.
SSD will help when consumers are slower than producers, which is quite possible.
When consumers are slow , file system cache misses ,
then random access happens , spinning disk will result in worst case scenario.

Kubernetes & Docker Efficiency [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
I've been looking for information on how efficiently Kubernetes & Docker are in terms of using machine resources, but I haven't found much so far. Here are my three questions, all about Kubernetes+Docker:
If multiple containers on the same node are running the same binary, are the code pages shared between all these instances? That is, is there a single set of physical pages allocated on the node for all these processes? For example, if I'm running a service mesh like Istio, which runs Envoy in every pod, is the system smart enough to only load the Envoy code in memory once, or does all the indirection taking place prevent the Linux kernel from recognizing that sharing is possible?
In a large Kubernetes deployment, there will end up being a considerable number of redundantly downloaded docker images on each node. Instead, it would seem more effective to have a single in-cluster repository for these images that all nodes can fetch from. I saw this about having docker use NFS for a common image store. Is this the only answer?
I heard there's a practical limit to the number of pods Kubernetes will schedule on a single node (30). Such a small limit forces you to use smaller VMs in order to be able to fully saturate them. Anybody know why this limit exists and whether it will eventually be raised? I ask this in the context of trying to run Kubernetes on bare metal where VMs aren't used at all. In such a world, I'd want to be able to pack way more than 30 pods on a (large) physical machine.
Thank you for any insights or pointers.
You state your question in the way that you plan to use docker as container runtime for kubernetes. That is fine - but there are more choices. Depending on the runtime the answers will change.
In general kubernetes provides an abstraction over the actual scheduling and running of pods/containers. Perhaps you invest too much human time into details that can be solved with more metal, which is cheap.
Multiple containers on a single node are usually (docker/containerd/crio) just system processes. Like you launch your Apache httpd multiple times yourself. If the kernel uses memory deduplication, it can indeed share pages.
If you use a container runtime that launches micro-VMs (firecracker,kata, ...) I doubt memory deduplication will be possible.
I would not recommend to share storage for the container images, f.e. with NFS. With some customer setups I had to diagnose issues caused by this. like deadlocks. Basically you would reduce the robustness of your cluster in order to save disk space. Just use more metal.
The usual limit is 110 Pods per node which is usually plenty. You can change this limit using --max-pods parameter to the kubelet process or configuration file for kubelet. The reason for the limit is that the management of a pod incurs effort on the kubelet and etcd/apiserver side.

Exam 70-496: Limit TFS proxy data cache to 15GB

One of the question in exam 70-496 is about limiting disk usage for cache. What would be the right answer for following question?
PercentageBasedPolicy OR FixedSizeBasedPolicy
Question is
You are the administrator of a TFS system that uses version control proxies at remote sites to reduce the burden on the WAN. The hard disk that stores the cache for a version control proxy server is upgraded to a larger size.
Management wants to ensure that more of the disk is used but not all of it.
You need to ensure that the proxy always uses a maximum of 15 GB for caching.
What should you do?
The link has answer and according to me it should be fixed but various forums says it would be percentage based. Please advise.
Link for Microsoft definition
http://msdn.microsoft.com/en-us/library/ms400763(v=vs.100).aspxv
Fixed the hint is in the "always" and "maximum".
You could probably do it with a percentage one if you know the total disk size, but it's kind of a stupid way of doing it. Unless Fixed: 15000 is no part of the possible answers.
The size is specified in megabytes.
See also:
https://msdn.microsoft.com/en-us/library/ms400763(v=vs.80).aspx

What is the best distributed Erlang in-memory cache? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I need some suggestion for the erlang in-memory cache system.
The cache item is key-value based storage.
key is usually an ASCII string; value is erlang's types include number / list / tuple / etc.
The cache item can be set by any of the node.
The cache item can be get by any of the node.
The cache item is shared cross all nodes even on different servers
dirty-read is permitted, I don't want any lock or transaction to reduce the performance.
Totally distributed, no centralized machine or service.
Good performance
Easy install and deployment and configuration and maintenance
First choice seems to me is mnesia, but I have no experence on it.
Does it meet my requirement?
How the performance can I expect?
Another option is memcached --
But I am afraid the performance is lower than mnesia because extra serialization/deserialization are performed as memcached daemon is from another OS process.
Yes. Mnesia meets your requirements. However, like you said, a tool is good when the one using it understands it in depth. We have used mnesia on a distributed authentication system and we have not experienced any problem thus far. When mnesia is used as a cache it is better off than memcached, for one reason "Memcached cannot guarantee that what you write, you can read at any time, due to memory swap out issues and stuff" (follow here). However, this means that your distributed system is going to be built over Erlang. Indeed mnesia in your case beats most NoSQL cache solutions because their systems are Eventually consistent. Mnesia is consistent, as long as network availability can be ensured across the cluster. For a distributed cache system, you dont want a situation where you read different values for the same key from different nodes, hence mnesia's consistency comes in handy here. Something you should think about, is that, it is possible to have a centralised Memory cache for a distributed system. This works like this: You have RABBITMQ server running and accessible by AMQP clients on each Cluster node. Systems interact over the AMQP interface. Because, the cache is centralised, consistency is ensured by the process/system responsible for writing and reading from the cache. The other systems just place a request for a key, onto the AMQP message bus, and the system responsible for cache receives this message and replies it with the value.
We have used the Message bus Architecture using RABBITMQ for a recent system which involved integration with banking systems, an ERP system and Public online service. What we built was responsible for fusing all these together and we are glad that we used RABBITMQ. The details are many but what we did is to come up with a message format, and a system identification mechanism. All systems must have a RABBITMQ client for writing and reading from the message bus. Then you would create a read Queue for each system, so that other system write their requests into that queue, whose name inside RABBITMQ, is the same as the system owning it. Then, later, you must encrypt the messages passing over the bus. In the end, you have systems bound together over large distance/across states, but with an efficient network, you wont believe how fast RABBITMQ binds these systems. Anyhow, RABBITMQ can also be clustered, and i should tell you that it is Mnesia which powers RABBITMQ (that tells you how good mnesia can be).
Another thing is that, you should do some reading and write many programs until you are comfortable with it.

What size is considered 'big' for the tbl_version table in TFS_Main database

We are currently experiencing significant waits in TFS database, and are trying to understand if these are as a consequence of the size of the tbl_Version version history table in the database.
Currently this table contains just over 20 million records, and is taking up approximately 6GB of storage space (total index space is just over 10GB). Looking at the queries that SQL Server is having to deal with, we have high PAGEIOLATCH_SH waits whenever this table is accessed. Obviously we don't have control over the queries being thrown at the database (all part of TFS).
Currently we have TFS on a Virtual Machine, and in essence want to get to understand whether we should (a) move to a physical machine, (b) attempt to reduce size of tbl_version or (c) follow a combination of these.
In our organisation it will be non-trivial to move to a physical server, so I'd like to get a feel for whether our table sizes are 'normal' or not before making any such decision.
PageLatch_SH typically indicates a wait for a page to be loaded from disk to memory. From the sounds of it tbl_Version is not being kept around in memory. There are 2 things you can do to improve the situation:
a. Get more RAM (not sure how much RAM you have on the server).
b. Get a faster disk subsystem.
In TFS 2010 we enable page compression if you have Enterprise Edition of SQL. This should help with the problem.
Based on some 2007 stats from Microsoft: http://blogs.msdn.com/b/bharry/archive/2007/03/13/march-devdiv-dogfood-statistics.aspx probably not the biggest.
But MS (as documented on that blog) had done some DB tuning, this I believe is in TFS 2010, but for earlier versions you'll probably need to talk to MS direct.
Caveat: We're using TFS 2008.
We're currently sitting with about 9GB of data (18GB index) with 31M rows. This is after about a year and a half of usage in an IS shop with 50-60 active developers.
Part of our problem, which we still need to address, is large binaries stored in the version control system. The answer to my question here may provide some information as to whether or not there are a few major offenders that are causing the size of that table to be bigger than you want.

Resources