Neo4j enterprise cluster master not fully utilizing CPU - neo4j

We have a Neo4j v3.0.4 Enterprise cluster running on a machine in AWS with 16 cores and when we issue a lot of requests to it, it seems to only utilize at most ~40% of the CPU (when looking on the box using htop it only seems to utilize 6 cores?). Disk + network IO on said box all look negligible during the test.
Screen cap of CPU profile - the flat part is when we hit it with load:
Requests are routed to the DB via a cluster of Spring Boot apps using Spring Data Neo4j 4 and from our investigations it does not look like those servers are forming any bottlenecks, from a memory, CPU, and network IO POV.
We currently are NOT using Bolt, nor are we using causal clustering; however we are planning on moving towards both. In the interim though, is there anything that may cause this type of behavior? Could our DB be misconfigured? Could this be a JVM level problem?
Any advice is much appreciated - thanks!

Related

Neo4J load distribution

We have neo4j configured in casual cluster on Kubernetnes based setup. All components are deployed on individual machine with size: t2.xlarge on aws. And we use pod affinity to schedule a deployment. While working with application under stress, we observed that there is considerable system load only one machine. For example see this:
First neo4j machine for core:
and second machine for core in same cluster:
We have bolt+router protocol configured in backend. I am not sure what causing this much resource utilization on one machine whereas others work in minimum.
I checked memory consumption on pod level as well. Neo4j-1 takes 9gb of memory whereas others are taking around 4gb. So my questions is, is this expected behavior?
I am posting my findings as answer. Not sure if its correct though. I checked status of each instance in neo4j using monitoring api's:
http://localhost:7474/db/manage/server/core/writable
This will give leader as True and False for followers. This could be the reason leader takes high resources while others are not taking much. Ref link:https://neo4j.com/docs/operations-manual/current/monitoring/causal-cluster/

AWS server became slow after traffic increase

I have a single page Angular app that makes request to a Rails API service. Both are running on a t2xlarge Ubuntu instance. I am using a Postgres database.
We had increase in traffic, and my Rails API became slow. Sometimes, I get an error saying Passenger queue full for rails application.
Auto scaling on the server is working; three more instances are created. But I cannot trace this issue. I need root access to upgrade, which I do not have. Please help me with this.
As you mentioned that you are using T2.2xlarge instance type. Firstly I want to tell you should not use T2 instance type for production environment. Cause of T2 instance uses CPU Credit. Lets take a look on this
What happens if I use all of my credits?
If your instance uses all of its CPU credit balance, performance
remains at the baseline performance level. If your instance is running
low on credits, your instance’s CPU credit consumption (and therefore
CPU performance) is gradually lowered to the base performance level
over a 15-minute interval, so you will not experience a sharp
performance drop-off when your CPU credits are depleted. If your
instance consistently uses all of its CPU credit balance, we recommend
a larger T2 size or a fixed performance instance type such as M3 or
C3.
Im not sure you won't face to the out of CPU Credit problem because you are using Xlarge type but I think you should use other fixed performance instance types. So instance's performace maybe one part of your problem. You should use cloudwatch to monitor on 2 metrics: CPUCreditUsage and CPUCreditBalance to make sure the problem.
Secondly, how about your ASG? After scale-out, did your service become stable? If so, I think you do not care about this problem any more because ASG did what it's reponsibility.
Please check the following
If you are opening a connection to Database, make sure you close it.
If you are using jquery, bootstrap, datatables, or other css libraries, use the CDN links like
<link rel="stylesheet" ref="https://cdnjs.cloudflare.com/ajax/libs/bootstrap-select/1.12.4/css/bootstrap-select.min.css">
it will reduce a great amount of load on your server. do not copy the jquery or other external libraries on your own server when you can directly fetch it from other servers.
There are a number of factors that can cause an EC2 instance (or any system) to appear to run slowly.
CPU Usage. The higher the CPU usage the longer to process new threads and processes.
Free Memory. Your system needs free memory to process threads, create new processes, etc. How much free memory do you have?
Free Disk Space. Operating systems tend to thrash when the file systems on system drives run low on free disk space. How much free disk space do you have?
Network Bandwidth. What is the average bytes in / out for your
instance?
Database. Monitor connections, free memory, disk bandwidth, etc.
Amazon has CloudWatch which can provide you with monitoring for everything except for free disk space (you can add an agent to your instance for this metric). This will also help you quickly see what is happening with your instances.
Monitor your EC2 instances and your database.
You mention T2 instances. These are burstable CPUs which means that if you have consistenly higher CPU usage, then you will want to switch to fixed performance EC2 instances. CloudWatch should help you figure out what you need (CPU or Memory or Disk or Network performance).
This is totally independent of AWS Server. Looks like your software needs more juice (RAM, StorageIO, Network) and it is not sufficient with one machine. You need to evaluate the metric using cloudwatch and adjust software needs based on what is required for the software.
It could be memory leaks or processing leaks that may lead to this as well. You need to create clusters or server farm to handle the load.
Hope it helps.

Ruby on Rails server requirements

I use rails for small applications, but I'm not at all an expert. I'm hosting them on a Digital Ocean server with 512MB ram, which seems to be insufficient.
I was wondering what are Ruby on Rails server requirements (in terms of RAM) for a single app.
Besides I can I measure if my server is able to support the number of application on my server?
Many thanks
It depends on how much traffic you think you need to handle. We have two machines (a 32GB RAM, usage see below) with 32 unicorn workers two serve one app with loads of traffic and we have one machine with loads of 2 worker apps that have very few traffic.
We also have to consider the database (which needs the most RAM by far in our case due to big caches we granted it). And on top of that all we have *nix which caches the filesystem in unused RAM.
Conclusion: It is very hard to tell without you telling us what sort of traffic you expect.
Our memory usage on one of the two servers for the big app: https://gist.github.com/2called-chaos/bc2710744374f6e4a8e9b2d8c45b91cf
The output is from a little ruby script I made called unistat: https://gist.github.com/2called-chaos/50fc5412b34aea335fe9

Portable Development Environment

I am working on a huge Grails Project, and more and more people are going to be working on it.
The problem is that I have been spending ages setting up people's machines (java etc etc).
I heard that there is something like a VM that you can set up on your machine once and transfer on other people's computers.
Also what about performance issues? Lets say I install a VM on other people machine that will slow down things no?
Yes it will be very bad in performance as VM runs as a full fledged machine with OS on top of it. If you have all systems with very high configuration like quad-core i7's and minimum of 8Gb rams then only systems will be able to handle such load.
Still it will be a bad idea to do so. And while running grails you already have multiple instances of java running like in my case with netbeans I have 2 resource hungry java instances running with 700mb and 500mb respectively.

Why is membase server so slow in response time?

I have a problem that membase is being very slow on my environment.
I am running several production servers (Passenger) on rails 2.3.10 ruby 1.8.7.
Those servers communicate with 2 membase machines in a cluster.
the membase machines each have 64G of memory and a100g EBS attached to them, 1G swap.
My problem is that membase is being VERY slow in response time and is actually the slowest part right now in all of the application lifecycle.
my question is: Why?
the rails gem I am using is memcache-northscale.
the membase server is 1.7.1 (latest).
The server is doing between 2K-7K ops per second (for the cluster)
The response time from membase (based on NewRelic) is 250ms in average which is HUGE and unreasonable.
Does anybody know why is this happening?
What can I do inorder to improve this time?
It's hard to immediately say with the data at hand, but I think I have a few things you may wish to dig into to narrow down where the issue may be.
First of all, do your stats with membase show a significant number of background fetches? This is in the Web UI statistics for "disk reads per second". If so, that's the likely culprit for the higher latencies.
You can read more about the statistics and sizing in the manual, particularly the sections on statistics and cluster design considerations.
Second, you're reporting 250ms on average. Is this a sliding average, or overall? Do you have something like max 90th or max 99th latencies? Some outlying disk fetches can give you a large average, when most requests (for example, those from RAM that don't need disk fetches) are actually quite speedy.
Are your systems spread throughout availability zones? What kind of instances are you using? Are the clients and servers in the same Amazon AWS region? I suspect the answer may be "yes" to the first, which means about 1.5ms overhead when using xlarge instances from recent measurements. This can matter if you're doing a lot of fetches synchronously and in serial in a given method.
I expect it's all in one region, but it's worth double checking since those latencies sound like WAN latencies.
Finally, there is an updated Ruby gem, backwards compatible with Fauna. Couchbase, Inc. has been working to add back to Fauna upstream. If possible, you may want to try the gem referenced here:
http://www.couchbase.org/code/couchbase/ruby/2.0.0
You will also want to look at running Moxi on the client-side. By accessing Membase, you need to go through a proxy (called Moxi). By default, it's installed on the server which means you might make a request to one of the servers that doesn't actually have the key. Moxi will go get it...but then you're doubling the network traffic.
Installing Moxi on the client-side will eliminate this extra network traffic: http://www.couchbase.org/wiki/display/membase/Moxi
Perry

Resources