Reliability with Neo4J- Is there a way to configure neo4j to not crash? - neo4j

I am evaluating neo4j for production usage. During my playing around with neo4j it crashed (OutOfMemory exceptions) many times due to non-optimized queries.
I understand that perhaps there's not enough memory. In other databases I've used the server doesn't just crash, but instead slows down or terminates a specific query/transaction. Neo4j on the other hand simply shuts down.
Is there a way to configure neo4j to not crash?
Here's a response I received from the neo4j team:
you can configure several options, that we e.g. have also enabled in
sandbox:
transaction and global memory limits, the global memory limit should be around 70% of the configured heap, and if you know what the
concurrency of your statements is you can also configure the
per-statement memory limit if you don't have outlier queries
transaction timeouts, limit queries to at most X seconds runtime and abort them after
Those settings should be default for new installations but
unfortunately aren't yet.
Source

You will want to control and limit transaction memory usage, see : https://neo4j.com/docs/operations-manual/current/performance/memory-configuration/#memory-configuration-limit-transaction-memory

Related

Neo4J taking out long-lived locks in non-query transaction

In our application we occasionally add around 10,000 nodes and 100,000 relationships to a Neo4J graph over the course of a few minutes, and then DETACH DELETE many of them a few minutes later. Previously the delete query was very quick (<100ms), but after a small change to our data model and some of our other queries (which are not running at the time), it now often blocks for minutes before completing.
While this blocking is happening there are no other queries running, and I have an export from Halin showing all the transactions that are happening at the time. It's difficult to reproduce here, but in summary there are exactly two transactions going on, one of which is my delete query. The delete query is stated to be blocked by the other one, which has 7 locks out, is in the Running state, and has no attached query or client at all. I imagine this means that it's an internal Neo4J process. It has 0 cpu time, and its entire 180s runtime is accounted for by idle time. There's no other information given.
What could be causing this transaction to lock the nodes that I want to delete for such a long time with no queries running?
What I've tried:
Using apoc.periodic.iterate and apoc.periodic.commit to split the query into smaller chunks - the inner queries end up locked
Looking in the query logs - difficult to be sure but I can't see any evidence of the internal transaction
Looking in the debug logs - records of garbage collections (always around 300ms) and some graph algorithms running, but never while this query is blocked, and nothing else relevant
Other info:
Neo4J version: 3.5.18-enterprise (docker)
Cluster mode: HA cluster with 2 nodes (also reproduced with only 1 node)
It turned out that there was a query a few minutes before that had been set going and then the client disconnected (missing await in C#). I still don't quite understand why this caused the observations, but my guess is that Neo4j put the query into a weird state after the client disconnected, and then some part of it ended up waiting for the transaction timeout before releasing its locks.

neo4j high cpu and open transactions

is there a way to check why the server (neo4j dedicated) has high cpu after a while of running queries?
also is the attached monitor screen ok? lots of open transactions there, which only increase
Opened should continue to increase. That is not how many are currently opened but rather just a total including transactions that were opened and are now finished and not running.
However, "current" shows 7 which means you still have 7 transactions running which probably explains the high CPU usage, depending on what those transactions are doing. Is it expected that you would have 7 transactions running? If so then there's probably nothing to worry about. If not, then you might want to look in to why those transactions didn't finish when you expected them to and you can also configure the execution card to limit the time each query can run for before being killed.

Slow Query Performance

I am running some very large databases (500 MB and 300 MB) in my application on several different machines.
From a hardware perspective, the machines have been identically configured.
I am using SQL Server CE 4.0 as my DBMS.
The performance critical query has been indexed to improve its performance.
The problem is that on [only] one of the machines, I am observing egregiously slow query performance. This usually happens after a long period of time of inactivity (from a query perspective). After I do several (about 7-8) queries, the slow performance disappears.
The weird thing is that this initial slow query performance does not happen on the other machine.
The only difference between the two machines is the data contained inside the databases.
I suspect that the distribution of data on the slow machine is somehow reducing the effectiveness of the indexing and that SQL Server CE has to rebalance the indexing in a much more significant way than on the other faster machine.
One thing I notice is that when the query is very slow, the disk activity increases significantly and the process corresponding to reading the database shows a spike in the read bytes.
This does not happen on the other machine.
Does anyone know how I might go about root causing this issue?
My code is written in C++ and uses the ATL/OLEDB API to manipulate the database.
UPDATE: My performance profiling activities indicate that it's not the query itself that is slow - it is the processing of the returned rowset that takes a while. For each row returned, I query another database for related data. I understand that this is not the right way to do it but the performance problem only happens on one machine. One thing I noticed is that when I have other unrelated queries happening on the same database in other threads, the unrelated queries will stall the query that is exhibiting the performance problem.

Why is my PostgreSQL server cpu constrained?

My database is very cpu constrained, and I can't find the root cause of the issue. I currently have two applications servers each wit a Rails api connecting to PostgreSQL via the ruby-pg gem. Both application server also have sidekiq running background jobs, and I have a handful of support servers processing new posts from a national feed via sidekiq. If I were running out of memory, the solution would seemingly be straight forward. Any general ideas why I am CPU constrained?
Database Specs:
Rackspace 8GB Performance Tier cloud VM (8GB RAM, 8x Core CPU, SSD)
Debian 7 Wheezy Linux OS
PostgreSQL 9.1 with PostGIS extension
Possible Problems:
PostgreSQL 9.1 is bad at indexes
The database has nearly 10GB of indexes. I am going to upgrade my database to PostgreSQL version >= 9.2. In version 9.2, index only scans were introduced.
Too many connections
In the postgresql.conf, I have set max connection equal to '500'. Usually throughout the day, only 175 connections are utilized, but during peak times, sidekiq tasks will increase the current connections to 350. How many connections are recommended with an 8GB server instance?
Idol Connections
When I take a look at pg_stat_activity in the psql console, I see sidekiq is leaving a lot of IDLE connections. Could these connections result in CPU inflation? Does the fix exist in the api or in sidekiq?
Need a more powerful server
Maybe there is not a bug. I might need to simply increase the server instance. Again this would make more sense if I was memory bound. However, both app servers and 3 of the support sidekiq servers are 4gb performance tier instances. Essentially, servers that interact with the database have combined more than double the resources of the database. Should this even matter?
Additional questions:
What tools/techniques should I employ to troubleshoot the issue?
Any basic settings in the postgresql.conf related to cpu usage?
Are there any known issues related to rails, sidekiq, or the pg gem that could be a contributing factor? (I havent seen any open issues.)
Are there any general postgreSQL guideline for CPU usage?
Any other ideas thoughts that might help my search?
You are using massively too many concurrent connections. PostgreSQL will be wasting lots of its time on housekeeping and juggling concurrent queries. All the concurrent work will be fighting for CPU and buffer space, there'll be heavy contention on spinlocks, and it'll all generally be a mess.
On an 8 core machine, you should probably not have more than 20 actively working connections if you're mostly CPU constrained. If you're I/O limited, you can go higher, but 350 is just ridiculous.
If possible, put a PgBouncer in transaction pooling mode in front of your PostgreSQL instance, so queries get queued up and executed rapidly in series instead of slowly in parallel.
See number of database connections (Pg wiki).
Additionally, PostGIS can be very CPU-heavy. It sometimes needs to do very complex calculations. I suggest using the auto_explain module to record long running queries, and using pg_stat_statements / pg_stat_plans to record what's taking up resources. Examine these queries to see if they need improvement.
Your idle in transaction sessions must be dealt with, too. Depending on why they're idle and whether they have a transaction ID or not, they might be causing serious table bloat. They're also creating unnecessary signalling overhead within PostgreSQL, as it has to do more co-ordination with backends that're actively doing things. Finally, the number of open transactions its self increases the cost of some internal housekeeping operations.
So. Your DB will probably perform better if you reduce the connection counts, put a PgBouncer in transaction pooling mode in front, and fix those idle connections.
Most likely you are CPU constrained because your work needs a lot of CPU. :)
9.1 is not generally bad at indexes. There may be some specific issues, as all versions might, which exactly what they are might change from version to version.
Index-only-scans are mostly a benefit when you are IO constrained. I wouldn't hold out much hope for that being a magic bullet for you.
350 connections are certainly not helpful, but probably are not very harmful, either. But when they are harmful, it can be downright catastrophic. The correct value is more determined by the number of cores, not the amount of RAM. If it is easy to throttle down the sidekiq connections, do it even if you can't prove that it helps.
If the connections are just IDLE, not IDLE in transaction, then they probably aren't very harmful, but again there are a few cases where they can be. That is pretty much the same issue as the number of connections.
The connection you showed from top was idle in transaction. That status shouldn't be taking up much CPU, so that probably means it is rapidly cycling through statements and top just happens to catch it while it is between them. But you didn't say how many similar lines there were in top, if it is just that one it suggests your code is not running concurrently and 7 of you 8 CPUs are wasted.
Regarding the db server versus the other servers, if the database is fundamentally the limit, beating on it with a bigger hammer is not going to help. Often there is some flexibility about where computation is done. If you can get the app servers to do more computation that is currently done on the db and let the db focus on ACID issues, that would be good. But no one but you can know if that is possible or feasible.
My first stop would be to use pg_stat_statements to see what SQL statements are taking the most time. Maybe just adding an index to the slowest/most frequent query would make the problem magically go away.

Unicorn CPU usage spiking during load tests, ways to optimize

I am interested in ways to optimize my Unicorn setup for my Ruby on Rails 3.1.3 app. I'm currently spawning 14 worker processes on High-CPU Extra Large Instance since my application appears to be CPU bound during load tests. At about 20 requests per second replaying requests on a simulation load tests, all 8 cores on my instance get peaked out, and the box load spikes up to 7-8. Each unicorn instance is utilizing about 56-60% CPU.
I'm curious what are ways that I can optimize this? I'd like to be able to funnel more requests per second onto an instance of this size. Memory is completely fine as is all other I/O. CPU is getting tanked during my tests.
If you are CPU bound you want to use no more unicorn processes than you have cores, otherwise you overload the system and slow down the scheduler. You can test this on a dev box using ab. You will notice that 2 unicorns will outperform 20 (number depends on cores, but the concept will hold true).
The exception to this rule is if your IO bound. In which case add as many unicorns as memory can hold.
A good performance trick is to route IO bound requests to a different app server hosting many unicorns. For example, if you have a request that uses a slow sql query, or your waiting on an external request, such as a credit card transaction. If using nginx, define an upstream server for the IO bound requests, forward those urls to a box with 40 unicorns. CPU bound or really fast requests, forward to a box with 8 unicorns (you stated you have 8 cores, but on aws you might want to try 4-6 as their schedulers are hypervised and already very busy).
Also, I'm not sure you can count on aws giving you reliable CPU usage, as your getting a percentage of an obscure percentage.
First off, you probably don't want instances at 45-60% cpu. In that case, if you get a traffic spike, all of your instances will choke.
Next, 14 Unicorn instances seems large. Unicorn does not use threading. Rather, each process runs with a single thread. Unicorn's master process will only select a thread if it is able to handle it. Because of this, the number of cores isn't a metric you should use to measure performance with Unicorn.
A more conservative setup may use 4 or so Unicorn processes per instance, responding to maybe 5-8 requests per second. Then, adjust the number of instances until your CPU use is around 35%. This will ensure stability under the stressful '20 requests per second scenario.'
Lastly, you can get more gritty stats and details by using God.
For a high CPU extra large instance, 20 requests per second is very low. It is likely there is an issue with the code. A unicorn-specific problem seems less likely. If you are in doubt, you could try a different app server and confirm it still happens.
In this scenario, questions I'd be thinking about...
1 - Are you doing something CPU intensive in code--maybe something that should really be in the database. For example, if you are bringing back a large recordset and looping through it in ruby/rails to sort it or do some other operation, that would explain a CPU bottleneck at this level as opposed to within the database. The recommendation in this case is to revamp the query to do more and take the burden off of rails. For example, if you are sorting the result set in your controller, rather than through sql, that would cause an issue like this.
2 - Are you doing anything unusual compared to a vanilla crud app, like accessing a shared resource, or anything where contention could be an issue?
3 - Do you have any loops that might burn CPU, especially if there was contention for a resource?
4 - Try unhooking various parts of the controller logic in question. For example, how well does it scale if you hack your code to just return a static hello world response instead? I bet suddenly unicorn will be blazlingly fast. Then try adding back in parts of your code until you discover the source of the slowness.

Resources