In our application we occasionally add around 10,000 nodes and 100,000 relationships to a Neo4J graph over the course of a few minutes, and then DETACH DELETE many of them a few minutes later. Previously the delete query was very quick (<100ms), but after a small change to our data model and some of our other queries (which are not running at the time), it now often blocks for minutes before completing.
While this blocking is happening there are no other queries running, and I have an export from Halin showing all the transactions that are happening at the time. It's difficult to reproduce here, but in summary there are exactly two transactions going on, one of which is my delete query. The delete query is stated to be blocked by the other one, which has 7 locks out, is in the Running state, and has no attached query or client at all. I imagine this means that it's an internal Neo4J process. It has 0 cpu time, and its entire 180s runtime is accounted for by idle time. There's no other information given.
What could be causing this transaction to lock the nodes that I want to delete for such a long time with no queries running?
What I've tried:
Using apoc.periodic.iterate and apoc.periodic.commit to split the query into smaller chunks - the inner queries end up locked
Looking in the query logs - difficult to be sure but I can't see any evidence of the internal transaction
Looking in the debug logs - records of garbage collections (always around 300ms) and some graph algorithms running, but never while this query is blocked, and nothing else relevant
Other info:
Neo4J version: 3.5.18-enterprise (docker)
Cluster mode: HA cluster with 2 nodes (also reproduced with only 1 node)
It turned out that there was a query a few minutes before that had been set going and then the client disconnected (missing await in C#). I still don't quite understand why this caused the observations, but my guess is that Neo4j put the query into a weird state after the client disconnected, and then some part of it ended up waiting for the transaction timeout before releasing its locks.
Related
I'm using Neo4j 3.5.14 Enterprise (cypher over http/bolt). I'm seeing an issue where randomly a cypher query would be stuck never to be back again which takes out a worker thread. Eventually, if the service is not redeployed, all worker threads would be stuck and the service is no longer doing its job.
I tried using apoc.cypher.runTimeboxed but that appears to cause my queries to not return until the time limit is over (20000 ms in this case) even though in some cases it can return faster than that. I'm actually not sure that runTimeboxed would work because I believe it is actually stuck forever which might not respond to time limit anyway depending on how that's implemented.
My question is - how would you end a runaway query like that? Any tricks?
I try to find some document about it, that when some queries are running and KSQL-Server restarts. What will happened?
Does it perform similar to Kafka-Streams, so the consumer offset is not committed and at-least-once is guaranteed?
I can observe that the queries stored in the command topic, and queries are executed when ksql-server restarts
I try to find some document about it, that when some queries are running and KSQL-Server restarts. What will happened?
If you only have a single KSQL server, then stopping that server will of course stop all the queries. Once the server is running again, all queries will continue from the points they stopped processing. No data is lost.
If you have multiple KSQL servers running, then stopping one (or some) of them will cause the remaining servers to take over any query processing tasks from the stopped servers. Once the stopped servers have been restarted the query processing workload will be shared again across all servers.
Does it perform similar to Kafka-Streams, so the consumer offset is not committed and at-least-once is guaranteed?
Yes.
But (even better): Whether the processing guarantees are at-least-once or exactly-once depends solely on the KSQL server's configuration. It does of course not depend on whether or when the server is being restarted, crashes, etc.
we're on neo4j 2.1.4 soon to upgrade to 2.2.1.
We've been experiencing some slow downs with certain cypher queries and I think they are mostly centered around two to three nodes out of millions in the graph. These nodes were created with the intent on having some monitoring put in place to check the availability of the graph. I've since found out that a few apps that have been built are actually exercising these queries before actually performing their write operations on the graph. Then I found out that our load balancer was setup to actually do some tests through multiple apps that end up querying the same nodes. So we have a large mix of applications that are all either pulling or updating these same nodes. This has resulted in those two nodes taking anywhere from 8 to 40 seconds to be returned.
Is there any way to determine how many updates and how many queries are being issued against one node?
Since Neo4j 2.2 there's a config option to log queries taking longer than a given threshold, see the dbms.querylog.XXXX settings in http://neo4j.com/docs/stable/configuration-settings.html.
To get an update count for a given node you could setup a custom TransactionEventHandler that tracks write accesses to your given nodes.
I have a server on Heroku - 3 dynos, 2 processes each.
The server does 2 things:
It responds to requests from the browser (AJAX and some web pages), based on data stored in a postgresql database
It exposes a REST API to update the data in the database. This API is called by another server. The rate of calls is limited: The other server only calls my server through a queue with a single worker, which makes sure the other server doesn't issue more than one request in parallel to my server (I verified that indeed it doesn't).
When I look at new relic, I see the following graph, which suggests that even though I keep the other server at one parallel request at most, it still loads my server which creates peaks.
I'd expect that since the rate of calls from the other server is limited, my server will not get overloaded, since a request will only start when the previous request ended (I'm guessing that maybe the database gets overloaded if it gets an update request and returns but continue processing after that).
What can explain this behaviour?
Where else can I look at in order to understand what's going on?
Is there a way to avoid this behaviour?
There are whole lot of directions this investigation could go, but from your screenshot and some inferences, I have two guesses.
A long query—You'd see this graph if your other server or a browser occasionally hits a slow query. If it's just a long read query and your DB isn't hitting its limits, it should only affect the process running the query, but if the query is taking an exclusive lock, all dynos will have to wait on it. Since the spikes are so regular, first think of anything you have running on a schedule - if the cadence matches, you probably have your culprit. The next simple thing to do is run heroku pg:long-running-queries and heroku pg:seq-scans. The former shows queries that might need optimization, and the latter shows full table scans you can probably fix with a different query or a better index. You can find similar information in NewRelic's Database tab, which has time and throughput graphs you can try to match agains your queueing spikes. Finally, look at NewRelic's Transactions tab.
There are various ways to sort - slowest average response time is probably going to help, but check out all the options and see if any transactions stand out.
Click on a suspicious transaction and look at the graph on the right. If you see spikes matching your queueing buildups, that could be it, but since it looks to be affecting your whole site, watch out for several transactions seeing correlated slowdowns.
Check out the transaction traces at the bottom. Something in there taking a long time to run is as close to a smoking gun as you'll get. This should correlate with pg:long-running-queries.
Look at the breakdown table between the graph and the transaction traces. Check for things that are taking a long time (eg. a 2 second external request) or happening often (eg, a partial that gets rendered 2500 times per request). Those are places for caching or optimization.
Garbage collection—This is less likely because Ruby GCs all the time and there's no reason it would show spikes on that regular cadence, but if there's a regular request that allocates a ton of objects, both building the objects and cleaning them up will take time. It would only affect one dyno at once, and it would be correlated with a long or highly repetitive query in your NewRelic investigation. You can see some stats about this in NewRelic's Ruby VM tab.
Take a look at your dyno and DB memory usage too. Both are printed to the Heroku logs, and if you add Librato, they'll build some automatic graphs that are quite helpful. If your dyno is swapping, performance will suffer and you should either upgrade to a bigger dyno or run fewer processes per dyno. Processes will typically accumulate memory as they run and never quite release as much as you'd like, so tune it so that right before a restart, your dyno is just under its available RAM. Similarly for the DB, if you're hitting swap there, query performance will suffer and you should upgrade.
Other things it could be, but probably isn't in this case:
Sleeping dynos—Heroku puts a dyno to sleep if it hasn't served a request in a while, but only if you have just 1 dyno running. You have 3, so this isn't it.
Web Server Concurrency—If at any given moment, there are more requests than available processes, requests will be queued. The obvious fix is to increase the available dynos/processes, which will put more load on your DB and potentially move the issue there. Since some regular request is visible every time, I'm guessing request volume is low and this also isn't your problem.
Heroku Instability—Sometimes, for no obvious reason, Heroku starts queueing requests more than it should and doesn't report any issues at status.heroku.com. Restarting the dynos typically fixes that temporarily while Heroku gets their head back on straight.
I'm building a background job that's updating users' statistics for a web application. The job currently takes 55-60 seconds, and I'm concerned about what would happen if a user were to try to load his stats page at the same time that job is running.
From what I've read about PostgreSQL and concurrency, if two clients attempt to access the same row (one updating and one reading), and I'm not explicitly starting any transactions, the first one just has to wait for the second one to finish.
So if I'm understanding that correctly, the only performance hit I'm likely to incur is on the infinitesimally small chance that a user tries to load his stats page at the same moment that the row is being updated. It's not like the whole stats table is locked up during the 55-60 second job unless I were to explicitly configure Postgres to do that, right?
Is that a correct interpretation? Are there other factors I'm missing?
(I mention the Rails part just in case it has any bearing on the above scenario)
(Also: the PostgreSQL version is 9.0.4)
It depends on transaction isolation level. If I've got your case - you are talking about Dirty Read avoiding delay. And YES, Dirty Read is impossible if you are using default isolation level. Reader will wait for the writer only when it will try to get the same row that is being updated.
Read Committed is the default isolation level in PostgreSQL. When a transaction runs on this isolation level, a SELECT query sees only data committed before the query began;
specs on ISOLATION