cypher query using http/bolt into Neo4j hangs Java thread - neo4j

I'm using Neo4j 3.5.14 Enterprise (cypher over http/bolt). I'm seeing an issue where randomly a cypher query would be stuck never to be back again which takes out a worker thread. Eventually, if the service is not redeployed, all worker threads would be stuck and the service is no longer doing its job.
I tried using apoc.cypher.runTimeboxed but that appears to cause my queries to not return until the time limit is over (20000 ms in this case) even though in some cases it can return faster than that. I'm actually not sure that runTimeboxed would work because I believe it is actually stuck forever which might not respond to time limit anyway depending on how that's implemented.
My question is - how would you end a runaway query like that? Any tricks?

Related

Neo4J taking out long-lived locks in non-query transaction

In our application we occasionally add around 10,000 nodes and 100,000 relationships to a Neo4J graph over the course of a few minutes, and then DETACH DELETE many of them a few minutes later. Previously the delete query was very quick (<100ms), but after a small change to our data model and some of our other queries (which are not running at the time), it now often blocks for minutes before completing.
While this blocking is happening there are no other queries running, and I have an export from Halin showing all the transactions that are happening at the time. It's difficult to reproduce here, but in summary there are exactly two transactions going on, one of which is my delete query. The delete query is stated to be blocked by the other one, which has 7 locks out, is in the Running state, and has no attached query or client at all. I imagine this means that it's an internal Neo4J process. It has 0 cpu time, and its entire 180s runtime is accounted for by idle time. There's no other information given.
What could be causing this transaction to lock the nodes that I want to delete for such a long time with no queries running?
What I've tried:
Using apoc.periodic.iterate and apoc.periodic.commit to split the query into smaller chunks - the inner queries end up locked
Looking in the query logs - difficult to be sure but I can't see any evidence of the internal transaction
Looking in the debug logs - records of garbage collections (always around 300ms) and some graph algorithms running, but never while this query is blocked, and nothing else relevant
Other info:
Neo4J version: 3.5.18-enterprise (docker)
Cluster mode: HA cluster with 2 nodes (also reproduced with only 1 node)
It turned out that there was a query a few minutes before that had been set going and then the client disconnected (missing await in C#). I still don't quite understand why this caused the observations, but my guess is that Neo4j put the query into a weird state after the client disconnected, and then some part of it ended up waiting for the transaction timeout before releasing its locks.

Neo4j query execution exceeding specified timeout

I am setting a timeout for Neo4j queries in the configuration, but it does not seem to work to a meaningful degree of accuracy. Is there a way to get Neo4j to timeout a query close (i.e., within ~10% of) the specified timeout?
I have tried setting dbms.transaction.timeout, which seems to somewhat work, just not accurately.
For instance, if I set it to 2s, the query stops executing after around 8s. I have also set it to 10s, and had a query stop after 30s. Not a big deal, except on my real dataset I set timeout=3600s and I have a query that has been running for 2+ hours that ran for 5 hours=18000s.
I also saw a post about using, e.g.,
unsupported.dbms.executiontime_limit.enabled=true
unsupported.dbms.executiontime_limit.time=2s
But same issue as above.
Are there any other ways to get Neo4j to timeout a query with a little more accuracy?

Neo4j query monitoring / profiling for long running queries

I have some relly long running queries. Just as abckground information: I am crawling my graph for all instances of a specific meta path. for example, count all instances of a specific metha path found in the graph.
MATCH (a:Content) - [:isTaggedWith]-> (t:Term) <-[:isTaggedWith]-(b:Content) return (*)
In the first place, I want to measure the runtimes. is there any possibility to do so? especially in the community edition?
Furthermore, I have the problem that I do not know, whether a query is still running in neo4j or if it was already terminated. I issue the query from a rest client but I am open to other options if necessary. For example, I queried neo4j with a rest client and set the read timeout (client side) on 2 days. The problem is, that I can't verify whether the query is still running or if the client is simply waiting for the neo4j answer, which will never appear because the query might already be killed in the backend. is there really no possibility to check from the browser or another client which queries are currently running? maybe with an option to terminate them as well.
Thanks in advance!
Measuring Query Performance
To answer your first question, there are two main options for measuring the performance of a query. The first is to use PROFILE; put it in front of a query (like PROFILE MATCH (a:Content)-[:IsTaggedWith]->(t:Term)...), and it will execute the query and display the execution plan used, including the native API calls, number of results from each operation, number of total database hits, and total time of execution.
The downside is that PROFILE will execute the query, so if it is an operation that writes to the database, the changes are persisted. To profile a query without actually executing it, EXPLAIN can be used instead of PROFILE. This will show the query plan and native operations that will be used to execute the query, as well as the estimated total database hits, but it will not actually run the query, so it is only an estimate.
Checking Long Running Queries (Enterprise only)
Checking for running queries can be accomplished using Cypher in Enterprise Edition: CALL dbms.listQueries;. You must be logged in as an admin user to perform the query. If you want to stop a long-running query, use CALL dbms.killQuery() and pass in the ID of the query you wish to terminate.
Note that besides manual killing of a query and timeout of it based on the configured query timeout, unless you have something else set up to kill long-runners, the queries should, in general, not be getting killed on the backend; however, with the above method, you can double-check your assumptions that the queries are indeed executing after sending.
These are available only in Enterprise Edition; there is no way that I am aware of to use these functions or replicate their behavior in Community.
For measuring long running queries I figured out the following approach:
Use a tmux (tmux crash course) terminal session, which is really very easy. Hereby, you can execute your query and close the terminal. Later on you can get back the session.
New session: tmux new -s *sessionName*
Detach from current session (within session): tmux detach
List sessions: tmux ls
Re-attach to session: tmux a -t *sessionName*
Within the tmux session, execute the query via the cypher shell. Either directly in the shell or pipe the command into the shell. The ladder approach is preferable because you can use the unix command time to actually measure the runtime as follows:
time cat query.cypher | cypher-shell -u neo4j -p n > result.txt
The file query.cypher simply conatins the regular query including terminating semicolon at the end. The result of the query will be piped into the result.txt and the runtime of the execution will be displayed in the terminal.
Moreover, it is possible to list the running queries only in the enterprise edition as correctly stated by #rebecca.

My server gets overloaded even though I keep a limit on the requests I send it

I have a server on Heroku - 3 dynos, 2 processes each.
The server does 2 things:
It responds to requests from the browser (AJAX and some web pages), based on data stored in a postgresql database
It exposes a REST API to update the data in the database. This API is called by another server. The rate of calls is limited: The other server only calls my server through a queue with a single worker, which makes sure the other server doesn't issue more than one request in parallel to my server (I verified that indeed it doesn't).
When I look at new relic, I see the following graph, which suggests that even though I keep the other server at one parallel request at most, it still loads my server which creates peaks.
I'd expect that since the rate of calls from the other server is limited, my server will not get overloaded, since a request will only start when the previous request ended (I'm guessing that maybe the database gets overloaded if it gets an update request and returns but continue processing after that).
What can explain this behaviour?
Where else can I look at in order to understand what's going on?
Is there a way to avoid this behaviour?
There are whole lot of directions this investigation could go, but from your screenshot and some inferences, I have two guesses.
A long query—You'd see this graph if your other server or a browser occasionally hits a slow query. If it's just a long read query and your DB isn't hitting its limits, it should only affect the process running the query, but if the query is taking an exclusive lock, all dynos will have to wait on it. Since the spikes are so regular, first think of anything you have running on a schedule - if the cadence matches, you probably have your culprit. The next simple thing to do is run heroku pg:long-running-queries and heroku pg:seq-scans. The former shows queries that might need optimization, and the latter shows full table scans you can probably fix with a different query or a better index. You can find similar information in NewRelic's Database tab, which has time and throughput graphs you can try to match agains your queueing spikes. Finally, look at NewRelic's Transactions tab.
There are various ways to sort - slowest average response time is probably going to help, but check out all the options and see if any transactions stand out.
Click on a suspicious transaction and look at the graph on the right. If you see spikes matching your queueing buildups, that could be it, but since it looks to be affecting your whole site, watch out for several transactions seeing correlated slowdowns.
Check out the transaction traces at the bottom. Something in there taking a long time to run is as close to a smoking gun as you'll get. This should correlate with pg:long-running-queries.
Look at the breakdown table between the graph and the transaction traces. Check for things that are taking a long time (eg. a 2 second external request) or happening often (eg, a partial that gets rendered 2500 times per request). Those are places for caching or optimization.
Garbage collection—This is less likely because Ruby GCs all the time and there's no reason it would show spikes on that regular cadence, but if there's a regular request that allocates a ton of objects, both building the objects and cleaning them up will take time. It would only affect one dyno at once, and it would be correlated with a long or highly repetitive query in your NewRelic investigation. You can see some stats about this in NewRelic's Ruby VM tab.
Take a look at your dyno and DB memory usage too. Both are printed to the Heroku logs, and if you add Librato, they'll build some automatic graphs that are quite helpful. If your dyno is swapping, performance will suffer and you should either upgrade to a bigger dyno or run fewer processes per dyno. Processes will typically accumulate memory as they run and never quite release as much as you'd like, so tune it so that right before a restart, your dyno is just under its available RAM. Similarly for the DB, if you're hitting swap there, query performance will suffer and you should upgrade.
Other things it could be, but probably isn't in this case:
Sleeping dynos—Heroku puts a dyno to sleep if it hasn't served a request in a while, but only if you have just 1 dyno running. You have 3, so this isn't it.
Web Server Concurrency—If at any given moment, there are more requests than available processes, requests will be queued. The obvious fix is to increase the available dynos/processes, which will put more load on your DB and potentially move the issue there. Since some regular request is visible every time, I'm guessing request volume is low and this also isn't your problem.
Heroku Instability—Sometimes, for no obvious reason, Heroku starts queueing requests more than it should and doesn't report any issues at status.heroku.com. Restarting the dynos typically fixes that temporarily while Heroku gets their head back on straight.

Neo4j 2.0.4 browser cannot query large datasets

Whenever I try to run cypher queries in Neo4j browser 2.0 on large (anywhere from 3 to 10GB) batch-imported datasets, I receive an "Unknown Error." Then Neo4j server stops responding, and I need to exit out using Task Manager. Prior to this operation, the server shuts down quickly and easily. I have no such issues with smaller batch-imported datasets.
I work on a Win 7 64bit computer, using the Neo4j browser. I have adjusted the .properties file to allow for much larger memory allocations. I have configured my JVM heap to 12g, which should be fine for 64bit JDK. I just recently doubled my RAM, which I thought would fix the issue.
My CPU usage is pegged. I have the logs enabled but I don't know where to find them.
I really like the visualization capabilities of the 2.0.4 browser, does anyone know what might be going wrong?
Your query is taking a long time, and the web browser interface reports "Unknown Error" after a certain timeout period. The query is still running, but you won't see the results in the browser. This drove me nuts too when it first happened to me. If you run the query in the neo4j shell you can verify whether or not this is the problem, because the shell won't time out.
Once this timeout occurs, you can find that the whole system becomes quite non-responsive, especially if you re-run the query, because now you have two extremely long queries running in parallel!
Depending on the type of query, you may be able to improve performance. Sometimes it's as simple as limiting the number of returned nodes (in cases where you only need to find one node or path).
Hope this helps.
Grace and peace,
Jim

Resources