How to debug Neo4J stalls? - neo4j

I have a Neo4J server running that periodically stalls out for 10s of seconds. The web frontend will say it's "disconnected" with the red warning bar at the top, and a normally instant REST query in my application will apparently hang, until the stall ends and then everything returns to normal. The web frontend becomes usable and my REST query completes fine.
Is there any way to debug what is happening during one of these stall periods? Can you get a list of currently running queries? Or a list of what hosts are connected to the server? Or any kind of indication of server load?

Most likely JVM garbage collection kicking in because you haven't allocated enough heap space.
There's a number of ways to debug this. You can for example enable GC logging (uncomment appropriate lines in neo4j-wrapper.conf), or use a profiler (e.g. YourKit) to see what's going on and why the pauses.

Related

neo4j 3.5.x GC running over and over again, even after just starting the server

Our application uses neo4j 3.5.x (tried both community and enterprise editions) to store some data.
No matter how we setup memory in conf/neo4j.conf (tried with lots of combinations for initial/max heap settings from 4 to 16 GB), the GC process runs periodically every 3 seconds, putting the machine to its knees, slowing the whole system down.
There's a combination (8g/16g) that seems to make stuff more stable, but a few minutes (20-30) after our system is being used, GC kicks again on neo4j and goes into this "deadly" loop.
If we restart the neo4j server w/o restarting our system, as soon as our system starts querying neo4j, GC starts again... (we've noticed this behavior consistently).
We've had a 3.5.x community instance which was working fine from last week (when we've tried to switch to enterprise). We've copied over the data/ folder from enterprise to community instance and started the community instance... only to have it behave the same way the enterprise instance did, running GC every 3 seconds.
Any help is appreciated. Thanks.
Screenshot of jvisualvm with 8g/16g of heap
On debug.log, only these are significative:
2019-03-21 13:44:28.475+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: client produced 301 messages on the worker queue, auto-read is being disabled.
2019-03-21 13:45:15.136+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: consumed messages on the worker queue below 100, auto-read is being enabled.
2019-03-21 13:45:15.140+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: client produced 301 messages on the worker queue, auto-read is being disabled.
And I have a neo4j.log excerpt from around the time the jvisualvm screenshot shows but it's 3500 lines long... so here it is on Pastebin:
neo4j.log excerpt from around the time time the jvisualvm screenshot was taken
JUST_PUT_THIS_TO_KEEP_THE_SO_EDITOR_HAPPY_IGNORE...
Hope this helps, I have also the logs for the Enterprise edition if needed, though they are a bit more 'cahotic' (neo4j restarts in between) and I have no jvisualvm screenshot for them

Distributed Erlang - network split recovery and using heart with distributed applications

I have a standard situation, two distributed Erlang nodes, one master one standby.
When I stop the master the standby comes on - failover, when I start the master the standby stops - takeover. Everything works fine as long as heart is not turned on and there is no network split.
However, when I disconnect the master from the network after 60 seconds or so the standby gives me an error message ** removing (timedout) connection ** and starts up as if the master node stopped. This makes sense to me, it doesn't know if the master is alive or not, and epmd can't connect to the master node so it is removed from the nodes() list. Lets pretend for a moment that this is the desired outcome.
The problem is that, when the connection is restored, I have master and standby running at the same time and the standby is oblivious to the fact that the master is running. Pinging the standby during the masters init does not solve the issue. I checked nodes() on the standby after doing so, it sees the master node but still it continues to run.
My solution for now has been to create a process, that monitors all nodes that are above each node in hierarchy and if any of them are online, can be pinged, the process calls erlang:halt() to terminate the standby node. It works for simple situations, but maybe someone can tell me if there is a better way? I found a similar problem described on Elixir forum so it probably a known erlang problem without an easy solution. https://elixirforum.com/t/distributed-application-network-split/10007
If during a network split you don't want to have two nodes running in parallel I'm guessing an outside monitoring application needs to be used?
The second major issue is heart. If heart is turned on, as is, the failover never happens. If heart is running with a sleep before it calls start it stops the failover node when it calls the application start. So even when it can't start the master, do to it not having access to vital resources for example, it stops the failover node, and doesn't bring it back up after it fails to start the master. I don't know if heart is not supposed to be used with a distributed application or if there is an option to run some erlang code to check if the resources are available before attempting a start the node and before stopping the failover node?
The documentation on heart is not great. Very hard to find any examples of HEART_COMMAND. I found a way to set the HEART_COMMAND to a script from within my application, but there is a limit to how long the argument can be, and it's not as long as stated in the documentation from what I can tell. This for example sets a sleep timer for 60 seconds before calling application start again. It doesn't solve any issues, because in 60 seconds it stops the failover node and hangs if master node can't start.
heart:set_cmd("sleep 60; ./bin/myapp start")
The solution I've ended up with for now is letting heart of the main release start another release, a pre-loader, which does a preliminary check that all resources are available and if they are it starts the main release-application, and if they are not it continues checking forever. This way the main app is running on the failover node without interruption. So the main release has heart turned on, and the pre-loader does not. I ended up using a bash script file because I needed to do more work than I could fit in the heart:set_cmd/1, so I'm still calling heart:set_cmd(Dir ++ "/priv/myHeartScript.sh " ++ Arg1 ++ " " ++ Arg2), but don't get carried away with the Args as there is a limit on size! I also used Environment Variables which I set in vm.args using -env to pass data to the script, such as the pre-loader path/name. This allowed me to avoid having to edit the scrip too during deployment.
If anyone has a better solution PLEASE let me know.
UPDATE
The team at Erlang Solutions was kind enough to shed some light onto the subject. Basically, nobody they know uses the Erlangs built in distributed model. Everything revolves around the data, and as long as it is available on redundant databases you can spin up new applications anytime. They recommend using the cloud hosts that can spin up new servers when one goes down or use a redundant node design, so have 5 nodes up in parallel and if a few go down you can restart them manually or by other means.
As for me, I can say that getting heart to start a pre-loader release/app gets the job done but it gets complicated fast. To launch the app now requires provisioning several extra sys.config/vm.args/rebar.config files. I will be looking into their suggestions for the next iteration.
UPDATE
Moved away from using Erlang distributed model. Using RabbitMQ to send heartbeats to all nodes, including itself. If a node is receiving heartbeats from itself and no other node it's the master, if receiving more than one use any attribute like node name to chose the master. You don't have to use RabbitMQ, but you need to make sure all nodes can reach the same destination and consume from it.
Also, devOps oppose using heart. They prefer to use standard Linux tools to monitor application status and restart it after crash or a server reboot.

My server gets overloaded even though I keep a limit on the requests I send it

I have a server on Heroku - 3 dynos, 2 processes each.
The server does 2 things:
It responds to requests from the browser (AJAX and some web pages), based on data stored in a postgresql database
It exposes a REST API to update the data in the database. This API is called by another server. The rate of calls is limited: The other server only calls my server through a queue with a single worker, which makes sure the other server doesn't issue more than one request in parallel to my server (I verified that indeed it doesn't).
When I look at new relic, I see the following graph, which suggests that even though I keep the other server at one parallel request at most, it still loads my server which creates peaks.
I'd expect that since the rate of calls from the other server is limited, my server will not get overloaded, since a request will only start when the previous request ended (I'm guessing that maybe the database gets overloaded if it gets an update request and returns but continue processing after that).
What can explain this behaviour?
Where else can I look at in order to understand what's going on?
Is there a way to avoid this behaviour?
There are whole lot of directions this investigation could go, but from your screenshot and some inferences, I have two guesses.
A long query—You'd see this graph if your other server or a browser occasionally hits a slow query. If it's just a long read query and your DB isn't hitting its limits, it should only affect the process running the query, but if the query is taking an exclusive lock, all dynos will have to wait on it. Since the spikes are so regular, first think of anything you have running on a schedule - if the cadence matches, you probably have your culprit. The next simple thing to do is run heroku pg:long-running-queries and heroku pg:seq-scans. The former shows queries that might need optimization, and the latter shows full table scans you can probably fix with a different query or a better index. You can find similar information in NewRelic's Database tab, which has time and throughput graphs you can try to match agains your queueing spikes. Finally, look at NewRelic's Transactions tab.
There are various ways to sort - slowest average response time is probably going to help, but check out all the options and see if any transactions stand out.
Click on a suspicious transaction and look at the graph on the right. If you see spikes matching your queueing buildups, that could be it, but since it looks to be affecting your whole site, watch out for several transactions seeing correlated slowdowns.
Check out the transaction traces at the bottom. Something in there taking a long time to run is as close to a smoking gun as you'll get. This should correlate with pg:long-running-queries.
Look at the breakdown table between the graph and the transaction traces. Check for things that are taking a long time (eg. a 2 second external request) or happening often (eg, a partial that gets rendered 2500 times per request). Those are places for caching or optimization.
Garbage collection—This is less likely because Ruby GCs all the time and there's no reason it would show spikes on that regular cadence, but if there's a regular request that allocates a ton of objects, both building the objects and cleaning them up will take time. It would only affect one dyno at once, and it would be correlated with a long or highly repetitive query in your NewRelic investigation. You can see some stats about this in NewRelic's Ruby VM tab.
Take a look at your dyno and DB memory usage too. Both are printed to the Heroku logs, and if you add Librato, they'll build some automatic graphs that are quite helpful. If your dyno is swapping, performance will suffer and you should either upgrade to a bigger dyno or run fewer processes per dyno. Processes will typically accumulate memory as they run and never quite release as much as you'd like, so tune it so that right before a restart, your dyno is just under its available RAM. Similarly for the DB, if you're hitting swap there, query performance will suffer and you should upgrade.
Other things it could be, but probably isn't in this case:
Sleeping dynos—Heroku puts a dyno to sleep if it hasn't served a request in a while, but only if you have just 1 dyno running. You have 3, so this isn't it.
Web Server Concurrency—If at any given moment, there are more requests than available processes, requests will be queued. The obvious fix is to increase the available dynos/processes, which will put more load on your DB and potentially move the issue there. Since some regular request is visible every time, I'm guessing request volume is low and this also isn't your problem.
Heroku Instability—Sometimes, for no obvious reason, Heroku starts queueing requests more than it should and doesn't report any issues at status.heroku.com. Restarting the dynos typically fixes that temporarily while Heroku gets their head back on straight.

Neo4j 2.0.4 browser cannot query large datasets

Whenever I try to run cypher queries in Neo4j browser 2.0 on large (anywhere from 3 to 10GB) batch-imported datasets, I receive an "Unknown Error." Then Neo4j server stops responding, and I need to exit out using Task Manager. Prior to this operation, the server shuts down quickly and easily. I have no such issues with smaller batch-imported datasets.
I work on a Win 7 64bit computer, using the Neo4j browser. I have adjusted the .properties file to allow for much larger memory allocations. I have configured my JVM heap to 12g, which should be fine for 64bit JDK. I just recently doubled my RAM, which I thought would fix the issue.
My CPU usage is pegged. I have the logs enabled but I don't know where to find them.
I really like the visualization capabilities of the 2.0.4 browser, does anyone know what might be going wrong?
Your query is taking a long time, and the web browser interface reports "Unknown Error" after a certain timeout period. The query is still running, but you won't see the results in the browser. This drove me nuts too when it first happened to me. If you run the query in the neo4j shell you can verify whether or not this is the problem, because the shell won't time out.
Once this timeout occurs, you can find that the whole system becomes quite non-responsive, especially if you re-run the query, because now you have two extremely long queries running in parallel!
Depending on the type of query, you may be able to improve performance. Sometimes it's as simple as limiting the number of returned nodes (in cases where you only need to find one node or path).
Hope this helps.
Grace and peace,
Jim

Web App Performance Problem

I have a website that is hanging every 5 or 10 requests. When it works, it works fast, but if you leave the browser sit for a couple minutes and then click a link, it just hangs without responding. The user has to push refresh a few times in the browser and then it runs fast again.
I'm running .NET 3.5, ASP.NET MVC 1.0 on IIS 7.0 (Windows Server 2008). The web app connects to a SQLServer 2005 DB that is running locally on the same instance. The DB has about 300 Megs of RAM and the rest is free for web requests I presume.
It's hosted on GoGrid's cloud servers, and this instance has 1GB of RAM and 1 Core. I realize that's not much, but currently I'm the only one using the site, and I still receive these hangs.
I know it's a difficult thing to troubleshoot, but I was hoping that someone could point me in the right direction as to possible IIS configuration problems, or what the "rough" average hardware requirements would be using these technologies per 1000 users, etc. Maybe for a webserver the minimum I should have is 2 cores so that if it's busy you still get a response. Or maybe the slashdot people are right and I'm an idiot for using Windows period, lol. In my experience though, it's usually MY algorithm/configuration error and not the underlying technology's fault.
Any insights are appreciated.
What diagnistics are available to you? Can you tell what happens when the user first hits the button? Does your application see that request, and then take ages to process it, or is there a delay and then your app gets going and works as quickly as ever? Or does that first request just get lost completely?
My guess is that there's some kind of paging going on, I beleive that Windows tends to have a habit of putting non-recently used apps out of the way and then paging them back in. Is that happening to your app, or the DB, or both?
As an experiment - what happens if you have a sneekly little "howAreYou" page in your app. Does the tiniest possible amount of work, such as getting a use count from the db and displaying it. Have a little monitor client hit that page every minute or so. Measure Performance over time. Spikes? Consistency? Does the very presence of activity maintain your applicaition's presence and prevent paging?
Another idea: do you rely on any caching? Do you have any kind of aging on that cache?
Your application pool may be shutting down because of inactivity. There is an Idle Time-out setting per pool, in minutes (it's under the pool's Advanced Settings - Process Model). It will take some time for the application to start again once it shuts down.
Of course, it might just be the virtualization like others suggested, but this is worth a shot.
Is the site getting significant traffic? If so I'd look for poorly-optimized queries or queries that are being looped.
Your configuration sounds fine assuming your overall traffic is relatively low.
To many data base connections without being release?
Connecting some service/component that is causing timeout?
Bad resource release?
Network traffic?
Looping queries or in code logic?

Resources