Perfino impact of server becoming unreachable / offline - perfino

Regarding Perfino from EJ-Technologies...:
Once the Agent is installed against PROD JVMs, what happens if the Perfino server becomes unavailable?
My concern is not that profiling data might be missed, but rather whether PROD would become unstable or unstartable if Perfino died a sudden death and could not be rapidly recovered.
Thanks, Robin.

The agent will never become unstable if there is no perfino server. It caches recorded data and transfers it to the perfino server when a connection can be made. Connection attempts are made from time to time and you can see corresponding output on stderr.
After a certain period it will lose cached data to avoid substantial memory overhead to the monitored JVM.

Related

Restart KSQL-Server when some queries are running

I try to find some document about it, that when some queries are running and KSQL-Server restarts. What will happened?
Does it perform similar to Kafka-Streams, so the consumer offset is not committed and at-least-once is guaranteed?
I can observe that the queries stored in the command topic, and queries are executed when ksql-server restarts
I try to find some document about it, that when some queries are running and KSQL-Server restarts. What will happened?
If you only have a single KSQL server, then stopping that server will of course stop all the queries. Once the server is running again, all queries will continue from the points they stopped processing. No data is lost.
If you have multiple KSQL servers running, then stopping one (or some) of them will cause the remaining servers to take over any query processing tasks from the stopped servers. Once the stopped servers have been restarted the query processing workload will be shared again across all servers.
Does it perform similar to Kafka-Streams, so the consumer offset is not committed and at-least-once is guaranteed?
Yes.
But (even better): Whether the processing guarantees are at-least-once or exactly-once depends solely on the KSQL server's configuration. It does of course not depend on whether or when the server is being restarted, crashes, etc.

neo4j 3.5.x GC running over and over again, even after just starting the server

Our application uses neo4j 3.5.x (tried both community and enterprise editions) to store some data.
No matter how we setup memory in conf/neo4j.conf (tried with lots of combinations for initial/max heap settings from 4 to 16 GB), the GC process runs periodically every 3 seconds, putting the machine to its knees, slowing the whole system down.
There's a combination (8g/16g) that seems to make stuff more stable, but a few minutes (20-30) after our system is being used, GC kicks again on neo4j and goes into this "deadly" loop.
If we restart the neo4j server w/o restarting our system, as soon as our system starts querying neo4j, GC starts again... (we've noticed this behavior consistently).
We've had a 3.5.x community instance which was working fine from last week (when we've tried to switch to enterprise). We've copied over the data/ folder from enterprise to community instance and started the community instance... only to have it behave the same way the enterprise instance did, running GC every 3 seconds.
Any help is appreciated. Thanks.
Screenshot of jvisualvm with 8g/16g of heap
On debug.log, only these are significative:
2019-03-21 13:44:28.475+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: client produced 301 messages on the worker queue, auto-read is being disabled.
2019-03-21 13:45:15.136+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: consumed messages on the worker queue below 100, auto-read is being enabled.
2019-03-21 13:45:15.140+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: client produced 301 messages on the worker queue, auto-read is being disabled.
And I have a neo4j.log excerpt from around the time the jvisualvm screenshot shows but it's 3500 lines long... so here it is on Pastebin:
neo4j.log excerpt from around the time time the jvisualvm screenshot was taken
JUST_PUT_THIS_TO_KEEP_THE_SO_EDITOR_HAPPY_IGNORE...
Hope this helps, I have also the logs for the Enterprise edition if needed, though they are a bit more 'cahotic' (neo4j restarts in between) and I have no jvisualvm screenshot for them

Current Sidekiq job lost when deploying to Heroku

I have a Sidekiq job that runs for a while and when I deploy to Heroku and the job is running, it can't finish within in the few seconds.
That is fine, as the job is designed to be able to be re-run if needed.
The problem is that the job gets lost (instead of put back to redis and run again after deploy).
I found that it is advised to set :timeout: 8 on heroku and I tried it, but it had no effect (also tried seeting to 5).
When there is an exception, I get errors reported, but I don't see any. So not sure what could be wrong.
Any tips on how to debug this?
The free version of Sidekiq will push unfinished jobs back to Redis after the timeout has passed, default of 8 seconds. Heroku gives a process 10 seconds to shut down. That means we have 2 seconds to get those jobs back to Redis or they will be lost. If your network is slow, if the Redis server is swapping, etc, that 2 sec deadline might not be met and the jobs lost.
You were on the right track: one answer is to lower the timeout so you have a better chance of meeting that deadline. But network or swapping delay can't be predicted: even 5 seconds might not be enough time.
Under normal healthy conditions, things should work as designed. Keep your machines healthy (uncongested network, plenty of RAM) and the basic fetch should work well. Sidekiq Pro's reliable fetch feature is a fundamental redesign of how Sidekiq fetches jobs and works around all of these issues by keeping jobs in Redis all the time so they can't be lost. But it comes with serious trade offs too: it's more complicated, slower and more Redis intensive than "basic" fetch.
In short, I don't know why you are losing jobs but make sure your instances and Redis server are healthy and the latency is low.
https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
This is actually feature of sidekiq - designed to steer you toward paying pro version:
http://sidekiq.org/products/pro
RELIABILITY
More reliable message processing.
Cloud environments are noisy and unreliable. Seeing timeouts? Wild swings in latency or performance? Ruby VM crashes or processes disappearing?
If a Sidekiq process crashes while processing a job, that job is lost.
If the Sidekiq client gets a networking error while pushing a job to Redis, an exception is raised and the job is not delivered.
Sidekiq Pro uses Redis's RPOPLPUSH command to ensure that jobs will not be lost if the process crashes or gets a KILL signal.
The Sidekiq Pro client can withstand transient Redis outages or timeouts. It will enqueue jobs locally upon error and attempt to deliver those jobs once connectivity is restored.
Deploy terminates all processes that belongs to user, therefore job is lost. There is actually not much you can do there.
As #mike-perham and #esse noted, Sidekiq is designed the way it can loose jobs due to its fetching mechanism. Your options to get around this are:
To buy Sidekiq Pro (although it was reported to cause the same issue)
To write your own fetcher (but that would mean you can not use most of 3rd party libraries, as they will not work with your custom fetcher)
To mimic Sidekiq Pro's reliable fetch by backing up your jobs data. In case you are up for this way, check out attentive_sidekiq gem which does exactly that.

How to debug Neo4J stalls?

I have a Neo4J server running that periodically stalls out for 10s of seconds. The web frontend will say it's "disconnected" with the red warning bar at the top, and a normally instant REST query in my application will apparently hang, until the stall ends and then everything returns to normal. The web frontend becomes usable and my REST query completes fine.
Is there any way to debug what is happening during one of these stall periods? Can you get a list of currently running queries? Or a list of what hosts are connected to the server? Or any kind of indication of server load?
Most likely JVM garbage collection kicking in because you haven't allocated enough heap space.
There's a number of ways to debug this. You can for example enable GC logging (uncomment appropriate lines in neo4j-wrapper.conf), or use a profiler (e.g. YourKit) to see what's going on and why the pauses.

Application Pool recycling results in very long response times

I've read somewhere, that application pool recycling shouldn't be very noticeable to the end user, when overlapping is enabled, but in my case that results in at least 10 times longer responses than usually (depending on load, response time from regular 100ms grows up to 5000ms). Also that is not for a single request, but several ones right after pool recycling (I was using ~10 concurrent connections when testing this).
So questions would be:
In my opinion I don't do anything, that would take a long time on application start - in general, that is only IoC container and routing initialization, also even I would do something - that is what overlapping should take care, or not?
Is sql connection pool destroyed during pool recycling and could that be a reason for long response times?
What would be the best method to profile what is taking so long? Also may be there are ideas, what could take so long from IIS/.NET side, and how to avoid that.
Overlapping only means that the old worker process will be kept running while the new one is started. As soon as the new one is started, it begins handling all requests. "Started" does not mean that initialization (which might be contained in Application_Start, any static constructors in your application, or any one time, contentious tasks like proxy building) have been completed. This means that new requests will have to wait while these processes are completed, even though the "old" worker process might still be available for a short time. Also, if your application uses any kind of caching, your new caches will be "cold", meaning there will be some additional processing time required until the caches are warmed up.
Yes - your new application will have a new sql connection pool.
In my experience, in a production environment, with well tested code and an application that requires consistent, high performance, I choose to disable application pool recycling altogether. Application Pool recycling is a "feature" introduced to combat the perception that IIS was unstable, when in fact what was usually really unstable was the applications that it was hosting. In my opinion, it is a crutch that allows people to deploy less than stable code. If it is causing you problems, turn it off and make sure your application doesn't have any memory leaks, etc. that might lead to long term application instability.

Resources