Hazelcast OperationTimeoutException on remote execution - timeout

I'm running a 5-node Hazelcast cluster at version 3.6.6 in AWS.
I'm using it as a workload distributor and I'm using
IExecutorService
<T> void submit(Runnable task,
MemberSelector memberSelector,
ExecutionCallback<T> callback)
API to execute tasks on a member of my choice. I do not use partition-based balancing since different partitions would have different weights.
After I start the cluster it works well for several days and then submitting members start to receive OperationTimeoutException. Once it starts, all members start to receive this timeout and it happens quite sporadically, there maybe a short period when all works smooth and then this exception starts happening again. The target member does receive the task within less than a second and executes it correctly.
The exception itself looks like this:
July 3rd 2019, 10:54:01 UTC:
No response for 560000 ms. Aborting invocation!
Invocation{serviceName='hz:impl:executorService',
op=com.hazelcast.executor.impl.operations.MemberCallableTaskOperation{identityHash=1179024466,
serviceName='hz:impl:executorService', partitionId=-1, replicaIndex=0,
callId=684145, invocationTime=1562150679963 (Wed Jul 03 10:44:39 UTC
2019), waitTimeout=-1, callTimeout=500000, name=exec_service_3},
partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500,
invokeCount=1, callTimeout=500000,
target=Address[x.x.x.x]:5701, backupsExpected=0,
backupsCompleted=0, connection=Connection [/x.x.x.x:5701 ->
/x.x.x.x:35360], endpoint=Address[x.x.x.x]:5701,
alive=true, type=MEMBER} No response has been received!
backups-expected:0 backups-completed: 0, execution took: 3445
milliseconds
Stacktrace:
at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:536) ~[anodot-arnorld-1.0-SNAPSHOT.jar:na]
at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService$IsOperationStillRunningCallback.setOperationTimeout(IsStillRunningService.java:241)
at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService$IsOperationStillRunningCallback.onResponse(IsStillRunningService.java:229)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture$1.run(InvocationFuture.java:127)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76) [anodot-arnorld-1.0-SNAPSHOT.jar:na]
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
at ------ End remote and begin local stack-trace ------.(Unknown Source) ~[na:na]
... 8 frames truncated
The timing of the exception is quite weird:
July 3rd 2019, 10:53:57.956 - task submitted for execution by sending instance
July 3rd 2019, 10:53:58.024 - execution starts on target instance
July 3rd 2019, 10:54:01.391 - the sending instance receives the exception
In my logs I see that the timeout happens shortly after the task was submitted and the "execution took:" part of the exception is quite precise, indeed in the quoted case the time that passed since the task was sent to execution was about 3.5 seconds. On the other hand, the "invocationTime", (Wed Jul 03 10:44:39 UTC 2019) in quoted case, is about 10 minutes back in the past, even before the job was actually submitted for execution (July 3rd 2019, 10:53:57 UTC)
I've seen this exception being attributed to long GC pauses but as I'm constantly monitoring GC, I'm quite sure this is not the case. Also, the networking between cluster members looks quite live, latencies are low.
From what I've seen in the Hazelcast code, the "invocationTime" is taken from the "clusterClock" and not directly from the system time, suggesting that for some reason the cluster clock is 10 minutes off but I can't figure out why that happens. The cluster is quite busy but I don't see any exceptional surges in load when this exception starts to happen.
The problem disappears when I take down the whole cluster and then start it anew.
I'm planning to add monitoring on the clusterTime to see when it starts to drift but it'll still won't explain why that happens.
Any thoughts?
Update:
In short, the cluster time drifts over time from the system time and once the gap is big enough the tasks start to fail with the timeout exception.
For details: https://github.com/hazelcast/hazelcast/issues/15339

Finally, the issue was resolved by upgrading Hazelcast version to 3.12.11 (4.x.x breaks too many things), looks like the way cluster time is managed there is insensitive to GC pauses. Some APIs were broken and needed adjustments in code, nothing too serious. A note of warning, 3.6.6 is incompatible with 3.12.11 so a rolling cluster upgrade is impossible. We did a full cluster restart, luckily, it was possible.

Related

Prometheus errors and log location

I have a Prometheus service running in a docker container and we have a group of servers that are rotating reporting up and down with the error "context deadline exceeded".
Our time interval is 15 seconds and timeout is 10 second.
The servers have been polled with no issues for months, no new changes have been identified. At first I suspected a networking issues but I have triple checked the entire path and all containers and everything is okay. I have even tcpdumped on the destination server and Prometheus polling server and can see the connections establish and complete, yet still being reported as down.
Can anyone tell me where I can find logs relating to "content deadline exceeded"? Is there any additional information I can find on what is causing this?
From other thread it seems like this is a timeout issue, but the servers are a subsecond away and again there is no packetloss occurring anywhere.
Thanks for any help.

How can I see how long my Cloud Run deployed revision took to spin up?

I deployed a Vue.js and a Kotlin server app. Cloud Run does promise to put a service to sleep if no request to it arise for a specific time. I did not opened my app for a day now. As I opened it - it was available almost immediatly. Since I know how long it takes to spin up when started locally I kinda don't trust the promise that Cloud Run really had put the app to sleep and span it up so crazy fast.
I'd love to know a way how I can really see how long it took for the spinup - also for startup improvement for the backend service.
After having the service inactive for some time, record the time when you request the service URL and request it.
Then go to the logs for the Cloud Run service, and use this filter to see the logs for the service:
resource.type="cloud_run_revision"
resource.labels.service_name="$SERVICE_NAME"
Look for the log entry with the normal app output after your request, check its time and compare it with the recorded time.
You can't know when the instance will be evicted or if it is kept in memory. It could happen quickly, or take hours or days before eviction. it's "serverless".
About the starting time, when I test, I deploy a new revision and I have a try on it. In the logging service, the first log entry of the new revision provides me the cold start duration. (Usually 300+ ms, compare to usual 20 - 50 ms with warm start).
The billing instance time is the sum of all the containers running times. A container is considered as "running" when it process request(s).

TFS Release gates do not fire

I am using TFS Release gates to call severall api's before i deploy.
Usually this works great. But sometimes the gates don't fire at all and the api is never called.
I set the timeout on 5 minutes.. so after 5 minutes it should try again.... But that sheduled is then messed up. sometimes it retries after 5 minutes (as it is supposed to) but sometimes it takes 11 or 12 minutes.....
It looks like the requests are queued somewhere... but i really gave no idea.
Anybody knows this behaviour ?
Updated
The Delay before evaluation is a time delay at the beginning of the gate evaluation process that allows the gates to initialize, stabilize, and begin providing accurate results for the current deployment
Gate actually is a sever task which is currently run on server. Some examples could prove
On-premise - Any on-premise/behind the firewall resources are
inaccessible
Env Variables - some of the environment variables are not accessible
(they are not yet initialized at the time of gate check)
It's run by release service account just as other tasks in release pipeline.
Besides, Delay before evaluation is meant to be the time samples from the gates might be unreliable. It's an acceptable phenomena. Sometimes it may need more time than your configuration to do the check.
Source Link: Release approvals and gates overview

In what cases does Google Cloud Run respond with "The request failed because the HTTP connection to the instance had an error."?

We've been running Google Cloud Run for a little over a month now and noticed that we periodically have cloud run instances that simply fail with:
The request failed because the HTTP connection to the instance had an error.
This message is nearly always* proceeded by the following message (those are the only messages in the log):
This request caused a new container instance to be started and may thus take longer and use more CPU than a typical request.
* I cannot find, nor recall, a case where that isn't true, but I have not done an exhaustive search.
A few things that may be of importance:
Our concurrency level is set to 1 because our requests can take up to the maximum amount of memory available, 2GB.
We have received errors that we've exceeded the maximum memory, but we've dialed back our usage to obviate that issue.
This message appears to occur shortly after 30 seconds (e.g., 32, 35) and our timeout is set to 75 seconds.
In my case, this error was always thrown after 120 seconds from receiving the request. I figured out the issue that Node 12 default request timeout is 120 seconds. So If you are using Node server you either can change the default timeout or update Node version to 13 as they removed the default timeout https://github.com/nodejs/node/pull/27558.
If your logs didn't catch anything useful, most probably the instance crashes because you run heavy CPU tasks. A mention about this can be found on the Google Issue Tracker:
A common cause for 503 errors on Cloud Run would be when requests use
a lot of CPU and as the container is out of resources it is unable to
process some requests
For me the issue got resolved by upgrading node "FROM node:13.10.1 AS build" to "FROM node:14.10.1 AS build" in docker file it got resolved by upgarding the node.

neo4j 3.5.x GC running over and over again, even after just starting the server

Our application uses neo4j 3.5.x (tried both community and enterprise editions) to store some data.
No matter how we setup memory in conf/neo4j.conf (tried with lots of combinations for initial/max heap settings from 4 to 16 GB), the GC process runs periodically every 3 seconds, putting the machine to its knees, slowing the whole system down.
There's a combination (8g/16g) that seems to make stuff more stable, but a few minutes (20-30) after our system is being used, GC kicks again on neo4j and goes into this "deadly" loop.
If we restart the neo4j server w/o restarting our system, as soon as our system starts querying neo4j, GC starts again... (we've noticed this behavior consistently).
We've had a 3.5.x community instance which was working fine from last week (when we've tried to switch to enterprise). We've copied over the data/ folder from enterprise to community instance and started the community instance... only to have it behave the same way the enterprise instance did, running GC every 3 seconds.
Any help is appreciated. Thanks.
Screenshot of jvisualvm with 8g/16g of heap
On debug.log, only these are significative:
2019-03-21 13:44:28.475+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: client produced 301 messages on the worker queue, auto-read is being disabled.
2019-03-21 13:45:15.136+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: consumed messages on the worker queue below 100, auto-read is being enabled.
2019-03-21 13:45:15.140+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: client produced 301 messages on the worker queue, auto-read is being disabled.
And I have a neo4j.log excerpt from around the time the jvisualvm screenshot shows but it's 3500 lines long... so here it is on Pastebin:
neo4j.log excerpt from around the time time the jvisualvm screenshot was taken
JUST_PUT_THIS_TO_KEEP_THE_SO_EDITOR_HAPPY_IGNORE...
Hope this helps, I have also the logs for the Enterprise edition if needed, though they are a bit more 'cahotic' (neo4j restarts in between) and I have no jvisualvm screenshot for them

Resources