We have 1 swarm cluster with 3 managers and 10 workers for performance test. When 100 concurrent requests(create service) sent to one swarm manager, dockerd may accept all the requests to dispatch to workers. But if we increase the num of concurrent requests, the dockerd error log says:
Error creating service serviceXXX: rpc error: code = 4 desc = context
deadline exceeded"
Is there a default value of max concurrent requests that dockerd can handle in code? How could we increase the concurrent requests that dockerd can process successfully?
The daemon is version 17.03.
As commented in issue 29987, this error message is not very explicit:
I think whenever we encounter a context deadline exceeded error, we should rewrite it to a coherent explanation of what timed out, and perhaps list possible reasons that could cause the timeout (loss of quorum, etc).
When working on docker/docker-e2e, I had problems where things were timing out causing context deadline exceed errors, but the root cause of the timeout was some other error that was getting ignored, superseded, or otherwise buried.
As detailed in issue 33631:
This error can have various causes (See this search for existing issues mentioning this error).
The error itself is quite generic, and could mean that the manager was not able to communicate with other managers in the cluster.
From just the error, it's not easy to discover why it fails to communicate (it can be either a bad network connection, other managers did not properly re-join the cluster, therefore you lost quorum, or it could be if (e.g.) managers did not have a static IP-address, and the IP-address changed - which is currently not supported).
You can see a similar case here but this happens also with less queries.
Related
I have a Prometheus service running in a docker container and we have a group of servers that are rotating reporting up and down with the error "context deadline exceeded".
Our time interval is 15 seconds and timeout is 10 second.
The servers have been polled with no issues for months, no new changes have been identified. At first I suspected a networking issues but I have triple checked the entire path and all containers and everything is okay. I have even tcpdumped on the destination server and Prometheus polling server and can see the connections establish and complete, yet still being reported as down.
Can anyone tell me where I can find logs relating to "content deadline exceeded"? Is there any additional information I can find on what is causing this?
From other thread it seems like this is a timeout issue, but the servers are a subsecond away and again there is no packetloss occurring anywhere.
Thanks for any help.
I'm doing load testing on an ExpressJS app hosted on Google Cloud Run, upon spike increase in traffic, there is a period where I see many 500 errors in Stackdriver with the message "The request failed because the instance could not start successfully." - which effectively leads to server downtime.
Seeing that this error occurs more frequently as the app scales up, I'm thinking this is caused by the Cloud Run load balancer assigning traffic prematurely to new instances, before these instances are ready to accept requests.
As I continue to run the load test, the instances are continuously and repeatedly killed and restarted, so there is no mechanism for recovery while the load is on.
I don't see any error logs from my NodeJS application, suggesting none of the failed requests actually reached my app.
What can I do to avoid these errors?
How does Cloud Run determine that a port is ready to accept requests?
Is it something I misconfigured in my ExpressJS app or can I somehow delay Cloud Run a bit before sending requests to a new instance?
This turned out to be caused by a combination of Cloud Run auto-scaling maximum instance limit and Cloud SQL's connection limit.
I was running a small Cloud SQL Postgres instance (3.75 GB / 1 vCPU) which comes with a default connection limit of 100. (https://cloud.google.com/sql/docs/quotas)
By default, Cloud Run assigns a maximum instance count of 1000 for auto-scaling. During the load test, the sudden spike in request count pushed the auto-scaling to create hundreds of instances, which quickly exhausted the Cloud SQL connection limit of 100.
This exact scenario is documented for Cloud SQL: https://cloud.google.com/sql/docs/postgres/connect-run#connection_limits_3 (it would be nice if this is also documented on Cloud Run, it did not immediately occur to me to look for documentation on Cloud SQL when this issue occurred)
The solution is a combination of limiting the maximum instance count on Cloud Run to a number that is tolerable, and adjusting resource allocation / maximum connection limit on Cloud SQL. The exact configuration would obviously depend on the expected level of load.
We've been running Google Cloud Run for a little over a month now and noticed that we periodically have cloud run instances that simply fail with:
The request failed because the HTTP connection to the instance had an error.
This message is nearly always* proceeded by the following message (those are the only messages in the log):
This request caused a new container instance to be started and may thus take longer and use more CPU than a typical request.
* I cannot find, nor recall, a case where that isn't true, but I have not done an exhaustive search.
A few things that may be of importance:
Our concurrency level is set to 1 because our requests can take up to the maximum amount of memory available, 2GB.
We have received errors that we've exceeded the maximum memory, but we've dialed back our usage to obviate that issue.
This message appears to occur shortly after 30 seconds (e.g., 32, 35) and our timeout is set to 75 seconds.
In my case, this error was always thrown after 120 seconds from receiving the request. I figured out the issue that Node 12 default request timeout is 120 seconds. So If you are using Node server you either can change the default timeout or update Node version to 13 as they removed the default timeout https://github.com/nodejs/node/pull/27558.
If your logs didn't catch anything useful, most probably the instance crashes because you run heavy CPU tasks. A mention about this can be found on the Google Issue Tracker:
A common cause for 503 errors on Cloud Run would be when requests use
a lot of CPU and as the container is out of resources it is unable to
process some requests
For me the issue got resolved by upgrading node "FROM node:13.10.1 AS build" to "FROM node:14.10.1 AS build" in docker file it got resolved by upgarding the node.
Got this issue while trying to clear the events for primary calendar ID.
This issue occurred with my first_account#gmail.com account, But with my another account second_account#gmail.com , clear event works.
can anyone please help me with this ?
Probable root cause could be heavy network traffic. Since you are just testing, you can try again until you get a success response.
If you will encounter this during the development stage suggested action for this error is to use exponential backoff.
Exponential backoff is a standard error handling strategy for network
applications in which the client periodically retries a failed request
over an increasing amount of time. If a high volume of requests or
heavy network traffic causes the server to return errors, exponential
backoff may be a good strategy for handling those errors. Conversely,
it is not a relevant strategy for dealing with errors unrelated to
rate-limiting, network volume or response times, such as invalid
authorization credentials or file not found errors.
Used properly, exponential backoff increases the efficiency of
bandwidth usage, reduces the number of requests required to get a
successful response, and maximizes the throughput of requests in
concurrent environments.
Create requests are not idempotent. A simple retry is insufficient and
may result in duplicate entities. Check whether the entity exists
before retrying.
With a Kafka cluster of 3 and a Zookeeper cluster of the same I brought up one distributed connector node. This node ran successfully with a single task. I then brought up a second connector, this seemed to run as some of the code in the task definitely ran. However it then didn't seem to stay alive (though with no errors thrown, the not staying alive was observed by a lack of expected activity, while the first connector continued to function correctly). When I call the URL http://localhost:8083/connectors/mqtt/tasks, on each connector node, it tells me the connector has one task. I would expect this to be two tasks, one for each node/worker. (Currently the worker configuration says tasks.max = 1 but I've also tried setting it to 3.
When I try and bring up a third connector, I get the error:
"POST /connectors HTTP/1.1" 500 90 5
(org.apache.kafka.connect.runtime.rest.RestServer:60)
ERROR IO error forwarding REST request:
(org.apache.kafka.connect.runtime.rest.RestServer:241)
java.net.ConnectException: Connection refused
Trying to call the connector POST method again from the shell returns the error:
{"error_code":500,"message":"IO Error trying to forward REST request:
Connection refused"}
I also tried upgrading to Apache Kafka 0.10.1.1 that was released today. I'm still seeing the problems. The connectors are each running on isolated Docker containers defined by a single image. They should be identical.
The problem could be that I'm trying to run the POST request to http://localhost:8083/connectors on each worker, when I only need to run it once on a single worker and then the tasks for that connector will automatically distribute to the other workers. If this is the case, how do I get the tasks to distribute? I currently have the max set to three, but only one appears to be running on a single worker.
Update
I ultimately got things running using essentially the same approach that Yuri suggested. I gave each worker a unique group ID, then gave each connector task the same name. This allowed the three connectors and their single tasks to share a single offset, so that in the case of sink connectors the messages they consumed from Kafka were not duplicated. They are basically running as standalone connectors since the workers have different group ids and thus won't communicate with each other.
If the connector workers have the same group ID, you can't add more than one connector with the same name. If you give the connectors different names, they will have different offsets and consume duplicate messages. If you have three workers in the same group, one connector and three tasks, you would theoretically have an ideal situation where the tasks share an offset and the workers make sure the tasks are always running and well distributed (with each task consuming a unique set of partitions). In practice the connector framework doesn't create more than one task, even with tasks.max set to 3 and when the topic tasks are consuming has 25 partitions.
If anyone knows why I'm seeing this behaviour, please let me know.
I've encountered with similar issue in the same situation as yours.
Task.max is configured for a topic and distributed workers automatically decide what nodes handle topic. So, if you have 3 workers in a cluster and your topic configuration says task.max=2 then only 2 of 3 workers will process the topic. In theory, if one of workers fails, 3rd should pick up workload. But..
The distributed connector turned out to be very unreliable: once you add\remove some nodes, the cluster broke down and all workers did nothing but tried to choose leader and failed. The only way to fix was to restart whole cluster and preferably all workers simultaneously.
I chose another way - I used standalone worker and it works like a charm to me because distribution of load is implemented on Kafka client level and once some worker dropped, the cluster re-balances automatically and clients connected to unoccupied topics.
PS. Maybe it will be useful for you too. Confluent connector is not tolerate to invalid payload that does not match topic's schema. Once the connector get some invalid message it silently dies. The only way to find out is to analyze metrics.
I'm posting an answer to an old question, since Kafka Connect has moved on a lot in three years.
In the latest version (2.3.1) there is incremental rebalancing which massively improves the behaviour of Kafka Connect.
It's also worth noting that when configuring Kafka Connect rest.advertised.host.name must be set correctly, as if it's not you will see errors including the one quoted
{"error_code":500,"message":"IO Error trying to forward REST request: Connection refused"}
See this post for more details.