Hikari "Add connection elided" - spring-cloud-dataflow

I am getting the following messages running a task in spring cloud data flow.
DEBUG 13167 --- [spring_batch146] com.zaxxer.hikari.pool.HikariPool : HikariPool-2 - Add connection elided, waiting 1, queue 2
I can't find any information on it.

The log message indicates that your application experienced a sudden burst of traffic and database connection requests triggered asynchronously were "merged".
You can find additional information relating to this feature in the Welcome to the Jungle document.

I have the same case, opening many at once in parallel threads.
And in the cloud environment, timeout failures are occurring.
Anyone with any help?
"Add connection elided, waiting 17, queue 18"

Related

Request timed out (System.Web.HttpException)

In periods we are experiencing many "Request timed out" exceptions (System.Web.HttpException) from a specific endpoint that is called often.
It appears not to be related to high-peak periods and has been experienced right after deployment and at random times. No pattern.
The solution is not to increase the execution timeout as the requests are normally completed within seconds.
Neither the web server nor the backend SQL Server is stressed. We have even seen low CPU usage during an incident period.
From ApplicationInsights I got the exact endpoint failing, which is a standard controller action. However, there is no additional information. No stack trace. No error code. Nothing. The exception is thrown at any time between 1 second and minutes after the request start.
From ApplicationInsight I can see that some of the requests to the failing endpoint are completed. However, the response time is extremely long (up to 8 minutes).
I have found nothing in the IIS logs. We have set up the failed request logging and waiting for the next incident. However, we do not expect to get more information than we already got from ApplicationInsights.
I'm uncertain whether this is an ASP.NET MVC application issue or an IIS configuration. It puzzles me, that no stack trace is available.
Any suggestions on how to approach this challenge? Pointers to articles/blogs that can help me solve the issue are very much appreciated.
UPDATE
I was looking through our trace logs and realized that they were not complete, i.e., entries were missing. We use ApplicationInsights (AI) for tracing. AI is configured to keep all traces, exceptions, and events, and it is working flawlessly in DEV and STAGING.
We have two AI environments: AI-PROD and AI-TEST. The environment is selected in web.config via instrumentation key. The entire AI config is in the ApplicationInsights.config and this file is the same in DEV, STAGING, and PROD.
I tried to connect STAGING to the AI-PROD environment to verify that it was not a problem with the environment. It worked flawlessly.
I disabled AI in PROD and the server started without throwing “Request timed out” errors during startup. When PROD is connected to either the AI-PROD or the AI-TEST environment I get “Request timed out” errors during startup.

Message Loss (message sent to spring-amqp doesn't get published to rabbitmq)

We are having a setup where we are using spring-amqp transacted channels to push our messages to RabbitMq. During a testing we found that messages were not even getting published from spring-amqp to rabbitmq;
we suspect metricsCollector.basicPublish(this) in com.rabbitmq.client.impl.ChannelN failure(no exception is thrown).
because we can see that RabbitUtils.commitIfNecessary(channel) in org.springframework.amqp.rabbit.core.RabbitTemplate is not getting called when there is an issue executing metricsCollector.basicPublish(this) for the same code flow.
We have taken TCP dumps and could see that message were written to stream/socket on rabbitmq, but since commit didn't happen due to an a probable amqp api failure the messages were not delivered to corresponding queues.
Jars Version Being used in the setup:-
spring-amqp-2.2.1.RELEASE.jar,
spring-rabbit-2.2.1.RELEASE.jar
amqp-client-5.7.3.jar,
metrics-core-3.0.2.jar
Is anyone facing the similar issue?
Can someone please help.
---edit 1
(Setup) :- We are using same connection Factory for flows with parent transaction and flows not running with parent transactions
On further analyzing the issue , we found that isChannelLocallyTransacted is sometimes showing in-consistent behavior because ConnectionFactoryUtils.isChannelTransactional(channel, getConnectionFactory() is sometimes having a reference to transacted channel (returns true hence expression isChannelLocallyTransacted evaluates to false) due to which tx.commit never happens; so message gets lost before getting committed to RabbitMQ.

Prometheus errors and log location

I have a Prometheus service running in a docker container and we have a group of servers that are rotating reporting up and down with the error "context deadline exceeded".
Our time interval is 15 seconds and timeout is 10 second.
The servers have been polled with no issues for months, no new changes have been identified. At first I suspected a networking issues but I have triple checked the entire path and all containers and everything is okay. I have even tcpdumped on the destination server and Prometheus polling server and can see the connections establish and complete, yet still being reported as down.
Can anyone tell me where I can find logs relating to "content deadline exceeded"? Is there any additional information I can find on what is causing this?
From other thread it seems like this is a timeout issue, but the servers are a subsecond away and again there is no packetloss occurring anywhere.
Thanks for any help.

Cloud Run: 500 Server Error with no log output

We are investigating an issue on a deployed cloud run service, where requests made to the service occasionnaly fail with a StatusCodeError: 500, while no log of said requests appear on cloud run.
Served requests usually produce two log lines detailing the request, route and exit code (POST 200 on https://service-name.a.run.app/route/...)
One with log name projects/XXX/logs/run.googleapis.com/stdout is produced by our application to log the serving of every request
One with log name projects/XXX/logs/run.googleapis.com/requests is automatically produced by cloud run on every request
When the incident occurs, none of those are logged. The client (running in a gke pod in the same project) has the only log of the failing requests, with the following message:
StatusCodeError: 500 - "\n<html><head>\n<meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<title>500 Server Error</title>\n</head>\n<body text=#000000 bgcolor=#ffffff>\n<h1>Error: Server Error</h1>\n<h2>The server encountered an error and could not complete your request.<p>Please try again in 30 seconds.</h2>\n<h2></h2>\n</body></html>\n"
Rough timeline of the last incident:
14:41 - Service is serving requests as expected, producing both log lines each time
14:44 to 14:56 - Cloud run logs are empty, every request made to the service (~30) gets the 500 error message
14:56 - Cloud run terminates the currently running container instance, (as happens after some inactivity for instance), which is correctly logged by the application ([INFO] Handling signal: term)
14:58 - Cloud run instantiates a new container instance and starts serving incoming requests (which are logged normally)
The absence of logs during the incident makes it hard to investigate its cause, and at this stage we would be gratefull for any kind of lead.
Our service has another known issue, that may or may not be related. The service is designed to avoid multiple replicas, as a single one should be able to handle the load and serve concurrent requests (cloud run concurency = 80), but has a relatively long cold start time (~30s). This leads to 429 errors when a spike of requests comes while no replica is available (because of cloud run hard capping concurrency to 1 during cold start). This issue was somewhat mitigated by allowing some replication (currently maxScale = 3), since each replica can put a request on hold during the cold start, but will require some work on the client side to handle correctly (simple retries after the cold start).
I have found this PIT that describes the aforementioned behavior. It seems to happen because a part of Cloud Run thinks that there are already provisioned instances handling the traffic but there aren't. This issue is currently being worked on internally but there's no ETA for a fix at the moment.
The current workaround is to set a maximum number of instances to at least 4.

In what cases does Google Cloud Run respond with "The request failed because the HTTP connection to the instance had an error."?

We've been running Google Cloud Run for a little over a month now and noticed that we periodically have cloud run instances that simply fail with:
The request failed because the HTTP connection to the instance had an error.
This message is nearly always* proceeded by the following message (those are the only messages in the log):
This request caused a new container instance to be started and may thus take longer and use more CPU than a typical request.
* I cannot find, nor recall, a case where that isn't true, but I have not done an exhaustive search.
A few things that may be of importance:
Our concurrency level is set to 1 because our requests can take up to the maximum amount of memory available, 2GB.
We have received errors that we've exceeded the maximum memory, but we've dialed back our usage to obviate that issue.
This message appears to occur shortly after 30 seconds (e.g., 32, 35) and our timeout is set to 75 seconds.
In my case, this error was always thrown after 120 seconds from receiving the request. I figured out the issue that Node 12 default request timeout is 120 seconds. So If you are using Node server you either can change the default timeout or update Node version to 13 as they removed the default timeout https://github.com/nodejs/node/pull/27558.
If your logs didn't catch anything useful, most probably the instance crashes because you run heavy CPU tasks. A mention about this can be found on the Google Issue Tracker:
A common cause for 503 errors on Cloud Run would be when requests use
a lot of CPU and as the container is out of resources it is unable to
process some requests
For me the issue got resolved by upgrading node "FROM node:13.10.1 AS build" to "FROM node:14.10.1 AS build" in docker file it got resolved by upgarding the node.

Resources