My rabbit template is configured to use CachingConnectionFactory with
cache mode connection. In rare cases when calling
rabbitTemplate.convertAndSend
it passes without any issue but the message never got the rabbitmq broker.
Few seconds after that, another thread logs:
An unexpected connection driver error occured
and stacktrace:
com.rabbitmq.client.MissedHeartbeatException: Heartbeat missing with heartbeat = 60 seconds
Is there a config that I should activate to be sure that message has got the broker or at least I expect an exception to be thrown in the sending thread?
Consider to turn on Publisher Confirms and Returns on the CachingConnectionFactory and mandatory for the RabbitTemplate: https://docs.spring.io/spring-amqp/docs/2.1.4.RELEASE/reference/#cf-pub-conf-ret
Related
We are having a setup where we are using spring-amqp transacted channels to push our messages to RabbitMq. During a testing we found that messages were not even getting published from spring-amqp to rabbitmq;
we suspect metricsCollector.basicPublish(this) in com.rabbitmq.client.impl.ChannelN failure(no exception is thrown).
because we can see that RabbitUtils.commitIfNecessary(channel) in org.springframework.amqp.rabbit.core.RabbitTemplate is not getting called when there is an issue executing metricsCollector.basicPublish(this) for the same code flow.
We have taken TCP dumps and could see that message were written to stream/socket on rabbitmq, but since commit didn't happen due to an a probable amqp api failure the messages were not delivered to corresponding queues.
Jars Version Being used in the setup:-
spring-amqp-2.2.1.RELEASE.jar,
spring-rabbit-2.2.1.RELEASE.jar
amqp-client-5.7.3.jar,
metrics-core-3.0.2.jar
Is anyone facing the similar issue?
Can someone please help.
---edit 1
(Setup) :- We are using same connection Factory for flows with parent transaction and flows not running with parent transactions
On further analyzing the issue , we found that isChannelLocallyTransacted is sometimes showing in-consistent behavior because ConnectionFactoryUtils.isChannelTransactional(channel, getConnectionFactory() is sometimes having a reference to transacted channel (returns true hence expression isChannelLocallyTransacted evaluates to false) due to which tx.commit never happens; so message gets lost before getting committed to RabbitMQ.
My question is related to the question already posted here
Its indicated in the original post that the timeout happens about once a month. In our setup we are receiving this once every 10 seconds. Our production logs are filled with this handshake exception messages. Would setting the timeout value for handshake apply to our scenario as well?
Yes. Setting handshake-timeout=0 on the relevant acceptor URL in your broker.xml applies here even with the higher volume of timeouts.
I have a Grails application using the grails RabbitMQ plugin to handle messages asynchronously. The queue is set to durable, the messages are persistent, and there are 20 concurrent consumers. Acknowledgement is turned on and is set to issue an ack/nack based on if the consumer returns normally or throws an exception. These consumers usually handle messages fine, but when the queue fills up very quickly (5,000 or so messages at once) some of the messages get lost.
There is logging in the consumer when the message from Rabbit is received and that logging event never occurs, so the consumer is not receiving the lost messages at all. Further, there are no exceptions that appear in the logs.
I have tried increasing the prefetch value of the consumers to 5 (from 1), but that did not solve the problem. I have checked the RabbitMQ UI and there are no messages stuck in the queue and there are no unacknowledged messages.
We've been running Google Cloud Run for a little over a month now and noticed that we periodically have cloud run instances that simply fail with:
The request failed because the HTTP connection to the instance had an error.
This message is nearly always* proceeded by the following message (those are the only messages in the log):
This request caused a new container instance to be started and may thus take longer and use more CPU than a typical request.
* I cannot find, nor recall, a case where that isn't true, but I have not done an exhaustive search.
A few things that may be of importance:
Our concurrency level is set to 1 because our requests can take up to the maximum amount of memory available, 2GB.
We have received errors that we've exceeded the maximum memory, but we've dialed back our usage to obviate that issue.
This message appears to occur shortly after 30 seconds (e.g., 32, 35) and our timeout is set to 75 seconds.
In my case, this error was always thrown after 120 seconds from receiving the request. I figured out the issue that Node 12 default request timeout is 120 seconds. So If you are using Node server you either can change the default timeout or update Node version to 13 as they removed the default timeout https://github.com/nodejs/node/pull/27558.
If your logs didn't catch anything useful, most probably the instance crashes because you run heavy CPU tasks. A mention about this can be found on the Google Issue Tracker:
A common cause for 503 errors on Cloud Run would be when requests use
a lot of CPU and as the container is out of resources it is unable to
process some requests
For me the issue got resolved by upgrading node "FROM node:13.10.1 AS build" to "FROM node:14.10.1 AS build" in docker file it got resolved by upgarding the node.
We have a connection pool for an embedded derby database. We are setting
max wait time to 5 secs
max connection in pool 100
We are getting org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object very frequently. When this exception is coming, the connections owned by the application is always 1, this is evident from the logs.
The above exception states that the pool manager cannot produce a viable connection to a waiting requester and the maxWait has passed therefore triggering a timeout.
Ref: Cannot get a connection, pool error Timeout waiting for idle object in PutSQL?
There is 1 application using derby, the Derby database, and 2 other applications.
As per my understanding, the following are the main reason for not getting a connection
There is a network issue
Connection pool has been exhausted, because of connection leak
Connection pool getting exhausted, because of long-running queries
In our case, it's an embedded derby database which is local to the application.
So, network issue is ruled out. There are no long-running queries.
I am not able to figure out what is causing wait timeout. Could it be related to OS, Filesystem, server utilization going high etc?
Any help is appreciated.