Message Loss (message sent to spring-amqp doesn't get published to rabbitmq) - spring-amqp

We are having a setup where we are using spring-amqp transacted channels to push our messages to RabbitMq. During a testing we found that messages were not even getting published from spring-amqp to rabbitmq;
we suspect metricsCollector.basicPublish(this) in com.rabbitmq.client.impl.ChannelN failure(no exception is thrown).
because we can see that RabbitUtils.commitIfNecessary(channel) in org.springframework.amqp.rabbit.core.RabbitTemplate is not getting called when there is an issue executing metricsCollector.basicPublish(this) for the same code flow.
We have taken TCP dumps and could see that message were written to stream/socket on rabbitmq, but since commit didn't happen due to an a probable amqp api failure the messages were not delivered to corresponding queues.
Jars Version Being used in the setup:-
spring-amqp-2.2.1.RELEASE.jar,
spring-rabbit-2.2.1.RELEASE.jar
amqp-client-5.7.3.jar,
metrics-core-3.0.2.jar
Is anyone facing the similar issue?
Can someone please help.
---edit 1
(Setup) :- We are using same connection Factory for flows with parent transaction and flows not running with parent transactions
On further analyzing the issue , we found that isChannelLocallyTransacted is sometimes showing in-consistent behavior because ConnectionFactoryUtils.isChannelTransactional(channel, getConnectionFactory() is sometimes having a reference to transacted channel (returns true hence expression isChannelLocallyTransacted evaluates to false) due to which tx.commit never happens; so message gets lost before getting committed to RabbitMQ.

Related

MQTT Subscribe / OTA Update Deep Sleep / ESP32 / FreeRTOS

The goal is to receive messages over MQTT in an IoT device that comes out of deep sleep periodically. The exact same considerations exist for OTA update as for any other parameter update. In my case, ultimately, I want to use this for both.
Progress
It runs
The device wakes for about 15 seconds. If during that time, I publish a bunch of messages to the relevant topic, the message arrived successfully. Inside the AWS console I can publish to :
$aws/things/<device-name>/shadow/update/delta
{
"state":{
"desired":{
"output":true
}
}
}
And the delta callback function runs for 'output'. Great but no practical use to anyone.
IoT Job
I created a custom AWS IoT job in the console in an effort to overcome the problem. My thinking was that it might retain the message to ensure delivery. I've been running the job for the past half hour but so far nothing has come through. It had a 20 timeout but is still stuck in queued, not even in progress yet... So, there is clearly a flaw in this approach.
AWS CLI test
Just for completeness, I've attempted to fire off the MQTT message from the console. It has the benefit that you can specify the QOS, (in theory) ensuring that it gets delivered at least once.
aws iot-data publish --topic "$aws/things/<device-name>/shadow/update/delta" --qos 1 --payload file://Downloads/outputTrue.json --cli-binary-format raw-in-base64-out
But oddly this didn't seem to work at all. I didn't see the message arrive at the broker at all: subscribing in the console test.
AWS IoT Core does not support retained messages, see here.
The MQTT specification provides a provision for the publisher to request that the broker retain the last message sent to a topic and send it to all future topic subscribers. AWS IoT doesn't support retained messages. If a request is made to retain messages, the connection is disconnected.
As the wake-up times are perriodically, a possible approach could be to publish the next wake-up slot of your device in a separate topic where your backend is listening to. Your backend will then publish the desired information to your device-topic once the slot opens up.
Of course this approach is quite fragile concerning latency and network stability.
Time to share the answer I found from piecing together numerous posts and reaching out to the very helpful AWS support team. This link is the one that really covers it:
https://docs.aws.amazon.com/iot/latest/developerguide/jobs-devices.html#jobs-workflow-device-online
My summarised pseudo code is :
1. init() & connect() to mqtt as before.
2. Subscribe to the following topics & create callback function for each:
a. Get pending.
b. Notify next.
c. Get next.
d. Update rejected.
e. Update accepted.
3. Create Publish topics:
a. Get pending.
b. Get Next.
4. Pending topics = optional. But necessary to handle many tasks and select between them.
5. Aws-iot-jobs-describe() to publish a request for the next job. It links up to the notify next callback (somehow).
6. In the callback, grab job document, execute job & report Success / Failure.
7. Done.
There is a helpful example in esp-aws-iot/samples/linux/jobs-samples/jobs_sample.c. You need to copy some of the constants over from the sample aws_iot_config.h.
Once you've done all of this, you are able to use AWS Jobs to manage your OTA roll out, which was the original intent.

ksqldb cli failed to teminate a query:Timeout while waiting for command topic consumer to process command topic

I want to terminate a query to drop a table. But i got the below error, and after while the query is terminated, the ksql log don't print any error message. How can i find the root cause?
ksql> terminate CTAS_KSQL1_TABLE_SACMES_PACK_STATS_275;
Could not write the statement 'terminate CTAS_KSQL1_TABLE_SACMES_PACK_STATS_275;' into the command topic.
Caused by: Timeout while waiting for command topic consumer to process command topic
Looks like you may have run into a bug in older versions of ksqlDB. Maybe this one: https://github.com/confluentinc/ksql/issues/4267
The general issue is that the query gets into a state where it can't close down cleanly. What's blocking the shutdown does eventually complete or timeout. In the case of issue #4267 above, the issue was that the sink topic, i.e. the topic ksqlDB is writing to, has been deleted out-of-band, i.e. by something other than ksqlDB, and ksqlDB is stuck trying to get metadata for a non-existent topic. Did you delete the sink topic?
There were others resolved issues too that I can't find.
Bouncing the server after issuing the terminate should clean up the stuck query. Though it's a pretty severe workaround!
Upgrading to a later version, something released after May 2020, the issue should be resolved.

Grails RabbitMQ losing messages when many messages at once

I have a Grails application using the grails RabbitMQ plugin to handle messages asynchronously. The queue is set to durable, the messages are persistent, and there are 20 concurrent consumers. Acknowledgement is turned on and is set to issue an ack/nack based on if the consumer returns normally or throws an exception. These consumers usually handle messages fine, but when the queue fills up very quickly (5,000 or so messages at once) some of the messages get lost.
There is logging in the consumer when the message from Rabbit is received and that logging event never occurs, so the consumer is not receiving the lost messages at all. Further, there are no exceptions that appear in the logs.
I have tried increasing the prefetch value of the consumers to 5 (from 1), but that did not solve the problem. I have checked the RabbitMQ UI and there are no messages stuck in the queue and there are no unacknowledged messages.

RabbitMq: Message lost from queue when exceptions occurred during message processing

I am ruby on rails developer. i am using rabbitMQ in my project to processed some data as soon as the data comes in queue. i am using bunny gem a rabbitMQ client that provide interface to interact with RabbitMq.
My issue is that whenever an exceptions occurred or server stops unexpectedly while processing data from queue my message from the queue is lost.
I want to know how people deal with lost messages from the rabbitMQ queue. is there any way to get those messages back for processing.
There is no way to get the messages back when they're lost. Maybe you could try and track down some entries in RMQ's database cache - but that's just a wild guess/long shot and I don't think that it will help.
What you do need to do for the future is:
in case you are using a single server: make the queues and messages durable, and explicitly acknowledge (so switch off the auto-ACK flag) messages on consumer side only once they're processed.
in case you are using cluster of RMQ nodes (which is of course recommended exactly to avoid these situations): set up queue mirroring
Take a look at RMQ persistance and high availability.

Deploying an SQS Consumer

I am looking to run a service that will be consuming messages that are placed into an SQS queue. What is the best way to structure the consumer application?
One thought would be to create a bunch of threads or processes that run this:
def run(q, delete_on_error=False):
while True:
try:
m = q.read(VISIBILITY_TIMEOUT, wait_time_seconds=MAX_WAIT_TIME_SECONDS)
if m is not None:
try:
process(m.id, m.get_body())
except TransientError:
continue
except Exception as ex:
log_exception(ex)
if not delete_on_error:
continue
q.delete_message(m)
except StopIteration:
break
except socket.gaierror:
continue
Am I missing anything else important? What other exceptions do I have to guard against in the queue read and delete calls? How do others run these consumers?
I did find this project, but it seems stalled and has some issues.
I am leaning toward separate processes rather than threads to avoid the the GIL. Is there some container process that can be used to launch and monitor these separate running processes?
There are a few things:
The SQS API allows you to receive more than one message with a single API call (up to 10 messages, or up to 256k worth of messages, whichever limit is hit first). Taking advantage of this feature allows you to reduce costs, since you are charged per API call. It looks like you're using the boto library - have a look at get_messages.
In your code right now, if processing a message fails due to a transient error, the message won't be able to be processed again until the visibility timeout expires. You might want to consider returning the message to the queue straight away. You can do this by calling change_visibility with 0 on that message. The message will then be available for processing straight away. (It might seem that if you do this then the visibility timeout will be permanently changed on that message - this is actually not the case. The AWS docs state that "the visibility timeout for the message the next time it is received reverts to the original timeout value". See the docs for more information.)
If you're after an example of a robust SQS message consumer, you might want to check out NServiceBus.AmazonSQS (of which I am the author). (C# - sorry, I couldn't find any python examples.)

Resources