I am trying to use x-ray to trace requests which use an SNS-SQS fanout pattern.
The request comes from API GW, lambda proxy integration, published to SNS and delivered to a subscribed SQS which has a lambda trigger which receives the messages for further processing.
However the trace stops at SNS.
Unfortunately we do not support this architecture today. The issue is that the trace information from the starting request (APIG in this case) is lost once the SNS message is invoked. There currently isn't a workaround for this behavior. We are working with SNS and SQS to provide a better user experience and support for these cases. Please stay tuned for more.
Related
Currently we're using Hangfire for scheduling and running long lived tasks. We need these tasks to be able to be retried in the event of an ungraceful shutdown, which Hangfire handles for us.
We're looking to try and move to a producer/consumer model and I've built a basic prototype with Masstransit and AWS SQS, but I have some concerns about how to handle the event of a task being processed during an ungraceful shutdown.
I understand that eventually the SQS visibility timeout will expire and the queued item will be picked up for processing again, but setting that timeout isn't trivial as the length of tasks can be quite varied and I'd prefer if the task could immediately resume/retry processing when the application starts up again.
I got reading about Job Consumers and they seemed to be better fitted to this type of scenario, but all the examples I've seen are using RabbitMQ. Wondering if it's possible/appropriate to do this using SQS, or if there's a better approach?
Thank you for taking the time to read this question :)
MassTransit will extend the visibility timeout as long as the consumer is still running.
I believe SQS has an upper-limit of something like 12 hours, but you should look it up and find out.
Job Consumers have significantly greater requirements (sagas, temporary queues, etc.) and SQS is really annoying about not having auto-delete/expiring queues, so I'd stick to a regular consumer if you can swing it.
The goal is to receive messages over MQTT in an IoT device that comes out of deep sleep periodically. The exact same considerations exist for OTA update as for any other parameter update. In my case, ultimately, I want to use this for both.
Progress
It runs
The device wakes for about 15 seconds. If during that time, I publish a bunch of messages to the relevant topic, the message arrived successfully. Inside the AWS console I can publish to :
$aws/things/<device-name>/shadow/update/delta
{
"state":{
"desired":{
"output":true
}
}
}
And the delta callback function runs for 'output'. Great but no practical use to anyone.
IoT Job
I created a custom AWS IoT job in the console in an effort to overcome the problem. My thinking was that it might retain the message to ensure delivery. I've been running the job for the past half hour but so far nothing has come through. It had a 20 timeout but is still stuck in queued, not even in progress yet... So, there is clearly a flaw in this approach.
AWS CLI test
Just for completeness, I've attempted to fire off the MQTT message from the console. It has the benefit that you can specify the QOS, (in theory) ensuring that it gets delivered at least once.
aws iot-data publish --topic "$aws/things/<device-name>/shadow/update/delta" --qos 1 --payload file://Downloads/outputTrue.json --cli-binary-format raw-in-base64-out
But oddly this didn't seem to work at all. I didn't see the message arrive at the broker at all: subscribing in the console test.
AWS IoT Core does not support retained messages, see here.
The MQTT specification provides a provision for the publisher to request that the broker retain the last message sent to a topic and send it to all future topic subscribers. AWS IoT doesn't support retained messages. If a request is made to retain messages, the connection is disconnected.
As the wake-up times are perriodically, a possible approach could be to publish the next wake-up slot of your device in a separate topic where your backend is listening to. Your backend will then publish the desired information to your device-topic once the slot opens up.
Of course this approach is quite fragile concerning latency and network stability.
Time to share the answer I found from piecing together numerous posts and reaching out to the very helpful AWS support team. This link is the one that really covers it:
https://docs.aws.amazon.com/iot/latest/developerguide/jobs-devices.html#jobs-workflow-device-online
My summarised pseudo code is :
1. init() & connect() to mqtt as before.
2. Subscribe to the following topics & create callback function for each:
a. Get pending.
b. Notify next.
c. Get next.
d. Update rejected.
e. Update accepted.
3. Create Publish topics:
a. Get pending.
b. Get Next.
4. Pending topics = optional. But necessary to handle many tasks and select between them.
5. Aws-iot-jobs-describe() to publish a request for the next job. It links up to the notify next callback (somehow).
6. In the callback, grab job document, execute job & report Success / Failure.
7. Done.
There is a helpful example in esp-aws-iot/samples/linux/jobs-samples/jobs_sample.c. You need to copy some of the constants over from the sample aws_iot_config.h.
Once you've done all of this, you are able to use AWS Jobs to manage your OTA roll out, which was the original intent.
I am using a SNS to send out notifications to SQS in another AWS account. There is a DLQ configured to this SQS listener for processing failure of messages in SQS. Is it feasible to use same DLQ for handling SNS subscription failures as well?
While you certainly can use the same SQS as a DLQ for both the SNS topic and the SQS subscription, doing so might not be advisable. You'd be writing two different kinds of messages to the same queue, which might need different types of processing. The DLQ queue would also need a wider set of permissions to accommodate multiple writers. I would recommend using separate DLQs for the two different use cases.
Background : I am using a GCP dataflow (https://cloud.google.com/dataflow) in streaming mode to process information. Dataflow receives data from GAE Endpoint through Pub/Sub. I am using apache beam wrapper function for subscribing -
beam.io.ReadFromPubSub(...). Dataflow is running always, and pub/sub subscription is waiting for new data.
Issue : I am noticing that, after some days (no fixed pattern on how many days), it stops receiving any message. Publishing part in GAE works fine, as I can see message Id returned to me.
Troubleshoot : when I restart dataflow (by cloning), it works again.
Suspect : Since I am relying on beam's api, I am not sure what's happening underneath. Is this a bug with beam.io.ReadFromPubSub or bug with dataflow.
Note : Tried https://cloud.google.com/pubsub/docs/troubleshooting, but no relevant help.
According to the documentation:
Q: How many times will I receive each message?
Amazon SQS is
engineered to provide “at least once” delivery of all messages in its
queues. Although most of the time each message will be delivered to
your application exactly once, you should design your system so that
processing a message more than once does not create any errors or
inconsistencies.
Is there any good practice to achieve the exactly-once delivery?
I was thinking about using the DynamoDB “Conditional Writes” as distributed locking mechanism but... any better idea?
Some reference to this topic:
At-least-once delivery (Service Behavior)
Exactly-once delivery (Service Behavior)
FIFO queues are now available and provide ordered, exactly once out of the box.
https://aws.amazon.com/sqs/faqs/#fifo-queues
Check your region for availability.
The best solution really depends on exactly how critical it is that you not perform the action suggested in the message more than once. For some actions such as deleting a file or resizing an image it doesn't really matter if it happens twice, so it is fine to do nothing. When it is more critical to not do the work a second time I use an identifier for each message (generated by the sender) and the receiver tracks dups by marking the ids as seen in memchachd. Fine for many things, but probably not if life or money depends on it, especially if there a multiple consumers.
Conditional writes sound like a clever solution, but it has me wondering if perhaps AWS isn't such a great solution for your problem if you need a bullet proof exactly-once solution.
Another alternative for distributed locking is Redis cluster, which can also be provisioned with AWS ElasticCache. Redis supports transactions which guarantee that concurrent calls will get executed in sequence.
One of the advantages of using cache is that you can set expiration timeouts, so if your message processing fails the lock will get timed release.
In this blog post the usage of a low-latency control database like Amazon DynamoDB is also recommended:
https://aws.amazon.com/blogs/compute/new-for-aws-lambda-sqs-fifo-as-an-event-source/
Amazon SQS FIFO queues ensure that the order of processing follows the
message order within a message group. However, it does not guarantee
only once delivery when used as a Lambda trigger. If only once
delivery is important in your serverless application, it’s recommended
to make your function idempotent. You could achieve this by tracking a
unique attribute of the message using a scalable, low-latency control
database like Amazon DynamoDB.
In short - we can put item or update item in dynamodb table with condition expretion attribute_not_exists(for put) or if_not_exists(for update), please check example here
https://stackoverflow.com/a/55110463/9783262
If we get an exception during put/update operations, we have to return from a lambda without further processing, if not get it then process the message (https://aws.amazon.com/premiumsupport/knowledge-center/lambda-function-idempotent/)
The following resources were helpful for me too:
https://ably.com/blog/sqs-fifo-queues-message-ordering-and-exactly-once-processing-guaranteed
https://aws.amazon.com/blogs/aws/introducing-amazon-sns-fifo-first-in-first-out-pub-sub-messaging/
https://youtu.be/8zysQqxgj0I