Background : I am using a GCP dataflow (https://cloud.google.com/dataflow) in streaming mode to process information. Dataflow receives data from GAE Endpoint through Pub/Sub. I am using apache beam wrapper function for subscribing -
beam.io.ReadFromPubSub(...). Dataflow is running always, and pub/sub subscription is waiting for new data.
Issue : I am noticing that, after some days (no fixed pattern on how many days), it stops receiving any message. Publishing part in GAE works fine, as I can see message Id returned to me.
Troubleshoot : when I restart dataflow (by cloning), it works again.
Suspect : Since I am relying on beam's api, I am not sure what's happening underneath. Is this a bug with beam.io.ReadFromPubSub or bug with dataflow.
Note : Tried https://cloud.google.com/pubsub/docs/troubleshooting, but no relevant help.
Related
We use Google Cloud Run to wrap an analysis developed in R behind a web API. For this, we have a small Fastify app that launches an R script and uploads the results to Google Cloud Storage. The process' stdout and stderr are written to a file and are also uploaded at the end of the analysis.
However, we sometimes run into issues when a process takes longer to execute than expected. In these cases, we fail to upload anything and it's difficult to debug, because stdout and stderr are "lost" on the instance. The only thing we see in the Cloud Run logs is this message
The request has been terminated because it has reached the maximum request timeout
Is there a recommended way to handle a request timeout?
In App Engine there used to be a descriptive error: DeadllineExceededError for Python and DeadlineExceededException for Java.
We currently evaluate the following approach
Explicitly set Cloud Run's request timeout
Provide the same value as an environment variable, so it's available to the container
When receiving a request, we start a timer that calls a "cleanup" function just before the timeout is exceeded
The cleanup function stops the running analysis and uploads the current stdout and stderr files to Cloud Storage
This feels a little complicated so any feedback very appreciated.
Since the default timeout is 5 minutes and can extend up to 60 minutes, I would simply start by increasing this to 10 minutes. Then observe over the course of a month how that affects your service.
Aside from that fix, I would start investigating why your process is taking longer than expected and if it's perhaps due to a forever-growing result set.
If there's no result set scalability concern, then bumping the default timeout up from 5-minutes seems to be the most reasonable and simple fix. It would only be a problem until your script has to deal with more data in the future for some reason.
The goal is to receive messages over MQTT in an IoT device that comes out of deep sleep periodically. The exact same considerations exist for OTA update as for any other parameter update. In my case, ultimately, I want to use this for both.
Progress
It runs
The device wakes for about 15 seconds. If during that time, I publish a bunch of messages to the relevant topic, the message arrived successfully. Inside the AWS console I can publish to :
$aws/things/<device-name>/shadow/update/delta
{
"state":{
"desired":{
"output":true
}
}
}
And the delta callback function runs for 'output'. Great but no practical use to anyone.
IoT Job
I created a custom AWS IoT job in the console in an effort to overcome the problem. My thinking was that it might retain the message to ensure delivery. I've been running the job for the past half hour but so far nothing has come through. It had a 20 timeout but is still stuck in queued, not even in progress yet... So, there is clearly a flaw in this approach.
AWS CLI test
Just for completeness, I've attempted to fire off the MQTT message from the console. It has the benefit that you can specify the QOS, (in theory) ensuring that it gets delivered at least once.
aws iot-data publish --topic "$aws/things/<device-name>/shadow/update/delta" --qos 1 --payload file://Downloads/outputTrue.json --cli-binary-format raw-in-base64-out
But oddly this didn't seem to work at all. I didn't see the message arrive at the broker at all: subscribing in the console test.
AWS IoT Core does not support retained messages, see here.
The MQTT specification provides a provision for the publisher to request that the broker retain the last message sent to a topic and send it to all future topic subscribers. AWS IoT doesn't support retained messages. If a request is made to retain messages, the connection is disconnected.
As the wake-up times are perriodically, a possible approach could be to publish the next wake-up slot of your device in a separate topic where your backend is listening to. Your backend will then publish the desired information to your device-topic once the slot opens up.
Of course this approach is quite fragile concerning latency and network stability.
Time to share the answer I found from piecing together numerous posts and reaching out to the very helpful AWS support team. This link is the one that really covers it:
https://docs.aws.amazon.com/iot/latest/developerguide/jobs-devices.html#jobs-workflow-device-online
My summarised pseudo code is :
1. init() & connect() to mqtt as before.
2. Subscribe to the following topics & create callback function for each:
a. Get pending.
b. Notify next.
c. Get next.
d. Update rejected.
e. Update accepted.
3. Create Publish topics:
a. Get pending.
b. Get Next.
4. Pending topics = optional. But necessary to handle many tasks and select between them.
5. Aws-iot-jobs-describe() to publish a request for the next job. It links up to the notify next callback (somehow).
6. In the callback, grab job document, execute job & report Success / Failure.
7. Done.
There is a helpful example in esp-aws-iot/samples/linux/jobs-samples/jobs_sample.c. You need to copy some of the constants over from the sample aws_iot_config.h.
Once you've done all of this, you are able to use AWS Jobs to manage your OTA roll out, which was the original intent.
I'm using KafkaIO in dataflow to read messages from one topic. I use the following code.
KafkaIO.<String, String>read()
.withReadCommitted()
.withBootstrapServers(endPoint)
.withConsumerConfigUpdates(new ImmutableMap.Builder<String, Object>()
.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true)
.put(ConsumerConfig.GROUP_ID_CONFIG, groupName)
.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 8000).put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG, 2000)
.build())
// .commitOffsetsInFinalize()
.withTopics(Collections.singletonList(topicNames))
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.withoutMetadata();
I run the dataflow program in my local using the direct runner. Everything runs fine. I run another instance of the same program in parallel i.e another consumer. Now I see duplicate messages in processing of the pipeline.
Though I have provided consumer group id, starting another consumer with same consumer group id(different instance of the same program) shouldn't be processing same elements that are processed by another consumer right?
How does this turn out using dataflow runner?
I don't think the options you have set guarantees non-duplicate delivery of messages across pipelines.
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG: This is a flag for the Kafka consumer not for Beam pipeline itself. Seems like this is best effort and periodic so you might still see duplicates across multiple pipelines.
withReadCommitted(): This just means that Beam will not read uncommitted messages. Again, it will not prevent duplicates across multiple pipelines.
See here for the protocol Beam source use to determine the starting point of the Kafka source.
To guarantee non-duplicate delivery probably you have to read from different topics or different subscriptions.
I am trying to use x-ray to trace requests which use an SNS-SQS fanout pattern.
The request comes from API GW, lambda proxy integration, published to SNS and delivered to a subscribed SQS which has a lambda trigger which receives the messages for further processing.
However the trace stops at SNS.
Unfortunately we do not support this architecture today. The issue is that the trace information from the starting request (APIG in this case) is lost once the SNS message is invoked. There currently isn't a workaround for this behavior. We are working with SNS and SQS to provide a better user experience and support for these cases. Please stay tuned for more.
There is a streaming dataflow running on google cloud (Apache beam 2.5). The dataflow was showing some system lag so I tries to update that dataflow with --update flag. Now the old dataflow is in Updating state and the new dataflow that initiated after the update process is in Pending state.
Now at this point everything is stuck. I am unable to stop/cancel the jobs now. Old job is still in updating state and no status change operation is permitted. I tried to change the state of the job using gcloud dataflow jobs cancel and REST api but it's showing job cannot be updated as it's in RELOAD state. The new initiate job is in not started/pending state. Unable to change the state of this as well. It's showing job is not in condition to perform this operation.
Please let me know how to stop/cancel/delete this streaming dataflow.
Did you try to cancel the job from both command line gcloud tool and web console UI? If nothing works, I think you need to contact Google Cloud Support.