I have an issue regarding goolge dataflow.
I'm writing a dataflow pipeline which reads data from PubSub, and write to BigQuery, it's works.
Now I have to handle late data and i was following some examples on intenet but it's not working properly, here is my code:
pipeline.apply(PubsubIO.readStrings()
.withTimestampAttribute("timestamp").fromSubscription(Constants.SUBSCRIBER))
.apply(ParDo.of(new ParseEventFn()))
.apply(Window.<Entity> into(FixedWindows.of(WINDOW_SIZE))
// processing of late data.
.triggering(
AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(DELAY_SIZE))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(ALLOW_LATE_SIZE)
.accumulatingFiredPanes())
.apply(ParDo.of(new ParseTableRow()))
.apply("Write to BQ", BigQueryIO.<TableRow>write()...
Here is my pubsub message:
{
...,
"timestamp" : "2015-08-31T09:52:25.005Z"
}
When I manually push some messages(go to PupsubTopic and publish) with timestamp is << ALLOW_LATE_SIZE but these messages are still passed.
You should specify the allowed lateness formally using the "Duration" object as: .withAllowedLateness(Duration.standardMinutes(ALLOW_LATE_SIZE)), assuming you have set the value of ALLOW_LATE_SIZE in minutes.
You may check the documentation page for "Google Cloud Dataflow SDK for Java", specifically the "Triggers" sub-chapter.
Related
I am trying to collect energy generation statistics like Watts and wattHour form external API. I have External Rest API endpoint available for it.
Is there a way in Thingsboard using rule chain to call external endpoint and store its as telemetry data. Later i want to show this data in dashboards.
I know it has been too long, but thingsboard has lacking documentation and it might be useful for someone else.
You'd have to use the REST API CALL external node (https://thingsboard.io/docs/user-guide/rule-engine-2-0/external-nodes/#rest-api-call-node)
If the Node was successful, it will output it's OutboundMessage containing the HTTP Response, with the metadata containing:
- metadata.status
- metadata.statusCode
- metadata.statusReason
and with the payload of the message containing the response body from your external REST service (i.e. your stored telemetry).
You then have to use a script transformation node in order to format the metadata, payload and msgType, into the POST_TELEMETRY_REQUEST message format, see: https://thingsboard.io/docs/user-guide/rule-engine-2-0/overview/#predefined-message-types
Your external REST API should provide the correct "deviceName" and "deviceType", as well as the "ts" in UNIX milliseconds.
Notice that you also need to change the messageType (msgType return variable) to "POST_TELEMETRY_REQUEST".
Finally, just transmit the result to the Save timeseries action node and it will be stored as a telemetry from the specified device. Hope this helps.
I have a streaming topic in Json with 50 fields. I try to create another stream with 1 field using KSQL from the topic as below:
create stream data (timeGMT string) with (kafka_topic='json_data', value_format='json');
The stream was created successfully, however no data returns from below KSQL query:
select * from data;
This is running on KSQL 5.0.0
There are a few things to check, including:
Is there any data in the topic?
PRINT 'json_data' FROM BEGINNING;
Have you told KSQL to read from the beginning of the topic?
SET 'auto.offset.reset' = 'earliest';
Are there messages in your topic that aren't JSON or can't be parsed? Check the KSQL Server log for errors.
You can see more information on these, and other troubleshooting tips, in this blog.
Trying to connect our contact center to Google Cloud Speech via gRPC on C++, using async streaming.
The logs we pass down the configuration is below:
#2018-08-01 07:13:11,316||FINEST|MediaMgr|2601||ProcessMessage - Message
Content: type: StartCloudRecognition
endpointID: 5
config: { "provider": "google", "profanity-filter": false, "chunkSize": 8192,
"phrases": [] }
Then we start writing the content in:
#2018-08-01 07:13:12,101||FINEST|Tele|2673|FileName=media/audio/CloudEndpoint.cpp,LineNumber=317|CheckStatus() get 2|MPP227####
#2018-08-01 07:13:12,151||FINEST|Tele|2673|FileName=media/audio/CloudEndpoint.cpp,LineNumber=317|CheckStatus() get 1|MPP227####
#2018-08-01 07:13:12,151||FINEST|Tele|2673|FileName=media/audio/CloudEndpoint.cpp,LineNumber=325|GSRProvider::CheckStatus: EndpointID=5 Got event tag=0x1899f98 ok=true state=1|MPP227####
#2018-08-01 07:13:12,151||FINEST|Tele|2673|FileName=media/audio/CloudEndpoint.cpp,LineNumber=317|CheckStatus() get 1|MPP227####
#2018-08-01 07:13:12,151||FINEST|Tele|2673|FileName=media/audio/CloudEndpoint.cpp,LineNumber=325|GSRProvider::CheckStatus: EndpointID=5 Got event tag=0x1899f94 ok=true state=2|MPP227####
#2018-08-01 07:13:12,151||FINEST|Tele|2673|FileName=media/audio/CloudEndpoint.cpp,LineNumber=317|CheckStatus() get 1|MPP227####
#2018-08-01 07:13:12,151||FINEST|Tele|2673|FileName=media/audio/CloudEndpoint.cpp,LineNumber=325|GSRProvider::CheckStatus: EndpointID=5 Got event tag=0x1899f8c ok=true state=4|MPP227####
the final line of the snippet logs state = 4 is write_tag, so our app do write data through the gRPC tunnel.
and little bit later we got lines like :
#2018-08-01 07:13:12,283||FINEST|Tele|2673|FileName=media/audio/CloudEndpoint.cpp,LineNumber=325|GSRProvider::CheckStatus: EndpointID=5 Got event tag=0x1899f90 ok=false state=3|MPP227####
this event comes in as a read_tag and the ok tag is being set to google to false, from gRPC it means the tunnel being closed. So the speech recognition process stop before the tester even try to speak some meaningful content.
we have 2 lab sharing same build on the same area, one in ireland which always failed the test, one at usa always pass the same test.
Any Ideas?
Find out some packet we sent to google is corrupted.
I have just started learning how to use some Google Cloud Products.
I am currently busy with Cloud Dataflow.
I decided to start writing simple program.
It does nothing more than reading from Bigquery table and than writing to another table.
The job is failing.
Pipeline p = Pipeline.create(options);
PCollection<TableRow> data = p.apply(BigQueryIO.Read.named("test")
.fromQuery("select itemName from `Dataset.sampletable`").usingStandardSql());
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("category").setType("STRING"));
TableSchema schema = new TableSchema().setFields(fields);
data.apply(BigQueryIO.Write.named("Write").to("Dataset.dataflow_test")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
p.run();
}
Error code:
(6093f22a86dc3c25): Workflow failed. Causes: (6093f22a86dc389a): S01:test/DataflowPipelineRunner.BatchBigQueryIONativeRead+Write/DataflowPipelineRunner.BatchBigQueryIOWrite/DataflowPipelineRunner.BatchBigQueryIONativeWrite failed., (709b1cdded98b0f6): BigQuery creating dataset "_dataflow_temp_dataset_10172746300453557418" in project "1234-project" failed., (709b1cdded98b191): BigQuery execution failed., (709b1cdded98b22c): Error:
Message: IAM setPolicy failed for 42241429167:_dataflow_temp_dataset_10172746300453557418
HTTP Code: 400
I can imagine that maybe writing it immediately after reading can be a reason for this failing. Therefore I would like to know a good solution for this.
I just need to run a dataflow pipeline on a daily basis, but it seems to me that suggested solutions like App Engine Cron Service, which requires building a whole web app, seems a bit too much.
I was thinking about just running the pipeline from a cron job in a Compute Engine Linux VM, but maybe that's far too simple :). What's the problem with doing it that way, why isn't anybody (besides me I guess) suggesting it?
This is how I did it using Cloud Functions, PubSub, and Cloud Scheduler
(this assumes you've already created a Dataflow template and it exists in your GCS bucket somewhere)
Create a new topic in PubSub. this will be used to trigger the Cloud Function
Create a Cloud Function that launches a Dataflow job from a template. I find it easiest to just create this from the CF Console. Make sure the service account you choose has permission to create a dataflow job. the function's index.js looks something like:
const google = require('googleapis');
exports.triggerTemplate = (event, context) => {
// in this case the PubSub message payload and attributes are not used
// but can be used to pass parameters needed by the Dataflow template
const pubsubMessage = event.data;
console.log(Buffer.from(pubsubMessage, 'base64').toString());
console.log(event.attributes);
google.google.auth.getApplicationDefault(function (err, authClient, projectId) {
if (err) {
console.error('Error occurred: ' + err.toString());
throw new Error(err);
}
const dataflow = google.google.dataflow({ version: 'v1b3', auth: authClient });
dataflow.projects.templates.create({
projectId: projectId,
resource: {
parameters: {},
jobName: 'SOME-DATAFLOW-JOB-NAME',
gcsPath: 'gs://PATH-TO-YOUR-TEMPLATE'
}
}, function(err, response) {
if (err) {
console.error("Problem running dataflow template, error was: ", err);
}
console.log("Dataflow template response: ", response);
});
});
};
The package.json looks like
{
"name": "pubsub-trigger-template",
"version": "0.0.1",
"dependencies": {
"googleapis": "37.1.0",
"#google-cloud/pubsub": "^0.18.0"
}
}
Go to PubSub and the topic you created, manually publish a message. this should trigger the Cloud Function and start a Dataflow job
Use Cloud Scheduler to publish a PubSub message on schedule
https://cloud.google.com/scheduler/docs/tut-pub-sub
There's absolutely nothing wrong with using a cron job to kick off your Dataflow pipelines. We do it all the time for our production systems, whether it be our Java or Python developed pipelines.
That said however, we are trying to wean ourselves off cron jobs, and move more toward using either AWS Lambdas (we run multi cloud) or Cloud Functions. Unfortunately, Cloud Functions don't have scheduling yet. AWS Lambdas do.
There is a FAQ answer to that question:
https://cloud.google.com/dataflow/docs/resources/faq#is_there_a_built-in_scheduling_mechanism_to_execute_pipelines_at_given_time_or_interval
You can automate pipeline execution by using Google App Engine (Flexible Environment only) or Cloud Functions.
You can use Apache Airflow's Dataflow Operator, one of several Google Cloud Platform Operators in a Cloud Composer workflow.
You can use custom (cron) job processes on Compute Engine.
The Cloud Function approach is described as "Alpha" and it's still true that they don't have scheduling (no equivalent to AWS cloudwatch scheduling event), only Pub/Sub messages, Cloud Storage changes, HTTP invocations.
Cloud composer looks like a good option. Effectively a re-badged Apache Airflow, which is itself a great orchestration tool. Definitely not "too simple" like cron :)
You can use cloud scheduler to schedule your job as well. See my post
https://medium.com/#zhongchen/schedule-your-dataflow-batch-jobs-with-cloud-scheduler-8390e0e958eb
Terraform script
data "google_project" "project" {}
resource "google_cloud_scheduler_job" "scheduler" {
name = "scheduler-demo"
schedule = "0 0 * * *"
# This needs to be us-central1 even if the app engine is in us-central.
# You will get a resource not found error if just using us-central.
region = "us-central1"
http_target {
http_method = "POST"
uri = "https://dataflow.googleapis.com/v1b3/projects/${var.project_id}/locations/${var.region}/templates:launch?gcsPath=gs://zhong-gcp/templates/dataflow-demo-template"
oauth_token {
service_account_email = google_service_account.cloud-scheduler-demo.email
}
# need to encode the string
body = base64encode(<<-EOT
{
"jobName": "test-cloud-scheduler",
"parameters": {
"region": "${var.region}",
"autoscalingAlgorithm": "THROUGHPUT_BASED",
},
"environment": {
"maxWorkers": "10",
"tempLocation": "gs://zhong-gcp/temp",
"zone": "us-west1-a"
}
}
EOT
)
}
}