I need to run the GCP Bigquery SQL from the dataflow job. In the PTransform I have loaded the landing tables in the Bigquery.
Now how do I run SQL query from the same Dataflow job code.
Please note I am using python here.
Thanks.
Related
I am newbie in apache beam environment.
Trying to fit apache beam pipeline for batch orchestration.
My definition of batch is as follows
Batch==> a set of jobs,
Job==> can have one or more sub-job.
There can be dependencies between jobs/sub-jobs.
Can apache beam pipeline be mapped with my custom batch??
Apache Beam is unified for developing both batch and stream pipelines which can be run on Dataflow. You can create and deploy your pipeline using Dataflow. Beam Pipelines are portable so that you can use any of the runners available according to your requirement.
Cloud Composer can be used for batch orchestration as per your requirement. Cloud Composer is built on Apache Airflow. Both Apache Beam and Apache Airflow can be used together since Apache Airflow can be used to trigger the Beam jobs. Since you have custom jobs running, you can configure the beam and airflow for batch orchestration.
Airflow is meant to perform orchestration and also pipeline dependency management while Beam is used to build data pipelines which are executed data processing systems.
I believe Composer might be more suited for what you're trying to make. From there, you can launch Dataflow jobs from your environment using Airflow operators (for example, in case you're using Python, you can use the DataflowCreatePythonJobOperator).
I am trying to write a script to automate the deployment of a Java Dataflow job. The script creates a template and then uses the command
gcloud dataflow jobs run my-job --gcs-location=gs://my_bucket/template
The issue is, I want to update the job if the job already exists and it's running. I can do the update if I run the job via maven, but I need to do this via gcloud so I can have a service account for deployment and another one for running the job. I tried different things (adding --parameters update to the command line), but I always get an error. Is there a way to update a Dataflow job exclusively via gcloud dataflow jobs run?
Referring to the official documentation, which describes gcloud beta dataflow jobs - a group of subcommands for working with Dataflow jobs, there is no possibility to use gcloud for update the job.
As for now, the Apache Beam SDKs provide a way to update an ongoing streaming job on the Dataflow managed service with new pipeline code, you can find more information here. Another way of updating an existing Dataflow job is by using REST API, where you can find Java example.
Additionally, please follow Feature Request regarding recreating job with gcloud.
I have only one topic which was created in the production project. I want to run my dataflow job in dev environment which needs to consume production pubsub topic. When I submit my dataflow job in dev project it is not working and it always shows running in dataflow UI but not reading any elements from pubsub. If I submit to production project it works perfectly.
Why it is not reading messages from different project topic? I'm using java-sdk 2.1 and runner is "dataflowrunner"
PCollection<String> StreamData = p.apply("Read pubsub message",PubsubIO.readStrings().fromSubscription(options.getInputPubSub()));
Using mvn to submit dataflow job
mvn compile exec:java -Dexec.mainClass=dataflow.streaming.SampleStream -Dexec.args="—project=project-dev-1276 --stagingLocation=gs://project-dev/dataflow/staging --tempLocation=gs://project-dev/dataflow/bq_temp --zone=europe-west1-c --bigQueryDataset=stream_events --bigQueryTable=events_sample --inputPubSub=projects/project-prod/subscriptions/stream-events --streaming=true --runner=dataflowRunner"
Note: If I am using directrunner it works and consumes messages from different project pubsub topic.
No elements added in the queue and no estimated size.
You need to add Pub/Sub Subscriber permissions in your production project for a user (a service account) that your job will use. By default, workers use your project’s Compute Engine service account as the controller service account. This service account (<project-number>-compute#developer.gserviceaccount.com) should be given Pub/Sub Subscriber permission.
Read more here https://cloud.google.com/dataflow/docs/concepts/security-and-permissions and here https://cloud.google.com/pubsub/docs/access-control
I am struggling with this, and initially thought it could be the result of switching the pipeline data source from Cloud Datastore to Firebase Firestore, which required a new project. But I've since found the same error in separate pipelines. All pipelines run successfully on the local DirectRunner and the permissions appear to be the same as the old project.
It looks like none of the VMs are booting and the pipeline never scales above 0 workers. "The Dataflow appears to be stuck" is the only error message I could find and there is nothing in StackDriver. Tried every dependency management variation I could find in the docs but it doesn't seem to be the problem.
My last Dataflow job-id is 2017-10-11_11_12_01-15165703816317931044.
Tried elevating the access roles of all services accounts and still no luck.
Without any logging information, it's hard to pinpoint. But this can happen if you have changed the permissions or roles of the Dataflow service account or the Compute Engine service account so that the service account does not have enough permissions to get the images for the Dataflow workers.
I am trying to setup and execute the Spring Cloud Tasks Sample of partitioned batch job (https://github.com/spring-cloud/spring-cloud-task/tree/master/spring-cloud-task-samples/partitioned-batch-job) in Spring Cloud Data Flow Server.
But for some reason there are errors in the partitioned job tasks:
A job execution for this job is already running: JobInstance: id=2, version=0, Job=[partitionedJob]
Is the partition job incompatible with Spring Cloud Dataflow server?
Yes, the sample partitioned batch job is compatible with Spring Cloud Data Flow server and work out of the box so long as:
The datasource is either H2 or Mysql.
And you are using the Spring Cloud Data Flow Server Local
But it is difficult to diagnose the issue without knowing which Data Flow Server you are using and the database. Also were there any exceptions?