I want to create a dataflow template from a python script - google-cloud-dataflow

I found this script and I want to create a dataflow template from it but I don't know how. I also found this command
python -m examples.mymodule \
--runner DataflowRunner \
--project YOUR_PROJECT_ID \
--staging_location gs://YOUR_BUCKET_NAME/staging \
--temp_location gs://YOUR_BUCKET_NAME/temp \
--template_location gs://YOUR_BUCKET_NAME/templates/YOUR_TEMPLATE_NAME
for creating and staging a template, but it's really confusing for me.

First of all you must prepare your script to be used as a template, for this you can follow the link provided by #JayadeepJayaraman [1].
Regarding the python command, it will allow you to create and store your template in the bucket selected in this paramter "--template_location", and the "examples.mymodule" refers to the path of the package.name_script for which you want to create the template.
[1] https://cloud.google.com/dataflow/docs/guides/templates/creating-templates

You can take a look at https://cloud.google.com/dataflow/docs/guides/templates/creating-templates on how to create Python Dataflow templates.

Related

How do I use a custom version of the apache beam python SDK on DataFlow?

Current version of Apache Beam does not support type code 11 (json) from google spanner, as it uses a version of google-cloud-spanner that is two major versions behind the current release. Therefore I updated my own version to do so - haven't quite figured out how to do a proper PR on the Github or run tests yet.
Either way, that will take a while. I have heard that there is a way to specify a custom Apache beam SDK on DataFlow, but that was from 3 years ago and not specific. Is it still possible? What kind of file do I need to save the SDK in - zip, tar, tar.gz? What folders need to be in that archive? apache_beam, apache_beam-2.34.0.dist-info? just the files in apache_beam? Do I just set the option in sdk-location="gs://bucket" in PipelineOptions?
Thanks.
After you have your container built, you need to ensure that you are using runner V2 and you also need to set the sdk_container_image flag like so (the other flags are relevant to wordcount and may not be relevant to your pipeline):
python -m apache_beam.examples.wordcount \
--input=INPUT_FILE \
--output=OUTPUT_FILE \
--project=PROJECT_ID \
--region=REGION \
--temp_location=TEMP_LOCATION \
--runner=DataflowRunner \
--disk_size_gb=DISK_SIZE_GB \
--experiments=use_runner_v2 \
--sdk_container_image=$IMAGE_URI
Before you run your pipeline on Dataflow, you should ensure that your container works by running a small job locally like so:
python path/to/my/pipeline.py \
--runner=PortableRunner \
--job_endpoint=embed \
--environment_type=DOCKER \
--environment_config=IMAGE_URI \
--input=INPUT_FILE \
--output=OUTPUT_FILE
Please take a look at https://cloud.google.com/dataflow/docs/guides/using-custom-containers for more details.

Google Endpoints YAML file update: Is there a simpler method

When using Google Endpoints with Cloud Run to provide the container service, one creates a YAML file (stagger 2.0 format) to specify the paths with all configurations. For EVERY CHANGE the following is what I do (based on the documentation (https://cloud.google.com/endpoints/docs/openapi/get-started-cloud-functions)
Step 1: Deploying the Endpoints configuration
gcloud endpoints services deploy openapi-functions.yaml \
--project ESP_PROJECT_ID
This gives me the following output:
Service Configuration [CONFIG_ID] uploaded for service [CLOUD_RUN_HOSTNAME]
Then,
Step 2: Download the script to local machine
chmod +x gcloud_build_image
./gcloud_build_image -s CLOUD_RUN_HOSTNAME \
-c CONFIG_ID -p ESP_PROJECT_ID
Then,
Step 3: Re deploy the Cloud Run service
gcloud run deploy CLOUD_RUN_SERVICE_NAME \
--image="gcr.io/ESP_PROJECT_ID/endpoints-runtime-serverless:CLOUD_RUN_HOSTNAME-CONFIG_ID" \
--allow-unauthenticated \
--platform managed \
--project=ESP_PROJECT_ID
Is this the process for every API path change? Or is there a simpler direct method of updating the YAML file and uploading it somewhere?
Thanks.
Based on the documentation, yes, this would be the process for every API path change. However, this may change in the future as this feature is currently on beta as stated on the documentation you shared.
You may want to look over here in order to create a feature request to GCP so they can improve this feature in the future.
In the meantime, I could advise to create a script for this process as it is always the same steps and doing something in bash that runs these commands would help you automatize the task.
Hope you find this useful.
When you use the default Cloud Endpoint image as described in the documentation the parameter --rollout_strategy=managed is automatically set.
You have to wait up to 1 minutes to use the new configuration. Personally it's what I observe in my deployments. Have a try on it!

How to Activate Dataflow Shuffle Service through gcloud CLI

I am trying to activate the Dataflow Shuffle [DS] through gcloud command line interface.
I am using this command:
gcloud dataflow jobs run ${JOB_NAME_STANDARD} \
--project=${PROJECT_ID} \
--region=us-east1 \
--service-account-email=${SERVICE_ACCOUNT} \
--gcs-location=${TEMPLATE_PATH}/template \
--staging-location=${PIPELINE_FOLDER}/staging \
--parameters "experiments=[shuffle_mode=\"service\"]"
The job starts. The Dataflow UI reflects it:
However, the logs showing the error with parsing the value:
Failed to parse SDK pipeline options: json: cannot unmarshal string into Go struct
field sdkPipelineOptions.experiments of type []string
What am I doing wrong?
This question is indeed related to an existing question:
How to activate Dataflow Shuffle service?
however the original question was covering python API, while my problem is with gcloud CLI.
[DS] https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#cloud-dataflow-shuffle
P.S. Update
I have also tried:
No luck.
There's currently no way (I know of) to enable shuffle_service for template.
You have two options:
a) Run a job not from template
b) create a template that already has shuffle_service enabled.
The unmarshalling issue is most likely because templates only support fixed amount of parameters and template does not support "experiments" parameter.
----UPD----
I was asked on how to create template with shuffle_service enabled.
Here are sample steps I took.
Follow WordCountTutorial to create project with pipeline definition.
Created template with following command:
mvn -Pdataflow-runner compile exec:java -Dexec.mainClass=org.apache.beam.examples.WindowedWordCount -Dexec.args="--project={project-name} --stagingLocation=gs://{staging-location} --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://{output-location} --runner=DataflowRunner --experiments=shuffle_mode=service --region=us-central1 --templateLocation=gs://{resulting-template-location}"
Note --experiments=shuffle_mode=service argument
Invoked template from UI or via command:
cloud dataflow jobs run {job-name} --project={project-name} --region=us-central1 --gcs-location=gs://{resulting-template-location}

Getting Dataflowrunner with --experiments=upload_graph to work

I have a pipeline that produces a dataflow graph (serialized JSON representation) that exceeds the allowable limit for the API, and thus cannot be launched via the dataflow runner for apache beam as one would normally do. And running dataflow runner with the instructed parameter --experiments=upload_graph does not work and fails saying there are no steps specified .
When getting notified about this size problem via an error, the following information is provided:
the size of the serialized JSON representation of the pipeline exceeds the allowable limit for the API.
Use experiment 'upload_graph' (--experiments=upload_graph)
to direct the runner to upload the JSON to your
GCS staging bucket instead of embedding in the API request.
Now using this parameter, does indeed result in dataflow runner uploading an additional dataflow_graph.pb file to the staging location beside the usual pipeline.pb file. Which I verified actually exists in gcp storage.
However the job in gcp dataflow then immediately fails after start with the following error:
Runnable workflow has no steps specified.
I've tried this flag with various pipelines, even apache beam example pipelines and see the same behaviour.
This can be reproduced by using word count example:
mvn archetype:generate \
-DarchetypeGroupId=org.apache.beam \
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
-DarchetypeVersion=2.11.0 \
-DgroupId=org.example \
-DartifactId=word-count-beam \
-Dversion="0.1" \
-Dpackage=org.apache.beam.examples \
-DinteractiveMode=false
cd word-count-beam/
Running it without the experiments=upload_graph parameter works:
(make sure to specify your project, and buckets if you want to run this)
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
--gcpTempLocation=gs://<your-gcs-bucket>/tmp \
--inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
-Pdataflow-runner
Running it with the experiments=upload_graph results in pipe failing with message workflow has no steps specified
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
--gcpTempLocation=gs://<your-gcs-bucket>/tmp \
--experiments=upload_graph \
--inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
-Pdataflow-runner
Now I would expect that dataflow runner would direct gcp dataflow to read the steps from the bucket specified as seen in the source code:
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L881
However this seems not to be the case. Has anyone gotten this to work, or has found some documentation regarding this feature that can point me in the right direction?
The experiment has since been reverted and the messaging will be corrected in Beam 2.13.0
Revert PR
I recently ran into this issue and the solution was quite silly. I had quite a complex dataflow streaming job developed and it was working fine and the next day stopped working with error "Runnable workflow has no steps specified.". In my case, someone specified pipeline().run().waitUntilFinish() twice after creating options and due to that, I was getting this error. Removing the duplicate pipeline run resolved the issue. I still think there should be some useful error trace by beam/dataflowrunner in this scenario.

How can I stage additional files using Google Cloud Dataflow?

I am reading a bunch configuration files in my Google Dataflow program and wonder what is the best way to stage them. Currently I do it this way and the system cannot find them.
FileReader filereader1 = new FileReader("config_1.csv");
FileReader filereader2 = new FileReader("config_2.csv");
config_1.csv and config_2.csv are stored in ./target/classes/org/model/examples/
My running script looks like this:
mvn compile exec:java -Dexec.mainClass=org.model.examples.MyPipeline \
-Dexec.args="--runner=DataflowRunner \
--project=mortgage-data-warehouse
--gcpTempLocation=gs://my-project-bucket/tmp \
--inputFile=gs://my-project-bucket/Data/input.txt \
--filesToStage=./target/classes/org/datamodel/examples/config_1.csv, ./target/classes/org/datamodel/examples/config_2.csv" \
-Pdataflow-runner
I have got the error
java.io.FileNotFoundException: config_1.csv (The system cannot find the file specified)
I wonder if this is the proper way to set --filesToStage.
For small configuration files, it is better to read files from resource folder such as what has been written by this link and avoid the complication of using --filesToStage

Resources