How to Activate Dataflow Shuffle Service through gcloud CLI - google-cloud-dataflow

I am trying to activate the Dataflow Shuffle [DS] through gcloud command line interface.
I am using this command:
gcloud dataflow jobs run ${JOB_NAME_STANDARD} \
--project=${PROJECT_ID} \
--region=us-east1 \
--service-account-email=${SERVICE_ACCOUNT} \
--gcs-location=${TEMPLATE_PATH}/template \
--staging-location=${PIPELINE_FOLDER}/staging \
--parameters "experiments=[shuffle_mode=\"service\"]"
The job starts. The Dataflow UI reflects it:
However, the logs showing the error with parsing the value:
Failed to parse SDK pipeline options: json: cannot unmarshal string into Go struct
field sdkPipelineOptions.experiments of type []string
What am I doing wrong?
This question is indeed related to an existing question:
How to activate Dataflow Shuffle service?
however the original question was covering python API, while my problem is with gcloud CLI.
[DS] https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#cloud-dataflow-shuffle
P.S. Update
I have also tried:
No luck.

There's currently no way (I know of) to enable shuffle_service for template.
You have two options:
a) Run a job not from template
b) create a template that already has shuffle_service enabled.
The unmarshalling issue is most likely because templates only support fixed amount of parameters and template does not support "experiments" parameter.
----UPD----
I was asked on how to create template with shuffle_service enabled.
Here are sample steps I took.
Follow WordCountTutorial to create project with pipeline definition.
Created template with following command:
mvn -Pdataflow-runner compile exec:java -Dexec.mainClass=org.apache.beam.examples.WindowedWordCount -Dexec.args="--project={project-name} --stagingLocation=gs://{staging-location} --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://{output-location} --runner=DataflowRunner --experiments=shuffle_mode=service --region=us-central1 --templateLocation=gs://{resulting-template-location}"
Note --experiments=shuffle_mode=service argument
Invoked template from UI or via command:
cloud dataflow jobs run {job-name} --project={project-name} --region=us-central1 --gcs-location=gs://{resulting-template-location}

Related

Google Endpoints YAML file update: Is there a simpler method

When using Google Endpoints with Cloud Run to provide the container service, one creates a YAML file (stagger 2.0 format) to specify the paths with all configurations. For EVERY CHANGE the following is what I do (based on the documentation (https://cloud.google.com/endpoints/docs/openapi/get-started-cloud-functions)
Step 1: Deploying the Endpoints configuration
gcloud endpoints services deploy openapi-functions.yaml \
--project ESP_PROJECT_ID
This gives me the following output:
Service Configuration [CONFIG_ID] uploaded for service [CLOUD_RUN_HOSTNAME]
Then,
Step 2: Download the script to local machine
chmod +x gcloud_build_image
./gcloud_build_image -s CLOUD_RUN_HOSTNAME \
-c CONFIG_ID -p ESP_PROJECT_ID
Then,
Step 3: Re deploy the Cloud Run service
gcloud run deploy CLOUD_RUN_SERVICE_NAME \
--image="gcr.io/ESP_PROJECT_ID/endpoints-runtime-serverless:CLOUD_RUN_HOSTNAME-CONFIG_ID" \
--allow-unauthenticated \
--platform managed \
--project=ESP_PROJECT_ID
Is this the process for every API path change? Or is there a simpler direct method of updating the YAML file and uploading it somewhere?
Thanks.
Based on the documentation, yes, this would be the process for every API path change. However, this may change in the future as this feature is currently on beta as stated on the documentation you shared.
You may want to look over here in order to create a feature request to GCP so they can improve this feature in the future.
In the meantime, I could advise to create a script for this process as it is always the same steps and doing something in bash that runs these commands would help you automatize the task.
Hope you find this useful.
When you use the default Cloud Endpoint image as described in the documentation the parameter --rollout_strategy=managed is automatically set.
You have to wait up to 1 minutes to use the new configuration. Personally it's what I observe in my deployments. Have a try on it!

I want to create a dataflow template from a python script

I found this script and I want to create a dataflow template from it but I don't know how. I also found this command
python -m examples.mymodule \
--runner DataflowRunner \
--project YOUR_PROJECT_ID \
--staging_location gs://YOUR_BUCKET_NAME/staging \
--temp_location gs://YOUR_BUCKET_NAME/temp \
--template_location gs://YOUR_BUCKET_NAME/templates/YOUR_TEMPLATE_NAME
for creating and staging a template, but it's really confusing for me.
First of all you must prepare your script to be used as a template, for this you can follow the link provided by #JayadeepJayaraman [1].
Regarding the python command, it will allow you to create and store your template in the bucket selected in this paramter "--template_location", and the "examples.mymodule" refers to the path of the package.name_script for which you want to create the template.
[1] https://cloud.google.com/dataflow/docs/guides/templates/creating-templates
You can take a look at https://cloud.google.com/dataflow/docs/guides/templates/creating-templates on how to create Python Dataflow templates.

Getting Dataflowrunner with --experiments=upload_graph to work

I have a pipeline that produces a dataflow graph (serialized JSON representation) that exceeds the allowable limit for the API, and thus cannot be launched via the dataflow runner for apache beam as one would normally do. And running dataflow runner with the instructed parameter --experiments=upload_graph does not work and fails saying there are no steps specified .
When getting notified about this size problem via an error, the following information is provided:
the size of the serialized JSON representation of the pipeline exceeds the allowable limit for the API.
Use experiment 'upload_graph' (--experiments=upload_graph)
to direct the runner to upload the JSON to your
GCS staging bucket instead of embedding in the API request.
Now using this parameter, does indeed result in dataflow runner uploading an additional dataflow_graph.pb file to the staging location beside the usual pipeline.pb file. Which I verified actually exists in gcp storage.
However the job in gcp dataflow then immediately fails after start with the following error:
Runnable workflow has no steps specified.
I've tried this flag with various pipelines, even apache beam example pipelines and see the same behaviour.
This can be reproduced by using word count example:
mvn archetype:generate \
-DarchetypeGroupId=org.apache.beam \
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
-DarchetypeVersion=2.11.0 \
-DgroupId=org.example \
-DartifactId=word-count-beam \
-Dversion="0.1" \
-Dpackage=org.apache.beam.examples \
-DinteractiveMode=false
cd word-count-beam/
Running it without the experiments=upload_graph parameter works:
(make sure to specify your project, and buckets if you want to run this)
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
--gcpTempLocation=gs://<your-gcs-bucket>/tmp \
--inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
-Pdataflow-runner
Running it with the experiments=upload_graph results in pipe failing with message workflow has no steps specified
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
--gcpTempLocation=gs://<your-gcs-bucket>/tmp \
--experiments=upload_graph \
--inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
-Pdataflow-runner
Now I would expect that dataflow runner would direct gcp dataflow to read the steps from the bucket specified as seen in the source code:
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L881
However this seems not to be the case. Has anyone gotten this to work, or has found some documentation regarding this feature that can point me in the right direction?
The experiment has since been reverted and the messaging will be corrected in Beam 2.13.0
Revert PR
I recently ran into this issue and the solution was quite silly. I had quite a complex dataflow streaming job developed and it was working fine and the next day stopped working with error "Runnable workflow has no steps specified.". In my case, someone specified pipeline().run().waitUntilFinish() twice after creating options and due to that, I was getting this error. Removing the duplicate pipeline run resolved the issue. I still think there should be some useful error trace by beam/dataflowrunner in this scenario.

Scanning Rest API's through OWASP zap inside a docker environment

I set an Azure devops CI/CD build that will start a vm where Owasp Zap is running as a proxy and where the Owasp zap Azure devops task will run on a target url and copy my report in an Azure Storage.
Followed this guy's beautiful tutorial: https://kasunkodagoda.com/2017/09/03/introducing-owasp-zed-attack-proxy-task-for-visual-studio-team-services/
(also the guy who created the Azure devops task)
All well and good but recently I wanted to use an REST Api as a target url. The Owasp zap task in azure devops doesn't have the ability. Even asked the creator (https://github.com/kasunkv/owasp-zap-vsts-task/issues/30#issuecomment-452258621) and he also didn't think this is available through the Azure devops task and only through docker.
On my next quest I am now trying to get it running inside a docker image. (Firstly inside Azure devops but that wasn't smooth https://github.com/zaproxy/zaproxy/issues/5176 )
And finally getting on this tutorial (https://zaproxy.blogspot.com/2017/06/scanning-apis-with-zap.html)
Where I am trying to run a docker image with the following steps:
--- docker pull owasp/zap2docker-weekly
--running the container
-------command : docker run -v ${pwd}:/zap/wrk/:rw -t owasp/zap2docker-weekly zap-api-scan.py -t https://apiurl/api.json -f openapi -z "-configfile /zap/wrk/options.prop"
------- options.prop file
-config replacer.full_list\(0\).description=auth1 \
-config replacer.full_list\(0\).enabled=true \
-config replacer.full_list\(0\).matchtype=REQ_HEADER \
-config replacer.full_list\(0\).matchstr=Authorization \
-config replacer.full_list\(0\).regex=false \
-config replacer.full_list\(0\).replacement=Bearer xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
But This scans only the root url not every URL. As I am typing this question i tried to download the json file from the root and running the docker run command with passing the json file with the -t I am getting number of imported url's : what seems to be everything. But this seems to freeze inside powershell.
Which step do i miss to get a full recursive scan on my rest api ?
Any one some ideas or some help pls ?
Firstly, your property file format is wrong. You only need the '-config' and '\'s if you set the options directly on the command line. In the property file you should have:
replacer.full_list(0).description=auth1
replacer.full_list(0).enabled=true
replacer.full_list(0).matchtype=REQ_HEADER
replacer.full_list(0).matchstr=Authorization
replacer.full_list(0).regex=false
replacer.full_list(0).replacement=Bearer xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Secondly, what does https://apiurl/api.json return and have you checked you can access it from within your docker container?
Try running
curl https://apiurl/api.json
and see what you get.

How to use swagger code-generator

I am working on creating the rest client and I will be calling an API which gives this big json output .I want to know how to create the Pojo classes automatically by inputting this json to swagger code gen and let it create my pojo classes for me which will save manual time . Here is what I have tried
To generate a PHP client for http://petstore.swagger.io/v2/swagger.json, please run the following
git clone https://github.com/swagger-api/swagger-codegen
cd swagger-codegen
mvn clean package
java -jar modules/swagger-codegen-cli/target/swagger-codegen-cli.jar generate \
-i http://petstore.swagger.io/v2/swagger.json \
-l php \
-o /var/tmp/php_api_client
(if you're on Windows, replace the last command with java -jar modules\swagger-codegen-cli\target\swagger-codegen-cli.jar generate -i http://petstore.swagger.io/v2/swagger.json -l php -o c:\temp\php_api_client)
I could not get pass the mvn clean package and it is giving the error
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project swagger-codegen: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: There was an error in the forked process
[ERROR] java.lang.NoClassDefFoundError: io/swagger/models/properties/Property
Anyone have successfully used this swagger ? or even if you can suggest any other framework which can do this functionality would be of great help . Thanks in advance ..,
I seen the following link
Update code generated by Swagger code-gen
and able to run the application .. Can anyone explain if I can use this to get the pojo object created for the json input?
Your problem is not swagger itself. Your problem comes from maven and it says that it can't find a certain class. I downloaded the repo and it compiles on my machine with mvn validate package. Make sure you have .m2\repository\io\swagger\swagger-models... in your standard maven cache directory. Thats the dependency, which has the Property class.
Actually maven should download it right before compiling. Check the maven output for connection errors or unreachable downloads etc.

Resources