How to disable public ip in a predefined template for a dataflow job launch - google-cloud-dataflow

I am trying to deploy a dataflow job using google's predefined template using python api
I do not want my dataflow compute instance to have a public ip, so I use something like this:
GCSPATH="gs://dataflow-templates/latest/Cloud_PubSub_to_GCS_Text"
BODY = {
"jobName": "{jobname}".format(jobname=JOBNAME),
"parameters": {
"inputTopic" : "projects/{project}/topics/{topic}".format(project=PROJECT, topic=TOPIC),
"outputDirectory": "gs://{bucket}/pubsub-backup-v2/{topic}/".format(bucket=BUCKET, topic=TOPIC),
"outputFilenamePrefix": "{topic}-".format(topic=TOPIC),
"outputFilenameSuffix": ".txt"
},
"environment": {
"machineType": "n1-standard-1",
"usePublicIps": False,
"subnetwork": SUBNETWORK,
}
}
request = service.projects().templates().launch(projectId=PROJECT, gcsPath=GCSPATH, body=BODY)
response = request.execute()
but I get this error:
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://dataflow.googleapis.com/v1b3/projects/ABC/templates:launch?alt=json&gcsPath=gs%3A%2F%2Fdataflow-templates%2Flatest%2FCloud_PubSub_to_GCS_Text returned "Invalid JSON payload received. Unknown name "use_public_ips" at 'launch_parameters.environment': Cannot find field.">
If I remove the usePublicIps, it goes through, but my compute instance gets deployed with public ip.

The parameter usePublicIps cannot be overriden in runtime. You need to send this parameter with value false into Dataflow Template generation command.
mvn compile exec:java -Dexec.mainClass=class -Dexec.args="--project=$PROJECT \
--runner=DataflowRunner --stagingLocation=bucket --templateLocation=bucket \
--usePublicIps=false"
It will add an entry ipConfiguration on template's JSON indicating that workers needs only with Private IP.
The links are printscreens of template JSON with and without ipConfiguration entry.
Template with usePublicIps=false
Template without usePublicIps=false

It seems you are using the json from projects.locations.templates.create
The environment block documented here needs to follow
"environment": {
"machineType": "n1-standard-1",
"ipConfiguration": "WORKER_IP_PRIVATE",
"subnetwork": SUBNETWORK // sample: regions/${REGION}/subnetworks/${SUBNET}
}
The value for ipConfiguration is an enum documented at Job.WorkerIPAddressConfiguration

By reading the docs for Specifying your Network and Subnetwork on Dataflow I see that python uses use_public_ips=false insted of usePublicIps=false which is used by Java. Try changing that parameter.
Also, keep in mind that:
When you turn off public IP addresses, the Cloud Dataflow pipeline can
access resources only in the following places:
another instance in the same VPC network
a Shared VPC network
a network with VPC Network Peering enabled

I found one way to make this work
Clone Google Defined Templates
Run the template with custom parameters
mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToText \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=${PROJECT_ID} \
--stagingLocation=gs://${BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
--tempLocation=gs://${BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
--runner=DataflowRunner \
--windowDuration=2m \
--numShards=1 \
--inputTopic=projects/${PROJECT_ID}/topics/$TOPIC \
--outputDirectory=gs://${BUCKET}/temp/ \
--outputFilenamePrefix=windowed-file \
--outputFilenameSuffix=.txt \
--workerMachineType=n1-standard-1 \
--subnetwork=${SUBNET} \
--usePublicIps=false"

Besides all the other methods mentioned so far, gcloud dataflow jobs run and gcloud dataflow flex-template run define the optional flag --disable-public-ips.

Related

Eventarc triggers for crossproject

I have created a cloud run service. My event arc is not triggering the cross project to read the data. How to give the event filter for resource name in event arc with insert job/Job completed to trigger to BQ table.
gcloud eventarc triggers create ${SERVICE}-test1\
--location=${REGION} --service-account ${SVC_ACCOUNT} \
--destination-run-service ${SERVICE} \
--destination-run-region=${REGION} \
--event-filters type=google.cloud.audit.log.v1.written \
--event-filters methodName=google.cloud.bigquery.v2.JobService.InsertJob \
--event-filters serviceName=bigquery.googleapis.com \
--event-filters-path-pattern resourceName="/projects/destinationproject/locations/us-central1/jobs/*"
I have tried multiple options giving the resource name like:
"projects/projectname/datasets/outputdataset/tables/outputtable"

How to use a custom dataset for T5X?

I've created a custom seqio task and added it to the TaskRegistry following the instruction per the documentation. When I set the gin parameters, accounting for the new task I've created, I receive an error that says my task does not exist.
No Task or Mixture found with name [my task name]. Available:
Am I using the correct Mixture/Task module that needs to be imported? If not, what is the correct statement that would allow me to use my custom task?
--gin.MIXTURE_OR_TASK_MODULE=\"t5.data.tasks\"
Here is the full eval script I am using.
python3 t5x/eval.py \
--gin_file=t5x/examples/t5/t5_1_0/11B.gin \
--gin_file=t5x/configs/runs/eval.gin \
--gin.MIXTURE_OR_TASK_NAME=\"task_name\" \
--gin.MIXTURE_OR_TASK_MODULE=\"t5.data.tasks\" \
--gin.partitioning.PjitPartitioner.num_partitions=8 \
--gin.utils.DatasetConfig.split=\"test\" \
--gin.DROPOUT_RATE=0.0 \
--gin.CHECKPOINT_PATH=\"${CHECKPOINT_PATH}\" \
--gin.EVAL_OUTPUT_DIR=\"${EVAL_OUTPUT_DIR}\"

Running same DF template in parallel yields strange results

I have a dataflow job that extracts data from Cloud SQL and loads it into Cloud Storage. We've configured the job to accept parameters so we can use the same code to extract multiple tables. The dataflow job is compiled as a template.
When we create/run instances of the template in serial we get the results we expect. However if we create/run instances in parallel only a few files turn up on Cloud Storage. In both cases we can see that the DF jobs are created and terminate sucessfully.
For example we have 11 instances which produce 11 output files. In serial we get all 11 files, in parallel we only get around 3 files. During the parallel run all 11 instances were running at the same time
Can anyone offer some advice as to why this is happening? I'm assuming that temporary files created by the DF template are somehow overwritten during the parallel run?
The main motivation of running in parallel is extracting the data more quickly.
Edit
The pipeline is pretty simple:
PCollection<String> results = p
.apply("Read from Cloud SQL", JdbcIO.<String>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create(dsDriver, dsConnection)
.withUsername(options.getCloudSqlUsername())
.withPassword(options.getCloudSqlPassword())
)
.withQuery(options.getCloudSqlExtractSql())
.withRowMapper(new JdbcIO.RowMapper<String>() {
#Override
public String mapRow(ResultSet resultSet) throws Exception {
return mapRowToJson(resultSet);
}
})
.withCoder(StringUtf8Coder.of()));
When I compile the template I do
mvn compile exec:java \
-Dexec.mainClass=com.xxxx.batch_ingestion.LoadCloudSql \
-Dexec.args="--project=myproject \
--region=europe-west1 \
--stagingLocation=gs://bucket/dataflow/staging/ \
--cloudStorageLocation=gs://bucket/data/ \
--cloudSqlInstanceId=yyyy \
--cloudSqlSchema=dev \
--runner=DataflowRunner \
--templateLocation=gs://bucket/dataflow/template/BatchIngestion"
When I invoke the template I also provide "tempLocation". I can see the dynamic temp locations are being used. Despite this I'm not seeing all the output files when running in parallel.
Thanks
Solution
Add unique tempLocation
Add unique output path & filename
Move the output files to final destination on CS after DF completes its processing

What is the NGSI v2 endpoint for mimicking IoT Agent commands?

When testing commands Southbound, I am currently using the NGSI v1 endpoint as shown:
curl -X POST \
'http://{{iot-agent}}/v1/updateContext' \
-H 'Content-Type: application/json' \
-H 'fiware-service: openiot' \
-H 'fiware-servicepath: /' \
-d '{
"contextElements": [
{
"type": "Bell",
"isPattern": "false",
"id": "urn:ngsi-ld:Bell:001",
"attributes": [
{
"name": "ring",
"type": "command",
"value": ""
}
]
}
],
"updateAction": "UPDATE"
}'
As you can see this is an NGSI v1 request. According to this presentation on Slideshare (slide 16) use of NGSI v1 is discouraged - I would like to replace this with an NGSI v2 request. I believe that all IoT Agents are now NGSI v2 capable, however I have been unable to find the details of the replacement NGSI v2 request within the documentation.
So the question is what is the equivalent cUrl command to mimic a command from Orion using NGSI v2?
In this document you can see a good reference on how to send commands using the NGSIv2 API:
If you take a look to the previous device example, you can find that a "ping" command was defined. Any update on this attribute “Ping” at the NGSI entity in the ContextBroker will send a command to your device. For instance, to send the "Ping" command with value "Ping request" you could use the following operation in the ContextBroker API:
PUT /v2/entities/[ENTITY_ID]/attrs/ping
{
"value": "Ping request",
"type": "command"
}
ContextBroker API is quite flexible and allows to update an attribute in several ways. Please have a look to the NGSIv2 specification for details.
Important note: don't use operations in the NGSI API with creation semantics. Otherwise, the entity/attribute will be created locally to ContextBroker and the command will not progress to the device (and you will need to delete the created entity/attribute if you want to make it to work again). Thus, the following operations must not be used:
POST /v2/entities
PUT /v2/entities
POST /v2/op/entites with actionType append, appendStrict or replace
POST /v1/updateContext with actionType APPEND, APPEND_STRICT or REPLACE
EDIT: all the above refers to the Orion endpoint used by final client to send commands. #jason-fox has clarified that question refers to the IOTA endpoint that receives commands request from Orion (it should have been evident by the {{iot-agent}}, but I missed that part sorry :)
The Orion-to-IOTA communication for commands is based on the registration-forwarding mechanism. Currently, Orion always uses NGSIv1 to forward updates (even in the case the client uses NGSIv2 updates). In the future, we envision the usage of NGSIv2 but in order to achieve this, first we need:
To complete the Context Source Forwarding Specification, based on NGSIv2. It is currently under discussion in this PR. Feedback is welcome as comments to that PR!
To implement forwarding based in Context Source Forwarding Specification in Orion
To implement NGSIv2 endpoint compliant with Context Source Forwarding Specification in the IOTAs.
While the above gets completed, the only mechanism is the current one based in NGSIv1. However, note the Orion-IOTA interaction is internal to platform component and final client could base all their interactions to the platform (in particular, to the Orion endpoint) on NGSIv2, so this is not a big issue.
The Context Source Forwarding Specification, based on NGSIv2 is now completed and the old /v1 endpoint has been deprecated. According to the discussions of the associated Support for NGSIv2 issue, the correct request to send is as follows:
curl -iX POST \
http://localhost:4041/v2/op/update \
-H 'Content-Type: application/json' \
-H 'fiware-service: openiot' \
-H 'fiware-servicepath: /' \
-d '{
"actionType": "update",
"entities": [
{
"type": "Bell",
"id": "urn:ngsi-ld:Bell:001",
"ring" : {
"type": "command",
"value": ""
}
}
]
}'

Cron-like application of groovy script with console plugin environment?

We have an application that we would like to run a script on just like we do in the console window with access to the applications libraries and context, but we need to run it periodically like a cron job.
While the permanent answer is obviously a Quartz job, we need to the do this before we are able to patch the application.
Is there something available that gives us the same environment as the console-plugin but can be run via command-line or without a UI?
you can run a console script like the web interface does but just with a curl like this:
curl -F 'code=
class A {
def name
}
def foo = new A(name: "bar")
println foo.name
' localhost:8080/console/execute
You'll get the response as the console would print below.
With regard to #mwaisgold 's solution above, I made a couple of quick additions that helped. I added a little bit more to the script to handle authentication, plus the -F flag for curl caused an ambiguous method overloading error with the GroovyShell's evaluate method, so I addressed that by using the -d instead:
#/bin/bash
curl -i -H "Content-type: application/x-www-form-urlencoded" -c cookies.txt -X POST localhost:8080/myapp/j_spring_security_check -d "j_username=admin&j_password=admin"
curl -i -b cookies.txt -d 'code=
int iterations = 0
while (iterations < 10) {
log.error "********** Console Cron Test ${iterations++} ***********"
}
log.error "********** Console Cron Test Complete ***********"
' localhost:8080/myapp/console/execute

Resources