How are WorkerHarnessThreads managed in Cloud Dataflow? - google-cloud-dataflow

Is the option numberOfWorkerHarnessThreads used by cloud-dataflow runner now?
Earlier the PipelineOptions property numberOfWorkerHarnessThreads was specified in the doc and was displayed in Dataflow Job Monitoring UI under Pipeline options. Both are missing now.
If this is not used, how are the worker threads managed now?

The option is still there. You can find it in DataflowPipelineDebugOptions.

Related

Spring Cloud Data Flow - Task Properties

I'm using SCDF and i was wondering if there was any way to configure default properties for one application?
I got a task application registered in SCDF and this application gets some JDBC properties to access business database :
app.foo.export.datasource.url=jdbc:db2://blablabla
app.foo.export.datasource.username=testuser
app.foo.export.datasource.password=**************
app.foo.export.datasource.driverClassName=com.ibm.db2.jcc.DB2Driver
Do i really need to put this prop in a property file like this : (it's bit weird to define them during the launch)
task launch fooTask --propertiesFile aaa.properties
Also, we cannot use the rest API, credentials would appear in the url.
Or is there another way/place to define default business props for an application ? These props will be only used by this task.
The purpose is to have one place where OPS team can configure url and credentials without playing with the launch command.
Thank you.
Yeah, SCDF feels a bit weird in the configuration area.
As you wrote, you can register an application and create tasks, but all the configuration is passed at the first launch of the task. Speaking other way round, you can't fully install/configure a task without running it.
As soon as a task has run once, you can relaunch it without any configuration and it uses the configuration from before. The whole config is saved in the SCDF database.
However, if you try to overwrite an existing configuration property with a new value, SCDF seems to ignore the new value and continue to use the old one. No idea if this is intended by design or a bug or if we are doing something wrong.
Because we run SCDF tasks on Kubernetes and we are used to configure all infrastructure in YAML files, the best option we found was to write our own Operator for SCDF.
This operator works against the REST interface of SCDF and also compensates the weird configuration issues mentioned above.
For example the overwrite issue is solved by first deleting the configuration and recreate it with the new values.
With this operator we have reached what you are looking for: all our SCDF configuration is in a git repository and all changes are done through merge requests. Thanks to CI/CD, on the next launch, the new configuration is used.
However, a Kubernetes operator should be part of the product. Without it, SCDF on Kubernetes feels quite "alien".

Enabling Debug|Trace worker logs for google cloud dataflow

Not able to enable the debug|trace level logging of the dataflow workers.
The documentation :https://cloud.google.com/dataflow/docs/guides/logging#SettingLevels
indicates the usage of DataflowWorkerLoggingOptions to programmatically overrides the default log level on the worker and enable the debug|trace level logging; however the interface is deprecated and no more present in bean-sdk 2.27.0 .
Has anyone been able to enable the worker level debugging in cloud dataflow; in any way.
The documentation is still up to date and the interface is still present and will work.
The interface is deprecated because the Java-based Dataflow worker is not used when running a pipeline using Beam's portability framework. Quoting the deprecation message:
#deprecated This interface will no longer be the source of truth for worker logging configuration once jobs are executed using a dedicated SDK harness instead of user code being co-located alongside Dataflow worker code. Please set the option below and also the corresponding option within org.apache.beam.sdk.options.SdkHarnessOptions to ensure forward compatibility.
So what you should do is follow the instructions that you linked and also set up logging in SdkHarnessOptions.

Specifying --diskSizeGb when running a dataflow template

I'm trying to use a Google dataflow template to export data from Bigtable to Google Cloud Storage (GCS). I'm following the gcloud command details here. However, when running I get a warning and associated error where the suggested fix is to add workers (--numWorkers), increase the attached disk size (--diskSizeGb). However, I see no way to execute the Google provided template while passing those parameters. Amy I missing something?
Reviewing a separate question, it seems like there is a way to do this. Can someone explain how?
parameters like numWorkers and diskSizeGb are Dataflow wide pipeline options. You should be able to specify them like so
gcloud dataflow jobs run JOB_NAME \
--gcs-location LOCATION --num-workers=$NUM_WORKERS --diskSizeGb=$DISK_SIZE
Let me know if you have furthr questions

Is it ok to directly overwrite Dataflow template parameters that are set at build-time?

We would like to prevent certain parameters (namely filesToStage) of our Dataflow template from populating in the Dataflow Job page. Is there a recommended way to achieve this? We've found that simply specifying "filesToStage=" when launching the template via gcloud suffices, but we're not sure if this is robust/stable behavior.
For context, we are hosting this Dataflow template for customer usage and would like to hide as much of the implementation as possible (including classpaths).
Specifically the filesToStage can be sent as blank and the files will be inferred based on the Java classpath:
If filesToStage is blank, Dataflow will infer the files to stage based on the Java classpath.
More information on the considerations for this and other fields can be found here.
For other parameters, the recommendation is to use Cloud KMS to keep the parameters hidden.

How could code know it's running on Google Dataflow?

I'd like to use some configs for a library that's used both on Dataflow and in a normal environment.
Is there a way for the code to check it's running on Dataflow? I couldn't see an environment variable, for example.
Quasi-follow-up to Google Dataflow non-python dependencies - separate setup.py?
One option is to use PipelineOptions, which contains the pipeline runner information. As mentioned in the beam documentation: "When you run the pipeline on a runner of your choice, a copy of the PipelineOptions will be available to your code. For example, you can read PipelineOptions from a DoFn’s Context."
More about PipelineOptions: https://beam.apache.org/documentation/programming-guide/#configuring-pipeline-options
This is not a good answer, but it may be the best we can do at the moment:
if 'harness' in os.environ.get('HOSTNAME', ''):

Resources