Update delay for dataflow UDF - google-cloud-dataflow

I have a dataflow from pub/sub to bigquery that uses a javascript UDF to manipulate data. If I modify the file in cloud storage, does the running dataflow automatically update to start using this new UDF, is there a delay or do I have to trigger it manually? I changed the UDF but the dataflow behaves as if it were running with the old one.
Also, what is the best way to debug these UDF that run on dataflow?
Thanks!

You mean the Dataflow Template, right?
Unfortunately, the UDF does not refresh when you change the file. To update with a new file, you need to perform a Pipeline update, or stop / restart your pipeline.
As for debugging the UDFs, I am not sure what's the best way; but you can access the pipeline code in the DataflowTemplates repository in Github, and debug the pipeline by running locally, or writing a reduced version of it.

Related

Get Sonarqube Analysis Status on a variable (GUI Job)

I couldn't find any solutions to this particular need.
Basically I have a GUI Job and I need the status of the Sonarqube Analysis so I can later send a POST Request with it.
(I'm aware that pipeline exists and works great but because a specific reason I need it to be GUI)
On the pipeline you have the WaitForQualityGate.status(), I've tried using this but no success.
Example of what is desired
Any insights? Thanks in advance
You can use the SonarQube Rest API to get the status.
Whenever you run SonarQube analysis through Jenkins Pipeline, upon the successful analysis you will see report-task.txt created in the workspace folder.
Note: The location of report-task.txt file depends on the tool that was used to generate it. The mvn sonar:sonar task defaults to path target/sonar. In my case, I used sonarscanner to analyse a nodejs project. So the location of report-tast.txt is .scannerwork.
Now, you will get the ceTaskUrl and ceTaskId in report-task.txt. You can use that ceTaskUrl to get the analysisId.
Then, you can use the below api to get the quality gate status using analysisId.
http://<sonarqube_host>/api/qualitygates/project_status?analysisId=$ANALYSIS_ID"
Now, try to get the curl output of the above API into a variable.
If you mean to say that you want a custom variable message to pop up in your Jenkins GUI based on the SonarQube scan status, then that would require you to:
Clone the original Jenkins source code
Add a custom HTML button/div/graphic
Compile the Jenkins code
Build the new code
Execute the generated JAR
Else, you can try some plugins available on Jenkins that would give you the ability to render conditional outputs. No promises on whether they can actually help you change the original GUI.
Any alternative traditional approach wouldn't be able to fulfill your GUI requirement.

Force update of SideInput on updating Dataflow pipeline

I have a Dataflow pipeline running that fetches a configuration of active tenants (stored in GCS) and feeds it into an ActiveTenantFilter as a sideInput. The configuration is rarely updated, hence why I decided to re-deploy the pipeline, using the --update flag, whenever it is updated.
However, when using the update flag, the file is not fetched again, i.e., the state is maintained. Is it possible to enforce that this PCollectionView is updated whenever the pipeline is re-deployed?
You are correct, when you --update a pipeline it will process new data but will not re-load old data. It sounds like what you want is slowly updating side inputs which unfortunately has not been implemented yet. You could instead try draining and re-starting your pipeline.

Question: BigQueryIO creates one file per input line, is it correct?

I'm new on Apache Beam and I'm developing a pipeline to get rows from JDBCIO and send them to BigQueryIO. I'm converting the rows to avro files with withAvroFormatFunction but it is creating a new file for each row returned by JDBCIO. The same for withFormatFunction with json files.
It is so slow to run locally with DirectRunner because it uploads a lot of files to Google Storage. Is this approach good for scaling on Google Dataflow? Is there a better way to deal with it?
Thanks
In BigqueryIO there is an option to specify withNumFileShards which controls the number of files that get generated while using Bigquery Load Jobs.
From the documentation
Control how many file shards are written when using BigQuery load jobs. Applicable only when also setting withTriggeringFrequency(org.joda.time.Duration).
You can set test your process by setting the value to 1 to see if only 1 large file gets created.
BigQueryIO will commit results to BigQuery for each bundle. The DirectRunner is known to be a bit inefficient about bundling. It never combines bundles. So whatever bundling is provided by a source is propagated to the sink. You can try using other runners such as Flink, Spark, or Dataflow. The in-process open source runners are about as easy to use as the direct runner. Just change --runner=DirectRunner to --runner=FlinkRunner and the default settings will run in local embedded mode.

How to ensure db scripts runs only once in Jenkins CI pipeline

Our application is using continuous integration with Jenkins. We have problem at our hand in deploying incremental db changes to oracle server.
Current mechanism followed is having rollback scripts and alter or incremental scripts (both ddl and dml).
In jenkins pipeline, we are calling rollback first and then incremental changes every time when build runs along with our java code changes. This is not ideal way to solve this problem.
I am looking for some best practices which will allow incremental db scripts to run only once.
I mentioned that best practice before, and it is not tied to JenkinsCI: you need to record the execution of your db script in a dedicated table of your database.
That is what a product like Flyway does, but you can implement that "record" part yourself too. That way, when your JenkinsCI pipeline job re-execute those scripts (through a wrapper of yours), said wrapper will detect they were already executed.

Jenkins CI Results CSV

I am attemtping to integrate Jenkins CI and APtest so that my Jenkins run will automatically update APtest results.
I have the ApTest importresults script below but need a way to generate the appropiate CSV file containing all results data from Jenkins.
Can someone help?
APTest import results.
Do you need all results data in the history of the job / system dumped every single time? Or are you looking for an incremental CSV after the completion of each build?

Resources