Force update of SideInput on updating Dataflow pipeline - google-cloud-dataflow

I have a Dataflow pipeline running that fetches a configuration of active tenants (stored in GCS) and feeds it into an ActiveTenantFilter as a sideInput. The configuration is rarely updated, hence why I decided to re-deploy the pipeline, using the --update flag, whenever it is updated.
However, when using the update flag, the file is not fetched again, i.e., the state is maintained. Is it possible to enforce that this PCollectionView is updated whenever the pipeline is re-deployed?

You are correct, when you --update a pipeline it will process new data but will not re-load old data. It sounds like what you want is slowly updating side inputs which unfortunately has not been implemented yet. You could instead try draining and re-starting your pipeline.

Related

How to re-run a build with the same parameters scheduled by cron

I know about the rebuild and replay functionality, but both of them are manually triggers. So here is our problem:
We have multiple servers which can be deployed with any branch that exists. But this deploy is manually. But we want to ensure, that at least once a day, the latest version of that branch is deployed to avoid having servers being outdated.
So what I want to do it, create a scheduler job that runs once a day and triggers a Jenkins job to rebuild the last job using the exact same parameters.
Would be great if someone has some input here :-)
You can try out the Persistent Parameter plugin and use it to define the relevant parameters inside the deploy job that you want to reuse.
This plugin enables you to set you input parameters (string, text, boolean and choice) with default values taken from the previous build - so every time you run the build with parameters manually or trigger it from another job the values that will be used are the values from the last execution.
Your caller job can still pass parameters to the deploy job during the daily execution - but for all parameters that are not passed their latest value will be used.
You can also override parameters defined as persistent as it just affects the default value.

Update delay for dataflow UDF

I have a dataflow from pub/sub to bigquery that uses a javascript UDF to manipulate data. If I modify the file in cloud storage, does the running dataflow automatically update to start using this new UDF, is there a delay or do I have to trigger it manually? I changed the UDF but the dataflow behaves as if it were running with the old one.
Also, what is the best way to debug these UDF that run on dataflow?
Thanks!
You mean the Dataflow Template, right?
Unfortunately, the UDF does not refresh when you change the file. To update with a new file, you need to perform a Pipeline update, or stop / restart your pipeline.
As for debugging the UDFs, I am not sure what's the best way; but you can access the pipeline code in the DataflowTemplates repository in Github, and debug the pipeline by running locally, or writing a reduced version of it.

How to ensure db scripts runs only once in Jenkins CI pipeline

Our application is using continuous integration with Jenkins. We have problem at our hand in deploying incremental db changes to oracle server.
Current mechanism followed is having rollback scripts and alter or incremental scripts (both ddl and dml).
In jenkins pipeline, we are calling rollback first and then incremental changes every time when build runs along with our java code changes. This is not ideal way to solve this problem.
I am looking for some best practices which will allow incremental db scripts to run only once.
I mentioned that best practice before, and it is not tied to JenkinsCI: you need to record the execution of your db script in a dedicated table of your database.
That is what a product like Flyway does, but you can implement that "record" part yourself too. That way, when your JenkinsCI pipeline job re-execute those scripts (through a wrapper of yours), said wrapper will detect they were already executed.

How to Remotely start jenkins build and get back result transactionally?

I had a request a to create a java client and start jenkins build for a specific job; and get back the result of that build.
The problem is, the system is used by multiple users and their build might messed up altogether. Also the get latest build my retrieve me the previous finished build instead of current one. Is there anyway to do build/get result transactionally?
I don't think there's a way to get true transactional functionality (in the way that, say, Postgres is transactional), however, I think you can prevent collisions amongst multiple users by doing the following:
Have your build wrapped around a script (bash, Python, or similar) which takes out an exclusive lock on a semfile before the build and releases it after its done. That is, a file which serves as a semaphore that the build process must be able to exclusively lock in order to be able to proceed.
That way, if you have a build in progress, and another user triggers one, the in-progress build will have the semfile locked, and the 2nd one will block waiting for the exclusive lock on that file, getting the lock only once the 1st build is complete and has released the lock on the file.
Also, to be able to refer to each remote build after the fact, I would recommend you refer to my previous post Retrieve id of remotely triggered jenkins job.

Is there a possibility in jenkins to run build only if something changed (in ClearCase SCM) from last build?

I need to build in jenkins only if there has been any change in ClearCase stream. I want to check it also in nightly or when someone choose to build manually, and to stop the build completely if there are no changes.
I tried the poll SCM but it doesn't seem to work well...
Any suggestion?
If it is possible, you should monitor the update of a snapshot view and, if the log of said update reveal any new files loaded, trigger the Jnekins job.
You find a similar approach in this thread.
You don't want to do something like that in a checkin trigger. It runs on the users client and will slow tings down, not to mention that you'd somehow have to figure out how to give every client access to that snapshot view.
What can work is a cron or scheduled job that runs lshistory and does something when it finds new checkins.
Yes you could do this via trigger, but I'd suggest a combo of trigger and additional script.. since updating the snapshot view might be time-consuming and effect checkins...
Create a simple trigger that when the files you are concerned about are changed on a stream will fire..
The script should "touch/create" a file in some well-known network location (or perhaps write to a pipe)...
the other script could be some cron (unix) or AT (windows) job that runs continually or each minute and if the well-known file is there will perform the update of the snapshot view..
The script could also read the Pipe written to by the trigger if you go that route
This is better than a cron job that has to do an lshistory each time.. but Martina was right to suggest not doing the whole thing in a trigger for performance and snapshot view accessability for all clients.. but a trigger to write to a pipe or write some empty file is efficient and the cron/AT job that actually does the update is effieicnet as it does not have to query the VOB each minute... just the file (or only after there is info on the pipe)..

Resources