How could code know it's running on Google Dataflow? - google-cloud-dataflow

I'd like to use some configs for a library that's used both on Dataflow and in a normal environment.
Is there a way for the code to check it's running on Dataflow? I couldn't see an environment variable, for example.
Quasi-follow-up to Google Dataflow non-python dependencies - separate setup.py?

One option is to use PipelineOptions, which contains the pipeline runner information. As mentioned in the beam documentation: "When you run the pipeline on a runner of your choice, a copy of the PipelineOptions will be available to your code. For example, you can read PipelineOptions from a DoFn’s Context."
More about PipelineOptions: https://beam.apache.org/documentation/programming-guide/#configuring-pipeline-options

This is not a good answer, but it may be the best we can do at the moment:
if 'harness' in os.environ.get('HOSTNAME', ''):

Related

Specifying --diskSizeGb when running a dataflow template

I'm trying to use a Google dataflow template to export data from Bigtable to Google Cloud Storage (GCS). I'm following the gcloud command details here. However, when running I get a warning and associated error where the suggested fix is to add workers (--numWorkers), increase the attached disk size (--diskSizeGb). However, I see no way to execute the Google provided template while passing those parameters. Amy I missing something?
Reviewing a separate question, it seems like there is a way to do this. Can someone explain how?
parameters like numWorkers and diskSizeGb are Dataflow wide pipeline options. You should be able to specify them like so
gcloud dataflow jobs run JOB_NAME \
--gcs-location LOCATION --num-workers=$NUM_WORKERS --diskSizeGb=$DISK_SIZE
Let me know if you have furthr questions

Jenkins pipeline from YAML file

Jenkins declarative pipeline is too powerful for us, often users can abuse it. We are thinking to use an opinionated YAML to describe CI/CD pipeline. And it seems there are two choices.
Write a plugin and consume YAML and dynamically create stage / steps.
Write a plugin to convert a YAML to Jenkins pipeline.
I am not expert on Jenkins, so I hope some expert can give some guidance and maybe an example.
using official plugin pipeline-as-yaml, but it has a fixed grammar.
using or customization wolox-ci
create your own shared libaray. However, they are easy from beginning but full grammer design is required when used widely. Here is a psudo code based on curry.
// create a file named yamlCompiler.groovy in shared library,
def call(str){
def rawMap = readYaml(text: str)
// consume yaml and get a lambda function
return {
stage{
steps.each{it ->
it."$type"(it)
}
}
}
}
Use yamlCompiler in your jenkinsfile code block.
#Library('your libs name')
def str =
'''
steps:
- type: sh
script: ls -la
- type: echo
message: xxx
'''
Closure closure = yamlCompiler(str)
closure.call()
I'm looking for a similar solution. We run hardened predefined pipelines for every project, but still want to allow dev teams to customise certain steps within the process —without allowing them the full power of a Jenkinsfile.
I'm also exploring the possibility of an —in your words— "opinionated YAML".
I've so far only found one example of such an implementation: Wolox-CI supports their own pre-defined build steps via YAML. You'll be able to see the steps they support here.
I'm thinking of parsing the YAML using Snake YAML. Here's an SO answer with an example on how to do it.
Two solutions:
create a shared library to abstract the actual pipeline and provide to your users some guidance on how to setup a shared library and a Jenkinsfile sample. Here is an example of embeded pipeline https://github.com/SAP/jenkins-library/blob/master/vars/piperPipeline.groovy
use another tool like https://drone.io/
If you're not an expert and don't want/have the time to become one, the second solution might be the best one.
Really? Is the only difference here when the plugin is executed?:
Write a plugin and consume YAML and dynamically create stage / steps.
Write a plugin to convert a YAML to Jenkins pipeline.
Forgive me, because I may be a little hardened, but abstracting a layer for the dynamic creation of a declarative, or scripted, Jenkinsfile written in the simple groovy lang syntax so that it can be pretty-printed in yml prevents users from updating your yml exactly how? It seems to me your abstraction only adds to the complexity with which you wish to implement usability.
One, all the current yml plugins for Jenkins do exactly that. Two, they don't actually have the full breadth of "features" (yes, I'm using that term loosely here) accessible by implementing the groovy/(java) classes already available in the Jenkins domain (referencing the DSL). Two solutions exist right now for this, and I've investigated both, and implemented both, extensively. One is wolox-ci, which is the better of the two, and the other is Pipeline-as-YAML. In my opinion, it's easy to use, but both lack the full breadth of implementation features simply using groovy provides. So why force it? Simply so your users can have a pretty-printed yml file, and not have to be concerned with simple syntax, which you claim hardens your infrastructure-as-code backend so that the same users can't screw it up? Sorry, I'm calling bull pucky on that assertion. What's to stop anyone from totally screwing up your builds by pushing a change to the yml file which breaks the integration with groovy, or worse, completely changes an algorithm you worked hard to customize?
Sorry, I just don't get it. Sure, making something more human readable is always a good thing. Doing it because of the reasons you've stipulated makes no sense, though. Also, unless you have a super simple defined algorithm in your CI/CD process, without any non-continuous-passing-style transform methods being implemented, then using the current iterations of the yml-as-Jenkinsfile-templates plugins is probably not the way you want to go.
Now, you could write your own plugin to do this, but what's the technical debt on that, versus just learning the groovy syntax? Also, it still doesn't prevent users from making code changes to your build infrastructure, then integrating those changes in a simple yml file.

In Jenkins what's the differences between Remote access API and Pipeline REST API

In Jenkins, we want to get the Pipeline stages information through API, e.g. a stage is success or fail. From another answer it seems we can achieve it through Pipeline REST API Plugin.
My question is:
Can Jenkinsapi and Python-Jenkins achieve the same thing? It seems they're designed for bare metal Jenkins, instead of the Pipeline plugin, right ? If that's the case, do we have similar Python library for Pipeline plugin?
Thanks!
You have all the info to answer your own question in the documentation you linked. Just to put it together:
Jenkins has a REST API
The pipeline REST API plugin injects new endpoints in the Jenkins REST API to give you access to several pipeline info
Python-Jenkins and jenkinsapi you mentioned above are both wrappers around the Jenkins REST API to help you develop scripts/apps targeting the Jenkins API in Python. Since those modules/libs are based most probably based on the core API specification, they most probably don't provide specific methods to target the pipeline endpoints (but you can probably extend that quite easily).
If you want to stay in Jenkinsapi, get_data() defined in JenkinsBase class could be used for querying Pipeline REST API directly. Not very sure if it's recommended or not.
Following codes works for me.
from jenkinsapi.jenkins import Jenkins
import requests
requests.packages.urllib3.disable_warnings()
# GET /job/:job-name/:run-id/wfapi/describe
url = 'https://localhost/job/anyjob/10/wfapi/describe'
jenkins = Jenkins(
'https://localhost',
username='username',
password='api_token',
ssl_verify=False,
lazy=True)
print(jenkins.get_data(url))

Is there a way to track usage of a global shared library in Jenkins?

Context:
At my work most developers are free to write their own Jenkinsfile for their own team's projects.
As the Jenkins admin, I provide developers with a global shared library.
Most projects are using either v1 or v2 or v3 or another version of this library, using the idiom library("theSharedLib#v#").
Question: Is there a way for me to find out which Jenkinsfile is using which version of the shared library without having to actually lookup into all those Jenkinsfile files (50+ files in as much git repos)?
What I would see best is some mechanism that write up (into a file on the Jenkins master or in a DB) which project/Jenkinsfile is using which version at the time the library is loaded.
A possible solution would be to add some code to every function inside the library that will actually do this reporting. I could then see which function is used by who. Any better solution?
I wrote https://github.com/CiscoDevNet/es-logger to gather information such as this from Jenkins. It has a plugin that will run a regex against the console log of a completed job and can then post events to elastic search.
Jenkins helpfully posts library loads at the start of the log such as:
Loading library sharedLib#version
So a simple regex like
"^Loading library\S+(?P<library_name>.*?)#(?P<library_version>.*?)\S+$"
added to the console_log_events plugin would generate events in an elastic search for each usage and each version.

How are WorkerHarnessThreads managed in Cloud Dataflow?

Is the option numberOfWorkerHarnessThreads used by cloud-dataflow runner now?
Earlier the PipelineOptions property numberOfWorkerHarnessThreads was specified in the doc and was displayed in Dataflow Job Monitoring UI under Pipeline options. Both are missing now.
If this is not used, how are the worker threads managed now?
The option is still there. You can find it in DataflowPipelineDebugOptions.

Resources