Amazon SWF vs Data Pipeline for Pig Scripts - amazon-swf

Hi have a daily data import pig script that I want to run on Amazon EMR. Should I use Simple Workflow or Data Pipeline to schedule and monitor the job? I tried going through data pipeline but it seems to require an output.. what goes into this output if I'm running a custom pig script? Are they expecting you to use a default pre-made pig script for data import tasks/jobs?
In my case I have a pig script that fetches from an S3 input and does some data transformations which push out to dynamodb. Trying to schedule this pig script in data pipeline, I see there is a pig activity type and a s3 to dynamodb template but I'm not sure how to customize/modify it so that it runs my pig script and transforms the data before it goes to dynamo db. Where is the s3 and dynamodb mappping set in this process? Is it redundant since the pig script imports from s3 and exports to dynamodb all by itself?

Simple Workflow is useful for managing workflows. In very simple terms, it is a queue with a lot of features like history tracking, signalling, timers etc.
Data Pipeline is useful when you need an ETL kind engine. It provides you capabilities of scheduling tasks on a periodic basis, handle dependencies between different tasks and failure-retry handling. It also allows you to not worry about resources like launching and shutting down EMR / EC2.
You can always build all these things on top of SWF by writing your own state machine. But IMO, it's better to use Data Pipeline for your use case.
For running a custom Pig job using Data Pipeline, you should be able to disable staging.
stage = false
'stage' is an optional field on PigActivity.
{
"name": "DefaultActivity",
"id": "ActivityId_1",
"type": "PigActivity",
"stage": "false",
"scriptUri": "s3://bucket/query",
"scriptVariable": [
"param1=value1",
"param2=value2"
],
"schedule": {
"ref": "ScheduleId_l"
},
"runsOn": {
"ref": "EmrClusterId_1"
}
}

Related

Automated ansible output parsing in jenkins pipeline

How do you automatically parse ansible warnings and errors in your jenkins pipeline jobs?
I greatly enjoy the power of leveraging in ansible in jenkins when it works. Upon a failure, the hunt to locate the actual error can be challenging.
I use WarningsNG which supports custom parsers (and allows their programmatic generation)
Do you know of any plugins or addons that already transform these logs into the kind charts similar to WarningsNG?
I figured I'd ask as I go off into deep regex land and make my own.
One good way to achieve this seems to be the following:
select an existing structured output ansible callback plugin (json, junit and yaml are all viable) . I selected junit as I can play with the format to get a really nice view into the playbook with errors reported in a very obvious way.
fork that GPL file (yes, so be careful with that license) to augment with the following:
store output as file
implement the missing callback methods (the three mentioned above do not implement the v2...item callbacks.
forward events to the default or debug callback to ensure operators see something when they execute the plan
add a secrets cleaner - if you use jenkins credentials-binding-plugin it will hide secrets from the console, it will not not hide secrets within stored files. You'll need to handle that in your playbook or via some groovy code (if groovy, try{...} finally { clean } seems a good pattern)
Snippet - forewarding to default callback
from ansible.plugins.callback.default import CallbackModule as CallbackModule_default
...
class CallbackModule(CallbackBase):
CALLBACK_VERSION = 2.0
CALLBACK_TYPE = 'stdout'
CALLBACK_NAME = 'json'
def __init__(self, display=None):
super(CallbackModule, self).__init__(display)
self.default_callback = CallbackModule_default()
...
def v2_on_file_diff(self, result):
self.default_callback.v2_on_file_diff(result)
... do whatever you'd want to ensure the content appears in the json file

Jenkins Pipeline: How to allocate multiple nodes for a single test - distributed test orchestration

This is not a normal "Run this job on many slaves" question!
Cookie-cutter answers will not do.
Use Case
I am attempting to use the Jenkins Pipeline to instrument a test for distributed software that involves one "client" and many "servers" by allocating Jenkins slaves for those roles, running the components on the slaves and then tearing it all down. We can pretend that "servers" will run a web server and "client" runs "wget" against them.
Considerations
I'm using the scripted pipeline (not declarative). Essentially I need all the "servers" to be up when I run the client logic on the "client" node.
Obviously sequential node{} blocks won't work because I need all slaves to be up concurrently. Parallel may work, and I am open to this option, but it seems that it will be hard to debug.
My solution
So here is what I've come up with so far. This is a simplified example, there might be logic after each node closure (set up each server) and near the end of each node closure (clean up each server), or it can all be done by the client, doesn't really matter.
def allocatedServerList = []
// Allocate 3 "servers" and then 1 client. Keep servers allocated.
node {
allocatedServerList << env.NODE_NAME
node {
allocatedServerList << env.NODE_NAME
node {
allocatedServerList << env.NODE_NAME
node {
//this is the client
sh "run some client work against ${allocatedServerList}"
//eg: ssh to each server, start some service, pound it for a while, shut them down
}
}
}
}
Surprisingly, this works fine.
Can anyone suggest a better approach? The downside with nested code is that you can't change the number of nodes easily (without recursive methods, which make it unreadable)
to run on all nodes you can use something like this:
The example shows how to trigger jobs on all Jenkins nodes from Pipeline.
Summary: * The script uses NodeLabel Parameter plugin to pass the job name to the payload job. * Node list retrieval is being performed using Jenkins API, so it will require script approvals in the Sandbox mode
To see this example from jenkins.io please visit : Trigger Job On All Nodes

Write key/value data through Jenkins API

I already use Jenkins API for some tasks in my build pipeline. Now, there is a task that I want to persist some simple dynamic data say like "50.24" for each build. Then be able to retrieve this data back in a different job.
More concretely, I am looking for something on these lines
POST to http://localhost:8080/job/myjob//api/json/store
{"code-coverage":"50.24"}
Then in a different job
GET
http://localhost:8080/job/myjob//api/json?code-coverage
One idea is to do archiveArtifacts and save it into a file and then read it back using the API/file. But I am wondering if there is plugin or a simple way to write some data for this job.
If you need to send a variable from one build to another:
The parametrized build is the easiest way to do this:
https://wiki.jenkins.io/display/JENKINS/Parameterized+Build
the URL would look like:
http://server/job/myjob/buildWithParameters?PARAMETER=Value
If you need to share complex data, you can save some files in your workspace and use it (send the absolute path) from another build.
If you need to re-use a simple variable computed during your build
I would go for using an environment var, updated during your flow:
Jenkinsfile (Declarative Pipeline)
pipeline {
agent any
environment {
DISABLE_AUTH = 'true'
DB_ENGINE = 'sqlite'
}
stages {
stage('Build') {
steps {
sh 'printenv'
}
}
}
}
All the details there:
https://jenkins.io/doc/pipeline/tour/environment/
If you need to re-use complex data between two builds
You have two case there, it's if your build are within the same workspace or not.
In the same workspace, it's totally fine to write your data in a text file, that is re-used later, by another job.
archiveArtifacts plugin is convenient if your usecase is about extracting test results from logs, and re-use it later. Otherwise you will have to write the process yourself.
If your second job is using another workspace, you will need to provide the absolute path to your child-job. In order for it to copy and process it.

Perform action after Dataflow pipeline has processed all data

Is it possible to perform an action once a batch Dataflow job has finished processing all data? Specifically, I'd like to move the text file that the pipeline just processed to a different GCS bucket. I'm not sure where to place that in my pipeline to ensure it executes once after the data processing has completed.
I don't see why you need to do this post pipeline execution. You could use side outputs to write the file to multiple buckets, and save yourself the copy after the pipeline finishes.
If that's not going to work for you (for whatever reason), then you can simply run your pipeline in blocking execution mode i.e. use pipeline.run().waitUntilFinish(), and then just write the rest of your code (which does the copy) after that.
[..]
// do some stuff before the pipeline runs
Pipeline pipeline = ...
pipeline.run().waitUntilFinish();
// do something after the pipeline finishes here
[..]
A little trick I got from reading the source code of apache beam's PassThroughThenCleanup.java.
Right after your reader, create a side input that 'combine' the entire collection (in the source code, it is the View.asIterable() PTransform) and connect its output to a DoFn. This DoFn will be called only after the reader has finished reading ALL elements.
P.S. The code literally name the operation, cleanupSignalView which I found really clever
Note that you can achieve the same effect using Combine.globally() (java) or beam.CombineGlobally() (python). For more info check out section 4.2.4.3 here
I think two options can help you here:
1) Use TextIO to write to the bucket or folder you want, specifying the exact GCS path (for e.g. gs://sandbox/other-bucket)
2) Use Object Change Notifications in combination with Cloud Functions. You can find a good primer on doing this here and the SDK for GCS in JS here. What you will do in this option is basically setting up a trigger when something drops in a certain bucket, and move it to another one using your self-written Cloud Function.

Jenkins: What is a good way to store a variable between two job runs?

I have a time-triggered job which needs to retrieve certain values stored in a previous run of this job.
Is there a way to store values between job runs in the Jenkins environment?
E.g., I can write something like next in a shell script action:
XXX=`cat /hardcoded/path/xxx`
#job itself
echo NEW_XXX > /hardcoded/path/xxx
But is there a more reliable approach?
A few options:
Store the data in the workspace. If the data isn't critical (i.e. it's ok to nuke it when the workspace is nuked) that should be fine. I only use this to cache expensive-to-compute data such as prebuilt library dependancies.
Store the data in some fixed location in the filesystem. You'll make jenkins less self-contained and thus make migrations+backups more complex - but probably not by much; especially if you store the data in some custom user-subdirectory of jenkins. parallel builds will also be tricky, and distributed builds likely impossible. Jenkins has a userContent subdirectory you could use for this - that way the file is at least part of the jenkins install and thus more easily migrated or backed up. I do this for the (rather large) code coverage trend files for my builds.
Store the data on a different machine (e.g. a database). This is more complicated to set up, but you're less dependant on the local machine's details, and it's probably easier to get distributed and parallel builds working. I've done this to maintain a live changelog.
Store the data as a build artifact. This means looking at previous build's artifacts. It's safe and repeatable, and because Uri's are used to access such artifacts, OK for distributed builds too. However, you need to deal with failed builds (should you look back several versions? start from scratch?) and you'll be storing many copies, which is just fine if it's 1KB but less fine if it's 1GB. Another downside here is that you'll probably need to open up jenkin's security settings quite far to allow annonymous access to artifacts (since you're just downloading from a uri).
The appropriate solution will depend on your situation.
I would pass the variable from the first job to the second as a parameter in a parameterized build. See this question for more info on how to trigger a parameterized build from another build.
If you are using Pipelines and you're variable is of a simple type, you can use a parameter to store it between runs of the same job.
Using the properties step, you can configure parameters and their default values from within the pipeline. Once configured you can read them at the start of each run and save them (as default value) at the end. In the declarative pipeline it could look something like this:
pipeline {
agent none
options {
skipDefaultCheckout true
}
stages {
stage('Read Variable'){
steps {
script {
try {
variable = params.YOUR_VARIABLE
}
catch (Exception e) {
echo("Could not read variable from parameters, assuming this is the first run of the pipeline. Exception: ${e}")
variable = ""
}
}
}
}
stage('Save Variable for next run'){
steps {
script {
properties([
parameters([
string(defaultValue: "${variable}", description: 'Variable description', name: 'YOUR_VARIABLE', trim: true)
])
])
}
}
}
}

Resources