Parallel DataFlow Pipeline from Single Google Cloud DataFlow Job - google-cloud-dataflow

I am trying to run two separated Pipelines from one DataFlow Job similar to below question:
Parallel pipeline inside one Dataflow Job
If we run two separated Pipelines using Single DataFlow job using single p.run() like below:
(
p | 'Do one thing' >> beam.Create(List1)
)
(
p | 'Do second thing' >> beam.Create(List2)
)
result = p.run()
result.wait_until_finish()
I think it will launch two separate pipelines in single Dataflow job but will it create two Bundles and will it run on two different workers?

You are correct: These two sections of a graph will run under the same Dataflow job.
Dataflow paralelizes execution of work items dynamically. This means that depending on your pipeline, the amount of work that it has, etc., Dataflow may decide to schedule on the same worker or on separate workers.
Normally, if there's enough work on both parallel pipelines (~20s on each), then yes: They will run in parallel.

Related

jenkins from master running multiple downstream jobs

use case: I want to run the regression suite on multiple images(image selection is variable it can be 4 or 5).
I want to create 1 master job who will take the images name as an input, and This master job passes the image name one by one to downstream regression job. The number of images can be variable.
Master job
INPUT image: a,b,c .....
|
|
-------------------------------------------------
| | |
Regression job Regression REgression
Input image: a Image b Image c
Can anyone just tell me how can I perform this task in Jenkins?
To Solve this, I have used the pipeline and Active choice parameter plugins.
Here is the configuration:
Here is the problem, I am getting ThunderImage list as [p,1,p,2,p,3] instead of ['p1','p2','p3'] .
So,you want to select which regression Job starts depending on the selected input from your Master Job? For this you could use the Post build task and use it's regex functionality to check for the input Paramter inside the Build Log.
If you are using a Pipeline job you can use the solution of Christopher Orr from: Jenkins trigger build dependent on build parameters

Prevent fusion in Apache Beam / Dataflow streaming (python) pipelines to remove pipeline bottleneck

We are currently working on a streaming pipeline on Apache Beam with DataflowRunner. We are reading messages from Pub/Sub and do some processing on them and afterwards we window them in slidings windows (currently the window size is 3 seconds and the interval is 3 seconds as well). Once the window is fired we do some post-processing on the elements inside the window. This post-processing step is significantly larger than the window size, it takes about 15 seconds.
The apache beam code of the pipeline:
input = ( pipeline | beam.io.ReadFromPubSub(subscription=<subscription_path>)
| beam.Map(process_fn))
windows = input | beam.WindowInto(beam.window.SlidingWindows(3, 3),
trigger=AfterCount(30),
accumulation_mode = AccumulationModel.DISCARDING)
group = windows | beam.GroupByKey()
group | beam.Map(post_processing_fn)
As you know, Dataflow tries to perform some optimizations on your pipeline steps. In our case it fusions everything together from the windowing onwards (clustered operations: 1/ processing 2/ windowing + post-processing) which is causing a slow sequential post-processing of all the windows by just 1 worker. We see logs every 15 seconds that the pipeline is processing the next window. However, we would like to have multiple workers picking up separate windows instead of the workload going to a single worker.
Therefore we were looking for ways to prevent this fusion from happening so Dataflow separates the window from post-processing of the windows. In that way we would expect Dataflow to be able to assign multiple workers again to the post-processing of fired windows.
What we have tried so far:
Increase the number of workers to 20, 30 or even 40 but without effect. Only the steps before the windowing gets assigned to multiple workers
Running the pipeline for 5 or 10 minutes but we noticed no worker re-allocation to help on this larger post-processing step after the windowing
After the windowing, put them back into a global window
Simulate another GroupByKey with a dummy key (as mentioned in https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#preventing-fusion) but without any success.
The last two actions indeed created a third clustered operation (1/ processing 2/ windowing 3/ post-processing ) but we noticed that still the same worker is executing everything after the windowing.
Is there any solution that can resolve this problem statement?
The current workaround we are now considering is to build another streaming pipeline which receives the windows so these worker can process the windows in parallel but it is cumbersome..
You have done the right thing to break fusion in your elements. I suspect there may be an issue getting you into trouble.
For streaming, a single key always gets processed in the same worker. By any chance, are all or most of your records assigned to a single key? If so, your processing will be done in a single worker.
Something that you can do to prevent this is to make the window a part of the key, so that the elements for multiple windows can be processed in different workers even though they have the same key:
class KeyIntoKeyPlusWindow(core.DoFn):
def process(self, element, window=core.DoFn.WindowParam):
key, values = element
yield ((key, window), element)
group = windows | beam.ParDo(KeyIntoKeyPlusWindow() | beam.GroupByKey()
And once you've done that, you can apply your post-processing:
group | beam.Map(post_processing_fn)

Iterative processing in Dataflow

As shown here Dataflow pipelines are represented by a fixed DAG. I'm wondering if it's possible to implement a pipeline where the processing proceeds until a dynamically evaluated condition is satisfied based on the data computed so far.
Here's some pseudo code to illustrate what I'd like to implement:
PCollection pco = null
while(true):
pco = pco.apply(someTransform())
if (conditionSatisfied(pco)):
break
pco.Write()
It seems like you really want iterative computations. Right now Dataflow does not provide support for that, but we are aware that it is a very important use case and we are working on finding the right set of APIs to express it.
For now your workarounds are:
Iteratively run whole pipelines (run pipeline, inspect output, run again if the condition is not satisfied, etc). This has the obvious downside of pipeline setup and teardown overhead.
Build a pipeline with a hard-coded number of iterations by .apply()'ing in a loop unconditionally, then run the whole pipeline.
A combination of the two, e.g. run fixed 5-iteration pipelines until you're satisfied with the result.

Your advice on a Hadoop MapReduce job

I have 2 files stored on a HDFS filesystem:
tbl_userlog: <website url (non canonical)> <tab> <username> <tab> <timestamp>
example: www.website.com, foobar87, 201101251456
tbl_websites: <website url (canonical)> <tab> <total hits>
example: website.com, 25889
I have written an Hadoop sequence of jobs which joins the 2 files on the website, performs a filter on the amount of total hits > n per website and then counts for each user the amount of websites he has visited which has > n total hits. The details of the sequence are as following:
A Map-only job which canonicizes the url in tbl_userlog (i.e. removes www, http:// and https:// from the url field)
A Map-only job which sorts tbl_websites on the url
An identity Map-Reduce job which takes the output of the 2 previous jobs as KeyValueTextInput and feeds them to a CompositeInput in order to make use of Hadoop native joining feature defined with jobConf.set("mapred.join.expr", CompositeInputFormat.compose("inner" (...))
A Map and Reduce job which filters the result of the previous job on total hits > n in its Map phase, groups the results on the in the shuffling phase, and performs the count on the number of websites for each user in the Reduce phase.
In order to chain these steps, I just call the jobs sequentially in the described order. Each individual job outputs its results into HDFS which the following job in the chain then retrieves and processes in turn.
As I am new to Hadoop, I would like to ask for your counseling:
Is there a better way to chain these jobs? In this configuration all intermediate results are written to HDFS and then read back.
Do you see any design flaw in this job, or could it be written more elegantly by making use of some Hadoop feature that I have missed?
I am using Apache Hadoop 0.20.2 and using higher-level frameworks such as Pig or Hive is not possible in the scope of the project.
Thanks in advance for your replies!
I think what you have will work with a couple of caveats. Before I start listing them, I want to make two definitions clear. A map-only job is a job that has a defined Mapper and run's with 0 reducers. If the job is running with > 0 IdentityReducers, then the job is not a map-only job. A reduce-only job is a job that has a define Reducer and run's with an IdentityMapper.
Your first job, can be a map-only job, since all you're doing is canonicalizing URLs. But if you want to use CompositeInputFormat, you should run with an IdentityReducer with more than 0 reducer's.
For your second job, I don't know what you mean by a map-only job that sorts. Sorting by it's very nature is a reduce side task. You probably mean that it has a define Mapper but no Reducer. But in order for the URLs to be sorted, you should run with an IdentityReducer with more than 0 reducer's.
Your third job is an interesting idea, but you have to be careful with CompositeInputFormat. There are two conditions that must be met for you to be able to use this input format. The first is that there has to be the same number of files in both input directories. This can be achieved by setting the same number of reducer's for Job1 and Job2. The second condition is that the input files CANNOT be splittable. This can be achieved by using a non splittable compression such as bzip.
This job sounds good. Although you can filter website that have < n hits in the reducer of the previous job and save yourself some I/O.
There's obviously more than one solution to a problem in software, so while you're solution would work, I wouldn't recommend it. Having 4 MapReduce jobs for this task is a bit expensive IMHO. The implementation I have in mind is a M-R-R workflow that uses Secondary Sort.
As far as chaining jobs is concerned, you should have a look at Oozie, which is a workflow manager. I have yet to use it, but that's where I'd start.

Jenkins (Hudson) - Managing dependencies between parallel builds

Using Jenkins or Hudson I would like to create a pipeline of builds with fork and join points, for example:
job A
/ \
job B job C
| |
job D |
\ /
job E
I would like to create arbitrary series-parallel graphs like this and leave Jenkins the scheduling freedom to execute B/D and C in parallel whenever a slave is available.
The Join Plugin immediately joins after B has executed. The Build Pipeline Plugin does not support fork/join points. Not sure if this is possible with the Throttle Concurrent Builds Plugin (or deprecated Locks & Latches Plugin); if so I could not figure out how. One solution could be to specify build dependencies with Apache Ivy and use the Ivy Plugin. However, my jobs are all Makefile C/C++/shell script jobs and I have no experience with Ivy to verify if this is possible.
What is the best way to specify parallel jobs and their dependencies in Jenkins?
There is a Build Flow plugin that meets this very need. It defines a DSL for specifying parallel jobs. Your example might be written like this:
build("job A")
parallel (
{
build("job B")
build("job D")
},
{
build("job C")
}
)
build("job E")
I just found it and it is exactly what I was looking for.
There is one solution that might work for you. It requires that all builds start with a single Job and end with a definite series of jobs at the end of each chain; in your diagram, "job A" would be the starting job, jobs C and D would be the terminating jobs.
Have Job A create a fingerprinted file. Job A can then start multiple chains of builds, B/D and C in this example. Also on Job A, add a promotion via the Promotions Plugin, whose criteria is the successful completion of the successive jobs - in this case, C and D. As part of the promotion, include a trigger of the final job, in your case Job E. This can be done with the Parameterized Trigger Plugin. Then, make sure that each of the jobs you list in the promotion criteria also fingerprint the same file and get the same fingerprint; I use the Copy Artifact Plugin to ensure I get the exact same file each time.

Resources