I'm using gcloud dataflow job and want individual execution times for all the steps in my dataflow including nested transforms. I'm using a streaming dataflow and the pipeline currently looks like this:
Current dataflow
Can anyone please suggest a solution?
Answer is WallTime. You can access this info by clicking one of the task in your pipeline(even nested).
Elapsed time of a job is the total time takes to complete your dataflow job while wall time is the sum time taken to run each step by the assigned workers. See the below image for more details.
Related
In the execution of each Dataflow job, job is taking around 2-4 mins for the creation and deletion of VMs(worker pool).
Please let me know if there is any way to minimize this?
OR
Can we create VMs for processing before execution of Dataflow job so that execution time can bring down?
Dataflow is fully managed. From documentation:
You should not attempt to manage or otherwise interact directly with
your Compute Engine Managed Instance Group; the Dataflow service will
take care of that for you. Manually altering any Compute Engine
resources associated with your Dataflow job is an unsupported
operation.
Is it possible to automatically call a dataflow job in the gap of every 10 minutes. Any insights how to achieve this?
Yes. This is explained in the blog post Scheduling Dataflow pipelines using App Engine Cron Service or Cloud Functions
.
I have to join data from Google Datastore and Google BigTable to produce some report. I need to execute that operation every minute. Is it possible to accomplish with Google Cloud Dataflow (assuming the processing itself should not take long time and/or can be split in independent parallel jobs)?
Should I have endless loop inside the "main" creating and executing the same pipeline again and again?
If most of time in such scenario is taken by bringing up the VMs, is it possible to instruct the Dataflow to use customer VMs instead?
Thanks,
If you expect that your job is small enough to complete in 60 seconds you could consider using the Datastore and BigTable APIs from within a DoFn in a Streaming job. Your pipeline might look something like:
PCollection<Long> impulse = p.apply(
CountingInput.unbounded().withRate(1, Duration.standardMinutes(1)))
PCollection<A> input1 = impulse.apply(ParDo.of(readFromDatastore));
PCollection<B> input2 = impulse.apply(ParDo.of(readFromBigTable));
...
This produces a single input every minute, forever. Running as a streaming pipeline, the VMs will continue running.
After reading from both APIs you can then window/join as necessary.
IBM's version of JSR352 provides a Rest API which can be used to trigger jobs, restart them, get the job logs. Can it also be used to get the status of each step and each partition of the step?
I want to build a job monitoring console from where i can trigger the jobs and monitor the status of the steps and partitions in real time without actually having to look into the job log. (after i trigger the job it should periodically give me the status of the step and partitions)
How should i go about doing this?
You can subscribe to our batch events, a JMS topic tree where we publish messages at various stages in the batch job lifecycle, (job started/ended, step checkpointed, etc.)
See the Knowledge Center documentation and this whitepaper as well for more information.
I would like to monitor the estimated time of all of my builds to catch the cases where this value is shown as 'N/A'.
In these cases the build gets stuck (probably due to network issues in my environment) and it won't start new builds for that job until killed manually.
What I am missing is how to get that data for each job, either from api or other source.
I would appreciated any suggestion.
Thanks.
For each job, you can click "Trend" on the job run history table, and it will show you the currently executing progress along with a graph of "usual" execution times.
Using the API, you can go to http://jenkins/job/<your_job_name>/<build_number>/api/xml (or /json) and the information is under <duration> and <estimatedDuration> fields.
Finally, there is a Jenkins Timeout Plugin that you can use to automatically take care of "stuck" builds