Spring Cloud Dataflow - composed-task-runner doesn't start second task - spring-cloud-dataflow

I have a Dataflow pipeline consisting of two sequential batch jobs. The first batch gets completed successfully, but the second one doesn't start.
I have started Dataflow server with the embedded H2 DB. I've pointed Spring Batch to the same H2 instance via application.properties. After the first step in my pipeline gets completed, I can see batch execution logs in that same DB instance.
My composed-task-runner application seems getting the Dataflow's datasource correctly. I can see it inherits it from Dataflow server and props are shown in the Dashboard's task execution section.
There are no errors in the logs. Only log entries from successful execution of the first batch.
My TASK_EXECUTION entries:
What could be the problem? And why there are two entries in the TASK_EXECUTION table for the first step? Per the task_name - these entries belong to the first batch step only.

I was able to address this issue by re-building my batch task using Spring Initialzr. Initially I was trying to use spring-cloud-task-app-starters as the base for my work, and probably it is not the right way of building Dataflow tasks.

Related

How does dataflow manage current processes during upscaling streaming job?

When dataflow streaming job with autoscaling enabled is deployed, it uses single worker.
Let's assume that pipeline reads pubsub messages, does some DoFn operations and uploads into BQ.
Let's also assume that PubSub queue is already a bit big.
So pipeline get started and loads some pubsubs processing them on single worker.
After couple of minutes it gets realized that some extra workers are needed and creates them.
Many pubsub messages are already loaded and are being processed but not acked yet.
And here is my question: how dataflow will manage those unacked yet, being processed elements?
My observations would suggest that dataflow sends many of those already being processed messages to a newly created worker and we can see that the same element is being processed at the same time on two workers.
Is this expected behavior?
Another question is - what next? First wins? Or new wins?
I mean, we have the same pubsub message that is still being processed on first worker and on the new one.
What if process on first worker will be faster and finishes processing? It will be acked and goes downstream or will be drop because new process for this element is on and only new one can be finalized?
Dataflow provides exactly-once processing of every record. Funnily enough, this does not mean that user code is run only once per record, whether by the streaming or batch runner.
It might run a given record through a user transform multiple times, or it might even run the same record simultaneously on multiple workers; this is necessary to guarantee at-least once processing in the face of worker failures. Only one of these invocations can “win” and produce output further down the pipeline.
More information here - https://cloud.google.com/blog/products/data-analytics/after-lambda-exactly-once-processing-in-google-cloud-dataflow-part-1

Cloud Dataflow Streaming continuously failing to insert

My dataflow pipeline functions as so:
Read from Pubsub
Transform data into rows
Write the rows to bigquery
On, occasion data is passed which fails to insert. That is alright, I know the reason for this failure. But dataflow continuously attempts to insert this data over and over and over and over. I would like to limit the number of retries as it bloats the worker logs with irrelevant information. Therefore making it extremely difficult to troubleshoot what is the problem when the same error repeatedly appears.
When running the pipeline locally I get:
no evaluator registered for Read(PubsubSource)
I would love to be able to test the pipeline locally. But it does not seem that dataflow supports this option with PubSub.
To clear the errors I am left with no other choice than canceling the pipeline and running a new job on the Google Cloud. Which costs time & money. Is there a way to limit the errors? Is there a way to test my pipeline locally? Is there a better approach to debugging the pipeline?
Dataflow UI
Job ID: 2017-02-08_09_18_15-3168619427405502955
To run the pipeline locally with unbounded data sets, on #Pablo's suggestion use the InProcessPipelineRunner.
dataflowOptions.setRunner(InProcessPipelineRunner.class);
Running the program locally has allowed me to handle errors with exceptions and optimize my workflow rapidly.

Is google dataflow BQ/BT Write atomic per job?

maybe I am a bad seeker but I couldn't find my answers in documentation, so I just want to try my luck here
So my question is that say I have a dataflow job that write to a BigQuery or BigTable and the job failed. Will dataflow will able to rollback to state before it started or there might simply be partial data in my table?
I know that write to GCS seems not atomic that there will be partial output partition produced along the way when the job is running.
However, I have tried dumping data into BQ by dataflow, and it seems that the output table will not be exposed to users until the job claimed success.
In Batch, Cloud Dataflow uses the following procedure for BigQueryIO.Write.to("some table"):
Write all data to a temporary directory on GCS.
Issue a BigQuery load job with an explicit list of all the temporary files containing the rows to be written.
If there are failures when the GCS writes are only partially complete, we will recreate the temp files on retry. Exactly one complete copy of the data will be produced by step 1 and used for loading in step 2, or the job will fail before step 2.
Each BigQuery load job, as in William V's answer, is atomic. The load job will succeed or fail, and if it fails there will be no data written to BigQuery.
For slightly more depth, Dataflow also uses a deterministic BigQuery job id (like dataflow_job_12423423) so that if the Dataflow code monitoring the load job fails and is retried we will still have exactly-once write semantics to BigQuery.
Together, this design means that each BigQueryIO.Write transform in your pipeline is atomic. In a common case, you have only one such write in your job, and so if the job succeeds the data will be in BigQuery and if the job fails there will be no data written.
However: Note that if you have multiple BigQueryIO.Write transforms in a pipeline, some of the writes may have successfully completed before the Dataflow job fails. The completed writes will not be reverted when the Dataflow job fails.
This means that you may need to be careful when rerunning a Dataflow pipeline with multiple sinks in order to ensure correctness in the presence of commited writes from the earlier failed job.
I can speak for Bigtable. Bigtable is atomic at the row level, not at the job level. A Dataflow job that fails part way will write partial data into Bigtable.
BigQuery jobs fail or succeed as a unit. From https://cloud.google.com/bigquery/docs/reference/v2/jobs
Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.
Though, just to be clear, BigQuery is atomic at the level of the BigQuery job, not at the level of a Dataflow job that might have created the BigQuery job. E.g. if your Dataflow job fails but it has written to BigQuery before failing (and that BigQuery job is complete) then the data will remain in BigQuery.

Cloud Dataflow job reading from BigQuery stuck before starting

I have a Cloud Dataflow job that's stuck in the initiation phase, before running any application logic. I tested this by adding a log output statement inside inside the processElement step, but it's not appearing in the logs so it seems it's not being reached.
All I can see in the logs are the following messages, this which appears every minute:
logger: Starting supervisor: /etc/supervisor/supervisord_watcher.sh: line 36: /proc//oom_score_adj: Permission denied
And these which loop every few seconds:
VM is healthy? true.
http: TLS handshake error from 172.17.0.1:38335: EOF
Job is in state JOB_STATE_RUNNING, will check again in 30 seconds.
The job ID is 2015-09-14_06_30_22-15275884222662398973, though I have an additional two jobs (2015-09-14_05_59_30-11021392791304643671, 2015-09-14_06_08_41-3621035073455045662) that I started the morning and which have the same problem.
Any ideas on what might be causing this?
It sounds like your pipeline has a BigQuery source followed by a DoFn. Before running your DoFn (and therefore reaching your print statement) the pipeline runs a BigQuery export job to create a snapshot of the data in GCS. This ensures that the pipeline gets a consistent view of the data contained in the BigQuery tables.
It seems like this BigQuery export job for your table took a long time. There unfortunately isn't a progress indicator for the export process. If you run the pipeline again and let it run longer, the export process should complete and then your DoFn will start running.
We are looking into improving the user experience of the export job as well as figuring out why it took longer than we expected.

Jenkins - monitoring the estimated time of builds

I would like to monitor the estimated time of all of my builds to catch the cases where this value is shown as 'N/A'.
In these cases the build gets stuck (probably due to network issues in my environment) and it won't start new builds for that job until killed manually.
What I am missing is how to get that data for each job, either from api or other source.
I would appreciated any suggestion.
Thanks.
For each job, you can click "Trend" on the job run history table, and it will show you the currently executing progress along with a graph of "usual" execution times.
Using the API, you can go to http://jenkins/job/<your_job_name>/<build_number>/api/xml (or /json) and the information is under <duration> and <estimatedDuration> fields.
Finally, there is a Jenkins Timeout Plugin that you can use to automatically take care of "stuck" builds

Resources