Dataflow Job success but data is not transferred - google-cloud-dataflow

I have Cloud Dataflow Job to pull the data from Oracle to Google CloudSQL in GCP Project which has VPC/VPN setup. Scheduled this dataflow job using Composer and executed it periodically. It was working fine, however sometime Dataflow Job executes successfully but data is not moved to CloudSQL. Nothing error/ specific events from Logs. Its weird...This happens very rare... It would be good if any one give suggestion to find out the root cause of this case.
Here is the dataflow operator definition,
dataFlowStatus = DataFlowJavaOperator( task_id='dataflowlabelstatus', jar=jar_pathr, gcp_conn_id='google_cloud_default', options={ 'project': project_id, 'region': gcp_region, 'usePublicIps': ip_condition, 'stagingLocation': staging_location, 'tempLocation': staging_location, 'subnetwork': gcp_sub_network, 'RTMaxDate': "{{ task_instance.xcom_pull(task_ids='validateDate', key='job_last_fetch_date') }}", 'oracleJDBCSource': oracle_connection_url, 'oracleUserName': rm_username, 'oraclePassword': rm_password, 'postgreUserName_RT': receipt_username, 'postgrePassword_RT': receipt_password, 'postgresDatabaseName_RT': receipt_schema, 'postgreUserName_Label': label_username, 'postgrePassword_Label': label_password, 'postgresDatabaseName_Label': label_schema, 'cloudSqlInstance_RT': receipt_cloud_instance, 'cloudSqlInstance_Label': label_cloud_instance, 'autoscalingAlgorithm': 'BASIC', 'maxNumWorkers': '50' }, dag=dag )

Related

PullRequest Build Validation with Jenkins and OnPrem Az-Devops

First off the setup in question:
A Jenkins Instance with several build nodes and on prem Azure-Devops server containing the Git Repositories.
The Repo in question is too large to always build on push for all branches and all devs, so a small workaround was done:
The production branches have a polling enabled twice a day (because of testing duration which is handled downstream more builds would not help with quality)
All other branches have their automated building suppressed. They still can start it manually for Builds/Deployments/Unittests if they so choose.
The jenkinsfile has parameterization for which platforms to build, on prod* all the platforms are true, on all other branches false.
This helps because else the initial build of a feature branch would always build/deploy locally all platforms which would take too much of a load on the server infrastructure.
I added a service endpoint for Jenkins in the Azure Devops, added a Buildvalidation .yml - this basically works because when I call the sourcebranch of the pull request with the merge commitID i added a parameter
isPullRequestBuild which contains the ID of the PR.
snippet of the yml:
- task: JenkinsQueueJob#2
inputs:
serverEndpoint: 'MyServerEndpoint'
jobName: 'MyJob'
isMultibranchJob: true
captureConsole: true
capturePipeline: true
isParameterizedJob: true
multibranchPipelineBranch: $(System.PullRequest.SourceBranch)
jobParameters: |
stepsToPerform=Build
runUnittest=true
pullRequestID=$(System.PullRequest.PullRequestId)
Snippet of the Jenkinsfile:
def isPullRequest = false
if ( params.pullRequestID?.trim() )
{
isPullRequest = true
//do stuff to change how the pipeline should react.
}
In the jenkinsfile I look whether the parameter is not empty and reset the platforms to build to basically all and to run the unittests.
The problem is: if the branch has never run, Jenkins does not already know the parameter in the first run, so it is ignored, building nothing, and returning with 0 because "nothing had to be done".
Is there any way to only run the jenkins build if it hasnt run already?
Or is it possible to get information from the remote call if this was the build with ID 1?
The only other thing would be to Call the Jenkins via web api and check for the last successful build, but in that case I would have have the token somewhere stored in source control.
Am I missing something obvious here? I dont want to trigger the feature branch builds to do nothing more than once, because Devs could lose useful information about their started builds/deployments.
Any ideas appreciated
To whom it may concern with similar problems:
In the end I used the following workaround:
The Jenkins Endpoint is called via a user that only is used for automated builds. So, in case that this user triggered the build, I set everything to run a Pull Request Validation, even if it is the first build. Along the lines of
def causes = currentBuild.getBuildCauses('hudson.model.Cause$UserIdCause')
if (causes != null)
{
def buildCauses= readJSON text: currentBuild.getBuildCauses('hudson.model.Cause$UserIdCause').toString()
buildCauses.each
{
buildCause ->
if (buildCause['userId'] == "theNameOfMyBuildUser")
{
triggeredByAzureDevops = true
}
}
}
getBuildcauses must be allowed to run by a Jenkins Admin for that to work.

how to make jenkins declarative pipeline waitForWebhook without consuming jenkins executor?

We have a Jenkins declarative pipeline which, after deploying our product on a cloud vm, needs to run some product tests on it. Tests are implemented as a separate job on another jenkins and tests will be run by main pipeline by triggering remote job on 2nd jenkins using parameterized remote trigger plugin parameterized remote trigger plugin.
While this plugin works great, when using option blockBuildUntilComplete it blocks for remote job to finish but doesn't release the jenkins executor. Since tests can take a lot of time to complete(upto 2 days), all this time executor will be blocked just waiting for another job to complete. When setting blockBuildUntilComplete as false it returns a job handle which can be used to fetch build status and result etc. Example here:
while( !handle.isFinished() ) {
echo 'Current Status: ' + handle.getBuildStatus().toString();
sleep 5
handle.updateBuildStatus()
}
But this still keeps consuming the executor, so we still have the same problem.
Based on comments in the article, we tried with waitForWebhook, but even when waiting for webhook it still keeps using the executor.
Based on article we tried with input, and we observed that it wasn't using executor when you have input block in stage and none agent for pipeline :
pipeline {
agent none
stages {
stage("test"){
input {
message "Should we continue?"
}
agent any
steps {
script{
echo "hello"
}
}
}
}
}
so input block does what we want to do with waitForWebhook, at least as far as not consuming an executor while waiting.
What our original idea was to have waitForWebhook inside a timeout surrounded by try catch surrounded by a while loop waiting for remote job to finish :
while(!handle.isFinished()){
try{
timeout(1) {
data = waitForWebhook hook
}
} catch(Exception e){
log.info("timeout occurred")
}
handle.updateBuildStatus()
}
This way we could avoid using executor for long period of time and also benefit from job handle returned by plugin. But we cannot find a way to do this with the input step. input doesn't use executor only when it's separate block inside stage, but it uses one if inside steps or steps->script. So we cannot use try catch, cannot check handle status andcannot loop on it. Is there a way to do this?
Even if waitForWebhook works like input is working we could use that.
Our main pipeline runs on a jenkins inside corporate network and test jenkins runs on cloud and cannot communicate in with our corp jenkins. so when using waitForWebhook, we would publish a message on a message pipeline which a consumer will read, fetch webhook url from db corresponding to job and post on it. We were hoping avoid this with our solution of using while and try-catch.

How can I programmatically cancel a Dataflow job that has run for too long?

I'm using Apache Beam on Dataflow through Python API to read data from Bigquery, process it, and dump it into Datastore sink.
Unfortunately, quite often the job just hangs indefinitely and I have to manually stop it. While the data gets written into Datastore and Redis, from the Dataflow graph I've noticed that it's only a couple of entries that get stuck and leave the job hanging.
As a result, when a job with fifteen 16-core machines is left running for 9 hours (normally, the job runs for 30 minutes), it leads to huge costs.
Maybe there is a way to set a timer that would stop a Dataflow job if it exceeds a time limit?
It would be great if you can create a customer support ticket where we would could try to debug this with you.
Maybe there is a way to set a timer that would stop a Dataflow job if
it exceeds a time limit?
Unfortunately the answer is no, Dataflow does not have an automatic way to cancel a job after a certain time. However, it is possible to do this using the APIs. It is possible to wait_until_finish() with a timeout then cancel() the pipeline.
You would do this like so:
p = beam.Pipeline(options=pipeline_options)
p | ... # Define your pipeline code
pipeline_result = p.run() # doesn't do anything
pipeline_result.wait_until_finish(duration=TIME_DURATION_IN_MS)
pipeline_result.cancel() # If the pipeline has not finished, you can cancel it
To sum up, with the help of #ankitk answer, this works for me (python 2.7, sdk 2.14):
pipe = beam.Pipeline(options=pipeline_options)
... # main pipeline code
run = pipe.run() # doesn't do anything
run.wait_until_finish(duration=3600000) # (ms) actually starts a job
run.cancel() # cancels if can be cancelled
Thus, in case if a job was successfully finished within the duration time in wait_until_finished() then cancel() will just print a warning "already closed", otherwise it will close a running job.
P.S. if you try to print the state of a job
state = run.wait_until_finish(duration=3600000)
logging.info(state)
it will be RUNNING for the job that wasn't finished within wait_until_finished(), and DONE for finished job.
Note: this technique will not work when running Beam from within a Flex Template Job...
The run.cancel() method doesn't work if you are writing a template and I haven't seen any successful work around it...

Unable to get the result from Google machine learning Cloud REST API

I am trying to run a job using Google Cloud Machine Learning REST-API ml.jobs.project.create
The latest job that I submitted has job id 'drivermonitoring20180109335'. Here on completion of the job, message 'job completed successfully' is displayed but I cannot see any desired output file in the specified location. Output logs can be seen in fig1
Also I would like to keep in-front of you my few observations while running this job id:
i) Running the job took very less time in comparison to any other job that I executed before.
ii) While running jobs before, every job earlier was executed via two different tasks viz a)master-replica-0 and b)service (refer fig2) but this job didn't have master-replica-0 task(refer fig3) I tried to Google the issue, but was unable to find any solution related to the issue.
So I can infer that the task that I was trying to run is being scheduled but the python script that I am trying to run is never scheduled to be executed.
Kindly let me know if you require more screenshots or if you want to have a look at the project structure to help with the issue.
Thanks in advance.
EDIT 1: Added JSON while making API call
POST https://ml.googleapis.com/v1/projects/drivermonitoringsystem/jobs?key={YOUR_API_KEY}
{
"trainingInput": {
"pythonModule": "trainer.retrain",
"args": [
"--bottleneck_dir=ModelTraining/tf_files/bottlenecks \
--model_dir=ModelTraining/tf_files/models/ \
--architecture=mobilenet_0.50_224 \
--output_graph=gs://<BUCKET_NAME>/tf_files/retrained_graph.pb \
--output_labels=gs://<BUCKET_NAME>/tf_files/retrained_labels.txt \
--image_dir=gs://<BUCKET_NAME>/dataset224x224/"
],
"region": "us-central1",
"packageUris": [
"gs://<BUCKET_NAME>/ModelTraining4.tar.gz"
],
"jobDir": "gs://<BUCKET_NAME>/tf_files/",
"runtimeVersion": "1.4"
},
"jobId": "job_id201801101535"
}
I have just run myself some sample jobs using both the gcloud command and the REST API, and everything has just worked fine in both of the cases. It looks like, in your case, the job was never executed, as there is no cluster created for processing the job itself (that is why master-replica-0 is missing).
The jobs that you had run previously and which had worked were launched also using the REST API, or instead with gcloud or a Client Library?
Here I share an example JSON I used when making the API call to ml.projects.jobs.create through the API Explorer link you shared, I suggest you try adapting it to your requirements and check if you got any missing field:
POST https://ml.googleapis.com/v1/projects/<YOUR_PROJECT>/jobs?key={YOUR_API_KEY}
{
"jobId": "<JOB_ID>",
"trainingInput": {
"jobDir": "gs://<LOCATION_TO_STORE_OUTPUTS>",
"runtimeVersion": "1.4",
"region": "<REGION>",
"packageUris": [
"gs://<PATH_TO_YOUR_TRAINER>/trainer-0.0.0.tar.gz"
],
"pythonModule": "<PYTHON_MODULE_TO_RUN>",
"args": [
"--train-files",
"gs://<PATH_TO_YOUR_TRAINING_DATA>/data.csv",
"--eval-files",
"gs://<PATH_TO_YOUR_TEST_DATA>/test.csv",
"--train-steps",
"100",
"--eval-steps",
"10",
"--verbosity",
"DEBUG"
]
}
}
Change TrainingInput to PredictionInput (and the appropriate child fields) if you are trying to run a prediction job instead of a training one, as in this example.

Use BlockingDataflowPipelineRunner and post-processing code for Dataflow template

I'd like to run some code after my pipeline finishes all processing, so I'm using BlockingDataflowPipelineRunner and placing code after pipeline.run() in main.
This works properly when I run the job from the command line using BlockingDataflowPipelineRunner. The code under pipeline.run() runs after the pipeline has finished processing.
However, it does not work when I try to run the job as a template. I deployed the job as a template (with TemplatingDataflowPipelineRunner), and then tried to run the template in a Cloud Function like this:
dataflow.projects.templates.create({
projectId: 'PROJECT ID HERE',
resource: {
parameters: {
runner: 'BlockingDataflowPipelineRunner'
},
jobName: `JOB NAME HERE`,
gcsPath: 'GCS TEMPLATE PATH HERE'
}
}, function(err, response) {
if (err) {
// etc
}
callback();
});
The runner does not seem to take. I can put gibberish under runner and the job still runs.
The code I had under pipeline.run() does not run when each job runs -- it runs only when I deploy the template.
Is it expected that the code under pipeline.run() in main would not run each time the job runs? Is there a solution for executing code after the pipeline is finished?
(For context, the code after pipeline.run() moves a file from one Cloud Storage bucket to another. It's archiving the file that was just processed by the job.)
Yes, this expected behavior. A template represents the pipeline itself, and allows (re-)executing the pipeline by launching the template. Since the template doesn't include any of the code from the main() method, it doesn't allow doing anything after the pipeline execution.
Similarly, the dataflow.projects.templates.create API is just the API to launch the template.
The way the blocking runner accomplished this was to get the job ID from the created pipeline and periodically poll to observe when it has completed. For your use case, you'll need to do the same:
Execute dataflow.projects.templates.create(...) to create the Dataflow job. This should return the job ID.
Periodically (every 5-10s, for instance) poll dataflow.projects.jobs.get(...) to retrieve the job with the given ID, and check what state it is in.

Resources